Jupyter Notebook

Validate & register multi-modal data#

Background#

scRNA data has moved beyond just RNA and can also include the measurements of other modalities such as chromatin accessibility, surface proteins or adaptive immune receptors. ECCITE-seq is designed to enable interrogation of single-cell transcriptomes together with surface protein markers in the context of CRISPR screens.

Here, we’ll showcase how to curate and register ECCITE-seq data from Papalexi21 in the form of MuData objects.

Setup#

Hide code cell content
!lamin init --storage ./test-multimodal --schema bionty
πŸ’‘ creating schemas: core==0.46.1 bionty==0.30.0 
βœ… saved: User(id='DzTjkKse', handle='testuser1', email='testuser1@lamin.ai', name='Test User1', updated_at=2023-08-28 17:18:00)
βœ… saved: Storage(id='QbSXDVao', root='/home/runner/work/lamin-usecases/lamin-usecases/docs/test-multimodal', type='local', updated_at=2023-08-28 17:18:00, created_by_id='DzTjkKse')
βœ… loaded instance: testuser1/test-multimodal
πŸ’‘ did not register local instance on hub (if you want, call `lamin register`)

import lamindb as ln
import lnschema_bionty as lb

lb.settings.species = "human"
ln.settings.verbosity = 3
βœ… loaded instance: testuser1/test-multimodal (lamindb 0.51.0)
βœ… set species: Species(id='uHJU', name='human', taxon_id=9606, scientific_name='homo_sapiens', updated_at=2023-08-28 17:18:02, bionty_source_id='GUPQ', created_by_id='DzTjkKse')
ln.track()
πŸ’‘ notebook imports: lamindb==0.51.0 lnschema_bionty==0.30.0
βœ… saved: Transform(id='yMWSFirS6qv2z8', name='Validate & register multi-modal data', short_name='multimodal', version='0', type=notebook, updated_at=2023-08-28 17:18:02, created_by_id='DzTjkKse')
βœ… saved: Run(id='Euv4SXFiqsmgu8oN0A8R', run_at=2023-08-28 17:18:02, transform_id='yMWSFirS6qv2z8', created_by_id='DzTjkKse')

Papalexi21#

Let’s use a MuData object:

Transform #

Hide code cell content
mdata = ln.dev.datasets.mudata_papalexi21_subset()
mdata
MuData object with n_obs Γ— n_vars = 200 Γ— 300
  var:	'name'
  4 modalities
    rna:	200 x 173
      obs:	'orig.ident', 'nCount_RNA', 'nFeature_RNA', 'nCount_HTO', 'nFeature_HTO', 'nCount_GDO', 'nCount_ADT', 'nFeature_ADT', 'percent.mito', 'MULTI_ID', 'HTO_classification', 'guide_ID', 'gene_target', 'NT', 'perturbation', 'replicate', 'S.Score', 'G2M.Score', 'Phase'
      var:	'name'
    adt:	200 x 4
      obs:	'orig.ident', 'nCount_RNA', 'nFeature_RNA', 'nCount_HTO', 'nFeature_HTO', 'nCount_GDO', 'nCount_ADT', 'nFeature_ADT', 'percent.mito', 'MULTI_ID', 'HTO_classification', 'guide_ID', 'gene_target', 'NT', 'perturbation', 'replicate', 'S.Score', 'G2M.Score', 'Phase'
      var:	'name'
    hto:	200 x 12
      obs:	'orig.ident', 'nCount_RNA', 'nFeature_RNA', 'nCount_HTO', 'nFeature_HTO', 'nCount_GDO', 'nCount_ADT', 'nFeature_ADT', 'percent.mito', 'MULTI_ID', 'HTO_classification', 'guide_ID', 'gene_target', 'NT', 'perturbation', 'replicate', 'S.Score', 'G2M.Score', 'Phase'
      var:	'name'
    gdo:	200 x 111
      obs:	'orig.ident', 'nCount_RNA', 'nFeature_RNA', 'nCount_HTO', 'nFeature_HTO', 'nCount_GDO', 'nCount_ADT', 'nFeature_ADT', 'percent.mito', 'MULTI_ID', 'HTO_classification', 'guide_ID', 'gene_target', 'NT', 'perturbation', 'replicate', 'S.Score', 'G2M.Score', 'Phase'
      var:	'name'

MuData objects build on top of AnnData objects to store and serialize multimodal data. More information can be found on the MuData documentation.

First we register the file:

file = ln.File(
    "papalexi21_subset.h5mu", description="Sub-sampled MuData from Papalexi21"
)
file.save()
βœ… storing file 'FnGGyp7gHg5fcLiPoiyz' at '.lamindb/FnGGyp7gHg5fcLiPoiyz.h5mu'

Now let’s validate and register the 3 feature sets this data contains:

  1. RNA (gene expression)

  2. ADT (antibody derived tags reflecting surface proteins)

  3. obs (metadata)

For the two modalities rna and adt, we use bionty tables as the reference:

Validate #

mdata["rna"].var_names[:5]
Index(['RP5-827C21.6', 'XX-CR54.1', 'SH2D6', 'RP11-379B18.5', 'RP11-778D9.12'], dtype='object', name='index')
lb.Gene.validate(mdata["rna"].var_names, lb.Gene.symbol);
πŸ’‘ using global setting species = human
❗ 173 terms (100.00%) are not validated for symbol: RP5-827C21.6, XX-CR54.1, SH2D6, RP11-379B18.5, RP11-778D9.12, RP11-703G6.1, AC005150.1, RP11-717H13.1, CTC-498J12.1, CTC-467M3.1, ARHGAP26-AS1, GABRA1, HIST1H4K, HLA-DQB1-AS1, RP11-524H19.2, SPACA1, VNN1, AC006042.7, AC002066.1, AC073934.6, ...
genes = lb.Gene.from_values(mdata["rna"].var_names, lb.Gene.symbol)
ln.save(genes)
πŸ’‘ using global setting species = human
βœ… created 77 Gene records from Bionty matching symbol: SH2D6, ARHGAP26-AS1, GABRA1, HLA-DQB1-AS1, SPACA1, VNN1, CTAGE15, PFKFB1, TRPC5, RBPMS-AS1, CA8, CSMD3, ZNF483, AK8, TMEM72-AS1, ARAP1-AS2, CRYAB, HOXC-AS2, LRRIQ1, TUBA3C, ...
βœ… created 12 Gene records from Bionty matching synonyms: CTC-467M3.1, HIST1H4K, CASC1, LARGE, NBPF16, C1orf65, IBA57-AS1, KIAA1239, TMEM75, AP003419.16, FAM65C, C14orf177
❗ ambiguous validation in Bionty for 6 records: HLA-DQB1-AS1, CTAGE15, CTRB2, LGALS9C, PCDHB11, TBC1D3G
❗ did not create Gene records for 84 non-validated symbols: AC002066.1, AC004019.13, AC005150.1, AC006042.7, AC011558.5, AC026471.6, AC073934.6, AC091132.1, AC092295.4, AC092687.5, AE000662.93, AL132989.1, AP000442.4, CTA-373H7.7, CTB-134F13.1, CTB-31O20.9, CTC-498J12.1, CTD-2562J17.2, CTD-3012A18.1, CTD-3065B20.2, ...
mdata["rna"].var_names = lb.Gene.standardize(mdata["rna"].var_names, lb.Gene.symbol)
πŸ’‘ using global setting species = human
πŸ’‘ standardized 89/173 terms
validated = lb.Gene.validate(mdata["rna"].var_names, lb.Gene.symbol)
πŸ’‘ using global setting species = human
βœ… 89 terms (51.40%) are validated for symbol
❗ 84 terms (48.60%) are not validated for symbol: RP5-827C21.6, XX-CR54.1, RP11-379B18.5, RP11-778D9.12, RP11-703G6.1, AC005150.1, RP11-717H13.1, CTC-498J12.1, RP11-524H19.2, AC006042.7, AC002066.1, AC073934.6, RP11-268G12.1, U52111.14, RP11-235C23.5, RP11-12J10.3, RP11-324E6.9, RP11-187A9.3, RP11-365N19.2, RP11-346D14.1, ...
new_genes = [
    lb.Gene(symbol=symbol, species=lb.settings.species)
    for symbol in mdata["rna"].var_names[~validated]
]
ln.save(new_genes)
lb.Gene.validate(mdata["rna"].var_names, lb.Gene.symbol);
πŸ’‘ using global setting species = human
βœ… 173 terms (100.00%) are validated for symbol
feature_set_rna = ln.FeatureSet.from_values(
    mdata["rna"].var_names, field=lb.Gene.symbol
)
πŸ’‘ using global setting species = human
βœ… 173 terms (100.00%) are validated for symbol
πŸ’‘ using global setting species = human
mdata["adt"].var_names
Index(['CD86', 'PDL1', 'PDL2', 'CD366'], dtype='object', name='index')
lb.CellMarker.validate(mdata["adt"].var_names, field=lb.CellMarker.name);
πŸ’‘ using global setting species = human
❗ 4 terms (100.00%) are not validated for name: CD86, PDL1, PDL2, CD366
markers = lb.CellMarker.from_values(mdata["adt"].var_names, field=lb.CellMarker.name)
ln.save(markers)
πŸ’‘ using global setting species = human
βœ… created 4 CellMarker records from Bionty matching name: CD86, PDL1, PDL2, CD366
lb.CellMarker.validate(mdata["adt"].var_names, field=lb.CellMarker.name);
πŸ’‘ using global setting species = human
βœ… 4 terms (100.00%) are validated for name

Register #

feature_set_adt = ln.FeatureSet.from_values(
    mdata["adt"].var_names, field=lb.CellMarker.name
)
πŸ’‘ using global setting species = human
βœ… 4 terms (100.00%) are validated for name
πŸ’‘ using global setting species = human

Link them to file:

file.features.add_feature_set(feature_set_rna, slot="rna")
file.features.add_feature_set(feature_set_adt, slot="adt")

The 3rd feature set is the obs:

obs = mdata["rna"].obs

We’re only interested in a single metadata column:

ln.Feature(name="gene_target", type="category").save()
features = ln.Feature.from_df(obs)
ln.save(features)
feature_set_obs = ln.FeatureSet.from_df(obs)
βœ… 19 terms (100.00%) are validated for name
file.features.add_feature_set(feature_set_obs, slot="obs")
gene_targets = lb.Gene.from_values(obs["gene_target"], lb.Gene.symbol)
ln.save(gene_targets)
file.add_labels(gene_targets, feature="gene_target")
πŸ’‘ using global setting species = human
βœ… created 23 Gene records from Bionty matching symbol: IFNGR1, CAV1, IRF7, ATF2, NFKBIA, STAT1, SPI1, JAK2, STAT2, IFNGR2, CD86, STAT5A, SMAD4, ETV7, IRF1, UBE2L6, PDCD1LG2, BRD4, POU2F2, STAT3, ...
βœ… created 1 Gene record from Bionty matching synonyms: MARCH8
❗ ambiguous validation in Bionty for 4 records: MARCHF8, IRF7, IFNGR2, TNFRSF14
❗ did not create Gene record for 1 non-validated symbol: NT
βœ… linked feature 'gene_target' to registry 'bionty.Gene'
nt = ln.Label(name="NT", description="Non-targeting control of perturbations")
nt.save()
file.add_labels(nt, feature="gene_target")
βœ… linked feature 'gene_target' to registry 'core.Label'
for col in ["orig.ident", "perturbation", "replicate", "Phase", "guide_ID"]:
    labels = [ln.Label(name=name) for name in obs[col].unique()]
    ln.save(labels)
βœ… loaded record with exact same name 

Because none of these labels seem like something we’d want to track in the registry or validate, we don’t link them to the file.

file.features
'rna': FeatureSet(id='tL3Mlmc6gIvEEw2UEsdS', n=184, type='float', registry='bionty.Gene', hash='Y8lsRtXCZKyPPberKAF0', updated_at=2023-08-28 17:18:09, created_by_id='DzTjkKse')
'adt': FeatureSet(id='h7qRLUpMWmIkkWJW2q8M', n=4, type='float', registry='bionty.CellMarker', hash='b-CtyjgPRO0WN27lTOqC', updated_at=2023-08-28 17:18:09, created_by_id='DzTjkKse')
'obs': FeatureSet(id='WyOUtF7IwYAzCAS6xDH5', n=19, registry='core.Feature', hash='rIpru3ak3te_X3q5NgnD', updated_at=2023-08-28 17:18:10, created_by_id='DzTjkKse')
file.describe()
πŸ’‘ File(id='FnGGyp7gHg5fcLiPoiyz', key=None, suffix='.h5mu', accessor='MuData', description='Sub-sampled MuData from Papalexi21', version=None, size=606320, hash='RaivS3NesDOP-6kNIuaC3g', hash_type='md5', created_at=2023-08-28 17:18:03, updated_at=2023-08-28 17:18:03)

Provenance:
    πŸ—ƒοΈ storage: Storage(id='QbSXDVao', root='/home/runner/work/lamin-usecases/lamin-usecases/docs/test-multimodal', type='local', updated_at=2023-08-28 17:18:00, created_by_id='DzTjkKse')
    πŸ’« transform: Transform(id='yMWSFirS6qv2z8', name='Validate & register multi-modal data', short_name='multimodal', version='0', type=notebook, updated_at=2023-08-28 17:18:03, created_by_id='DzTjkKse')
    πŸ‘£ run: Run(id='Euv4SXFiqsmgu8oN0A8R', run_at=2023-08-28 17:18:02, transform_id='yMWSFirS6qv2z8', created_by_id='DzTjkKse')
    πŸ‘€ created_by: User(id='DzTjkKse', handle='testuser1', email='testuser1@lamin.ai', name='Test User1', updated_at=2023-08-28 17:18:00)
Features:
  adt:
    πŸ”— index (4, bionty.CellMarker.id): ['82nG0xqSuEQD', 'kbrA7wdDuqDK', 'L0m6f7FPiDeg', 'BK30rjK34sZd'...]
  rna:
    πŸ”— index (184, bionty.Gene.id): ['JX35T6MPAehY', '40upnBi0oOrP', 'Q1niomE2Unvj', 'aEObkJL9sXMA', 'zUn2i8bxpM0X'...]
  obs (metadata):
    πŸ”— gene_target (bionty.Gene|core.Label)
        πŸ”— gene_target (28, bionty.Gene): ['TNFRSF14', 'IRF7', 'SPI1', 'MARCHF8', 'TNFRSF14']
        πŸ”— gene_target (1, core.Label): ['NT']
file.view_lineage()
https://d33wubrfki0l68.cloudfront.net/930dfdbcffa8554e37d6226827086b35dbbbdec6/b6aa4/_images/6ec8fd652cfa1ae71ca4e7716a3070ec70a0af5ae3f42fd32dfee118d1e1a2e9.svg
Hide code cell content
!lamin delete --force test-multimodal
!rm -r test-multimodal
πŸ’‘ deleting instance testuser1/test-multimodal
βœ…     deleted instance settings file: /home/runner/.lamin/instance--testuser1--test-multimodal.env
βœ…     instance cache deleted
βœ…     deleted '.lndb' sqlite file
❗     consider manually deleting your stored data: /home/runner/work/lamin-usecases/lamin-usecases/docs/test-multimodal