Validate & register multi-modal data#
Background#
scRNA data has moved beyond just RNA and can also include the measurements of other modalities such as chromatin accessibility, surface proteins or adaptive immune receptors. ECCITE-seq is designed to enable interrogation of single-cell transcriptomes together with surface protein markers in the context of CRISPR screens.
Here, weβll showcase how to curate and register ECCITE-seq data from Papalexi21 in the form of MuData objects.
Setup#
Show code cell content
!lamin init --storage ./test-multimodal --schema bionty
π‘ creating schemas: core==0.46.1 bionty==0.30.0
β
saved: User(id='DzTjkKse', handle='testuser1', email='testuser1@lamin.ai', name='Test User1', updated_at=2023-08-28 17:18:00)
β
saved: Storage(id='QbSXDVao', root='/home/runner/work/lamin-usecases/lamin-usecases/docs/test-multimodal', type='local', updated_at=2023-08-28 17:18:00, created_by_id='DzTjkKse')
β
loaded instance: testuser1/test-multimodal
π‘ did not register local instance on hub (if you want, call `lamin register`)
import lamindb as ln
import lnschema_bionty as lb
lb.settings.species = "human"
ln.settings.verbosity = 3
β
loaded instance: testuser1/test-multimodal (lamindb 0.51.0)
β
set species: Species(id='uHJU', name='human', taxon_id=9606, scientific_name='homo_sapiens', updated_at=2023-08-28 17:18:02, bionty_source_id='GUPQ', created_by_id='DzTjkKse')
ln.track()
π‘ notebook imports: lamindb==0.51.0 lnschema_bionty==0.30.0
β
saved: Transform(id='yMWSFirS6qv2z8', name='Validate & register multi-modal data', short_name='multimodal', version='0', type=notebook, updated_at=2023-08-28 17:18:02, created_by_id='DzTjkKse')
β
saved: Run(id='Euv4SXFiqsmgu8oN0A8R', run_at=2023-08-28 17:18:02, transform_id='yMWSFirS6qv2z8', created_by_id='DzTjkKse')
Papalexi21#
Letβs use a MuData object:
Transform #
Show code cell content
mdata = ln.dev.datasets.mudata_papalexi21_subset()
mdata
MuData object with n_obs Γ n_vars = 200 Γ 300 var: 'name' 4 modalities rna: 200 x 173 obs: 'orig.ident', 'nCount_RNA', 'nFeature_RNA', 'nCount_HTO', 'nFeature_HTO', 'nCount_GDO', 'nCount_ADT', 'nFeature_ADT', 'percent.mito', 'MULTI_ID', 'HTO_classification', 'guide_ID', 'gene_target', 'NT', 'perturbation', 'replicate', 'S.Score', 'G2M.Score', 'Phase' var: 'name' adt: 200 x 4 obs: 'orig.ident', 'nCount_RNA', 'nFeature_RNA', 'nCount_HTO', 'nFeature_HTO', 'nCount_GDO', 'nCount_ADT', 'nFeature_ADT', 'percent.mito', 'MULTI_ID', 'HTO_classification', 'guide_ID', 'gene_target', 'NT', 'perturbation', 'replicate', 'S.Score', 'G2M.Score', 'Phase' var: 'name' hto: 200 x 12 obs: 'orig.ident', 'nCount_RNA', 'nFeature_RNA', 'nCount_HTO', 'nFeature_HTO', 'nCount_GDO', 'nCount_ADT', 'nFeature_ADT', 'percent.mito', 'MULTI_ID', 'HTO_classification', 'guide_ID', 'gene_target', 'NT', 'perturbation', 'replicate', 'S.Score', 'G2M.Score', 'Phase' var: 'name' gdo: 200 x 111 obs: 'orig.ident', 'nCount_RNA', 'nFeature_RNA', 'nCount_HTO', 'nFeature_HTO', 'nCount_GDO', 'nCount_ADT', 'nFeature_ADT', 'percent.mito', 'MULTI_ID', 'HTO_classification', 'guide_ID', 'gene_target', 'NT', 'perturbation', 'replicate', 'S.Score', 'G2M.Score', 'Phase' var: 'name'
MuData objects build on top of AnnData objects to store and serialize multimodal data. More information can be found on the MuData documentation.
First we register the file:
file = ln.File(
"papalexi21_subset.h5mu", description="Sub-sampled MuData from Papalexi21"
)
file.save()
β
storing file 'FnGGyp7gHg5fcLiPoiyz' at '.lamindb/FnGGyp7gHg5fcLiPoiyz.h5mu'
Now letβs validate and register the 3 feature sets this data contains:
RNA (gene expression)
ADT (antibody derived tags reflecting surface proteins)
obs (metadata)
For the two modalities rna and adt, we use bionty tables as the reference:
Validate #
mdata["rna"].var_names[:5]
Index(['RP5-827C21.6', 'XX-CR54.1', 'SH2D6', 'RP11-379B18.5', 'RP11-778D9.12'], dtype='object', name='index')
lb.Gene.validate(mdata["rna"].var_names, lb.Gene.symbol);
π‘ using global setting species = human
β 173 terms (100.00%) are not validated for symbol: RP5-827C21.6, XX-CR54.1, SH2D6, RP11-379B18.5, RP11-778D9.12, RP11-703G6.1, AC005150.1, RP11-717H13.1, CTC-498J12.1, CTC-467M3.1, ARHGAP26-AS1, GABRA1, HIST1H4K, HLA-DQB1-AS1, RP11-524H19.2, SPACA1, VNN1, AC006042.7, AC002066.1, AC073934.6, ...
genes = lb.Gene.from_values(mdata["rna"].var_names, lb.Gene.symbol)
ln.save(genes)
π‘ using global setting species = human
β
created 77 Gene records from Bionty matching symbol: SH2D6, ARHGAP26-AS1, GABRA1, HLA-DQB1-AS1, SPACA1, VNN1, CTAGE15, PFKFB1, TRPC5, RBPMS-AS1, CA8, CSMD3, ZNF483, AK8, TMEM72-AS1, ARAP1-AS2, CRYAB, HOXC-AS2, LRRIQ1, TUBA3C, ...
β
created 12 Gene records from Bionty matching synonyms: CTC-467M3.1, HIST1H4K, CASC1, LARGE, NBPF16, C1orf65, IBA57-AS1, KIAA1239, TMEM75, AP003419.16, FAM65C, C14orf177
β ambiguous validation in Bionty for 6 records: HLA-DQB1-AS1, CTAGE15, CTRB2, LGALS9C, PCDHB11, TBC1D3G
β did not create Gene records for 84 non-validated symbols: AC002066.1, AC004019.13, AC005150.1, AC006042.7, AC011558.5, AC026471.6, AC073934.6, AC091132.1, AC092295.4, AC092687.5, AE000662.93, AL132989.1, AP000442.4, CTA-373H7.7, CTB-134F13.1, CTB-31O20.9, CTC-498J12.1, CTD-2562J17.2, CTD-3012A18.1, CTD-3065B20.2, ...
mdata["rna"].var_names = lb.Gene.standardize(mdata["rna"].var_names, lb.Gene.symbol)
π‘ using global setting species = human
π‘ standardized 89/173 terms
validated = lb.Gene.validate(mdata["rna"].var_names, lb.Gene.symbol)
π‘ using global setting species = human
β
89 terms (51.40%) are validated for symbol
β 84 terms (48.60%) are not validated for symbol: RP5-827C21.6, XX-CR54.1, RP11-379B18.5, RP11-778D9.12, RP11-703G6.1, AC005150.1, RP11-717H13.1, CTC-498J12.1, RP11-524H19.2, AC006042.7, AC002066.1, AC073934.6, RP11-268G12.1, U52111.14, RP11-235C23.5, RP11-12J10.3, RP11-324E6.9, RP11-187A9.3, RP11-365N19.2, RP11-346D14.1, ...
new_genes = [
lb.Gene(symbol=symbol, species=lb.settings.species)
for symbol in mdata["rna"].var_names[~validated]
]
ln.save(new_genes)
lb.Gene.validate(mdata["rna"].var_names, lb.Gene.symbol);
π‘ using global setting species = human
β
173 terms (100.00%) are validated for symbol
feature_set_rna = ln.FeatureSet.from_values(
mdata["rna"].var_names, field=lb.Gene.symbol
)
π‘ using global setting species = human
β
173 terms (100.00%) are validated for symbol
π‘ using global setting species = human
mdata["adt"].var_names
Index(['CD86', 'PDL1', 'PDL2', 'CD366'], dtype='object', name='index')
lb.CellMarker.validate(mdata["adt"].var_names, field=lb.CellMarker.name);
π‘ using global setting species = human
β 4 terms (100.00%) are not validated for name: CD86, PDL1, PDL2, CD366
markers = lb.CellMarker.from_values(mdata["adt"].var_names, field=lb.CellMarker.name)
ln.save(markers)
π‘ using global setting species = human
β
created 4 CellMarker records from Bionty matching name: CD86, PDL1, PDL2, CD366
lb.CellMarker.validate(mdata["adt"].var_names, field=lb.CellMarker.name);
π‘ using global setting species = human
β
4 terms (100.00%) are validated for name
Register #
feature_set_adt = ln.FeatureSet.from_values(
mdata["adt"].var_names, field=lb.CellMarker.name
)
π‘ using global setting species = human
β
4 terms (100.00%) are validated for name
π‘ using global setting species = human
Link them to file:
file.features.add_feature_set(feature_set_rna, slot="rna")
file.features.add_feature_set(feature_set_adt, slot="adt")
The 3rd feature set is the obs:
obs = mdata["rna"].obs
Weβre only interested in a single metadata column:
ln.Feature(name="gene_target", type="category").save()
features = ln.Feature.from_df(obs)
ln.save(features)
feature_set_obs = ln.FeatureSet.from_df(obs)
β
19 terms (100.00%) are validated for name
file.features.add_feature_set(feature_set_obs, slot="obs")
gene_targets = lb.Gene.from_values(obs["gene_target"], lb.Gene.symbol)
ln.save(gene_targets)
file.add_labels(gene_targets, feature="gene_target")
π‘ using global setting species = human
β
created 23 Gene records from Bionty matching symbol: IFNGR1, CAV1, IRF7, ATF2, NFKBIA, STAT1, SPI1, JAK2, STAT2, IFNGR2, CD86, STAT5A, SMAD4, ETV7, IRF1, UBE2L6, PDCD1LG2, BRD4, POU2F2, STAT3, ...
β
created 1 Gene record from Bionty matching synonyms: MARCH8
β ambiguous validation in Bionty for 4 records: MARCHF8, IRF7, IFNGR2, TNFRSF14
β did not create Gene record for 1 non-validated symbol: NT
β
linked feature 'gene_target' to registry 'bionty.Gene'
nt = ln.Label(name="NT", description="Non-targeting control of perturbations")
nt.save()
file.add_labels(nt, feature="gene_target")
β
linked feature 'gene_target' to registry 'core.Label'
for col in ["orig.ident", "perturbation", "replicate", "Phase", "guide_ID"]:
labels = [ln.Label(name=name) for name in obs[col].unique()]
ln.save(labels)
β
loaded record with exact same name
Because none of these labels seem like something weβd want to track in the registry or validate, we donβt link them to the file.
file.features
'rna': FeatureSet(id='tL3Mlmc6gIvEEw2UEsdS', n=184, type='float', registry='bionty.Gene', hash='Y8lsRtXCZKyPPberKAF0', updated_at=2023-08-28 17:18:09, created_by_id='DzTjkKse')
'adt': FeatureSet(id='h7qRLUpMWmIkkWJW2q8M', n=4, type='float', registry='bionty.CellMarker', hash='b-CtyjgPRO0WN27lTOqC', updated_at=2023-08-28 17:18:09, created_by_id='DzTjkKse')
'obs': FeatureSet(id='WyOUtF7IwYAzCAS6xDH5', n=19, registry='core.Feature', hash='rIpru3ak3te_X3q5NgnD', updated_at=2023-08-28 17:18:10, created_by_id='DzTjkKse')
file.describe()
π‘ File(id='FnGGyp7gHg5fcLiPoiyz', key=None, suffix='.h5mu', accessor='MuData', description='Sub-sampled MuData from Papalexi21', version=None, size=606320, hash='RaivS3NesDOP-6kNIuaC3g', hash_type='md5', created_at=2023-08-28 17:18:03, updated_at=2023-08-28 17:18:03)
Provenance:
ποΈ storage: Storage(id='QbSXDVao', root='/home/runner/work/lamin-usecases/lamin-usecases/docs/test-multimodal', type='local', updated_at=2023-08-28 17:18:00, created_by_id='DzTjkKse')
π« transform: Transform(id='yMWSFirS6qv2z8', name='Validate & register multi-modal data', short_name='multimodal', version='0', type=notebook, updated_at=2023-08-28 17:18:03, created_by_id='DzTjkKse')
π£ run: Run(id='Euv4SXFiqsmgu8oN0A8R', run_at=2023-08-28 17:18:02, transform_id='yMWSFirS6qv2z8', created_by_id='DzTjkKse')
π€ created_by: User(id='DzTjkKse', handle='testuser1', email='testuser1@lamin.ai', name='Test User1', updated_at=2023-08-28 17:18:00)
Features:
adt:
π index (4, bionty.CellMarker.id): ['82nG0xqSuEQD', 'kbrA7wdDuqDK', 'L0m6f7FPiDeg', 'BK30rjK34sZd'...]
rna:
π index (184, bionty.Gene.id): ['JX35T6MPAehY', '40upnBi0oOrP', 'Q1niomE2Unvj', 'aEObkJL9sXMA', 'zUn2i8bxpM0X'...]
obs (metadata):
π gene_target (bionty.Gene|core.Label)
π gene_target (28, bionty.Gene): ['TNFRSF14', 'IRF7', 'SPI1', 'MARCHF8', 'TNFRSF14']
π gene_target (1, core.Label): ['NT']
file.view_lineage()
Show code cell content
!lamin delete --force test-multimodal
!rm -r test-multimodal
π‘ deleting instance testuser1/test-multimodal
β
deleted instance settings file: /home/runner/.lamin/instance--testuser1--test-multimodal.env
β
instance cache deleted
β
deleted '.lndb' sqlite file
β consider manually deleting your stored data: /home/runner/work/lamin-usecases/lamin-usecases/docs/test-multimodal