Analyst’s view#

Here, we’ll take an analysts-centric view of typical file transformations.

If exploring more generally, read this first: Bird’s eye view.

# a lamindb instance containing Bionty schema
!lamin init --storage ./analysis-usecase --schema bionty

import lamindb as ln
import lnschema_bionty as lb

lb.settings.species = "human"  # globally set species
lb.settings.auto_save_parents = False

✅ loaded instance: testuser1/analysis-usecase (lamindb 0.51.0)

✅ set species: Species(id='uHJU', name='human', taxon_id=9606, scientific_name='homo_sapiens', updated_at=2023-08-28 17:19:38, bionty_source_id='rCA6', created_by_id='DzTjkKse')

ln.track()

💡 notebook imports: lamindb==0.51.0 lnschema_bionty==0.30.0

✅ saved: Transform(id='eNef4Arw8nNMz8', name='Analyst's view', short_name='analysis-flow', version='0', type=notebook, updated_at=2023-08-28 17:19:39, created_by_id='DzTjkKse')

✅ saved: Run(id='AaByItQRNjcGTMIr8PfE', run_at=2023-08-28 17:19:39, transform_id='eNef4Arw8nNMz8', created_by_id='DzTjkKse')

Track cell types, tissues and diseases#

We fetch an example dataset from LaminDB that has a few cell type, tissue and disease annotations:

adata

AnnData object with n_obs × n_vars = 40 × 100
    obs: 'cell_type', 'cell_type_id', 'tissue', 'disease'

adata.var_names[:5]

Index(['ENSG00000000003', 'ENSG00000000005', 'ENSG00000000419',
       'ENSG00000000457', 'ENSG00000000460'],
      dtype='object')

adata.obs[["tissue", "cell_type", "disease"]].value_counts()

tissue  cell_type                disease                   
brain   my new cell type         Alzheimer disease             10
heart   hepatocyte               cardiac ventricle disorder    10
kidney  T cell                   chronic kidney disease        10
liver   hematopoietic stem cell  liver lymphoma                10
Name: count, dtype: int64

Register biological metadata and link to the dataset#

As a first step, we register the Anndata object with LaminDB using from_anndata():

file = ln.File.from_anndata(
    adata, key="mini_anndata_with_obs.h5ad", var_ref=lb.Gene.ensembl_gene_id
)

💡 file will be copied to default storage upon `save()` with key 'mini_anndata_with_obs.h5ad'

💡 parsing feature names of X stored in slot 'var'

💡    using global setting species = human

❗    received 99 unique terms, 1 empty/duplicated term is ignored

❗    99 terms (100.00%) are not validated for ensembl_gene_id: ENSG00000000003, ENSG00000000005, ENSG00000000419, ENSG00000000457, ENSG00000000460, ENSG00000000938, ENSG00000000971, ENSG00000001036, ENSG00000001084, ENSG00000001167, ENSG00000001460, ENSG00000001461, ENSG00000001497, ENSG00000001561, ENSG00000001617, ENSG00000001626, ENSG00000001629, ENSG00000001630, ENSG00000001631, ENSG00000002016, ...

❗    no validated features, skip creating feature set

💡 parsing feature names of slot 'obs'

❗    4 terms (100.00%) are not validated for name: cell_type, cell_type_id, tissue, disease

❗    no validated features, skip creating feature set

file.save()

✅ storing file 'fIXsNWA2g9ZM4aOkUuLM' at 'mini_anndata_with_obs.h5ad'

cell_types = lb.CellType.from_values(adata.obs.cell_type, lb.CellType.name)
tissues = lb.Tissue.from_values(adata.obs.tissue, lb.Tissue.name)
diseases = lb.Disease.from_values(adata.obs.disease, lb.Disease.name)

All of these look good and contain no typos, let’s save them to their registries:

ln.save(cell_types)
ln.save(tissues)
ln.save(diseases)

We also need some features to bucket these labels:

ln.Feature(name="cell_type", type="category").save()
ln.Feature(name="tissue", type="category").save()
ln.Feature(name="disease", type="category").save()

Link labels against the file:

file.add_labels(cell_types, feature="cell_type")
file.add_labels(tissues, feature="tissue")
file.add_labels(diseases, feature="disease")

file.describe()

💡 File(id='fIXsNWA2g9ZM4aOkUuLM', key='mini_anndata_with_obs.h5ad', suffix='.h5ad', accessor='AnnData', description=None, version=None, size=46992, hash='IJORtcQUSS11QBqD-nTD0A', hash_type='md5', created_at=2023-08-28 17:19:40, updated_at=2023-08-28 17:19:40)

Provenance:
    🗃️ storage: Storage(id='UMx8jvzI', root='/home/runner/work/lamin-usecases/lamin-usecases/docs/analysis-usecase', type='local', updated_at=2023-08-28 17:19:36, created_by_id='DzTjkKse')
    💫 transform: Transform(id='eNef4Arw8nNMz8', name='Analyst's view', short_name='analysis-flow', version='0', type=notebook, updated_at=2023-08-28 17:19:40, created_by_id='DzTjkKse')
    👣 run: Run(id='AaByItQRNjcGTMIr8PfE', run_at=2023-08-28 17:19:39, transform_id='eNef4Arw8nNMz8', created_by_id='DzTjkKse')
    👤 created_by: User(id='DzTjkKse', handle='testuser1', email='testuser1@lamin.ai', name='Test User1', updated_at=2023-08-28 17:19:36)
Features:
  external:
    🔗 cell_type (3, bionty.CellType): ['T cell', 'hematopoietic stem cell', 'hepatocyte']
    🔗 disease (4, bionty.Disease): ['liver lymphoma', 'Alzheimer disease', 'chronic kidney disease', 'cardiac ventricle disorder']
    🔗 tissue (4, bionty.Tissue): ['kidney', 'liver', 'heart', 'brain']

file.view_lineage()

https://d33wubrfki0l68.cloudfront.net/41ed24182369e753bdb2cdbb609e148c670dcb68/6fc4e/_images/eb05f11720c6fd05e2ce2ebbc396bacad8bf06e56bb861cec08ba6e0dcf817f0.svg

Examine the currently available cell types and tissues:

lb.CellType.filter().df()

Show code cell output Hide code cell output

	name	ontology_id	abbr	synonyms	description	bionty_source_id	updated_at	created_by_id
id
BxNjby0x	T cell	CL:0000084	None	T-lymphocyte\|T-cell\|T lymphocyte	A Type Of Lymphocyte Whose Defining Characteri...	zURR	2023-08-28 17:19:46	DzTjkKse
m91LZBDZ	hematopoietic stem cell	CL:0000037	None	blood forming stem cell\|hemopoietic stem cell\|HSC	A Stem Cell From Which All Cells Of The Lympho...	zURR	2023-08-28 17:19:46	DzTjkKse
J7hHC8SK	hepatocyte	CL:0000182	None	None	The Main Structural Component Of The Liver. Th...	zURR	2023-08-28 17:19:46	DzTjkKse

lb.Tissue.filter().df()

Show code cell output Hide code cell output

	name	ontology_id	abbr	synonyms	description	bionty_source_id	updated_at	created_by_id
id
j9lTWyWV	kidney	UBERON:0002113	None	None	A Paired Organ Of The Urinary Tract Which Has ...	T17w	2023-08-28 17:19:46	DzTjkKse
HHKnN309	liver	UBERON:0002107	None	None	An Exocrine Gland Which Secretes Bile And Func...	T17w	2023-08-28 17:19:46	DzTjkKse
sm45H0wI	heart	UBERON:0000948	None	vertebrate heart\|chambered heart	A Myogenic Muscular Circulatory Organ Found In...	T17w	2023-08-28 17:19:46	DzTjkKse
7HcGzG0l	brain	UBERON:0000955	None	None	The Brain Is The Center Of The Nervous System ...	T17w	2023-08-28 17:19:46	DzTjkKse

Processing the dataset#

To track our data transformation we create a new Transform of type “pipeline”:

transform = ln.Transform(
    name="Subset to T-cells and liver lymphoma", version="0.1.0", type="pipeline"
)

Set the current tracking to the new transform:

ln.track(transform)

✅ saved: Transform(id='iRkSgqdvTiJQ5g', name='Subset to T-cells and liver lymphoma', version='0.1.0', type='pipeline', updated_at=2023-08-28 17:19:47, created_by_id='DzTjkKse')

✅ saved: Run(id='41GWkomMMNDu5f8nNcBL', run_at=2023-08-28 17:19:47, transform_id='iRkSgqdvTiJQ5g', created_by_id='DzTjkKse')

Get a backed AnnData object#

file = ln.File.filter(key="mini_anndata_with_obs.h5ad").one()

adata = file.backed()
adata

💡 adding file fIXsNWA2g9ZM4aOkUuLM as input for run 41GWkomMMNDu5f8nNcBL, adding parent transform eNef4Arw8nNMz8

AnnDataAccessor object with n_obs × n_vars = 40 × 100
  constructed for the AnnData object mini_anndata_with_obs.h5ad
    obs: ['_index', 'cell_type', 'cell_type_id', 'disease', 'tissue']
    var: ['_index']

adata.obs[["cell_type", "disease"]].value_counts()

cell_type                disease                   
T cell                   chronic kidney disease        10
hematopoietic stem cell  liver lymphoma                10
hepatocyte               cardiac ventricle disorder    10
my new cell type         Alzheimer disease             10
Name: count, dtype: int64

Subset dataset to specific cell types and diseases#

Create the subset:

subset_obs = adata.obs.cell_type.isin(["T cell", "hematopoietic stem cell"]) & (
    adata.obs.disease.isin(["liver lymphoma", "chronic kidney disease"])
)

adata_subset = adata[subset_obs]
adata_subset

AnnDataAccessorSubset object with n_obs × n_vars = 20 × 100
  obs: ['_index', 'cell_type', 'cell_type_id', 'disease', 'tissue']
  var: ['_index']

adata_subset.obs[["cell_type", "disease"]].value_counts()

cell_type                disease               
T cell                   chronic kidney disease    10
hematopoietic stem cell  liver lymphoma            10
Name: count, dtype: int64

This subset can now be registered:

file_subset = ln.File.from_anndata(
    adata_subset.to_memory(),
    key="subset/mini_anndata_with_obs.h5ad",
    var_ref=lb.Gene.ensembl_gene_id,
)

/opt/hostedtoolcache/Python/3.9.17/x64/lib/python3.9/site-packages/anndata/_core/anndata.py:1840: UserWarning: Variable names are not unique. To make them unique, call `.var_names_make_unique`.
  utils.warn_names_duplicates("var")

💡 file will be copied to default storage upon `save()` with key 'subset/mini_anndata_with_obs.h5ad'

💡 parsing feature names of X stored in slot 'var'

💡    using global setting species = human

❗    received 99 unique terms, 1 empty/duplicated term is ignored

❗    99 terms (100.00%) are not validated for ensembl_gene_id: ENSG00000000003, ENSG00000000005, ENSG00000000419, ENSG00000000457, ENSG00000000460, ENSG00000000938, ENSG00000000971, ENSG00000001036, ENSG00000001084, ENSG00000001167, ENSG00000001460, ENSG00000001461, ENSG00000001497, ENSG00000001561, ENSG00000001617, ENSG00000001626, ENSG00000001629, ENSG00000001630, ENSG00000001631, ENSG00000002016, ...

❗    no validated features, skip creating feature set

💡 parsing feature names of slot 'obs'

✅    3 terms (75.00%) are validated for name

❗    1 term (25.00%) is not validated for name: cell_type_id

✅    loaded: FeatureSet(id='k3qE0h805glMFKk4kO0Y', n=3, registry='core.Feature', hash='31ErlBj6bWYsixzVupVd', updated_at=2023-08-28 17:19:46, modality_id='g5FeuudV', created_by_id='DzTjkKse')

✅    linked: FeatureSet(id='k3qE0h805glMFKk4kO0Y', n=3, registry='core.Feature', hash='31ErlBj6bWYsixzVupVd', updated_at=2023-08-28 17:19:46, modality_id='g5FeuudV', created_by_id='DzTjkKse')

file_subset.save()

✅ storing file 'ACPBqrlQDcEPwVTQs4dV' at 'subset/mini_anndata_with_obs.h5ad'

Add labels to features, all of them validate:

cell_types = lb.CellType.from_values(adata.obs.cell_type, lb.CellType.name)
tissues = lb.Tissue.from_values(adata.obs.tissue, lb.Tissue.name)
diseases = lb.Disease.from_values(adata.obs.disease, lb.Disease.name)

file_subset.add_labels(cell_types, feature="cell_type")
file_subset.add_labels(tissues, feature="tissue")
file_subset.add_labels(diseases, feature="disease")

file_subset.describe()

💡 File(id='ACPBqrlQDcEPwVTQs4dV', key='subset/mini_anndata_with_obs.h5ad', suffix='.h5ad', accessor='AnnData', description=None, version=None, size=38992, hash='RgGUx7ndRplZZSmalTAWiw', hash_type='md5', created_at=2023-08-28 17:19:47, updated_at=2023-08-28 17:19:47)

Provenance:
    🗃️ storage: Storage(id='UMx8jvzI', root='/home/runner/work/lamin-usecases/lamin-usecases/docs/analysis-usecase', type='local', updated_at=2023-08-28 17:19:36, created_by_id='DzTjkKse')
    🧩 transform: Transform(id='iRkSgqdvTiJQ5g', name='Subset to T-cells and liver lymphoma', version='0.1.0', type='pipeline', updated_at=2023-08-28 17:19:47, created_by_id='DzTjkKse')
    👣 run: Run(id='41GWkomMMNDu5f8nNcBL', run_at=2023-08-28 17:19:47, transform_id='iRkSgqdvTiJQ5g', created_by_id='DzTjkKse')
    👤 created_by: User(id='DzTjkKse', handle='testuser1', email='testuser1@lamin.ai', name='Test User1', updated_at=2023-08-28 17:19:36)
Features:
  obs (metadata):
    🔗 cell_type (3, bionty.CellType): ['T cell', 'hematopoietic stem cell', 'hepatocyte']
    🔗 disease (4, bionty.Disease): ['liver lymphoma', 'Alzheimer disease', 'chronic kidney disease', 'cardiac ventricle disorder']
    🔗 tissue (4, bionty.Tissue): ['kidney', 'brain', 'liver', 'heart']

Examine data lineage#

Common questions that might arise are:

Which h5ad file is in the subset subfolder?
Which notebook ingested this file?
By whom?
And which file is its parent?

Let’s answer this using LaminDB:

Query a subsetted .h5ad file containing “hematopoietic stem cell” and “T cell” to learn which h5ad file is in the subset subfolder:

cell_types_bt_lookup = lb.CellType.lookup()

my_subset = ln.File.filter(
    suffix=".h5ad",
    key__startswith="subset",
    cell_types__in=[
        cell_types_bt_lookup.hematopoietic_stem_cell,
        cell_types_bt_lookup.t_cell,
    ],
).first()

my_subset.view_lineage()

https://d33wubrfki0l68.cloudfront.net/cf557d456ab3709e66c72f20e8ffd862f02cd4d4/6162b/_images/a9ccef1e31d01d1d2d9b7fe4480f70dcf2ab72edb8a0ab6df106d770ed382fa0.svg