Bird’s eye view#
Background#
Data lineage tracks data’s journey, detailing its origins, transformations, and interactions to trace biological insights, verify experimental outcomes, meet regulatory standards, and increase the robustness of research. While tracking data lineage is easier when it is governed by deterministic pipelines, it becomes hard when its governed by interactive human-driven analyses.
Here, we’ll backtrace file transformations through notebooks, pipelines & app uploads in a research project based on Schmidt22 which conducted genome-wide CRISPR activation and interference screens in primary human T cells to identify gene networks controlling IL-2 and IFN-γ production.
Setup#
We need an instance:
!lamin init --storage ./mydata
Show code cell output
💡 creating schemas: core==0.46.1
✅ saved: User(id='DzTjkKse', handle='testuser1', email='testuser1@lamin.ai', name='Test User1', updated_at=2023-08-28 17:19:58)
✅ saved: Storage(id='EpWzEqJI', root='/home/runner/work/lamin-usecases/lamin-usecases/docs/mydata', type='local', updated_at=2023-08-28 17:19:58, created_by_id='DzTjkKse')
✅ loaded instance: testuser1/mydata
💡 did not register local instance on hub (if you want, call `lamin register`)
Import lamindb:
import lamindb as ln
✅ loaded instance: testuser1/mydata (lamindb 0.51.0)
We simulate the raw data processing of Schmidt22 with toy data in a real world setting with multiple collaborators (here testuser1 and testuser2):
assert ln.setup.settings.user.handle == "testuser1"
bfx_run_output = ln.dev.datasets.dir_scrnaseq_cellranger(
"perturbseq", basedir=ln.settings.storage, output_only=False
)
ln.track(ln.Transform(name="Chromium 10x upload", type="pipeline"))
ln.File(bfx_run_output.parent / "fastq/perturbseq_R1_001.fastq.gz").save()
ln.File(bfx_run_output.parent / "fastq/perturbseq_R2_001.fastq.gz").save()
Show code cell output
✅ saved: Transform(id='Jq2AGj7WRQWMjB', name='Chromium 10x upload', type='pipeline', updated_at=2023-08-28 17:20:00, created_by_id='DzTjkKse')
✅ saved: Run(id='0WsKZT4nD2dHrK9AhHES', run_at=2023-08-28 17:20:00, transform_id='Jq2AGj7WRQWMjB', created_by_id='DzTjkKse')
💡 file in storage 'mydata' with key 'fastq/perturbseq_R1_001.fastq.gz'
💡 file in storage 'mydata' with key 'fastq/perturbseq_R2_001.fastq.gz'
Track a bioinformatics pipeline#
When working with a pipeline, we’ll register it before running it.
This only happens once and could be done by anyone on your team.
ln.setup.login("testuser2")
✅ logged in with email testuser2@lamin.ai and id bKeW4T6E
❗ record with similar name exist! did you mean to load it?
id | __ratio__ | |
---|---|---|
name | ||
Test User1 | DzTjkKse | 90.0 |
✅ saved: User(id='bKeW4T6E', handle='testuser2', email='testuser2@lamin.ai', name='Test User2', updated_at=2023-08-28 17:20:01)
transform = ln.Transform(name="Cell Ranger", version="7.2.0", type="pipeline")
ln.User.filter().df()
handle | name | updated_at | ||
---|---|---|---|---|
id | ||||
DzTjkKse | testuser1 | testuser1@lamin.ai | Test User1 | 2023-08-28 17:19:58 |
bKeW4T6E | testuser2 | testuser2@lamin.ai | Test User2 | 2023-08-28 17:20:01 |
transform
Transform(id='iatQ9SqYbMikfx', name='Cell Ranger', version='7.2.0', type='pipeline', created_by_id='bKeW4T6E')
ln.track(transform)
✅ saved: Transform(id='iatQ9SqYbMikfx', name='Cell Ranger', version='7.2.0', type='pipeline', updated_at=2023-08-28 17:20:02, created_by_id='bKeW4T6E')
✅ saved: Run(id='ZzyZYde5n10Hk8OJhbFR', run_at=2023-08-28 17:20:02, transform_id='iatQ9SqYbMikfx', created_by_id='bKeW4T6E')
Now, let’s stage a few files from an instrument upload:
files = ln.File.filter(key__startswith="fastq/perturbseq").all()
filepaths = [file.stage() for file in files]
💡 adding file aPI3WSHUubOA0nYG3IQz as input for run ZzyZYde5n10Hk8OJhbFR, adding parent transform Jq2AGj7WRQWMjB
💡 adding file CzwbQbtJnTG7IK1IneYD as input for run ZzyZYde5n10Hk8OJhbFR, adding parent transform Jq2AGj7WRQWMjB
Assume we processed them and obtained 3 output files in a folder 'filtered_feature_bc_matrix'
:
output_files = ln.File.from_dir("./mydata/perturbseq/filtered_feature_bc_matrix/")
ln.save(output_files)
Show code cell output
✅ created 3 files from directory using storage /home/runner/work/lamin-usecases/lamin-usecases/docs/mydata and key = perturbseq/filtered_feature_bc_matrix/
Let’s look at the data lineage at this stage:
output_files[0].view_lineage()
And let’s keep running the Cell Ranger pipeline in the background.
Show code cell content
transform = ln.Transform(
name="Preprocess Cell Ranger outputs", version="2.0", type="pipeline"
)
ln.track(transform)
[f.stage() for f in output_files]
filepath = ln.dev.datasets.schmidt22_perturbseq(basedir=ln.settings.storage)
file = ln.File(filepath, description="perturbseq counts")
file.save()
✅ saved: Transform(id='DKAE0HZhZ0W6K2', name='Preprocess Cell Ranger outputs', version='2.0', type='pipeline', updated_at=2023-08-28 17:20:02, created_by_id='bKeW4T6E')
✅ saved: Run(id='4ZHqzdkIHfsKGIzqXGUz', run_at=2023-08-28 17:20:02, transform_id='DKAE0HZhZ0W6K2', created_by_id='bKeW4T6E')
💡 adding file IyOam1v5OInsrqWkN6jF as input for run 4ZHqzdkIHfsKGIzqXGUz, adding parent transform iatQ9SqYbMikfx
💡 adding file eTVtPstSsgY1Oqce7gtP as input for run 4ZHqzdkIHfsKGIzqXGUz, adding parent transform iatQ9SqYbMikfx
💡 adding file TXNXeLEfkFJqP5zJqR9Q as input for run 4ZHqzdkIHfsKGIzqXGUz, adding parent transform iatQ9SqYbMikfx
💡 file in storage 'mydata' with key 'schmidt22_perturbseq.h5ad'
💡 data is AnnDataLike, consider using .from_anndata() to link var_names and obs.columns as features
Track app upload & analytics#
The hidden cell below simulates additional analytic steps including:
uploading phenotypic screen data
scRNA-seq analysis
analyses of the integrated datasets
Show code cell content
# app upload
ln.setup.login("testuser1")
transform = ln.Transform(name="Upload GWS CRISPRa result", type="app")
ln.track(transform)
filepath = ln.dev.datasets.schmidt22_crispra_gws_IFNG(ln.settings.storage)
file = ln.File(filepath, description="Raw data of schmidt22 crispra GWS")
file.save()
# upload and analyze the GWS data
ln.setup.login("testuser2")
transform = ln.Transform(name="GWS CRIPSRa analysis", type="notebook")
ln.track(transform)
file_wgs = ln.File.filter(key="schmidt22-crispra-gws-IFNG.csv").one()
df = file_wgs.load().set_index("id")
hits_df = df[df["pos|fdr"] < 0.01].copy()
file_hits = ln.File(hits_df, description="hits from schmidt22 crispra GWS")
file_hits.save()
✅ logged in with email testuser1@lamin.ai and id DzTjkKse
✅ saved: Transform(id='uMsYWgNsKABD5b', name='Upload GWS CRISPRa result', type='app', updated_at=2023-08-28 17:20:12, created_by_id='DzTjkKse')
✅ saved: Run(id='5Ygl1RdaJ0ZwzoXDDvrP', run_at=2023-08-28 17:20:12, transform_id='uMsYWgNsKABD5b', created_by_id='DzTjkKse')
💡 file in storage 'mydata' with key 'schmidt22-crispra-gws-IFNG.csv'
✅ logged in with email testuser2@lamin.ai and id bKeW4T6E
✅ saved: Transform(id='cSe7pklxw5metw', name='GWS CRIPSRa analysis', type='notebook', updated_at=2023-08-28 17:20:14, created_by_id='bKeW4T6E')
✅ saved: Run(id='32AgykUQ3uxk2EBkCFAT', run_at=2023-08-28 17:20:14, transform_id='cSe7pklxw5metw', created_by_id='bKeW4T6E')
💡 adding file SEQCQOdg7ZXqx2RKNEs2 as input for run 32AgykUQ3uxk2EBkCFAT, adding parent transform uMsYWgNsKABD5b
💡 file will be copied to default storage upon `save()` with key `None` ('.lamindb/vvihy7JoQRorzeOOpnJP.parquet')
💡 data is a dataframe, consider using .from_df() to link column names as features
✅ storing file 'vvihy7JoQRorzeOOpnJP' at '.lamindb/vvihy7JoQRorzeOOpnJP.parquet'
Let’s see what the data lineage of this looks:
file = ln.File.filter(description="hits from schmidt22 crispra GWS").one()
file.view_lineage()
In the backgound, somebody integrated and analyzed the outputs of the app upload and the Cell Ranger pipeline:
Show code cell content
# Let us add analytics on top of the cell ranger pipeline and the phenotypic screening
transform = ln.Transform(
name="Perform single cell analysis, integrating with CRISPRa screen",
type="notebook",
)
ln.track(transform)
file_ps = ln.File.filter(description__icontains="perturbseq").one()
adata = file_ps.load()
screen_hits = file_hits.load()
import scanpy as sc
sc.tl.score_genes(adata, adata.var_names.intersection(screen_hits.index).tolist())
filesuffix = "_fig1_score-wgs-hits.png"
sc.pl.umap(adata, color="score", show=False, save=filesuffix)
filepath = f"figures/umap{filesuffix}"
file = ln.File(filepath, key=filepath)
file.save()
filesuffix = "fig2_score-wgs-hits-per-cluster.png"
sc.pl.matrixplot(
adata, groupby="cluster_name", var_names=["score"], show=False, save=filesuffix
)
filepath = f"figures/matrixplot_{filesuffix}"
file = ln.File(filepath, key=filepath)
file.save()
✅ saved: Transform(id='kJ6mTQRR6oAPc1', name='Perform single cell analysis, integrating with CRISPRa screen', type='notebook', updated_at=2023-08-28 17:20:15, created_by_id='bKeW4T6E')
✅ saved: Run(id='T76pZaUrBGm2UAXuXc6o', run_at=2023-08-28 17:20:15, transform_id='kJ6mTQRR6oAPc1', created_by_id='bKeW4T6E')
💡 adding file 43cIB0047kXC8hNqf4yp as input for run T76pZaUrBGm2UAXuXc6o, adding parent transform DKAE0HZhZ0W6K2
💡 adding file vvihy7JoQRorzeOOpnJP as input for run T76pZaUrBGm2UAXuXc6o, adding parent transform cSe7pklxw5metw
WARNING: saving figure to file figures/umap_fig1_score-wgs-hits.png
💡 file will be copied to default storage upon `save()` with key 'figures/umap_fig1_score-wgs-hits.png'
✅ storing file 'WRCZohbQnMULyEIrXJi6' at 'figures/umap_fig1_score-wgs-hits.png'
WARNING: saving figure to file figures/matrixplot_fig2_score-wgs-hits-per-cluster.png
💡 file will be copied to default storage upon `save()` with key 'figures/matrixplot_fig2_score-wgs-hits-per-cluster.png'
✅ storing file 'lc5zTAmgRYPdsrgZjKrx' at 'figures/matrixplot_fig2_score-wgs-hits-per-cluster.png'
The outcome of it are a few figures stored as image files. Let’s query one of them and look at the data lineage:
Track notebooks#
We’d now like to track the current Jupyter notebook to continue the work:
ln.track()
💡 notebook imports: ipython==8.14.0 lamindb==0.51.0 scanpy==1.9.4
✅ saved: Transform(id='1LCd8kco9lZUz8', name='Bird's eye view', short_name='birds-eye', version='0', type=notebook, updated_at=2023-08-28 17:20:17, created_by_id='bKeW4T6E')
✅ saved: Run(id='gt4uSQeZlBTmjfVAe7xJ', run_at=2023-08-28 17:20:17, transform_id='1LCd8kco9lZUz8', created_by_id='bKeW4T6E')
Visualize data lineage#
Let’s load one of the plots:
file = ln.File.filter(key__contains="figures/matrixplot").one()
from IPython.display import Image, display
file.stage()
display(Image(filename=file.path))
💡 adding file lc5zTAmgRYPdsrgZjKrx as input for run gt4uSQeZlBTmjfVAe7xJ, adding parent transform kJ6mTQRR6oAPc1
We see that the image file is tracked as an input of the current notebook. The input is highlighted, the notebook follows at the bottom:
file.view_lineage()
Alternatively, we can also purely look at the sequence of transforms and ignore the files:
transform = ln.Transform.search("Bird's eye view", return_queryset=True).first()
transform.parents.df()
name | short_name | version | initial_version_id | type | reference | updated_at | created_by_id | |
---|---|---|---|---|---|---|---|---|
id | ||||||||
kJ6mTQRR6oAPc1 | Perform single cell analysis, integrating with... | None | None | None | notebook | None | 2023-08-28 17:20:17 | bKeW4T6E |
transform.view_parents()
Understand runs#
We tracked pipeline and notebook runs through run_context
, which stores a Transform
and a Run
record as a global context.
File
objects are the inputs and outputs of runs.
What if I don’t want a global context?
Sometimes, we don’t want to create a global run context but manually pass a run when creating a file:
run = ln.Run(transform=transform)
ln.File(filepath, run=run)
When does a file appear as a run input?
When accessing a file via stage()
, load()
or backed()
, two things happen:
The current run gets added to
file.input_of
The transform of that file gets added as a parent of the current transform
You can then switch off auto-tracking of run inputs if you set ln.settings.track_run_inputs = False
: Can I disable tracking run inputs?
You can also track run inputs on a case by case basis via is_run_input=True
, e.g., here:
file.load(is_run_input=True)
Query by provenance#
We can query or search for the notebook that created the file:
transform = ln.Transform.search("GWS CRIPSRa analysis", return_queryset=True).first()
And then find all the files created by that notebook:
ln.File.filter(transform=transform).df()
storage_id | key | suffix | accessor | description | version | initial_version_id | size | hash | hash_type | transform_id | run_id | updated_at | created_by_id | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
id | ||||||||||||||
vvihy7JoQRorzeOOpnJP | EpWzEqJI | None | .parquet | DataFrame | hits from schmidt22 crispra GWS | None | None | 18368 | O2Owo0_QlM9JBS2zAZD4Lw | md5 | cSe7pklxw5metw | 32AgykUQ3uxk2EBkCFAT | 2023-08-28 17:20:14 | bKeW4T6E |
Which transform ingested a given file?
file = ln.File.filter().first()
file.transform
Transform(id='Jq2AGj7WRQWMjB', name='Chromium 10x upload', type='pipeline', updated_at=2023-08-28 17:20:00, created_by_id='DzTjkKse')
And which user?
file.created_by
User(id='DzTjkKse', handle='testuser1', email='testuser1@lamin.ai', name='Test User1', updated_at=2023-08-28 17:20:12)
Which transforms were created by a given user?
users = ln.User.lookup()
ln.Transform.filter(created_by=users.testuser2).df()
name | short_name | version | initial_version_id | type | reference | updated_at | created_by_id | |
---|---|---|---|---|---|---|---|---|
id | ||||||||
iatQ9SqYbMikfx | Cell Ranger | None | 7.2.0 | None | pipeline | None | 2023-08-28 17:20:02 | bKeW4T6E |
DKAE0HZhZ0W6K2 | Preprocess Cell Ranger outputs | None | 2.0 | None | pipeline | None | 2023-08-28 17:20:10 | bKeW4T6E |
cSe7pklxw5metw | GWS CRIPSRa analysis | None | None | None | notebook | None | 2023-08-28 17:20:14 | bKeW4T6E |
kJ6mTQRR6oAPc1 | Perform single cell analysis, integrating with... | None | None | None | notebook | None | 2023-08-28 17:20:17 | bKeW4T6E |
1LCd8kco9lZUz8 | Bird's eye view | birds-eye | 0 | None | notebook | None | 2023-08-28 17:20:17 | bKeW4T6E |
Which notebooks were created by a given user?
ln.Transform.filter(created_by=users.testuser2, type="notebook").df()
name | short_name | version | initial_version_id | type | reference | updated_at | created_by_id | |
---|---|---|---|---|---|---|---|---|
id | ||||||||
cSe7pklxw5metw | GWS CRIPSRa analysis | None | None | None | notebook | None | 2023-08-28 17:20:14 | bKeW4T6E |
kJ6mTQRR6oAPc1 | Perform single cell analysis, integrating with... | None | None | None | notebook | None | 2023-08-28 17:20:17 | bKeW4T6E |
1LCd8kco9lZUz8 | Bird's eye view | birds-eye | 0 | None | notebook | None | 2023-08-28 17:20:17 | bKeW4T6E |
We can also view all recent additions to the entire database:
ln.view()
Show code cell output
File
storage_id | key | suffix | accessor | description | version | initial_version_id | size | hash | hash_type | transform_id | run_id | updated_at | created_by_id | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
id | ||||||||||||||
lc5zTAmgRYPdsrgZjKrx | EpWzEqJI | figures/matrixplot_fig2_score-wgs-hits-per-clu... | .png | None | None | None | None | 28814 | JYIPcat0YWYVCX3RVd3mww | md5 | kJ6mTQRR6oAPc1 | T76pZaUrBGm2UAXuXc6o | 2023-08-28 17:20:17 | bKeW4T6E |
WRCZohbQnMULyEIrXJi6 | EpWzEqJI | figures/umap_fig1_score-wgs-hits.png | .png | None | None | None | None | 118999 | laQjVk4gh70YFzaUyzbUNg | md5 | kJ6mTQRR6oAPc1 | T76pZaUrBGm2UAXuXc6o | 2023-08-28 17:20:16 | bKeW4T6E |
vvihy7JoQRorzeOOpnJP | EpWzEqJI | None | .parquet | DataFrame | hits from schmidt22 crispra GWS | None | None | 18368 | O2Owo0_QlM9JBS2zAZD4Lw | md5 | cSe7pklxw5metw | 32AgykUQ3uxk2EBkCFAT | 2023-08-28 17:20:14 | bKeW4T6E |
SEQCQOdg7ZXqx2RKNEs2 | EpWzEqJI | schmidt22-crispra-gws-IFNG.csv | .csv | None | Raw data of schmidt22 crispra GWS | None | None | 1729685 | cUSH0oQ2w-WccO8_ViKRAQ | md5 | uMsYWgNsKABD5b | 5Ygl1RdaJ0ZwzoXDDvrP | 2023-08-28 17:20:13 | DzTjkKse |
43cIB0047kXC8hNqf4yp | EpWzEqJI | schmidt22_perturbseq.h5ad | .h5ad | AnnData | perturbseq counts | None | None | 20659936 | la7EvqEUMDlug9-rpw-udA | md5 | DKAE0HZhZ0W6K2 | 4ZHqzdkIHfsKGIzqXGUz | 2023-08-28 17:20:10 | bKeW4T6E |
eTVtPstSsgY1Oqce7gtP | EpWzEqJI | perturbseq/filtered_feature_bc_matrix/features... | .tsv.gz | None | None | None | None | 6 | klcONrMGGAzxXC_bDBmy7g | md5 | iatQ9SqYbMikfx | ZzyZYde5n10Hk8OJhbFR | 2023-08-28 17:20:02 | bKeW4T6E |
TXNXeLEfkFJqP5zJqR9Q | EpWzEqJI | perturbseq/filtered_feature_bc_matrix/matrix.m... | .mtx.gz | None | None | None | None | 6 | ZcwwkrFjOEK1Z4u7kXnZFQ | md5 | iatQ9SqYbMikfx | ZzyZYde5n10Hk8OJhbFR | 2023-08-28 17:20:02 | bKeW4T6E |
Run
transform_id | run_at | created_by_id | reference | reference_type | |
---|---|---|---|---|---|
id | |||||
0WsKZT4nD2dHrK9AhHES | Jq2AGj7WRQWMjB | 2023-08-28 17:20:00 | DzTjkKse | None | None |
ZzyZYde5n10Hk8OJhbFR | iatQ9SqYbMikfx | 2023-08-28 17:20:02 | bKeW4T6E | None | None |
4ZHqzdkIHfsKGIzqXGUz | DKAE0HZhZ0W6K2 | 2023-08-28 17:20:02 | bKeW4T6E | None | None |
5Ygl1RdaJ0ZwzoXDDvrP | uMsYWgNsKABD5b | 2023-08-28 17:20:12 | DzTjkKse | None | None |
32AgykUQ3uxk2EBkCFAT | cSe7pklxw5metw | 2023-08-28 17:20:14 | bKeW4T6E | None | None |
T76pZaUrBGm2UAXuXc6o | kJ6mTQRR6oAPc1 | 2023-08-28 17:20:15 | bKeW4T6E | None | None |
gt4uSQeZlBTmjfVAe7xJ | 1LCd8kco9lZUz8 | 2023-08-28 17:20:17 | bKeW4T6E | None | None |
Storage
root | type | region | updated_at | created_by_id | |
---|---|---|---|---|---|
id | |||||
EpWzEqJI | /home/runner/work/lamin-usecases/lamin-usecase... | local | None | 2023-08-28 17:19:58 | DzTjkKse |
Transform
name | short_name | version | initial_version_id | type | reference | updated_at | created_by_id | |
---|---|---|---|---|---|---|---|---|
id | ||||||||
1LCd8kco9lZUz8 | Bird's eye view | birds-eye | 0 | None | notebook | None | 2023-08-28 17:20:17 | bKeW4T6E |
kJ6mTQRR6oAPc1 | Perform single cell analysis, integrating with... | None | None | None | notebook | None | 2023-08-28 17:20:17 | bKeW4T6E |
cSe7pklxw5metw | GWS CRIPSRa analysis | None | None | None | notebook | None | 2023-08-28 17:20:14 | bKeW4T6E |
uMsYWgNsKABD5b | Upload GWS CRISPRa result | None | None | None | app | None | 2023-08-28 17:20:12 | DzTjkKse |
DKAE0HZhZ0W6K2 | Preprocess Cell Ranger outputs | None | 2.0 | None | pipeline | None | 2023-08-28 17:20:10 | bKeW4T6E |
iatQ9SqYbMikfx | Cell Ranger | None | 7.2.0 | None | pipeline | None | 2023-08-28 17:20:02 | bKeW4T6E |
Jq2AGj7WRQWMjB | Chromium 10x upload | None | None | None | pipeline | None | 2023-08-28 17:20:00 | DzTjkKse |
User
handle | name | updated_at | ||
---|---|---|---|---|
id | ||||
bKeW4T6E | testuser2 | testuser2@lamin.ai | Test User2 | 2023-08-28 17:20:14 |
DzTjkKse | testuser1 | testuser1@lamin.ai | Test User1 | 2023-08-28 17:20:12 |
Show code cell content
!lamin login testuser1
!lamin delete --force mydata
!rm -r ./mydata
✅ logged in with email testuser1@lamin.ai and id DzTjkKse
💡 deleting instance testuser1/mydata
✅ deleted instance settings file: /home/runner/.lamin/instance--testuser1--mydata.env
✅ instance cache deleted
✅ deleted '.lndb' sqlite file
❗ consider manually deleting your stored data: /home/runner/work/lamin-usecases/lamin-usecases/docs/mydata