Bird’s eye view#

Background#

Data lineage tracks data’s journey, detailing its origins, transformations, and interactions to trace biological insights, verify experimental outcomes, meet regulatory standards, and increase the robustness of research. While tracking data lineage is easier when it is governed by deterministic pipelines, it becomes hard when its governed by interactive human-driven analyses.

Here, we’ll backtrace file transformations through notebooks, pipelines & app uploads in a research project based on Schmidt22 which conducted genome-wide CRISPR activation and interference screens in primary human T cells to identify gene networks controlling IL-2 and IFN-γ production.

Setup#

We need an instance:

!lamin init --storage ./mydata

Import lamindb:

import lamindb as ln

✅ loaded instance: testuser1/mydata (lamindb 0.51.0)

We simulate the raw data processing of Schmidt22 with toy data in a real world setting with multiple collaborators (here testuser1 and testuser2):

assert ln.setup.settings.user.handle == "testuser1"

bfx_run_output = ln.dev.datasets.dir_scrnaseq_cellranger(
    "perturbseq", basedir=ln.settings.storage, output_only=False
)
ln.track(ln.Transform(name="Chromium 10x upload", type="pipeline"))
ln.File(bfx_run_output.parent / "fastq/perturbseq_R1_001.fastq.gz").save()
ln.File(bfx_run_output.parent / "fastq/perturbseq_R2_001.fastq.gz").save()

Track a bioinformatics pipeline#

When working with a pipeline, we’ll register it before running it.

This only happens once and could be done by anyone on your team.

ln.setup.login("testuser2")

✅ logged in with email testuser2@lamin.ai and id bKeW4T6E

❗ record with similar name exist! did you mean to load it?

	id	__ratio__
name
Test User1	DzTjkKse	90.0

✅ saved: User(id='bKeW4T6E', handle='testuser2', email='testuser2@lamin.ai', name='Test User2', updated_at=2023-08-28 17:20:01)

transform = ln.Transform(name="Cell Ranger", version="7.2.0", type="pipeline")

ln.User.filter().df()

	handle	email	name	updated_at
id
DzTjkKse	testuser1	testuser1@lamin.ai	Test User1	2023-08-28 17:19:58
bKeW4T6E	testuser2	testuser2@lamin.ai	Test User2	2023-08-28 17:20:01

transform

Transform(id='iatQ9SqYbMikfx', name='Cell Ranger', version='7.2.0', type='pipeline', created_by_id='bKeW4T6E')

ln.track(transform)

✅ saved: Transform(id='iatQ9SqYbMikfx', name='Cell Ranger', version='7.2.0', type='pipeline', updated_at=2023-08-28 17:20:02, created_by_id='bKeW4T6E')

✅ saved: Run(id='ZzyZYde5n10Hk8OJhbFR', run_at=2023-08-28 17:20:02, transform_id='iatQ9SqYbMikfx', created_by_id='bKeW4T6E')

Now, let’s stage a few files from an instrument upload:

files = ln.File.filter(key__startswith="fastq/perturbseq").all()
filepaths = [file.stage() for file in files]

💡 adding file aPI3WSHUubOA0nYG3IQz as input for run ZzyZYde5n10Hk8OJhbFR, adding parent transform Jq2AGj7WRQWMjB

💡 adding file CzwbQbtJnTG7IK1IneYD as input for run ZzyZYde5n10Hk8OJhbFR, adding parent transform Jq2AGj7WRQWMjB

Assume we processed them and obtained 3 output files in a folder 'filtered_feature_bc_matrix':

output_files = ln.File.from_dir("./mydata/perturbseq/filtered_feature_bc_matrix/")
ln.save(output_files)

Let’s look at the data lineage at this stage:

output_files[0].view_lineage()

https://d33wubrfki0l68.cloudfront.net/26a4fd237b485c821d53e9417a14bd3635485118/de4a6/_images/2fa55ae8b39b72db09c3e1b281f6c867bb4f7d22bc29c0b6cf9ea4df632a4194.svg

And let’s keep running the Cell Ranger pipeline in the background.

Track app upload & analytics#

The hidden cell below simulates additional analytic steps including:

uploading phenotypic screen data
scRNA-seq analysis
analyses of the integrated datasets

Let’s see what the data lineage of this looks:

file = ln.File.filter(description="hits from schmidt22 crispra GWS").one()
file.view_lineage()

https://d33wubrfki0l68.cloudfront.net/be7075cfd14022febb82950c4efab0cde76df4df/adfec/_images/c6a4d51d97d57b691b34606f1f2135289e4dc32944669c98f62244083e72417f.svg

In the backgound, somebody integrated and analyzed the outputs of the app upload and the Cell Ranger pipeline:

Show code cell content Hide code cell content

# Let us add analytics on top of the cell ranger pipeline and the phenotypic screening
transform = ln.Transform(
    name="Perform single cell analysis, integrating with CRISPRa screen",
    type="notebook",
)
ln.track(transform)

file_ps = ln.File.filter(description__icontains="perturbseq").one()
adata = file_ps.load()
screen_hits = file_hits.load()

import scanpy as sc

sc.tl.score_genes(adata, adata.var_names.intersection(screen_hits.index).tolist())
filesuffix = "_fig1_score-wgs-hits.png"
sc.pl.umap(adata, color="score", show=False, save=filesuffix)
filepath = f"figures/umap{filesuffix}"
file = ln.File(filepath, key=filepath)
file.save()
filesuffix = "fig2_score-wgs-hits-per-cluster.png"
sc.pl.matrixplot(
    adata, groupby="cluster_name", var_names=["score"], show=False, save=filesuffix
)
filepath = f"figures/matrixplot_{filesuffix}"
file = ln.File(filepath, key=filepath)
file.save()

✅ saved: Transform(id='kJ6mTQRR6oAPc1', name='Perform single cell analysis, integrating with CRISPRa screen', type='notebook', updated_at=2023-08-28 17:20:15, created_by_id='bKeW4T6E')

✅ saved: Run(id='T76pZaUrBGm2UAXuXc6o', run_at=2023-08-28 17:20:15, transform_id='kJ6mTQRR6oAPc1', created_by_id='bKeW4T6E')

💡 adding file 43cIB0047kXC8hNqf4yp as input for run T76pZaUrBGm2UAXuXc6o, adding parent transform DKAE0HZhZ0W6K2

💡 adding file vvihy7JoQRorzeOOpnJP as input for run T76pZaUrBGm2UAXuXc6o, adding parent transform cSe7pklxw5metw

WARNING: saving figure to file figures/umap_fig1_score-wgs-hits.png

💡 file will be copied to default storage upon `save()` with key 'figures/umap_fig1_score-wgs-hits.png'

✅ storing file 'WRCZohbQnMULyEIrXJi6' at 'figures/umap_fig1_score-wgs-hits.png'

WARNING: saving figure to file figures/matrixplot_fig2_score-wgs-hits-per-cluster.png

💡 file will be copied to default storage upon `save()` with key 'figures/matrixplot_fig2_score-wgs-hits-per-cluster.png'

✅ storing file 'lc5zTAmgRYPdsrgZjKrx' at 'figures/matrixplot_fig2_score-wgs-hits-per-cluster.png'

The outcome of it are a few figures stored as image files. Let’s query one of them and look at the data lineage:

Track notebooks#

We’d now like to track the current Jupyter notebook to continue the work:

ln.track()

💡 notebook imports: ipython==8.14.0 lamindb==0.51.0 scanpy==1.9.4

✅ saved: Transform(id='1LCd8kco9lZUz8', name='Bird's eye view', short_name='birds-eye', version='0', type=notebook, updated_at=2023-08-28 17:20:17, created_by_id='bKeW4T6E')

✅ saved: Run(id='gt4uSQeZlBTmjfVAe7xJ', run_at=2023-08-28 17:20:17, transform_id='1LCd8kco9lZUz8', created_by_id='bKeW4T6E')

Visualize data lineage#

Let’s load one of the plots:

file = ln.File.filter(key__contains="figures/matrixplot").one()

from IPython.display import Image, display

file.stage()
display(Image(filename=file.path))

💡 adding file lc5zTAmgRYPdsrgZjKrx as input for run gt4uSQeZlBTmjfVAe7xJ, adding parent transform kJ6mTQRR6oAPc1

https://d33wubrfki0l68.cloudfront.net/dcbd1e67232f2ede82171ba02237575cc586c2b7/1ceff/_images/45891ad4693b5bfeb52a48b2ab2e5d0a82220b9482360ee1a8757fad581fffdc.png

We see that the image file is tracked as an input of the current notebook. The input is highlighted, the notebook follows at the bottom:

file.view_lineage()

https://d33wubrfki0l68.cloudfront.net/9093f6fa0c187ce53bacf02cd24b151c304218a8/e94bd/_images/6b81a816efdaa601d3dea731afe3058dc65caba3933f6282e85d39cc4e0408a2.svg

Alternatively, we can also purely look at the sequence of transforms and ignore the files:

transform = ln.Transform.search("Bird's eye view", return_queryset=True).first()

transform.parents.df()

	name	short_name	version	initial_version_id	type	reference	updated_at	created_by_id
id
kJ6mTQRR6oAPc1	Perform single cell analysis, integrating with...	None	None	None	notebook	None	2023-08-28 17:20:17	bKeW4T6E

transform.view_parents()

https://d33wubrfki0l68.cloudfront.net/a37627386726f7a4970cd3f51a0c7640895c3800/ff3dd/_images/067278d9041e5e75068a3b1b8510ceb6349dfcd1a4d7c954c1de02663cb6bcd6.svg

Understand runs#

We tracked pipeline and notebook runs through run_context, which stores a Transform and a Run record as a global context.

File objects are the inputs and outputs of runs.

Query by provenance#

We can query or search for the notebook that created the file:

transform = ln.Transform.search("GWS CRIPSRa analysis", return_queryset=True).first()

And then find all the files created by that notebook:

ln.File.filter(transform=transform).df()

	storage_id	key	suffix	accessor	description	version	initial_version_id	size	hash	hash_type	transform_id	run_id	updated_at	created_by_id
id
vvihy7JoQRorzeOOpnJP	EpWzEqJI	None	.parquet	DataFrame	hits from schmidt22 crispra GWS	None	None	18368	O2Owo0_QlM9JBS2zAZD4Lw	md5	cSe7pklxw5metw	32AgykUQ3uxk2EBkCFAT	2023-08-28 17:20:14	bKeW4T6E

Which transform ingested a given file?

file = ln.File.filter().first()
file.transform

Transform(id='Jq2AGj7WRQWMjB', name='Chromium 10x upload', type='pipeline', updated_at=2023-08-28 17:20:00, created_by_id='DzTjkKse')

And which user?

file.created_by

User(id='DzTjkKse', handle='testuser1', email='testuser1@lamin.ai', name='Test User1', updated_at=2023-08-28 17:20:12)

Which transforms were created by a given user?

users = ln.User.lookup()

ln.Transform.filter(created_by=users.testuser2).df()

	name	short_name	version	initial_version_id	type	reference	updated_at	created_by_id
id
iatQ9SqYbMikfx	Cell Ranger	None	7.2.0	None	pipeline	None	2023-08-28 17:20:02	bKeW4T6E
DKAE0HZhZ0W6K2	Preprocess Cell Ranger outputs	None	2.0	None	pipeline	None	2023-08-28 17:20:10	bKeW4T6E
cSe7pklxw5metw	GWS CRIPSRa analysis	None	None	None	notebook	None	2023-08-28 17:20:14	bKeW4T6E
kJ6mTQRR6oAPc1	Perform single cell analysis, integrating with...	None	None	None	notebook	None	2023-08-28 17:20:17	bKeW4T6E
1LCd8kco9lZUz8	Bird's eye view	birds-eye	0	None	notebook	None	2023-08-28 17:20:17	bKeW4T6E

Which notebooks were created by a given user?

ln.Transform.filter(created_by=users.testuser2, type="notebook").df()

	name	short_name	version	initial_version_id	type	reference	updated_at	created_by_id
id
cSe7pklxw5metw	GWS CRIPSRa analysis	None	None	None	notebook	None	2023-08-28 17:20:14	bKeW4T6E
kJ6mTQRR6oAPc1	Perform single cell analysis, integrating with...	None	None	None	notebook	None	2023-08-28 17:20:17	bKeW4T6E
1LCd8kco9lZUz8	Bird's eye view	birds-eye	0	None	notebook	None	2023-08-28 17:20:17	bKeW4T6E

We can also view all recent additions to the entire database:

ln.view()

Show code cell output Hide code cell output

File

	storage_id	key	suffix	accessor	description	version	initial_version_id	size	hash	hash_type	transform_id	run_id	updated_at	created_by_id
id
lc5zTAmgRYPdsrgZjKrx	EpWzEqJI	figures/matrixplot_fig2_score-wgs-hits-per-clu...	.png	None	None	None	None	28814	JYIPcat0YWYVCX3RVd3mww	md5	kJ6mTQRR6oAPc1	T76pZaUrBGm2UAXuXc6o	2023-08-28 17:20:17	bKeW4T6E
WRCZohbQnMULyEIrXJi6	EpWzEqJI	figures/umap_fig1_score-wgs-hits.png	.png	None	None	None	None	118999	laQjVk4gh70YFzaUyzbUNg	md5	kJ6mTQRR6oAPc1	T76pZaUrBGm2UAXuXc6o	2023-08-28 17:20:16	bKeW4T6E
vvihy7JoQRorzeOOpnJP	EpWzEqJI	None	.parquet	DataFrame	hits from schmidt22 crispra GWS	None	None	18368	O2Owo0_QlM9JBS2zAZD4Lw	md5	cSe7pklxw5metw	32AgykUQ3uxk2EBkCFAT	2023-08-28 17:20:14	bKeW4T6E
SEQCQOdg7ZXqx2RKNEs2	EpWzEqJI	schmidt22-crispra-gws-IFNG.csv	.csv	None	Raw data of schmidt22 crispra GWS	None	None	1729685	cUSH0oQ2w-WccO8_ViKRAQ	md5	uMsYWgNsKABD5b	5Ygl1RdaJ0ZwzoXDDvrP	2023-08-28 17:20:13	DzTjkKse
43cIB0047kXC8hNqf4yp	EpWzEqJI	schmidt22_perturbseq.h5ad	.h5ad	AnnData	perturbseq counts	None	None	20659936	la7EvqEUMDlug9-rpw-udA	md5	DKAE0HZhZ0W6K2	4ZHqzdkIHfsKGIzqXGUz	2023-08-28 17:20:10	bKeW4T6E
eTVtPstSsgY1Oqce7gtP	EpWzEqJI	perturbseq/filtered_feature_bc_matrix/features...	.tsv.gz	None	None	None	None	6	klcONrMGGAzxXC_bDBmy7g	md5	iatQ9SqYbMikfx	ZzyZYde5n10Hk8OJhbFR	2023-08-28 17:20:02	bKeW4T6E
TXNXeLEfkFJqP5zJqR9Q	EpWzEqJI	perturbseq/filtered_feature_bc_matrix/matrix.m...	.mtx.gz	None	None	None	None	6	ZcwwkrFjOEK1Z4u7kXnZFQ	md5	iatQ9SqYbMikfx	ZzyZYde5n10Hk8OJhbFR	2023-08-28 17:20:02	bKeW4T6E

Run

	transform_id	run_at	created_by_id	reference	reference_type
id
0WsKZT4nD2dHrK9AhHES	Jq2AGj7WRQWMjB	2023-08-28 17:20:00	DzTjkKse	None	None
ZzyZYde5n10Hk8OJhbFR	iatQ9SqYbMikfx	2023-08-28 17:20:02	bKeW4T6E	None	None
4ZHqzdkIHfsKGIzqXGUz	DKAE0HZhZ0W6K2	2023-08-28 17:20:02	bKeW4T6E	None	None
5Ygl1RdaJ0ZwzoXDDvrP	uMsYWgNsKABD5b	2023-08-28 17:20:12	DzTjkKse	None	None
32AgykUQ3uxk2EBkCFAT	cSe7pklxw5metw	2023-08-28 17:20:14	bKeW4T6E	None	None
T76pZaUrBGm2UAXuXc6o	kJ6mTQRR6oAPc1	2023-08-28 17:20:15	bKeW4T6E	None	None
gt4uSQeZlBTmjfVAe7xJ	1LCd8kco9lZUz8	2023-08-28 17:20:17	bKeW4T6E	None	None

Storage

	root	type	region	updated_at	created_by_id
id
EpWzEqJI	/home/runner/work/lamin-usecases/lamin-usecase...	local	None	2023-08-28 17:19:58	DzTjkKse

Transform

	name	short_name	version	initial_version_id	type	reference	updated_at	created_by_id
id
1LCd8kco9lZUz8	Bird's eye view	birds-eye	0	None	notebook	None	2023-08-28 17:20:17	bKeW4T6E
kJ6mTQRR6oAPc1	Perform single cell analysis, integrating with...	None	None	None	notebook	None	2023-08-28 17:20:17	bKeW4T6E
cSe7pklxw5metw	GWS CRIPSRa analysis	None	None	None	notebook	None	2023-08-28 17:20:14	bKeW4T6E
uMsYWgNsKABD5b	Upload GWS CRISPRa result	None	None	None	app	None	2023-08-28 17:20:12	DzTjkKse
DKAE0HZhZ0W6K2	Preprocess Cell Ranger outputs	None	2.0	None	pipeline	None	2023-08-28 17:20:10	bKeW4T6E
iatQ9SqYbMikfx	Cell Ranger	None	7.2.0	None	pipeline	None	2023-08-28 17:20:02	bKeW4T6E
Jq2AGj7WRQWMjB	Chromium 10x upload	None	None	None	pipeline	None	2023-08-28 17:20:00	DzTjkKse

User

	handle	email	name	updated_at
id
bKeW4T6E	testuser2	testuser2@lamin.ai	Test User2	2023-08-28 17:20:14
DzTjkKse	testuser1	testuser1@lamin.ai	Test User1	2023-08-28 17:20:12