Bird’s eye view#

Background#

Data lineage tracks data’s journey, detailing its origins, transformations, and interactions to trace biological insights, verify experimental outcomes, meet regulatory standards, and increase the robustness of research. While tracking data lineage is easier when it is governed by deterministic pipelines, it becomes hard when its governed by interactive human-driven analyses.

Here, we’ll backtrace file transformations through notebooks, pipelines & app uploads in a research project based on Schmidt22 which conducted genome-wide CRISPR activation and interference screens in primary human T cells to identify gene networks controlling IL-2 and IFN-γ production.

Setup#

We need an instance:

!lamin init --storage ./mydata

Import lamindb:

import lamindb as ln

✅ loaded instance: testuser1/mydata (lamindb 0.51.0)

We simulate the raw data processing of Schmidt22 with toy data in a real world setting with multiple collaborators (here testuser1 and testuser2):

assert ln.setup.settings.user.handle == "testuser1"

bfx_run_output = ln.dev.datasets.dir_scrnaseq_cellranger(
    "perturbseq", basedir=ln.settings.storage, output_only=False
)
ln.track(ln.Transform(name="Chromium 10x upload", type="pipeline"))
ln.File(bfx_run_output.parent / "fastq/perturbseq_R1_001.fastq.gz").save()
ln.File(bfx_run_output.parent / "fastq/perturbseq_R2_001.fastq.gz").save()

Track a bioinformatics pipeline#

When working with a pipeline, we’ll register it before running it.

This only happens once and could be done by anyone on your team.

ln.setup.login("testuser2")

✅ logged in with email testuser2@lamin.ai and id bKeW4T6E

❗ record with similar name exist! did you mean to load it?

	id	__ratio__
name
Test User1	DzTjkKse	90.0

✅ saved: User(id='bKeW4T6E', handle='testuser2', email='testuser2@lamin.ai', name='Test User2', updated_at=2023-08-28 14:49:15)

transform = ln.Transform(name="Cell Ranger", version="7.2.0", type="pipeline")

ln.User.filter().df()

	handle	email	name	updated_at
id
DzTjkKse	testuser1	testuser1@lamin.ai	Test User1	2023-08-28 14:49:12
bKeW4T6E	testuser2	testuser2@lamin.ai	Test User2	2023-08-28 14:49:15

transform

Transform(id='pWJEqHVhng2Qev', name='Cell Ranger', version='7.2.0', type='pipeline', created_by_id='bKeW4T6E')

ln.track(transform)

✅ saved: Transform(id='pWJEqHVhng2Qev', name='Cell Ranger', version='7.2.0', type='pipeline', updated_at=2023-08-28 14:49:15, created_by_id='bKeW4T6E')

✅ saved: Run(id='uN1Ug3h7Ytfd6qfgBi1s', run_at=2023-08-28 14:49:15, transform_id='pWJEqHVhng2Qev', created_by_id='bKeW4T6E')

Now, let’s stage a few files from an instrument upload:

files = ln.File.filter(key__startswith="fastq/perturbseq").all()
filepaths = [file.stage() for file in files]

💡 adding file RsIKTt02IAOncsRVrHzA as input for run uN1Ug3h7Ytfd6qfgBi1s, adding parent transform aqK0SBzGBxgghN

💡 adding file 4aZhw0PMOhyPqAW3rfN5 as input for run uN1Ug3h7Ytfd6qfgBi1s, adding parent transform aqK0SBzGBxgghN

Assume we processed them and obtained 3 output files in a folder 'filtered_feature_bc_matrix':

output_files = ln.File.from_dir("./mydata/perturbseq/filtered_feature_bc_matrix/")
ln.save(output_files)

Let’s look at the data lineage at this stage:

output_files[0].view_lineage()

https://d33wubrfki0l68.cloudfront.net/82792da073c50a67bef24edee65144dc1bdd066f/fcffe/_images/521dc86b2a77528ba6b8bab7ceb801f3b31fffaae9ea1b52c82a5a41d95c9e6a.svg

And let’s keep running the Cell Ranger pipeline in the background.

Track app upload & analytics#

The hidden cell below simulates additional analytic steps including:

uploading phenotypic screen data
scRNA-seq analysis
analyses of the integrated datasets

Let’s see what the data lineage of this looks:

file = ln.File.filter(description="hits from schmidt22 crispra GWS").one()
file.view_lineage()

https://d33wubrfki0l68.cloudfront.net/401d876db808cf86ab33e9e635a18c2f6a3f5b86/2415f/_images/3b5755553a580d8875ef25a4805b288662cf9c539720435460ad7f230d628451.svg

In the backgound, somebody integrated and analyzed the outputs of the app upload and the Cell Ranger pipeline:

Show code cell content Hide code cell content

# Let us add analytics on top of the cell ranger pipeline and the phenotypic screening
transform = ln.Transform(
    name="Perform single cell analysis, integrating with CRISPRa screen",
    type="notebook",
)
ln.track(transform)

file_ps = ln.File.filter(description__icontains="perturbseq").one()
adata = file_ps.load()
screen_hits = file_hits.load()

import scanpy as sc

sc.tl.score_genes(adata, adata.var_names.intersection(screen_hits.index).tolist())
filesuffix = "_fig1_score-wgs-hits.png"
sc.pl.umap(adata, color="score", show=False, save=filesuffix)
filepath = f"figures/umap{filesuffix}"
file = ln.File(filepath, key=filepath)
file.save()
filesuffix = "fig2_score-wgs-hits-per-cluster.png"
sc.pl.matrixplot(
    adata, groupby="cluster_name", var_names=["score"], show=False, save=filesuffix
)
filepath = f"figures/matrixplot_{filesuffix}"
file = ln.File(filepath, key=filepath)
file.save()

✅ saved: Transform(id='TKln6FHOmDPKwc', name='Perform single cell analysis, integrating with CRISPRa screen', type='notebook', updated_at=2023-08-28 14:49:21, created_by_id='bKeW4T6E')

✅ saved: Run(id='0jnUrtDaLrTNt2SddM9i', run_at=2023-08-28 14:49:21, transform_id='TKln6FHOmDPKwc', created_by_id='bKeW4T6E')

💡 adding file oAVvvsis2fUhMHEU1REc as input for run 0jnUrtDaLrTNt2SddM9i, adding parent transform w7jyueHY75U7cC

💡 adding file Ff6RGhUI48U9OugNQ7wt as input for run 0jnUrtDaLrTNt2SddM9i, adding parent transform GqkNrJ3VnlQO4u

WARNING: saving figure to file figures/umap_fig1_score-wgs-hits.png

💡 file will be copied to default storage upon `save()` with key 'figures/umap_fig1_score-wgs-hits.png'

✅ storing file 'ton5FLCvYHAxyCaEodi5' at 'figures/umap_fig1_score-wgs-hits.png'

WARNING: saving figure to file figures/matrixplot_fig2_score-wgs-hits-per-cluster.png

💡 file will be copied to default storage upon `save()` with key 'figures/matrixplot_fig2_score-wgs-hits-per-cluster.png'

✅ storing file 'O80IXB21Am9j0nBajowm' at 'figures/matrixplot_fig2_score-wgs-hits-per-cluster.png'

The outcome of it are a few figures stored as image files. Let’s query one of them and look at the data lineage:

Track notebooks#

We’d now like to track the current Jupyter notebook to continue the work:

ln.track()

💡 notebook imports: ipython==8.14.0 lamindb==0.51.0 scanpy==1.9.4

✅ saved: Transform(id='1LCd8kco9lZUz8', name='Bird's eye view', short_name='birds-eye', version='0', type=notebook, updated_at=2023-08-28 14:49:24, created_by_id='bKeW4T6E')

✅ saved: Run(id='sJZVw3N8Tlyia55gOiUS', run_at=2023-08-28 14:49:24, transform_id='1LCd8kco9lZUz8', created_by_id='bKeW4T6E')

Visualize data lineage#

Let’s load one of the plots:

file = ln.File.filter(key__contains="figures/matrixplot").one()

from IPython.display import Image, display

file.stage()
display(Image(filename=file.path))

💡 adding file O80IXB21Am9j0nBajowm as input for run sJZVw3N8Tlyia55gOiUS, adding parent transform TKln6FHOmDPKwc

https://d33wubrfki0l68.cloudfront.net/dcbd1e67232f2ede82171ba02237575cc586c2b7/1ceff/_images/45891ad4693b5bfeb52a48b2ab2e5d0a82220b9482360ee1a8757fad581fffdc.png

We see that the image file is tracked as an input of the current notebook. The input is highlighted, the notebook follows at the bottom:

file.view_lineage()

https://d33wubrfki0l68.cloudfront.net/cf871dafad50c6669b5e4665eb11d1b2d93b9147/3983b/_images/73de689077a870bc76855b77bfabd2adef7d6caec7edf32b46ab2648156fab2f.svg

Alternatively, we can also purely look at the sequence of transforms and ignore the files:

transform = ln.Transform.search("Bird's eye view", return_queryset=True).first()

transform.parents.df()

	name	short_name	version	initial_version_id	type	reference	updated_at	created_by_id
id
TKln6FHOmDPKwc	Perform single cell analysis, integrating with...	None	None	None	notebook	None	2023-08-28 14:49:23	bKeW4T6E

transform.view_parents()

https://d33wubrfki0l68.cloudfront.net/2cedf0ec070e1ea45b4f70597c24dbf052aaf7a7/cef84/_images/83b0139b6ff200687fa4c6469ed11ec8d7bc381f8a014c35d497a986eb6464e7.svg

Understand runs#

We tracked pipeline and notebook runs through run_context, which stores a Transform and a Run record as a global context.

File objects are the inputs and outputs of runs.

Query by provenance#

We can query or search for the notebook that created the file:

transform = ln.Transform.search("GWS CRIPSRa analysis", return_queryset=True).first()

And then find all the files created by that notebook:

ln.File.filter(transform=transform).df()

	storage_id	key	suffix	accessor	description	version	initial_version_id	size	hash	hash_type	transform_id	run_id	updated_at	created_by_id
id
Ff6RGhUI48U9OugNQ7wt	qwgoTldI	None	.parquet	DataFrame	hits from schmidt22 crispra GWS	None	None	18368	O2Owo0_QlM9JBS2zAZD4Lw	md5	GqkNrJ3VnlQO4u	TqFvLxCX6htdJ1Puou53	2023-08-28 14:49:21	bKeW4T6E

Which transform ingested a given file?

file = ln.File.filter().first()
file.transform

Transform(id='aqK0SBzGBxgghN', name='Chromium 10x upload', type='pipeline', updated_at=2023-08-28 14:49:13, created_by_id='DzTjkKse')

And which user?

file.created_by

User(id='DzTjkKse', handle='testuser1', email='testuser1@lamin.ai', name='Test User1', updated_at=2023-08-28 14:49:18)

Which transforms were created by a given user?

users = ln.User.lookup()

ln.Transform.filter(created_by=users.testuser2).df()

	name	short_name	version	initial_version_id	type	reference	updated_at	created_by_id
id
pWJEqHVhng2Qev	Cell Ranger	None	7.2.0	None	pipeline	None	2023-08-28 14:49:15	bKeW4T6E
w7jyueHY75U7cC	Preprocess Cell Ranger outputs	None	2.0	None	pipeline	None	2023-08-28 14:49:17	bKeW4T6E
GqkNrJ3VnlQO4u	GWS CRIPSRa analysis	None	None	None	notebook	None	2023-08-28 14:49:21	bKeW4T6E
TKln6FHOmDPKwc	Perform single cell analysis, integrating with...	None	None	None	notebook	None	2023-08-28 14:49:23	bKeW4T6E
1LCd8kco9lZUz8	Bird's eye view	birds-eye	0	None	notebook	None	2023-08-28 14:49:24	bKeW4T6E

Which notebooks were created by a given user?

ln.Transform.filter(created_by=users.testuser2, type="notebook").df()

	name	short_name	version	initial_version_id	type	reference	updated_at	created_by_id
id
GqkNrJ3VnlQO4u	GWS CRIPSRa analysis	None	None	None	notebook	None	2023-08-28 14:49:21	bKeW4T6E
TKln6FHOmDPKwc	Perform single cell analysis, integrating with...	None	None	None	notebook	None	2023-08-28 14:49:23	bKeW4T6E
1LCd8kco9lZUz8	Bird's eye view	birds-eye	0	None	notebook	None	2023-08-28 14:49:24	bKeW4T6E

We can also view all recent additions to the entire database:

ln.view()

Show code cell output Hide code cell output

File

	storage_id	key	suffix	accessor	description	version	initial_version_id	size	hash	hash_type	transform_id	run_id	updated_at	created_by_id
id
O80IXB21Am9j0nBajowm	qwgoTldI	figures/matrixplot_fig2_score-wgs-hits-per-clu...	.png	None	None	None	None	28814	JYIPcat0YWYVCX3RVd3mww	md5	TKln6FHOmDPKwc	0jnUrtDaLrTNt2SddM9i	2023-08-28 14:49:23	bKeW4T6E
ton5FLCvYHAxyCaEodi5	qwgoTldI	figures/umap_fig1_score-wgs-hits.png	.png	None	None	None	None	118999	laQjVk4gh70YFzaUyzbUNg	md5	TKln6FHOmDPKwc	0jnUrtDaLrTNt2SddM9i	2023-08-28 14:49:23	bKeW4T6E
Ff6RGhUI48U9OugNQ7wt	qwgoTldI	None	.parquet	DataFrame	hits from schmidt22 crispra GWS	None	None	18368	O2Owo0_QlM9JBS2zAZD4Lw	md5	GqkNrJ3VnlQO4u	TqFvLxCX6htdJ1Puou53	2023-08-28 14:49:21	bKeW4T6E
IeIPTV8JMbghR3t93wq8	qwgoTldI	schmidt22-crispra-gws-IFNG.csv	.csv	None	Raw data of schmidt22 crispra GWS	None	None	1729685	cUSH0oQ2w-WccO8_ViKRAQ	md5	KH66KIZvk8QwSG	KRLrXtmFL4GOpgPQY1zy	2023-08-28 14:49:19	DzTjkKse
oAVvvsis2fUhMHEU1REc	qwgoTldI	schmidt22_perturbseq.h5ad	.h5ad	AnnData	perturbseq counts	None	None	20659936	la7EvqEUMDlug9-rpw-udA	md5	w7jyueHY75U7cC	Lez62WObKIDwLdBSdwlj	2023-08-28 14:49:17	bKeW4T6E
ZL9ltsP0hzZOs4AlJUFo	qwgoTldI	perturbseq/filtered_feature_bc_matrix/features...	.tsv.gz	None	None	None	None	6	W3ZL00oIwVBT2lN-pUdT2w	md5	pWJEqHVhng2Qev	uN1Ug3h7Ytfd6qfgBi1s	2023-08-28 14:49:15	bKeW4T6E
KHzNnOUnK6l8BTgYSSCg	qwgoTldI	perturbseq/filtered_feature_bc_matrix/matrix.m...	.mtx.gz	None	None	None	None	6	xNBcmvSIeFtVPHZzYqlEkA	md5	pWJEqHVhng2Qev	uN1Ug3h7Ytfd6qfgBi1s	2023-08-28 14:49:15	bKeW4T6E

Run

	transform_id	run_at	created_by_id	reference	reference_type
id
YvCfkwTWWOgVFZymKfvl	aqK0SBzGBxgghN	2023-08-28 14:49:13	DzTjkKse	None	None
uN1Ug3h7Ytfd6qfgBi1s	pWJEqHVhng2Qev	2023-08-28 14:49:15	bKeW4T6E	None	None
Lez62WObKIDwLdBSdwlj	w7jyueHY75U7cC	2023-08-28 14:49:15	bKeW4T6E	None	None
KRLrXtmFL4GOpgPQY1zy	KH66KIZvk8QwSG	2023-08-28 14:49:18	DzTjkKse	None	None
TqFvLxCX6htdJ1Puou53	GqkNrJ3VnlQO4u	2023-08-28 14:49:21	bKeW4T6E	None	None
0jnUrtDaLrTNt2SddM9i	TKln6FHOmDPKwc	2023-08-28 14:49:21	bKeW4T6E	None	None
sJZVw3N8Tlyia55gOiUS	1LCd8kco9lZUz8	2023-08-28 14:49:24	bKeW4T6E	None	None

Storage

	root	type	region	updated_at	created_by_id
id
qwgoTldI	/home/runner/work/lamin-usecases/lamin-usecase...	local	None	2023-08-28 14:49:12	DzTjkKse

Transform

	name	short_name	version	initial_version_id	type	reference	updated_at	created_by_id
id
1LCd8kco9lZUz8	Bird's eye view	birds-eye	0	None	notebook	None	2023-08-28 14:49:24	bKeW4T6E
TKln6FHOmDPKwc	Perform single cell analysis, integrating with...	None	None	None	notebook	None	2023-08-28 14:49:23	bKeW4T6E
GqkNrJ3VnlQO4u	GWS CRIPSRa analysis	None	None	None	notebook	None	2023-08-28 14:49:21	bKeW4T6E
KH66KIZvk8QwSG	Upload GWS CRISPRa result	None	None	None	app	None	2023-08-28 14:49:19	DzTjkKse
w7jyueHY75U7cC	Preprocess Cell Ranger outputs	None	2.0	None	pipeline	None	2023-08-28 14:49:17	bKeW4T6E
pWJEqHVhng2Qev	Cell Ranger	None	7.2.0	None	pipeline	None	2023-08-28 14:49:15	bKeW4T6E
aqK0SBzGBxgghN	Chromium 10x upload	None	None	None	pipeline	None	2023-08-28 14:49:13	DzTjkKse

User

	handle	email	name	updated_at
id
bKeW4T6E	testuser2	testuser2@lamin.ai	Test User2	2023-08-28 14:49:21
DzTjkKse	testuser1	testuser1@lamin.ai	Test User1	2023-08-28 14:49:18