Validate & register bulk RNA-seq datasets#
Background#
Bulk RNA sequencing (RNA-seq) is a high-throughput technique that measures the gene expression levels of thousands of genes simultaneously, providing insights into overall cellular transcription patterns.
Here, weβll demonstrate how to make a bulk RNA-seq count matrix data ware house ready and apply the TVR (transform, validate, register) workflow on it.
Setup#
!lamin init --storage test-bulkrna --schema bionty
Show code cell output
π‘ creating schemas: core==0.46.1 bionty==0.30.0
β
saved: User(id='DzTjkKse', handle='testuser1', email='testuser1@lamin.ai', name='Test User1', updated_at=2023-08-28 14:45:18)
β
saved: Storage(id='TUwXuXPm', root='/home/runner/work/lamin-usecases/lamin-usecases/docs/test-bulkrna', type='local', updated_at=2023-08-28 14:45:18, created_by_id='DzTjkKse')
β
loaded instance: testuser1/test-bulkrna
π‘ did not register local instance on hub (if you want, call `lamin register`)
import lamindb as ln
from pathlib import Path
import lnschema_bionty as lb
import pandas as pd
import anndata as ad
β
loaded instance: testuser1/test-bulkrna (lamindb 0.51.0)
nf-core Salmon count matrix#
We start by simulating a nf-core RNA-seq) run which yields us a count matrix file.
Show code cell content
# pretend we're running a bulk RNA-seq pipeline
ln.track(ln.Transform(name="nf-core RNA-seq", reference="https://nf-co.re/rnaseq"))
# create a directory for its output
Path("./test-bulkrna/output_dir").mkdir(exist_ok=True)
# get the count matrix
path = ln.dev.datasets.file_tsv_rnaseq_nfcore_salmon_merged_gene_counts(
populate_registries=True
)
# move it into the output directory
path = path.rename(f"./test-bulkrna/output_dir/{path.name}")
# register it
ln.File(path, description="Merged Bulk RNA counts").save()
β
saved: Transform(id='U6QWL77oXYAlR8', name='nf-core RNA-seq', type=notebook, reference='https://nf-co.re/rnaseq', updated_at=2023-08-28 14:45:20, created_by_id='DzTjkKse')
β
saved: Run(id='HHH8VwCJoqr38pnTc76F', run_at=2023-08-28 14:45:20, transform_id='U6QWL77oXYAlR8', created_by_id='DzTjkKse')
β file has more than two suffixes (path.suffixes), using only last suffix: '.tsv'
π‘ file in storage 'test-bulkrna' with key 'output_dir/salmon.merged.gene_counts.tsv'
Transform #
ln.track()
π‘ notebook imports: anndata==0.9.2 lamindb==0.51.0 lnschema_bionty==0.30.0 pandas==1.5.3
β
saved: Transform(id='s5V0dNMVwL9iz8', name='Validate & register bulk RNA-seq datasets', short_name='bulkrna', version='0', type=notebook, updated_at=2023-08-28 14:45:24, created_by_id='DzTjkKse')
β
saved: Run(id='R8PPRCUUHV1XUAAzqVCC', run_at=2023-08-28 14:45:24, transform_id='s5V0dNMVwL9iz8', created_by_id='DzTjkKse')
Letβs query the file:
file = ln.File.filter(description="Merged Bulk RNA counts").one()
df = file.load()
π‘ adding file YrBtIDOCtmNZANkK8sl3 as input for run R8PPRCUUHV1XUAAzqVCC, adding parent transform U6QWL77oXYAlR8
If we look at it, we realize it deviates far from the tidy data standard Wickham14, conventions of statistics & machine learning Hastie09, Murphy12 and the major Python & R data packages.
Variables are not in columns and observations are not in rows:
df
gene_id | gene_name | RAP1_IAA_30M_REP1 | RAP1_UNINDUCED_REP1 | RAP1_UNINDUCED_REP2 | WT_REP1 | WT_REP2 | |
---|---|---|---|---|---|---|---|
0 | Gfp_transgene_gene | Gfp_transgene_gene | 0.0 | 0.000 | 0.0 | 0.0 | 0.0 |
1 | HRA1 | HRA1 | 0.0 | 8.572 | 0.0 | 0.0 | 0.0 |
2 | snR18 | snR18 | 3.0 | 8.000 | 4.0 | 8.0 | 8.0 |
3 | tA(UGC)A | TGA1 | 0.0 | 0.000 | 0.0 | 0.0 | 0.0 |
4 | tL(CAA)A | SUP56 | 0.0 | 0.000 | 0.0 | 0.0 | 0.0 |
... | ... | ... | ... | ... | ... | ... | ... |
120 | YAR064W | YAR064W | 0.0 | 2.000 | 0.0 | 0.0 | 0.0 |
121 | YAR066W | YAR066W | 3.0 | 13.000 | 8.0 | 5.0 | 11.0 |
122 | YAR068W | YAR068W | 9.0 | 28.000 | 24.0 | 5.0 | 7.0 |
123 | YAR069C | YAR069C | 0.0 | 0.000 | 0.0 | 0.0 | 1.0 |
124 | YAR070C | YAR070C | 0.0 | 0.000 | 0.0 | 0.0 | 0.0 |
125 rows Γ 7 columns
Letβs change that and move observations into rows:
df = df.T
df
0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | ... | 115 | 116 | 117 | 118 | 119 | 120 | 121 | 122 | 123 | 124 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
gene_id | Gfp_transgene_gene | HRA1 | snR18 | tA(UGC)A | tL(CAA)A | tP(UGG)A | tS(AGA)A | YAL001C | YAL002W | YAL003W | ... | YAR050W | YAR053W | YAR060C | YAR061W | YAR062W | YAR064W | YAR066W | YAR068W | YAR069C | YAR070C |
gene_name | Gfp_transgene_gene | HRA1 | snR18 | TGA1 | SUP56 | TRN1 | tS(AGA)A | TFC3 | VPS8 | EFB1 | ... | FLO1 | YAR053W | YAR060C | YAR061W | YAR062W | YAR064W | YAR066W | YAR068W | YAR069C | YAR070C |
RAP1_IAA_30M_REP1 | 0.0 | 0.0 | 3.0 | 0.0 | 0.0 | 0.0 | 1.0 | 55.0 | 36.0 | 632.0 | ... | 4.357 | 0.0 | 1.0 | 0.0 | 1.0 | 0.0 | 3.0 | 9.0 | 0.0 | 0.0 |
RAP1_UNINDUCED_REP1 | 0.0 | 8.572 | 8.0 | 0.0 | 0.0 | 0.0 | 0.0 | 72.0 | 33.0 | 810.0 | ... | 15.72 | 0.0 | 0.0 | 0.0 | 3.0 | 2.0 | 13.0 | 28.0 | 0.0 | 0.0 |
RAP1_UNINDUCED_REP2 | 0.0 | 0.0 | 4.0 | 0.0 | 0.0 | 0.0 | 0.0 | 115.0 | 82.0 | 1693.0 | ... | 13.772 | 0.0 | 4.0 | 0.0 | 2.0 | 0.0 | 8.0 | 24.0 | 0.0 | 0.0 |
WT_REP1 | 0.0 | 0.0 | 8.0 | 0.0 | 0.0 | 1.0 | 0.0 | 60.0 | 63.0 | 1115.0 | ... | 13.465 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 5.0 | 5.0 | 0.0 | 0.0 |
WT_REP2 | 0.0 | 0.0 | 8.0 | 0.0 | 0.0 | 0.0 | 0.0 | 30.0 | 25.0 | 704.0 | ... | 6.891 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 11.0 | 7.0 | 1.0 | 0.0 |
7 rows Γ 125 columns
Now, itβs clear that the first two rows are in fact no observations, but descriptions of the variables (or features) themselves.
Letβs create an AnnData object to model this. First, create a dataframe for the variables:
var = pd.DataFrame({"gene_name": df.loc["gene_name"].values}, index=df.loc["gene_id"])
var.head()
gene_name | |
---|---|
gene_id | |
Gfp_transgene_gene | Gfp_transgene_gene |
HRA1 | HRA1 |
snR18 | snR18 |
tA(UGC)A | TGA1 |
tL(CAA)A | SUP56 |
Now, letβs create an AnnData:
# we're also fixing the datatype here, which was string in the tsv
adata = ad.AnnData(df.iloc[2:].astype("float32"), var=var)
adata
AnnData object with n_obs Γ n_vars = 5 Γ 125
var: 'gene_name'
The AnnData object is in tidy form and complies with conventions of statistics and machine learning:
adata.to_df()
gene_id | Gfp_transgene_gene | HRA1 | snR18 | tA(UGC)A | tL(CAA)A | tP(UGG)A | tS(AGA)A | YAL001C | YAL002W | YAL003W | ... | YAR050W | YAR053W | YAR060C | YAR061W | YAR062W | YAR064W | YAR066W | YAR068W | YAR069C | YAR070C |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
RAP1_IAA_30M_REP1 | 0.0 | 0.000 | 3.0 | 0.0 | 0.0 | 0.0 | 1.0 | 55.0 | 36.0 | 632.0 | ... | 4.357 | 0.0 | 1.0 | 0.0 | 1.0 | 0.0 | 3.0 | 9.0 | 0.0 | 0.0 |
RAP1_UNINDUCED_REP1 | 0.0 | 8.572 | 8.0 | 0.0 | 0.0 | 0.0 | 0.0 | 72.0 | 33.0 | 810.0 | ... | 15.720 | 0.0 | 0.0 | 0.0 | 3.0 | 2.0 | 13.0 | 28.0 | 0.0 | 0.0 |
RAP1_UNINDUCED_REP2 | 0.0 | 0.000 | 4.0 | 0.0 | 0.0 | 0.0 | 0.0 | 115.0 | 82.0 | 1693.0 | ... | 13.772 | 0.0 | 4.0 | 0.0 | 2.0 | 0.0 | 8.0 | 24.0 | 0.0 | 0.0 |
WT_REP1 | 0.0 | 0.000 | 8.0 | 0.0 | 0.0 | 1.0 | 0.0 | 60.0 | 63.0 | 1115.0 | ... | 13.465 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 5.0 | 5.0 | 0.0 | 0.0 |
WT_REP2 | 0.0 | 0.000 | 8.0 | 0.0 | 0.0 | 0.0 | 0.0 | 30.0 | 25.0 | 704.0 | ... | 6.891 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 11.0 | 7.0 | 1.0 | 0.0 |
5 rows Γ 125 columns
Validate #
Letβs create a File object from this AnnData. Because this will validate the gene IDs and these are only defined given a species, we have to set a species context:
lb.settings.species = "saccharomyces cerevisiae"
β
set species: Species(id='nn8c', name='saccharomyces cerevisiae', taxon_id=559292, scientific_name='saccharomyces_cerevisiae', updated_at=2023-08-28 14:45:24, bionty_source_id='guBO', created_by_id='DzTjkKse')
Almost all gene IDs are validated:
genes = lb.Gene.from_values(adata.var.index, lb.Gene.stable_id)
π‘ using global setting species = saccharomyces cerevisiae
β
created 123 Gene records from Bionty matching stable_id: HRA1, snR18, tA(UGC)A, tL(CAA)A, tP(UGG)A, tS(AGA)A, YAL001C, YAL002W, YAL003W, YAL004W, YAL005C, YAL007C, YAL008W, YAL009W, YAL010C, YAL011W, YAL012W, YAL013W, YAL014C, YAL015C, ...
β did not create Gene records for 2 non-validated stable_ids: Gfp_transgene_gene, YAR062W
# also register the 2 non-validated genes
ln.save(genes)
assays = lb.ExperimentalFactor.lookup()
assays.rna_seq
ExperimentalFactor(id='J0KT2nxy', name='RNA-Seq', ontology_id='EFO:0008896', description='Rna-Seq Is A Method That Involves Purifying Rna And Making Cdna, Followed By High-Throughput Sequencing.', molecule='RNA assay', instrument='assay by high throughput sequencer', updated_at=2023-08-28 14:45:20, bionty_source_id='4p2Q', created_by_id='DzTjkKse')
modality = ln.Modality.filter(name="rna").one()
Register #
curated_file = ln.File.from_anndata(
adata, description="Curated bulk RNA counts", var_ref=lb.Gene.stable_id
)
π‘ file will be copied to default storage upon `save()` with key `None` ('.lamindb/maE9RTPy2gwgf33OGkKG.h5ad')
π‘ parsing feature names of X stored in slot 'var'
π‘ using global setting species = saccharomyces cerevisiae
β
123 terms (98.40%) are validated for stable_id
β 2 terms (1.60%) are not validated for stable_id: Gfp_transgene_gene, YAR062W
π‘ using global setting species = saccharomyces cerevisiae
β
linked: FeatureSet(id='oUWjZrGot3BTU6qqbQid', n=123, type='float', registry='bionty.Gene', hash='8j8y_AHnWb5huZ2hXCDj', created_by_id='DzTjkKse')
Hence, letβs save this file:
curated_file.save()
β
saved 1 feature set for slot: 'var'
β
storing file 'maE9RTPy2gwgf33OGkKG' at '.lamindb/maE9RTPy2gwgf33OGkKG.h5ad'
curated_file.add_labels(assays.rna_seq, "assay")
curated_file.add_labels(lb.settings.species, "species")
β
linked new feature 'assay' together with new feature set FeatureSet(id='J2T73j2czwiSUd5ezmHl', n=1, registry='core.Feature', hash='jaqP0giT9xll4onlbz1s', updated_at=2023-08-28 14:45:25, modality_id='OP2P7WX5', created_by_id='DzTjkKse')
π‘ no file links to it anymore, deleting feature set FeatureSet(id='J2T73j2czwiSUd5ezmHl', n=1, registry='core.Feature', hash='jaqP0giT9xll4onlbz1s', updated_at=2023-08-28 14:45:25, modality_id='OP2P7WX5', created_by_id='DzTjkKse')
β
linked new feature 'species' together with new feature set FeatureSet(id='Qdyc2YA9U7iWOjtPgyKL', n=2, registry='core.Feature', hash='Avlv4nJXMgzKmZPwyDbp', updated_at=2023-08-28 14:45:25, modality_id='OP2P7WX5', created_by_id='DzTjkKse')
var_feature_set = curated_file.features.get_feature_set("var")
var_feature_set.modality = modality
var_feature_set.save()
curated_file.describe()
π‘ File(id='maE9RTPy2gwgf33OGkKG', key=None, suffix='.h5ad', accessor='AnnData', description='Curated bulk RNA counts', version=None, size=28180, hash='6bieh8XjOCCz6bJToN4u1g', hash_type='md5', created_at=2023-08-28 14:45:25, updated_at=2023-08-28 14:45:25)
Provenance:
ποΈ storage: Storage(id='TUwXuXPm', root='/home/runner/work/lamin-usecases/lamin-usecases/docs/test-bulkrna', type='local', updated_at=2023-08-28 14:45:18, created_by_id='DzTjkKse')
π« transform: Transform(id='s5V0dNMVwL9iz8', name='Validate & register bulk RNA-seq datasets', short_name='bulkrna', version='0', type=notebook, updated_at=2023-08-28 14:45:25, created_by_id='DzTjkKse')
π£ run: Run(id='R8PPRCUUHV1XUAAzqVCC', run_at=2023-08-28 14:45:24, transform_id='s5V0dNMVwL9iz8', created_by_id='DzTjkKse')
π€ created_by: User(id='DzTjkKse', handle='testuser1', email='testuser1@lamin.ai', name='Test User1', updated_at=2023-08-28 14:45:18)
Features:
var (X):
π index (123, bionty.Gene.id): ['T3sLw0r7aPcM', 'EgcZn6KaoCKe', 'oWIBkvGTHn0P', 'v7AQgsU06P1c', '0GE0ByouFNSG'...]
external:
π assay (1, bionty.ExperimentalFactor): ['RNA-Seq']
π species (1, bionty.Species): ['saccharomyces cerevisiae']
Example queries#
We have two files in the file registry:
ln.File.filter().df()
storage_id | key | suffix | accessor | description | version | initial_version_id | size | hash | hash_type | transform_id | run_id | updated_at | created_by_id | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
id | ||||||||||||||
YrBtIDOCtmNZANkK8sl3 | TUwXuXPm | output_dir/salmon.merged.gene_counts.tsv | .tsv | None | Merged Bulk RNA counts | None | None | 3787 | xxw0k3au3KtxFcgtbEr4eQ | md5 | U6QWL77oXYAlR8 | HHH8VwCJoqr38pnTc76F | 2023-08-28 14:45:23 | DzTjkKse |
maE9RTPy2gwgf33OGkKG | TUwXuXPm | None | .h5ad | AnnData | Curated bulk RNA counts | None | None | 28180 | 6bieh8XjOCCz6bJToN4u1g | md5 | s5V0dNMVwL9iz8 | R8PPRCUUHV1XUAAzqVCC | 2023-08-28 14:45:25 | DzTjkKse |
curated_file.view_lineage()
Letβs by query by gene:
genes = lb.Gene.lookup()
genes.spo7
Gene(id='tpzwPcGZFK1y', symbol='SPO7', stable_id='YAL009W', ncbi_gene_ids='851224', biotype='protein_coding', description='Putative regulatory subunit of Nem1p-Spo7p phosphatase holoenzyme; regulates nuclear growth by controlling phospholipid biosynthesis, required for normal nuclear envelope morphology, premeiotic replication, and sporulation [Source:SGD;Acc:S000000007]', synonyms='', updated_at=2023-08-28 14:45:25, species_id='nn8c', bionty_source_id='HZRM', created_by_id='DzTjkKse')
# a feature set containing RNA measurements and SPO7 gene expression
feature_set = ln.FeatureSet.filter(genes=genes.spo7, modality__name="rna").first()
# files that link to this feature set
ln.File.filter(feature_sets=feature_set).df()
storage_id | key | suffix | accessor | description | version | initial_version_id | size | hash | hash_type | transform_id | run_id | updated_at | created_by_id | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
id | ||||||||||||||
maE9RTPy2gwgf33OGkKG | TUwXuXPm | None | .h5ad | AnnData | Curated bulk RNA counts | None | None | 28180 | 6bieh8XjOCCz6bJToN4u1g | md5 | s5V0dNMVwL9iz8 | R8PPRCUUHV1XUAAzqVCC | 2023-08-28 14:45:25 | DzTjkKse |
Show code cell content
!lamin delete --force test-bulkrna
!rm -r test-bulkrna
π‘ deleting instance testuser1/test-bulkrna
β
deleted instance settings file: /home/runner/.lamin/instance--testuser1--test-bulkrna.env
β
instance cache deleted
β
deleted '.lndb' sqlite file
β consider manually deleting your stored data: /home/runner/work/lamin-usecases/lamin-usecases/docs/test-bulkrna