Jupyter Notebook

Validate & register multi-modal data#

Hide code cell content
!lamin init --storage ./test-multimodal --schema bionty
💡 creating schemas: core==0.46.1 bionty==0.30.0 
✅ saved: User(id='DzTjkKse', handle='testuser1', email='testuser1@lamin.ai', name='Test User1', updated_at=2023-08-28 14:46:35)
✅ saved: Storage(id='30snRDOE', root='/home/runner/work/lamin-usecases/lamin-usecases/docs/test-multimodal', type='local', updated_at=2023-08-28 14:46:35, created_by_id='DzTjkKse')
✅ loaded instance: testuser1/test-multimodal
💡 did not register local instance on hub (if you want, call `lamin register`)

import lamindb as ln
import lnschema_bionty as lb

lb.settings.species = "human"
ln.settings.verbosity = 3
✅ loaded instance: testuser1/test-multimodal (lamindb 0.51.0)
✅ set species: Species(id='uHJU', name='human', taxon_id=9606, scientific_name='homo_sapiens', updated_at=2023-08-28 14:46:36, bionty_source_id='r7x6', created_by_id='DzTjkKse')
ln.track()
💡 notebook imports: lamindb==0.51.0 lnschema_bionty==0.30.0
✅ saved: Transform(id='yMWSFirS6qv2z8', name='Validate & register multi-modal data', short_name='multimodal', version='0', type=notebook, updated_at=2023-08-28 14:46:37, created_by_id='DzTjkKse')
✅ saved: Run(id='zJrAYqSzpdptu7ALm1BH', run_at=2023-08-28 14:46:37, transform_id='yMWSFirS6qv2z8', created_by_id='DzTjkKse')

MuData object#

Let’s use a MuData object:

Hide code cell content
mdata = ln.dev.datasets.mudata_papalexi21_subset()
mdata
MuData object with n_obs × n_vars = 200 × 300
  var:	'name'
  4 modalities
    rna:	200 x 173
      obs:	'orig.ident', 'nCount_RNA', 'nFeature_RNA', 'nCount_HTO', 'nFeature_HTO', 'nCount_GDO', 'nCount_ADT', 'nFeature_ADT', 'percent.mito', 'MULTI_ID', 'HTO_classification', 'guide_ID', 'gene_target', 'NT', 'perturbation', 'replicate', 'S.Score', 'G2M.Score', 'Phase'
      var:	'name'
    adt:	200 x 4
      obs:	'orig.ident', 'nCount_RNA', 'nFeature_RNA', 'nCount_HTO', 'nFeature_HTO', 'nCount_GDO', 'nCount_ADT', 'nFeature_ADT', 'percent.mito', 'MULTI_ID', 'HTO_classification', 'guide_ID', 'gene_target', 'NT', 'perturbation', 'replicate', 'S.Score', 'G2M.Score', 'Phase'
      var:	'name'
    hto:	200 x 12
      obs:	'orig.ident', 'nCount_RNA', 'nFeature_RNA', 'nCount_HTO', 'nFeature_HTO', 'nCount_GDO', 'nCount_ADT', 'nFeature_ADT', 'percent.mito', 'MULTI_ID', 'HTO_classification', 'guide_ID', 'gene_target', 'NT', 'perturbation', 'replicate', 'S.Score', 'G2M.Score', 'Phase'
      var:	'name'
    gdo:	200 x 111
      obs:	'orig.ident', 'nCount_RNA', 'nFeature_RNA', 'nCount_HTO', 'nFeature_HTO', 'nCount_GDO', 'nCount_ADT', 'nFeature_ADT', 'percent.mito', 'MULTI_ID', 'HTO_classification', 'guide_ID', 'gene_target', 'NT', 'perturbation', 'replicate', 'S.Score', 'G2M.Score', 'Phase'
      var:	'name'

First we register the file:

file = ln.File(
    "papalexi21_subset.h5mu", description="Sub-sampled MuData from Papalexi21"
)
file.save()
✅ storing file 'QoRgfBKFJ3n4POH9yDp5' at '.lamindb/QoRgfBKFJ3n4POH9yDp5.h5mu'

Register features#

Now let’s register the 3 feature sets this data contains:

  1. rna

  2. adt

  3. obs (metadata)

modalities#

For the two modalities rna and adt, we use bionty tables as the reference:

mdata["rna"].var_names[:5]
Index(['RP5-827C21.6', 'XX-CR54.1', 'SH2D6', 'RP11-379B18.5', 'RP11-778D9.12'], dtype='object', name='index')
lb.Gene.validate(mdata["rna"].var_names, lb.Gene.symbol);
💡 using global setting species = human
173 terms (100.00%) are not validated for symbol: RP5-827C21.6, XX-CR54.1, SH2D6, RP11-379B18.5, RP11-778D9.12, RP11-703G6.1, AC005150.1, RP11-717H13.1, CTC-498J12.1, CTC-467M3.1, ARHGAP26-AS1, GABRA1, HIST1H4K, HLA-DQB1-AS1, RP11-524H19.2, SPACA1, VNN1, AC006042.7, AC002066.1, AC073934.6, ...
genes = lb.Gene.from_values(mdata["rna"].var_names, lb.Gene.symbol)
ln.save(genes)
💡 using global setting species = human
✅ created 77 Gene records from Bionty matching symbol: SH2D6, ARHGAP26-AS1, GABRA1, HLA-DQB1-AS1, SPACA1, VNN1, CTAGE15, PFKFB1, TRPC5, RBPMS-AS1, CA8, CSMD3, ZNF483, AK8, TMEM72-AS1, ARAP1-AS2, CRYAB, HOXC-AS2, LRRIQ1, TUBA3C, ...
✅ created 12 Gene records from Bionty matching synonyms: CTC-467M3.1, HIST1H4K, CASC1, LARGE, NBPF16, C1orf65, IBA57-AS1, KIAA1239, TMEM75, AP003419.16, FAM65C, C14orf177
❗ ambiguous validation in Bionty for 6 records: HLA-DQB1-AS1, CTAGE15, CTRB2, LGALS9C, PCDHB11, TBC1D3G
did not create Gene records for 84 non-validated symbols: AC002066.1, AC004019.13, AC005150.1, AC006042.7, AC011558.5, AC026471.6, AC073934.6, AC091132.1, AC092295.4, AC092687.5, AE000662.93, AL132989.1, AP000442.4, CTA-373H7.7, CTB-134F13.1, CTB-31O20.9, CTC-498J12.1, CTD-2562J17.2, CTD-3012A18.1, CTD-3065B20.2, ...
mdata["rna"].var_names = lb.Gene.standardize(mdata["rna"].var_names, lb.Gene.symbol)
💡 using global setting species = human
💡 standardized 89/173 terms
validated = lb.Gene.validate(mdata["rna"].var_names, lb.Gene.symbol)
💡 using global setting species = human
89 terms (51.40%) are validated for symbol
84 terms (48.60%) are not validated for symbol: RP5-827C21.6, XX-CR54.1, RP11-379B18.5, RP11-778D9.12, RP11-703G6.1, AC005150.1, RP11-717H13.1, CTC-498J12.1, RP11-524H19.2, AC006042.7, AC002066.1, AC073934.6, RP11-268G12.1, U52111.14, RP11-235C23.5, RP11-12J10.3, RP11-324E6.9, RP11-187A9.3, RP11-365N19.2, RP11-346D14.1, ...
new_genes = [
    lb.Gene(symbol=symbol, species=lb.settings.species)
    for symbol in mdata["rna"].var_names[~validated]
]
ln.save(new_genes)
lb.Gene.validate(mdata["rna"].var_names, lb.Gene.symbol);
💡 using global setting species = human
173 terms (100.00%) are validated for symbol
feature_set_rna = ln.FeatureSet.from_values(
    mdata["rna"].var_names, field=lb.Gene.symbol
)
💡 using global setting species = human
173 terms (100.00%) are validated for symbol
💡 using global setting species = human
mdata["adt"].var_names
Index(['CD86', 'PDL1', 'PDL2', 'CD366'], dtype='object', name='index')
lb.CellMarker.validate(mdata["adt"].var_names, field=lb.CellMarker.name);
💡 using global setting species = human
4 terms (100.00%) are not validated for name: CD86, PDL1, PDL2, CD366
markers = lb.CellMarker.from_values(mdata["adt"].var_names, field=lb.CellMarker.name)
ln.save(markers)
💡 using global setting species = human
✅ created 4 CellMarker records from Bionty matching name: CD86, PDL1, PDL2, CD366
lb.CellMarker.validate(mdata["adt"].var_names, field=lb.CellMarker.name);
💡 using global setting species = human
4 terms (100.00%) are validated for name
feature_set_adt = ln.FeatureSet.from_values(
    mdata["adt"].var_names, field=lb.CellMarker.name
)
💡 using global setting species = human
4 terms (100.00%) are validated for name
💡 using global setting species = human

Link them to file:

file.features.add_feature_set(feature_set_rna, slot="rna")
file.features.add_feature_set(feature_set_adt, slot="adt")

metadata#

The 3rd feature set is the obs:

obs = mdata["rna"].obs

We’re only interested in a single metadata column:

ln.Feature(name="gene_target", type="category").save()
features = ln.Feature.from_df(obs)
ln.save(features)
feature_set_obs = ln.FeatureSet.from_df(obs)
19 terms (100.00%) are validated for name
file.features.add_feature_set(feature_set_obs, slot="obs")
gene_targets = lb.Gene.from_values(obs["gene_target"], lb.Gene.symbol)
ln.save(gene_targets)
file.add_labels(gene_targets, feature="gene_target")
💡 using global setting species = human
✅ created 23 Gene records from Bionty matching symbol: IFNGR1, CAV1, IRF7, ATF2, NFKBIA, STAT1, SPI1, JAK2, STAT2, IFNGR2, CD86, STAT5A, SMAD4, ETV7, IRF1, UBE2L6, PDCD1LG2, BRD4, POU2F2, STAT3, ...
✅ created 1 Gene record from Bionty matching synonyms: MARCH8
❗ ambiguous validation in Bionty for 4 records: MARCHF8, IRF7, IFNGR2, TNFRSF14
did not create Gene record for 1 non-validated symbol: NT
✅ linked feature 'gene_target' to registry 'bionty.Gene'
nt = ln.Label(name="NT", description="Non-targeting control of perturbations")
nt.save()
file.add_labels(nt, feature="gene_target")
✅ linked feature 'gene_target' to registry 'core.Label'
for col in ["orig.ident", "perturbation", "replicate", "Phase", "guide_ID"]:
    labels = [ln.Label(name=name) for name in obs[col].unique()]
    ln.save(labels)
✅ loaded record with exact same name 

Because none of these labels seem like something we’d want to track in the registry or validate, we don’t link them to the file.

file.features
'rna': FeatureSet(id='sxLn4A6nx5RHLZTRLEei', n=184, type='float', registry='bionty.Gene', hash='Y8lsRtXCZKyPPberKAF0', updated_at=2023-08-28 14:46:42, created_by_id='DzTjkKse')
'adt': FeatureSet(id='T8n5iDCAO6MUgEpmCHbR', n=4, type='float', registry='bionty.CellMarker', hash='b-CtyjgPRO0WN27lTOqC', updated_at=2023-08-28 14:46:42, created_by_id='DzTjkKse')
'obs': FeatureSet(id='WaBf3ZlcyQ74RtoNuVKJ', n=19, registry='core.Feature', hash='XM3kBOn5YRTdCTeqd2c9', updated_at=2023-08-28 14:46:42, created_by_id='DzTjkKse')
file.describe()
💡 File(id='QoRgfBKFJ3n4POH9yDp5', key=None, suffix='.h5mu', accessor='MuData', description='Sub-sampled MuData from Papalexi21', version=None, size=606320, hash='RaivS3NesDOP-6kNIuaC3g', hash_type='md5', created_at=2023-08-28 14:46:37, updated_at=2023-08-28 14:46:37)

Provenance:
    🗃️ storage: Storage(id='30snRDOE', root='/home/runner/work/lamin-usecases/lamin-usecases/docs/test-multimodal', type='local', updated_at=2023-08-28 14:46:35, created_by_id='DzTjkKse')
    💫 transform: Transform(id='yMWSFirS6qv2z8', name='Validate & register multi-modal data', short_name='multimodal', version='0', type=notebook, updated_at=2023-08-28 14:46:37, created_by_id='DzTjkKse')
    👣 run: Run(id='zJrAYqSzpdptu7ALm1BH', run_at=2023-08-28 14:46:37, transform_id='yMWSFirS6qv2z8', created_by_id='DzTjkKse')
    👤 created_by: User(id='DzTjkKse', handle='testuser1', email='testuser1@lamin.ai', name='Test User1', updated_at=2023-08-28 14:46:35)
Features:
  adt:
    🔗 index (4, bionty.CellMarker.id): ['kbrA7wdDuqDK', 'L0m6f7FPiDeg', 'BK30rjK34sZd', '82nG0xqSuEQD'...]
  rna:
    🔗 index (184, bionty.Gene.id): ['eLT0qMWy0Su3', 'DSCx18klY50i', '00LbYBJSgEZv', 'BPdrl1gzhIge', 'UVQiCjuKYOxS'...]
  obs (metadata):
    🔗 gene_target (bionty.Gene|core.Label)
        🔗 gene_target (28, bionty.Gene): ['TNFRSF14', 'UBE2L6', 'POU2F2', 'NFKBIA', 'STAT3']
        🔗 gene_target (1, core.Label): ['NT']
file.view_lineage()
https://d33wubrfki0l68.cloudfront.net/8a1ee57dc9b4d05790a43b70e1e972dc90eb6154/234e0/_images/eb463cedf30a6ce49e32ed1e7108f8fccdf3e0a1f3469fc5939b512fb9ffbb4a.svg
Hide code cell content
!lamin delete --force test-multimodal
!rm -r test-multimodal
💡 deleting instance testuser1/test-multimodal
✅     deleted instance settings file: /home/runner/.lamin/instance--testuser1--test-multimodal.env
✅     instance cache deleted
✅     deleted '.lndb' sqlite file
❗     consider manually deleting your stored data: /home/runner/work/lamin-usecases/lamin-usecases/docs/test-multimodal