Vertical integration for paired datasets

uniPort also involves a vertical integration method for paired-cell datasets. Here we use uniPort to integrates paired SNARE-seq CellLineMixture datasets from GSE126074.

[1]:
import uniport as up
import numpy as np
import pandas as pd
import scanpy as sc
import episcanpy as epi
from sklearn.preprocessing import MinMaxScaler
[2]:
labels = pd.read_csv('snare/cell_line_meta.txt', sep='\t')
celltype = labels['cell_line'].values
[3]:
adata_peaks = up.load_file('snare/GSE126074_CellLineMixture_SNAREseq_chromatin_counts.tsv')
adata_rna = up.load_file('snare/GSE126074_CellLineMixture_SNAREseq_cDNA_counts.tsv')
print(adata_peaks)
print(adata_rna)
AnnData object with n_obs × n_vars = 1047 × 136771
AnnData object with n_obs × n_vars = 1047 × 18666
[4]:
adata_peaks.obs['cell_type'] = celltype
adata_peaks.obs['domain_id'] = 0
adata_peaks.obs['domain_id'] = adata_peaks.obs['domain_id'].astype('category')
adata_peaks.obs['source'] = 'ATAC'

adata_rna.obs['cell_type'] = celltype
adata_rna.obs['domain_id'] = 1
adata_rna.obs['domain_id'] = adata_rna.obs['domain_id'].astype('category')
adata_rna.obs['source'] = 'RNA'

Preprocess scATAC-seq peaks. Select 2,000 highly variable peaks.

[5]:
adata_peaks.X[adata_peaks.X>1] = 1
epi.pp.select_var_feature(adata_peaks, nb_features=2000, show=False, copy=False)
sc.pp.normalize_total(adata_peaks)
up.batch_scale(adata_peaks)
print(adata_peaks)
AnnData object with n_obs × n_vars = 1047 × 2070
    obs: 'cell_type', 'domain_id', 'source'
    var: 'n_cells', 'prop_shared_cells', 'variability_score'

Preprocess scRNA-seq peaks. Select 2,000 highly variable genes.

[6]:
sc.pp.normalize_total(adata_rna)
sc.pp.log1p(adata_rna)
sc.pp.highly_variable_genes(adata_rna, n_top_genes=2000, inplace=False, subset=True)
up.batch_scale(adata_rna)
print(adata_rna)
AnnData object with n_obs × n_vars = 1047 × 2000
    obs: 'cell_type', 'domain_id', 'source'
    var: 'highly_variable', 'means', 'dispersions', 'dispersions_norm'
    uns: 'log1p', 'hvg'

Project scATAC-seq into latent space with the help of scRNA-seq. The latent representations of scATAC-seq are stored at adata_atac.obs['latent']

[7]:
adata_peaks = up.Run(adatas=[adata_peaks, adata_rna], mode='v', lr=0.001, iteration=10000)
Dataset 0: ATAC
AnnData object with n_obs × n_vars = 1047 × 2070
    obs: 'cell_type', 'domain_id', 'source'
    var: 'n_cells', 'prop_shared_cells', 'variability_score'
Dataset 1: RNA
AnnData object with n_obs × n_vars = 1047 × 2000
    obs: 'cell_type', 'domain_id', 'source'
    var: 'highly_variable', 'means', 'dispersions', 'dispersions_norm'
    uns: 'log1p', 'hvg'
Reference dataset is dataset 1


Epochs: 100%|████████████████████████████| 2500/2500 [16:26<00:00,  2.53it/s, recon_loss=239.627,kl_loss=7.374]

Perform UMAP visualization for scATAC before vertical integration.

[8]:
sc.pp.pca(adata_peaks)
sc.pp.neighbors(adata_peaks)
sc.tl.umap(adata_peaks, min_dist=0.1)
sc.pl.umap(adata_peaks, color=['source', 'cell_type'])
... storing 'cell_type' as categorical
... storing 'source' as categorical
../_images/examples_vertical_snare_cell_line_12_1.png

Perform UMAP visualization for scRNA.

[9]:
sc.pp.pca(adata_rna)
sc.pp.neighbors(adata_rna)
sc.tl.umap(adata_rna, min_dist=0.1)
sc.pl.umap(adata_rna, color=['source', 'cell_type'])
... storing 'cell_type' as categorical
... storing 'source' as categorical
../_images/examples_vertical_snare_cell_line_14_1.png

Perform UMAP visualization for scATAC latent representations after vertical integration. The cell types of scATAC-seq now are more distinguishable than before.

[10]:
sc.pp.neighbors(adata_peaks, use_rep='latent')
sc.tl.umap(adata_peaks, min_dist=0.1)
sc.pl.umap(adata_peaks, color=['source', 'cell_type'])
../_images/examples_vertical_snare_cell_line_16_0.png
[ ]: