Integrate PDAC with scRNA

uniPort can output a global optimal transport (OT) plan, i.e., a cell-cell correspondence matrix, that provides flexible transfer learning for deconvolution of spatial heterogeneous data using scRNA data in OT space, instead of embedding latent space.

[1]:

import uniport as up
import scanpy as sc
import pandas as pd

Read microarray-based ST data of pancreatic ductal adenocarcinoma (PDAC) tissues for integration, the diameter of which stretches for 100 μm. Cell-type deconvolution was applied on 428 spots paired with 1926 single cells, measuring 19,736 genes respectively.

[2]:

labels_rna = pd.read_csv('PDAC_scRNA_label.txt', sep='\t')
celltype = labels_rna['cell_type'].values
print(celltype)

['Acinar cells' 'Ductal' 'Ductal' ... 'Ductal' 'pDCs' 'RBCs']

[3]:

rna = sc.read('PDAC_scRNA.txt').transpose()
spot = up.load_file('PDAC_SPOT.txt')

Add domain_id, cell_type and source obs to AnnData.

[4]:

spot.obs['domain_id'] = 0
spot.obs['domain_id'] = spot.obs['domain_id'].astype('category')
spot.obs['source'] = 'SPOT'

rna.obs['cell_type'] = celltype
rna.obs['domain_id'] = 1
rna.obs['domain_id'] = rna.obs['domain_id'].astype('category')
rna.obs['source'] = 'RNA'

Concatenate SPOT and scRNA-seq with common genes using AnnData.concatenate.

[5]:

adata_cm = spot.concatenate(rna, join='inner', batch_key='domain_id')

Preprocess data with common genes. Select 2,000 highly variable common genes.

Scale data using batch_scale function in uniport (modified from SCALEX).

[6]:

sc.pp.normalize_total(adata_cm)
sc.pp.log1p(adata_cm)
sc.pp.highly_variable_genes(adata_cm, n_top_genes=2000, batch_key='domain_id', inplace=False, subset=True)
up.batch_scale(adata_cm)

... storing 'source' as categorical
... storing 'cell_type' as categorical

Preprocess SPOT data. Select 2,000 highly variable genes as SPOT speicifc.

[7]:

sc.pp.normalize_total(spot)
sc.pp.log1p(spot)
sc.pp.highly_variable_genes(spot, n_top_genes=2000, inplace=False, subset=True)
up.batch_scale(spot)

Preprocess scRNA-seq data. Select 2,000 highly variable genes as RNA specific.

[8]:

sc.pp.normalize_total(rna)
sc.pp.log1p(rna)
sc.pp.highly_variable_genes(rna, n_top_genes=2000, inplace=False, subset=True)
up.batch_scale(rna)

Integrate the SPOT and scRNA-seq data using both common and dataset-specific genes by Run function in uniport. Set save_OT=True and return a OT plan, which is a SPOT by RNA probabilistic matching matrix.

[9]:

adata, OT = up.Run(adatas=[spot,rna], adata_cm=adata_cm, save_OT=True)

Dataset 0: SPOT
AnnData object with n_obs × n_vars = 428 × 2000
    obs: 'domain_id', 'source'
    var: 'highly_variable', 'means', 'dispersions', 'dispersions_norm'
    uns: 'log1p', 'hvg'
Dataset 1: RNA
AnnData object with n_obs × n_vars = 1926 × 2000
    obs: 'cell_type', 'domain_id', 'source'
    var: 'highly_variable', 'means', 'dispersions', 'dispersions_norm'
    uns: 'log1p', 'hvg'
Reference dataset is dataset 1


Data with common HVG
AnnData object with n_obs × n_vars = 2354 × 2000
    obs: 'domain_id', 'source', 'cell_type'
    var: 'highly_variable', 'means', 'dispersions', 'dispersions_norm', 'highly_variable_nbatches', 'highly_variable_intersection'
    uns: 'log1p', 'hvg'


Warning! Saving Optimal Transport plan needs extra 0.01 GB memory, please set save_OT=False if no enough memory!
float32
Size of transport plan between datasets 0 and 1: (428, 1926)

Epochs:  77%|████████████████▊     | 2554/3334 [24:20<07:26,  1.75it/s, recloss=358.15,klloss=6.82,otloss=8.00]

EarlyStopping: run 2555 epoch

Save OT plan for deconvolution.

[10]:

name_idx = adata_cm[adata_cm.obs['source']=='SPOT'].obs_names
name_col = adata_cm[adata_cm.obs['source']=='RNA'].obs_names
OT_pd = pd.DataFrame(OT[0], index=name_idx, columns=name_col)
OT_pd.to_csv('/data/PDAC/OT_PDAC.txt', sep='\t')