Integrate BRCA with scRNA

uniPort can output a global optimal transport (OT) plan, i.e., a cell-cell correspondence matrix, that provides flexible transfer learning for deconvolution of spatial heterogeneous data using scRNA data in OT space, instead of embedding latent space.

[1]:

import uniport as up
import scanpy as sc
import pandas as pd

Read microarray-based ST data of HER2-positive breast cancer (BRCA), containing diffusely infiltrating cells that make it more diﬀicult to deconvolute spots. In total, 2,518 spots with 17,943 genes and 100,064 cells with 29,733 genes were used for integration.

[2]:

labels_rna = pd.read_csv('Whole_miniatlas_meta.csv', sep=',')
celltype = labels_rna['celltype_major'].values
print(celltype)

['Endothelial' 'Endothelial' 'Endothelial' ... 'Myeloid' 'Myeloid'
 'Myeloid']

[3]:

# RNA = sc.read_10x_mtx('sc/')
# RNA.write('RNA.h5ad', compression='gzip')
rna = sc.read('RNA.h5ad')
spot = up.load_file('BRCA_SPOT_Count.txt')

Add domain_id, cell_type and source obs to AnnData.

[4]:

spot.obs['domain_id'] = 0
spot.obs['domain_id'] = spot.obs['domain_id'].astype('category')
spot.obs['source'] = 'SPOT'

rna.obs['cell_type'] = celltype
rna.obs['domain_id'] = 1
rna.obs['domain_id'] = rna.obs['domain_id'].astype('category')
rna.obs['source'] = 'RNA'

Concatenate SPOT and scRNA-seq with common genes using AnnData.concatenate.

[5]:

adata_cm = spot.concatenate(rna, join='inner', batch_key='domain_id')

Preprocess data with common genes. Select 2,000 highly variable common genes.

Scale data using batch_scale function in uniport (modified from SCALEX).

[6]:

sc.pp.normalize_total(adata_cm)
sc.pp.log1p(adata_cm)
sc.pp.highly_variable_genes(adata_cm, n_top_genes=2000, batch_key='domain_id', inplace=False, subset=True)
up.batch_scale(adata_cm)
print(adata_cm)

... storing 'source' as categorical
... storing 'cell_type' as categorical

AnnData object with n_obs × n_vars = 102582 × 2000
    obs: 'domain_id', 'source', 'cell_type'
    var: 'gene_ids-1', 'feature_types-1', 'highly_variable', 'means', 'dispersions', 'dispersions_norm', 'highly_variable_nbatches', 'highly_variable_intersection'
    uns: 'log1p', 'hvg'

Preprocess SPOT data. Select 2,000 highly variable genes as SPOT speicifc.

[7]:

sc.pp.normalize_total(spot)
sc.pp.log1p(spot)
sc.pp.highly_variable_genes(spot, n_top_genes=2000, inplace=False, subset=True)
up.batch_scale(spot)
print(spot)

AnnData object with n_obs × n_vars = 2518 × 2000
    obs: 'domain_id', 'source'
    var: 'highly_variable', 'means', 'dispersions', 'dispersions_norm'
    uns: 'log1p', 'hvg'

Preprocess scRNA-seq data. Select 2,000 highly variable genes as RNA specific.

[8]:

sc.pp.normalize_total(rna)
sc.pp.log1p(rna)
sc.pp.highly_variable_genes(rna, n_top_genes=2000, inplace=False, subset=True)
up.batch_scale(rna)
print(rna)

AnnData object with n_obs × n_vars = 100064 × 2000
    obs: 'cell_type', 'domain_id', 'source'
    var: 'gene_ids', 'feature_types', 'highly_variable', 'means', 'dispersions', 'dispersions_norm'
    uns: 'log1p', 'hvg'

[9]:

spot.write('spot_processed.h5ad', compression='gzip')
rna.write('rna_processed.h5ad', compression='gzip')
adata_cm.write('rna_and_spot.h5ad', compression='gzip')

... storing 'source' as categorical
... storing 'cell_type' as categorical
... storing 'source' as categorical

Integrate the SPOT and scRNA-seq data using both common and dataset-specific genes by Run function in uniport. Set save_OT=True and return a OT plan, which is a SPOT by RNA probabilistic matching matrix.

[10]:

adata, OT = up.Run(adatas=[spot,rna], adata_cm=adata_cm, save_OT=True)

Dataset 0: SPOT
AnnData object with n_obs × n_vars = 2518 × 2000
    obs: 'domain_id', 'source'
    var: 'highly_variable', 'means', 'dispersions', 'dispersions_norm'
    uns: 'log1p', 'hvg'
Dataset 1: RNA
AnnData object with n_obs × n_vars = 100064 × 2000
    obs: 'cell_type', 'domain_id', 'source'
    var: 'gene_ids', 'feature_types', 'highly_variable', 'means', 'dispersions', 'dispersions_norm'
    uns: 'log1p', 'hvg'
Reference dataset is dataset 1


Data with common HVG
AnnData object with n_obs × n_vars = 102582 × 2000
    obs: 'domain_id', 'source', 'cell_type'
    var: 'gene_ids-1', 'feature_types-1', 'highly_variable', 'means', 'dispersions', 'dispersions_norm', 'highly_variable_nbatches', 'highly_variable_intersection'
    uns: 'log1p', 'hvg'


Warning! Saving Optimal Transport plan needs extra 4.03 GB memory, please set save_OT=False if no enough memory!
float32
Size of transport plan between datasets 0 and 1: (2518, 100064)

Epochs: 100%|███████████████████████████████████████████| 75/75 [12:05<00:00,  9.67s/it, recloss=409.33,klloss=6.93,otloss=10.03]

Save OT plan for deconvolution.

[12]:

name_idx = adata_cm[adata_cm.obs['source']=='SPOT'].obs_names
name_col = adata_cm[adata_cm.obs['source']=='RNA'].obs_names
OT_pd = pd.DataFrame(OT[0], index=name_idx, columns=name_col)
OT_pd.to_csv('OT_BRCA.txt', sep='\t')