Integrate BRCA with scRNA
uniPort can output a global optimal transport (OT) plan, i.e., a cell-cell correspondence matrix, that provides flexible transfer learning for deconvolution of spatial heterogeneous data using scRNA data in OT space, instead of embedding latent space.
[1]:
import uniport as up
import scanpy as sc
import pandas as pd
Read microarray-based ST data of HER2-positive breast cancer (BRCA), containing diffusely infiltrating cells that make it more difficult to deconvolute spots. In total, 2,518 spots with 17,943 genes and 100,064 cells with 29,733 genes were used for integration.
[2]:
labels_rna = pd.read_csv('Whole_miniatlas_meta.csv', sep=',')
celltype = labels_rna['celltype_major'].values
print(celltype)
['Endothelial' 'Endothelial' 'Endothelial' ... 'Myeloid' 'Myeloid'
'Myeloid']
[3]:
# RNA = sc.read_10x_mtx('sc/')
# RNA.write('RNA.h5ad', compression='gzip')
rna = sc.read('RNA.h5ad')
spot = up.load_file('BRCA_SPOT_Count.txt')
Add domain_id, cell_type and source obs to AnnData.
[4]:
spot.obs['domain_id'] = 0
spot.obs['domain_id'] = spot.obs['domain_id'].astype('category')
spot.obs['source'] = 'SPOT'
rna.obs['cell_type'] = celltype
rna.obs['domain_id'] = 1
rna.obs['domain_id'] = rna.obs['domain_id'].astype('category')
rna.obs['source'] = 'RNA'
Concatenate SPOT and scRNA-seq with common genes using AnnData.concatenate
.
[5]:
adata_cm = spot.concatenate(rna, join='inner', batch_key='domain_id')
batch_scale
function in uniport (modified from SCALEX).[6]:
sc.pp.normalize_total(adata_cm)
sc.pp.log1p(adata_cm)
sc.pp.highly_variable_genes(adata_cm, n_top_genes=2000, batch_key='domain_id', inplace=False, subset=True)
up.batch_scale(adata_cm)
print(adata_cm)
... storing 'source' as categorical
... storing 'cell_type' as categorical
AnnData object with n_obs × n_vars = 102582 × 2000
obs: 'domain_id', 'source', 'cell_type'
var: 'gene_ids-1', 'feature_types-1', 'highly_variable', 'means', 'dispersions', 'dispersions_norm', 'highly_variable_nbatches', 'highly_variable_intersection'
uns: 'log1p', 'hvg'
Preprocess SPOT data. Select 2,000 highly variable genes as SPOT speicifc.
[7]:
sc.pp.normalize_total(spot)
sc.pp.log1p(spot)
sc.pp.highly_variable_genes(spot, n_top_genes=2000, inplace=False, subset=True)
up.batch_scale(spot)
print(spot)
AnnData object with n_obs × n_vars = 2518 × 2000
obs: 'domain_id', 'source'
var: 'highly_variable', 'means', 'dispersions', 'dispersions_norm'
uns: 'log1p', 'hvg'
Preprocess scRNA-seq data. Select 2,000 highly variable genes as RNA specific.
[8]:
sc.pp.normalize_total(rna)
sc.pp.log1p(rna)
sc.pp.highly_variable_genes(rna, n_top_genes=2000, inplace=False, subset=True)
up.batch_scale(rna)
print(rna)
AnnData object with n_obs × n_vars = 100064 × 2000
obs: 'cell_type', 'domain_id', 'source'
var: 'gene_ids', 'feature_types', 'highly_variable', 'means', 'dispersions', 'dispersions_norm'
uns: 'log1p', 'hvg'
[9]:
spot.write('spot_processed.h5ad', compression='gzip')
rna.write('rna_processed.h5ad', compression='gzip')
adata_cm.write('rna_and_spot.h5ad', compression='gzip')
... storing 'source' as categorical
... storing 'cell_type' as categorical
... storing 'source' as categorical
Integrate the SPOT and scRNA-seq data using both common and dataset-specific genes by Run
function in uniport. Set save_OT=True and return a OT plan, which is a SPOT by RNA probabilistic matching matrix.
[10]:
adata, OT = up.Run(adatas=[spot,rna], adata_cm=adata_cm, save_OT=True)
Dataset 0: SPOT
AnnData object with n_obs × n_vars = 2518 × 2000
obs: 'domain_id', 'source'
var: 'highly_variable', 'means', 'dispersions', 'dispersions_norm'
uns: 'log1p', 'hvg'
Dataset 1: RNA
AnnData object with n_obs × n_vars = 100064 × 2000
obs: 'cell_type', 'domain_id', 'source'
var: 'gene_ids', 'feature_types', 'highly_variable', 'means', 'dispersions', 'dispersions_norm'
uns: 'log1p', 'hvg'
Reference dataset is dataset 1
Data with common HVG
AnnData object with n_obs × n_vars = 102582 × 2000
obs: 'domain_id', 'source', 'cell_type'
var: 'gene_ids-1', 'feature_types-1', 'highly_variable', 'means', 'dispersions', 'dispersions_norm', 'highly_variable_nbatches', 'highly_variable_intersection'
uns: 'log1p', 'hvg'
Warning! Saving Optimal Transport plan needs extra 4.03 GB memory, please set save_OT=False if no enough memory!
float32
Size of transport plan between datasets 0 and 1: (2518, 100064)
Epochs: 100%|███████████████████████████████████████████| 75/75 [12:05<00:00, 9.67s/it, recloss=409.33,klloss=6.93,otloss=10.03]
Save OT plan for deconvolution.
[12]:
name_idx = adata_cm[adata_cm.obs['source']=='SPOT'].obs_names
name_col = adata_cm[adata_cm.obs['source']=='RNA'].obs_names
OT_pd = pd.DataFrame(OT[0], index=name_idx, columns=name_col)
OT_pd.to_csv('OT_BRCA.txt', sep='\t')