CAMEX could achieve more accurate integration and annotation performance in both relatives and distant species

This tutorial demonstrates that CAMEX could achieve more accurate integration and annotation performance in both relatives and distant species.

Here, we use collected scRNA-seq data from four species: adult human visual cortex, frontal cortex, and cerebellum, mouse neocortex, as well as lizard and turtle pallium. Processed h5ad files can be downloaded from https://drive.google.com/drive/folders/1rwdjEvWFEFw82a0x2JzMi2jXICbUc5eb?usp=sharing

[1]:
import warnings
warnings.filterwarnings("ignore")
[3]:
import os
import time
import torch
import shutil
import warnings
import argparse
import importlib
import scanpy as sc

import pandas as pd
import numpy as mp

from CAMEX.base import Dataset
from CAMEX.trainer import Trainer
[4]:
from params import PARAMS
[5]:
t1 = time.time()

make log dir

[6]:
time_start = time.strftime("%Y-%m-%d-%H-%M-%S")
log_path = f'./log/{time_start}/'
for k, v in PARAMS.items():
    v['time_start'] = time_start
    v['log_path'] = log_path
print(log_path)
./log/2025-06-14-10-39-23/
[7]:
os.makedirs(log_path, exist_ok=True)
shutil.copy('params.py', log_path + 'params_current.py')
print(f'time: {time_start}')
time: 2025-06-14-10-39-23

preprocess scRNA_seq data to construct a heterogeneous graph of cells and genes

[8]:
#  —————————————————————————————————— 1 preprocess
print('start preprocess')
dataset = Dataset(**PARAMS['preprocess'])
# torch.save(dataset, log_path + 'dataset_preprocessed.pt')
# dataset = torch.load(f'{args.path}/log/2023-06-06-09-02-45/dataset_preprocessed.pt')
adata_CAMEX = dataset.adata_whole
dgl_data = dataset.dgl_data
start preprocess
                                raw-brain-human-Lake: reference  raw-brain-mouse-Chen: query  raw-brain-lizard-Tosches: query  raw-brain-turtle-Tosches: query
excitatory neuron                                       14747.0                        906.0                           1910.0                           7151.0
inhibitory neuron                                        6808.0                       1392.0                            242.0                           1490.0
oligodendrocyte                                          4369.0                       3724.0                            551.0                            155.0
cerebellar granule cell                                  3298.0                          NaN                              NaN                              NaN
astrocyte                                                2524.0                       1757.0                            520.0                           6514.0
oligodendrocyte precursor cell                           1358.0                       1792.0                            398.0                           1862.0
Purkinje cell                                            1001.0                          NaN                              NaN                              NaN
microglial cell                                           756.0                        724.0                            278.0                            589.0
endothelial cell                                          219.0                       1197.0                              NaN                              NaN
brain pericyte                                            209.0                          NaN                              NaN                            114.0
ependymal cell                                              NaN                        413.0                              NaN                              NaN
macrophage                                                  NaN                        167.0                              NaN                              NaN
neural progenitor cell                                      NaN                          NaN                            133.0                            717.0

[9]:
print('start train')
trainer = Trainer(adata_CAMEX, dgl_data, **PARAMS['train'])
start train

integration

[10]:
trainer.integration()
--------------------------------------------- integration ---------------------------------------------
epoch: 0, loss: 88.19321090792432
epoch: 1, loss: 26.90171910509651
epoch: 2, loss: 26.61919125215507
epoch: 3, loss: 26.456716372642987
epoch: 4, loss: 26.36408836458936
epoch: 5, loss: 26.273979987627193
epoch: 6, loss: 26.22034063456971
epoch: 7, loss: 26.136304831799166
epoch: 8, loss: 26.09897893740807
epoch: 9, loss: 26.08717859527211

annotation

[11]:
trainer.annotation()
--------------------------------------------- annotation ---------------------------------------------
epoch: 0, loss: 88.01263552904129
train_acc: {'raw-brain-human-Lakecell_acc': 0.9388}, test_acc: {'raw-brain-lizard-Toschescell_acc': 0.933, 'raw-brain-mouse-Chencell_acc': 0.6207, 'raw-brain-turtle-Toschescell_acc': 0.9275}, train_ami:{'raw-brain-human-Lakecell_ami': 0.6674, 'raw-brain-lizard-Toschescell_ami': 0.7326, 'raw-brain-mouse-Chencell_ami': 0.5929, 'raw-brain-turtle-Toschescell_ami': 0.6558}, best_epoch: 0
epoch: 1, loss: 55.156198382377625
train_acc: {'raw-brain-human-Lakecell_acc': 0.9442}, test_acc: {'raw-brain-lizard-Toschescell_acc': 0.9368, 'raw-brain-mouse-Chencell_acc': 0.7285, 'raw-brain-turtle-Toschescell_acc': 0.9235}, train_ami:{'raw-brain-human-Lakecell_ami': 0.6667, 'raw-brain-lizard-Toschescell_ami': 0.7578, 'raw-brain-mouse-Chencell_ami': 0.6837, 'raw-brain-turtle-Toschescell_ami': 0.6535}, best_epoch: 1
epoch: 2, loss: 53.2356573343277
train_acc: {'raw-brain-human-Lakecell_acc': 0.9434}, test_acc: {'raw-brain-lizard-Toschescell_acc': 0.9301, 'raw-brain-mouse-Chencell_acc': 0.7244, 'raw-brain-turtle-Toschescell_acc': 0.9161}, train_ami:{'raw-brain-human-Lakecell_ami': 0.6631, 'raw-brain-lizard-Toschescell_ami': 0.7533, 'raw-brain-mouse-Chencell_ami': 0.6903, 'raw-brain-turtle-Toschescell_ami': 0.6396}, best_epoch: 1
epoch: 3, loss: 52.655654072761536
train_acc: {'raw-brain-human-Lakecell_acc': 0.9479}, test_acc: {'raw-brain-lizard-Toschescell_acc': 0.9201, 'raw-brain-mouse-Chencell_acc': 0.7352, 'raw-brain-turtle-Toschescell_acc': 0.9208}, train_ami:{'raw-brain-human-Lakecell_ami': 0.6667, 'raw-brain-lizard-Toschescell_ami': 0.7362, 'raw-brain-mouse-Chencell_ami': 0.6919, 'raw-brain-turtle-Toschescell_ami': 0.6366}, best_epoch: 1
epoch: 4, loss: 53.30202889442444
train_acc: {'raw-brain-human-Lakecell_acc': 0.9453}, test_acc: {'raw-brain-lizard-Toschescell_acc': 0.9263, 'raw-brain-mouse-Chencell_acc': 0.737, 'raw-brain-turtle-Toschescell_acc': 0.9172}, train_ami:{'raw-brain-human-Lakecell_ami': 0.6666, 'raw-brain-lizard-Toschescell_ami': 0.7458, 'raw-brain-mouse-Chencell_ami': 0.7004, 'raw-brain-turtle-Toschescell_ami': 0.644}, best_epoch: 3
epoch: 5, loss: 52.66183912754059
train_acc: {'raw-brain-human-Lakecell_acc': 0.9491}, test_acc: {'raw-brain-lizard-Toschescell_acc': 0.9209, 'raw-brain-mouse-Chencell_acc': 0.7464, 'raw-brain-turtle-Toschescell_acc': 0.9215}, train_ami:{'raw-brain-human-Lakecell_ami': 0.6694, 'raw-brain-lizard-Toschescell_ami': 0.7351, 'raw-brain-mouse-Chencell_ami': 0.7069, 'raw-brain-turtle-Toschescell_ami': 0.6424}, best_epoch: 3
epoch: 6, loss: 53.20870327949524
train_acc: {'raw-brain-human-Lakecell_acc': 0.95}, test_acc: {'raw-brain-lizard-Toschescell_acc': 0.9221, 'raw-brain-mouse-Chencell_acc': 0.7778, 'raw-brain-turtle-Toschescell_acc': 0.9246}, train_ami:{'raw-brain-human-Lakecell_ami': 0.671, 'raw-brain-lizard-Toschescell_ami': 0.7481, 'raw-brain-mouse-Chencell_ami': 0.7196, 'raw-brain-turtle-Toschescell_ami': 0.6532}, best_epoch: 6
epoch: 7, loss: 52.54552209377289
train_acc: {'raw-brain-human-Lakecell_acc': 0.9479}, test_acc: {'raw-brain-lizard-Toschescell_acc': 0.9244, 'raw-brain-mouse-Chencell_acc': 0.7353, 'raw-brain-turtle-Toschescell_acc': 0.9203}, train_ami:{'raw-brain-human-Lakecell_ami': 0.6663, 'raw-brain-lizard-Toschescell_ami': 0.742, 'raw-brain-mouse-Chencell_ami': 0.6937, 'raw-brain-turtle-Toschescell_ami': 0.637}, best_epoch: 6
epoch: 8, loss: 52.27011561393738
train_acc: {'raw-brain-human-Lakecell_acc': 0.946}, test_acc: {'raw-brain-lizard-Toschescell_acc': 0.9164, 'raw-brain-mouse-Chencell_acc': 0.6963, 'raw-brain-turtle-Toschescell_acc': 0.9093}, train_ami:{'raw-brain-human-Lakecell_ami': 0.6634, 'raw-brain-lizard-Toschescell_ami': 0.7229, 'raw-brain-mouse-Chencell_ami': 0.6655, 'raw-brain-turtle-Toschescell_ami': 0.6217}, best_epoch: 6
epoch: 9, loss: 52.543911933898926
train_acc: {'raw-brain-human-Lakecell_acc': 0.9468}, test_acc: {'raw-brain-lizard-Toschescell_acc': 0.9216, 'raw-brain-mouse-Chencell_acc': 0.7345, 'raw-brain-turtle-Toschescell_acc': 0.9213}, train_ami:{'raw-brain-human-Lakecell_ami': 0.6657, 'raw-brain-lizard-Toschescell_ami': 0.7396, 'raw-brain-mouse-Chencell_ami': 0.701, 'raw-brain-turtle-Toschescell_ami': 0.6424}, best_epoch: 8
[12]:
adata_CAMEX.write_h5ad(log_path + 'adata_CAMEX.h5ad', compression='gzip')
[13]:
t2 = time.time()
[14]:
print(f'time usage: {round(t2-t1)} seconds')
time usage: 1841 seconds

analysis

[15]:
log_path
[15]:
'./log/2025-06-14-10-39-23/'
[16]:
adata_CAMEX = sc.read_h5ad(log_path + 'adata_CAMEX.h5ad')
adata_CAMEX
[16]:
AnnData object with n_obs × n_vars = 69985 × 2000
    obs: 'cell_ontology_class', 'cell_ontology_id', 'cell_type1', 'dataset_name', 'donor', 'organ', 'organism', 'platform', 'region', 'tSNE1', 'tSNE2', 'batch', 'n_genes_by_counts', 'total_counts', 'cell_ontology_class_num', 'cell_class', 'cell_class_num'
    var: 'n_cells_by_counts', 'mean_counts', 'pct_dropout_by_counts', 'total_counts', 'highly_variable', 'means', 'dispersions', 'dispersions_norm', 'highly_variable_nbatches', 'highly_variable_intersection', 'mean', 'std'
    uns: 'cell_type', 'data_order', 'dataset_description', 'dataset_type', 'hvg', 'log1p', 'neighbors', 'pca'
    obsm: 'X_CAMEX_Annotation', 'X_CAMEX_Annotation_eval', 'X_CAMEX_Integration', 'X_pca', 'cell_train_class'
    varm: 'PCs'
    layers: 'counts'
    obsp: 'connectivities', 'distances'

integration

[47]:
adata_CAMEX.obs.head(5)
[47]:
cell_ontology_class cell_ontology_id cell_type1 dataset_name donor organ organism platform region tSNE1 tSNE2 batch n_genes_by_counts total_counts cell_ontology_class_num cell_class cell_class_num cell_type_pred
index
Gran_cbm1_TTAATCAGTCGC cerebellar granule cell CL:0001031 Gran Lake_2018 1 Brain Homo sapiens snDrop-seq cbm -26.725456 -30.189707 raw-brain-human-Lake 708 1090.0 3 cerebellar granule cell 3 cerebellar granule cell
Gran_cbm1_ACAACGACATCC cerebellar granule cell CL:0001031 Gran Lake_2018 1 Brain Homo sapiens snDrop-seq cbm -31.022097 -22.272980 raw-brain-human-Lake 646 915.0 3 cerebellar granule cell 3 cerebellar granule cell
Gran_cbm1_TATGTCTATATG cerebellar granule cell CL:0001031 Gran Lake_2018 1 Brain Homo sapiens snDrop-seq cbm -35.684959 -24.031462 raw-brain-human-Lake 701 1053.0 3 cerebellar granule cell 3 cerebellar granule cell
Gran_cbm1_TAATGGAAAATA cerebellar granule cell CL:0001031 Gran Lake_2018 1 Brain Homo sapiens snDrop-seq cbm -25.886816 -33.706535 raw-brain-human-Lake 715 1115.0 3 cerebellar granule cell 3 cerebellar granule cell
Gran_cbm1_CTGGACTACAGC cerebellar granule cell CL:0001031 Gran Lake_2018 1 Brain Homo sapiens snDrop-seq cbm -25.068712 -36.672504 raw-brain-human-Lake 651 960.0 3 cerebellar granule cell 3 cerebellar granule cell
[48]:
adata_CAMEX.obsm['X_CAMEX_Integration'].shape
[48]:
(69985, 128)
[49]:
sc.pp.neighbors(adata_CAMEX, use_rep='X_CAMEX_Integration')
[50]:
sc.tl.umap(adata_CAMEX)
[51]:
palette_batch = {'raw-brain-human-Lake': '#FB7800', 'raw-brain-mouse-Chen': '#D62728',
                 'raw-brain-lizard-Tosches': '#31C4C9', 'raw-brain-turtle-Tosches': '#894EA1'}
[52]:
palette_cell = {'cerebellar granule cell': '#D1352B',
                'ependymal cell': '#9B5B33',
                'endothelial cell': '#EE934E',
                'brain pericyte': '#FFFF32',
                'astrocyte': '#CCCC33',
                'oligodendrocyte': '#BBDD78',
                'oligodendrocyte precursor cell': '#7DBFA7',
                'macrophage': '#3C77AF',
                'excitatory neuron': '#AECDE1',
                'inhibitory neuron': '#F5CFE4',
                'neural progenitor cell': '#A71AB9',
                'microglial cell': '#B383B9',
                'Purkinje cell': '#D67475',}
[53]:
sc.pl.umap(adata_CAMEX, color=['batch'], wspace=0.6, palette=palette_batch)
_images/integration_annotation_in_relatives_distant_species_28_0.png
[54]:
sc.pl.umap(adata_CAMEX, color=['cell_ontology_class'], wspace=0.6, palette=palette_cell)
_images/integration_annotation_in_relatives_distant_species_29_0.png
[ ]:

annotation

[55]:
adata_CAMEX
[55]:
AnnData object with n_obs × n_vars = 69985 × 2000
    obs: 'cell_ontology_class', 'cell_ontology_id', 'cell_type1', 'dataset_name', 'donor', 'organ', 'organism', 'platform', 'region', 'tSNE1', 'tSNE2', 'batch', 'n_genes_by_counts', 'total_counts', 'cell_ontology_class_num', 'cell_class', 'cell_class_num', 'cell_type_pred'
    var: 'n_cells_by_counts', 'mean_counts', 'pct_dropout_by_counts', 'total_counts', 'highly_variable', 'means', 'dispersions', 'dispersions_norm', 'highly_variable_nbatches', 'highly_variable_intersection', 'mean', 'std'
    uns: 'cell_type', 'data_order', 'dataset_description', 'dataset_type', 'hvg', 'log1p', 'neighbors', 'pca', 'umap', 'batch_colors', 'cell_ontology_class_colors', 'cell_type_pred_colors'
    obsm: 'X_CAMEX_Annotation', 'X_CAMEX_Annotation_eval', 'X_CAMEX_Integration', 'X_pca', 'cell_train_class', 'X_umap'
    varm: 'PCs'
    layers: 'counts'
    obsp: 'connectivities', 'distances'

adata_CAMEX.uns[‘cell_type’] is the dictionary of the correspondence between cell types and numerical codes

[56]:
adata_CAMEX.uns['cell_type']
[56]:
{'Purkinje cell': 6,
 'astrocyte': 4,
 'brain pericyte': 9,
 'cerebellar granule cell': 3,
 'endothelial cell': 8,
 'ependymal cell': 11,
 'excitatory neuron': 0,
 'inhibitory neuron': 1,
 'macrophage': 12,
 'microglial cell': 7,
 'neural progenitor cell': 13,
 'oligodendrocyte': 2,
 'oligodendrocyte precursor cell': 5,
 'unknown': 10}
[57]:
cell_type_d = {v: k for k, v in adata_CAMEX.uns['cell_type'].items()}
cell_type_d
[57]:
{6: 'Purkinje cell',
 4: 'astrocyte',
 9: 'brain pericyte',
 3: 'cerebellar granule cell',
 8: 'endothelial cell',
 11: 'ependymal cell',
 0: 'excitatory neuron',
 1: 'inhibitory neuron',
 12: 'macrophage',
 7: 'microglial cell',
 13: 'neural progenitor cell',
 2: 'oligodendrocyte',
 5: 'oligodendrocyte precursor cell',
 10: 'unknown'}
[ ]:
import torch
import numpy as np
[58]:
y_true = adata_CAMEX.obs['cell_ontology_class_num'].to_numpy().astype(np.int)  # numerical codes of manually annotated cell types (ground truth)
y_prob = torch.nn.Softmax(dim=-1)(torch.tensor(adata_CAMEX.obsm['cell_train_class'])).numpy()  # apply softmax to model outputs (logits) to obtain probability distributions
y_pred = np.argmax(y_prob, axis=-1).astype(np.int) # get the predicted class index with the highest probability
adata_CAMEX.obs.loc[:, 'cell_type_pred'] = [cell_type_d[item] for item in y_pred] # map predicted class indices to cell type names and store them in adata.obs
adata_CAMEX
[58]:
AnnData object with n_obs × n_vars = 69985 × 2000
    obs: 'cell_ontology_class', 'cell_ontology_id', 'cell_type1', 'dataset_name', 'donor', 'organ', 'organism', 'platform', 'region', 'tSNE1', 'tSNE2', 'batch', 'n_genes_by_counts', 'total_counts', 'cell_ontology_class_num', 'cell_class', 'cell_class_num', 'cell_type_pred'
    var: 'n_cells_by_counts', 'mean_counts', 'pct_dropout_by_counts', 'total_counts', 'highly_variable', 'means', 'dispersions', 'dispersions_norm', 'highly_variable_nbatches', 'highly_variable_intersection', 'mean', 'std'
    uns: 'cell_type', 'data_order', 'dataset_description', 'dataset_type', 'hvg', 'log1p', 'neighbors', 'pca', 'umap', 'batch_colors', 'cell_ontology_class_colors', 'cell_type_pred_colors'
    obsm: 'X_CAMEX_Annotation', 'X_CAMEX_Annotation_eval', 'X_CAMEX_Integration', 'X_pca', 'cell_train_class', 'X_umap'
    varm: 'PCs'
    layers: 'counts'
    obsp: 'connectivities', 'distances'
[61]:
sc.pp.neighbors(adata_CAMEX, use_rep='X_CAMEX_Annotation')
[62]:
sc.tl.umap(adata_CAMEX)

Visualize batch key in UMAP plot

[63]:
sc.pl.umap(adata_CAMEX, color=['batch'], wspace=0.6, palette=palette_batch)
_images/integration_annotation_in_relatives_distant_species_41_0.png

Visualize mannual annotation and CAMEX cell type prediction key in UMAP plot

[64]:
sc.pl.umap(adata_CAMEX, color=['cell_ontology_class', 'cell_type_pred'], wspace=0.4, palette=palette_cell)
_images/integration_annotation_in_relatives_distant_species_43_0.png
[ ]: