Spatial transcriptomics (ST) enables the study of gene expression within its spatial context in histopathology samples. To date, a limiting factor has been the resolution of sequencing based ST products. The introduction of the Visium High Definition (HD) technology opens the door to cell resolution ST studies. However, challenges remain in the ability to accurately map transcripts to cells and in cell type assignment based on spot data.
ENACT is the first tissue-agnostic pipeline that integrates advanced cell segmentation with Visium HD transcriptomics data to infer cell types across whole tissue sections. Our pipeline incorporates novel bin-to-cell assignment methods, enhancing the accuracy of single-cell transcript estimates. Validated on diverse synthetic and real datasets, our approach demonstrates high effectiveness at predicting cell types and scalability, offering a robust solution for spatially resolved transcriptomics analysis.
This repository has the code for inferring cell types from the sub-cellular transcript counts provided by VisiumHD.
This can be achieved through the following steps:
- Cell segmentation: segment high resolution image using NN-based image segmentation networks such as Stardist.
- Bin-to-cell assignment: Obtain cell-wise transcript counts by aggregating the VisiumHD bins that are associated with each cell
- Cell type inference: Use the cell-wise transcript counts to infer the cell labels/ phenotypes using methods used for single-cell RNA seq analysis (CellAsign or CellTypist or Sargent if installed) or novel approaches, and use comprehensive cell marker databases (Panglao or CellMarker can be used as reference).
Note
At this time, Sargent is currently not available in GitHub. For information on how to access Sargent (doi: https://doi.org/10.1016/j.mex.2023.102196), please contact the paper's corresponding authors (nima.nouri@sanofi.com). We provide the results obtained by Sargent in ENACT's Zenodo page under the following folders:
- ENACT_supporting_files/public_data/human_colorectal/paper_results/chunks/naive/sargent_results/
- ENACT_supporting_files/public_data/human_colorectal/paper_results/chunks/weighted_by_area/sargent_results/
- ENACT_supporting_files/public_data/human_colorectal/paper_results/chunks/weighted_by_transcript/sargent_results/
- ENACT_supporting_files/public_data/human_colorectal/paper_results/chunks/weighted_by_cluster/sargent_results/
- System Requirements
- Install ENACT from Source
- Install ENACT with Pip
- Input Files for ENACT
- Defining ENACT Configurations
- Output Files for ENACT
- Running ENACT from Notebook
- Running ENACT from Terminal
- Running Instructions
- Visualizing Results on TissUUmaps
- Reproducing Paper Results
- Creating Synthetic VisiumHD Datasets
- Citing ENACT
ENACT was tested with the following specifications:
-
Hardware Requirements: 32 CPU, 64GB RAM, 100 GB (harddisk and memory requirements may vary depending on whole slide image size; if the weight of the wsi is small the memory requirements can be significantly decreased)
-
Software: Python 3.9, (Optional) GPU (CUDA 11)
git clone https://github.com/Sanofi-OneAI/oneai-dda-spatialtr-enact.git
cd oneai-dda-spatialtr-enact
Start by defining the location and the name of the Conda environment in the Makefile
:
ENV_DIR := /home/oneai/envs/ <---- Conda environment location
PY_ENV_NAME := enact_py_env <---- Conda environment name
Next, run the following Make command to create a Conda environment with all of ENACT's dependencies
make setup_py_env
ENACT can be installed from Pypi using:
pip install enact-SO
ENACT requires only three files, which can be obtained from SpaceRanger’s outputs for each experiment:
- Whole resolution tissue image. This will be segmented to obtain the cell boundaries that will be used to aggregate the transcript counts.
- tissue_positions.parquet. This is the file that specifies the 2um Visium HD bin locations relative to the full resolution image.
- filtered_feature_bc_matrix.h5. This is the .h5 file with the 2um Visium HD bin counts.
ENACT users can choose to specify the configurations via one of two ways:
- Passing them within the class constructor:
from enact.pipeline import ENACT
so_hd = ENACT(
cache_dir="/home/oneai/test_cache",
wsi_path="Visium_HD_Human_Colon_Cancer_tissue_image.btf",
visiumhd_h5_path="binned_outputs/square_002um/filtered_feature_bc_matrix.h5",
tissue_positions_path="binned_outputs/square_002um/spatial/tissue_positions.parquet",
)
Full list of ENACT parameters (click to expand)
-
cache_dir (str):
Directory to cache ENACT results. This must be specified by the user. -
wsi_path (str):
Path to the Whole Slide Image (WSI) file. This must be provided by the user. -
visiumhd_h5_path (str):
Path to the Visium HD h5 file containing spatial transcriptomics data. This must be provided by the user. -
tissue_positions_path (str):
Path to the tissue positions file that contains spatial locations of barcodes. This must be provided by the user. -
analysis_name (str):
Name of the analysis, used for output directories and results.
Default:"enact_demo"
. -
seg_method (str):
Cell segmentation method.
Default:"stardist"
.
Options:["stardist"]
. -
patch_size (int):
Size of patches (in pixels) to process the image. Use a smaller patch size to reduce memory requirements.
Default:4000
. -
use_hvg (bool):
Whether to use highly variable genes (HVG) during the analysis.
Default:True
.
Options:[True]
. -
n_hvg (int):
Number of highly variable genes to use ifuse_hvg
isTrue
.
Default:1000
. -
n_clusters (int):
Number of clusters. Used only ifbin_to_cell_method
is"weighted_by_cluster"
.
Default:4
. -
bin_representation (str):
Representation type for VisiumHD bins.
Default:"polygon"
.
Options:["polygon"]
. -
bin_to_cell_method (str):
Method to assign bins to cells.
Default:"weighted_by_cluster"
.
Options:["naive", "weighted_by_area", "weighted_by_gene", "weighted_by_cluster"]
. -
cell_annotation_method (str):
Method for annotating cell types.
Default:"celltypist"
.
Options:["celltypist", "sargent" (if installed), "cellassign"]
. -
cell_typist_model (str):
Path to the pre-trained CellTypist model for cell type annotation. Only used ifcell_annotation_method
is"celltypist"
.
Refer to CellTypist Models for a list of available models.
Default:""
(empty string). -
run_synthetic (bool):
Whether to run synthetic data generation for testing purposes.
Default:False
. -
segmentation (bool):
Flag to run the image segmentation step.
Default:True
. -
bin_to_geodataframes (bool):
Flag to convert the bins to GeoDataFrames.
Default:True
. -
bin_to_cell_assignment (bool):
Flag to run bin-to-cell assignment.
Default:True
. -
cell_type_annotation (bool):
Flag to run cell type annotation.
Default:True
. -
cell_markers (dict):
A dictionary of cell markers used for annotation. Only used ifcell_annotation_method
is one of["sargent", "cellassign"]
. -
chunks_to_run (list):
Specific chunks of data to run the analysis on, typically for debugging.
Default:[]
(runs all chunks). -
configs_dict (dict):
Dictionary containing ENACT configuration parameters. If provided, the values inconfigs_dict
will override any corresponding parameters passed directly to the class constructor. This is useful for running ENACT with a predefined configuration for convenience and consistency.
Default:{}
(uses the parameters specified in the class constructor).
- Specifying configurations in a
yaml
file: (sample file located underconfig/configs.yaml
):
analysis_name: <analysis-name> <---- custom name for analysis. Will create a folder with that name to store the results
run_synthetic: False <---- True if you want to run bin to cell assignment on synthetic dataset, False otherwise
cache_dir: <path-to-store-enact-outputs> <---- path to store pipeline outputs
paths:
wsi_path: <path-to-whole-slide-image> <---- path to whole slide image
visiumhd_h5_path: <path-to-counts-file> <---- location of the 2um x 2um gene by bin file (filtered_feature_bc_matrix.h5) from 10X Genomics
tissue_positions_path: <path-to-tissue-positions> <---- location of the tissue of the tissue_positions.parquet file from 10X genomicsgenomics
steps:
segmentation: True <---- True if you want to run segmentation
bin_to_geodataframes: True <---- True to convert bin to geodataframes
bin_to_cell_assignment: True <---- True to bin-to-cell assignment
cell_type_annotation: True <---- True to run cell type annotation
params:
bin_to_cell_method: "weighted_by_cluster" <---- bin-to-cell assignment method. Pick one of ["naive", "weighted_by_area", "weighted_by_gene", "weighted_by_cluster"]
cell_annotation_method: "celltypist" <---- cell annotation method. Pick one of ["cellassign", "celltypist"]
cell_typist_model: "Human_Colorectal_Cancer.pkl" <---- CellTypist model weights to use. Update based on organ of interest if using cell_annotation_method is set to "celltypist"
seg_method: "stardist" <---- cell segmentation method. Stardist is the only option for now
patch_size: 4000 <---- defines the patch size. The whole resolution image will be broken into patches of this size. Reduce if you run into memory issues
use_hvg: True <---- True only run analysis on top n highly variable genes. Setting it to False runs ENACT on all genes in the counts file
n_hvg: 1000 <---- number of highly variable genes to use
n_clusters: 4 <---- number of cell clusters to use for the "weighted_by_cluster" method. Default is 4.
cell_markers: <---- cell-gene markers to use for cell annotation. Only applicable if params/cell_annotation_method is "cellassign" or "sargent"
Epithelial: ["CDH1","EPCAM","CLDN1","CD2"]
Enterocytes: ["CD55", "ELF3", "PLIN2", "GSTM3", "KLF5", "CBR1", "APOA1", "CA1", "PDHA1", "EHF"]
Goblet cells: ["MANF", "KRT7", "AQP3", "AGR2", "BACE2", "TFF3", "PHGR1", "MUC4", "MUC13", "GUCA2A"]
ENACT outputs all its results under the cache
directory which gets automatically created at run time:
.
└── cache/
└── <anaylsis_name> /
├── chunks/ # ENACT results at a chunck level
│ ├── bins_gdf/
│ │ └── patch_<patch_id>.csv
│ ├── cells_gdf/
│ │ └── patch_<patch_id>.csv
│ └── <bin_to_cell_method>/
│ ├── bin_to_cell_assign/
│ │ └── patch_<patch_id>.csv
│ ├── cell_ix_lookup/
│ │ └── patch_<patch_id>.csv
│ └── <cell_annotation_method>_results/
│ ├── cells_adata.csv
│ └── merged_results.csv
├── tmap/ # Directory storing files to visualize results on TissUUmaps
│ ├── <run_name>_adata.h5
│ ├── <run_name>_tmap.tmap
│ └── wsi.tif
└── cells_df.csv # cells dataframe, each row is a cell with its coordinates
ENACT breaks down the whole resolution image into "chunks" (or patches) of size patch_size
. Results are provided per-chunk under the chunks
directory.
bins_gdf
:Folder containing GeoPandas dataframes representing the 2um Visium HD bins within a given patchcells_gdf
: Folder containing GeoPandas dataframes representing cells segmented in the tissue<bin_to_cell_method>/bin_to_cell_assign
: Folder contains dataframes with the transcripts assigned to each cells<bin_to_cell_method>/cell_ix_lookup
: Folder contains dataframes defining the indices and coordinates of the cells<bin_to_cell_method>/<cell_annotation_method>_results/cells_adata.csv
: Anndata object containing the results from ENACT (cell coordinates, cell types, transcript counts)- <
bin_to_cell_method>/<cell_annotation_method>_results/merged_results.csv
: Dataframe (.csv) containing the results from ENACT (cell coordinates, cell types)
The demo notebook provides a step-by-step guide on how to install and run ENACT on VisiumHD public data using notebook.
This section provides a guide for running ENACT on the Human Colorectal Cancer sample provided on 10X Genomics' website.
Refer to Install ENACT from Source
- Whole slide image: full resolution tissue image
curl -O https://cf.10xgenomics.com/samples/spatial-exp/3.0.0/Visium_HD_Human_Colon_Cancer/Visium_HD_Human_Colon_Cancer_tissue_image.btf
- Visium HD output file. The transcript counts are provided in a .tar.gz file that needs to be extracted:
curl -O https://cf.10xgenomics.com/samples/spatial-exp/3.0.0/Visium_HD_Human_Colon_Cancer/Visium_HD_Human_Colon_Cancer_binned_outputs.tar.gz
tar -xvzf Visium_HD_Human_Colon_Cancer_binned_outputs.tar.gz
Locate the following two files from the extracted outputs file.
.
└── binned_outputs/
└── square_002um/
├── filtered_feature_bc_matrix.h5 <---- Transcript counts file (2um resolution)
└── spatial/
└── tissue_positions.parquet <---- Bin locations relative to the full resolution image
Refer to Running Instructions for a full list of ENACT parameters to change.
Below is a sample configuration file to use to run ENACT on the Human Colorectal cancer sample:
analysis_name: "colon-demo"
run_synthetic: False # True if you want to run bin to cell assignment on synthetic dataset, False otherwise.
cache_dir: "cache/ENACT_outputs" # Change according to your desired output location
paths:
wsi_path: "<path_to_data>/Visium_HD_Human_Colon_Cancer_tissue_image.btf" # whole slide image path
visiumhd_h5_path: "<path_to_data>/binned_outputs/square_002um/filtered_feature_bc_matrix.h5" # location of the 2um x 2um gene by bin file (filtered_feature_bc_matrix.h5) from 10X Genomics.
tissue_positions_path: "<path_to_data>/binned_outputs/square_002um/spatial/tissue_positions.parquet" # location of the tissue of the tissue_positions.parquet file from 10X genomics
steps:
segmentation: True # True if you want to run segmentation
bin_to_geodataframes: True # True to convert bin to geodataframes
bin_to_cell_assignment: True # True to assign cells to bins
cell_type_annotation: True # True to run cell type annotation
params:
seg_method: "stardist" # Stardist is the only option for now
patch_size: 4000 # Defines the patch size. The whole resolution image will be broken into patches of this size
bin_representation: "polygon" # or point TODO: Remove support for anything else
bin_to_cell_method: "weighted_by_cluster" # or naive
cell_annotation_method: "celltypist"
cell_typist_model: "Human_Colorectal_Cancer.pkl"
use_hvg: True # Only run analysis on highly variable genes + cell markers specified
n_hvg: 1000 # Number of highly variable genes to use
n_clusters: 4
chunks_to_run: []
cell_markers:
# Human Colon
Epithelial: ["CDH1","EPCAM","CLDN1","CD2"]
Enterocytes: ["CD55", "ELF3", "PLIN2", "GSTM3", "KLF5", "CBR1", "APOA1", "CA1", "PDHA1", "EHF"]
Goblet cells: ["MANF", "KRT7", "AQP3", "AGR2", "BACE2", "TFF3", "PHGR1", "MUC4", "MUC13", "GUCA2A"]
Enteroendocrine cells: ["NUCB2", "FABP5", "CPE", "ALCAM", "GCG", "SST", "CHGB", "IAPP", "CHGA", "ENPP2"]
Crypt cells: ["HOPX", "SLC12A2", "MSI1", "SMOC2", "OLFM4", "ASCL2", "PROM1", "BMI1", "EPHB2", "LRIG1"]
Endothelial: ["PECAM1","CD34","KDR","CDH5","PROM1","PDPN","TEK","FLT1","VCAM1","PTPRC","VWF","ENG","MCAM","ICAM1","FLT4"]
Fibroblast: ["COL1A1","COL3A1","COL5A2","PDGFRA","ACTA2","TCF21","FN"]
Smooth muscle cell: ["BGN","MYL9","MYLK","FHL2","ITGA1","ACTA2","EHD2","OGN","SNCG","FABP4"]
B cells: ["CD74", "HMGA1", "CD52", "PTPRC", "HLA-DRA", "CD24", "CXCR4", "SPCS3", "LTB", "IGKC"]
T cells: ["JUNB", "S100A4", "CD52", "PFN1P1", "CD81", "EEF1B2P3", "CXCR4", "CREM", "IL32", "TGIF1"]
NK cells: ["S100A4", "IL32", "CXCR4", "FHL2", "IL2RG", "CD69", "CD7", "NKG7", "CD2", "HOPX"]
This section provides a guide on running ENACT on your own data
Refer to Install ENACT from Source
Define the locations of ENACT's required files in the config/configs.yaml
file. Refer to Input Files for ENACT
analysis_name: <analysis-name> <---- custom name for analysis. Will create a folder with that name to store the results
cache_dir: <path-to-store-enact-outputs> <---- path to store pipeline outputs
paths:
wsi_path: <path-to-whole-slide-image> <---- path to whole slide image
visiumhd_h5_path: <path-to-counts-file> <---- location of the 2um x 2um gene by bin file (filtered_feature_bc_matrix.h5) from 10X Genomics.
tissue_positions_path: <path-to-tissue-positions> <---- location of the tissue of the tissue_positions.parquet file from 10X genomics
Define the following core parameters in the config/configs.yaml
file:
params:
bin_to_cell_method: "weighted_by_cluster" <---- bin-to-cell assignment method. Pick one of ["naive", "weighted_by_area", "weighted_by_gene", "weighted_by_cluster"]
cell_annotation_method: "celltypist" <---- cell annotation method. Pick one of ["cellassign", "celltypist", "sargent" (if installed)]
cell_typist_model: "Human_Colorectal_Cancer.pkl" <---- CellTypist model weights to use. Update based on organ of interest if using cell_annotation_method is set to
Refer to Defining ENACT Configurations for a full list of parameters to configure. If using CellTypist, set cell_typist_model
to one of the following models based on the organ and species under study: CellTypist models.
Step 4: Define Cell Gene Markers (Only applies for cell_annotation_method is "cellassign" or "sargent")
Define the cell gene markers in config/configs.yaml
file. Those can be expert annotated or obtained from open-source databases such as Panglao or CellMarker. Example cell markers for human colorectal cancer samples:
cell_markers:
Epithelial: ["CDH1","EPCAM","CLDN1","CD2"]
Enterocytes: ["CD55", "ELF3", "PLIN2", "GSTM3", "KLF5", "CBR1", "APOA1", "CA1", "PDHA1", "EHF"]
Goblet cells: ["MANF", "KRT7", "AQP3", "AGR2", "BACE2", "TFF3", "PHGR1", "MUC4", "MUC13", "GUCA2A"]
Enteroendocrine cells: ["NUCB2", "FABP5", "CPE", "ALCAM", "GCG", "SST", "CHGB", "IAPP", "CHGA", "ENPP2"]
Crypt cells: ["HOPX", "SLC12A2", "MSI1", "SMOC2", "OLFM4", "ASCL2", "PROM1", "BMI1", "EPHB2", "LRIG1"]
Endothelial: ["PECAM1","CD34","KDR","CDH5","PROM1","PDPN","TEK","FLT1","VCAM1","PTPRC","VWF","ENG","MCAM","ICAM1","FLT4"]
Fibroblast: ["COL1A1","COL3A1","COL5A2","PDGFRA","ACTA2","TCF21","FN"]
Smooth muscle cell: ["BGN","MYL9","MYLK","FHL2","ITGA1","ACTA2","EHD2","OGN","SNCG","FABP4"]
B cells: ["CD74", "HMGA1", "CD52", "PTPRC", "HLA-DRA", "CD24", "CXCR4", "SPCS3", "LTB", "IGKC"]
T cells: ["JUNB", "S100A4", "CD52", "PFN1P1", "CD81", "EEF1B2P3", "CXCR4", "CREM", "IL32", "TGIF1"]
NK cells: ["S100A4", "IL32", "CXCR4", "FHL2", "IL2RG", "CD69", "CD7", "NKG7", "CD2", "HOPX"]
make run_enact
To view results on TissUUmaps, begin by installing TissUUmaps by following the instructions at: https://tissuumaps.github.io/TissUUmaps-docs/docs/intro/installation.html#.
Once installed, follow the instructions at: https://tissuumaps.github.io/TissUUmaps-docs/docs/starting/projects.html#loading-projects
For convenience, ENACT creates a TissUUmaps project file (.tmap extension) located at under the <cache_dir>/tmap/
folder.
This section provides a guide on how to reproduce the ENACT paper results on the 10X Genomics Human Colorectal Cancer VisumHD sample. Here, ENACT is run on various combinations of bin-to-cell assignment methods and cell annotation algorithms.
Refer to Install ENACT from Source
- Run the following command which will download all the supplementary file from ENACT's Zenodo page and programmatically run ENACT with various combinations of bin-to-cell assignment methods and cell annotation algorithms:
make reproduce_results
-
To create synthetic VisiumHD dataset from Xenium or seqFISH+ data, run and follow the instructions of the notebooks in src/synthetic_data.
-
To run the ENACT pipeline with the synthetic data, set the following parameters in the
config/configs.yaml
file:
run_synthetic: True <---- True if you want to run bin to cell assignment on synthetic dataset, False otherwise.
- Run ENACT:
make run_enact
If you use this repository or its tools in your research, please cite the following:
@article {Kamel2024.10.17.618905,
author = {Kamel, Mena and Song, Yiwen and Solbas, Ana and Villordo, Sergio and Sarangi, Amrut and Senin, Pavel and Mathew, Sunaal and Ayestas, Luis Cano and Wang, Seqian and Classe, Marion and Bar-Joseph, Ziv and Planas, Albert Pla},
title = {ENACT: End-to-End Analysis of Visium High Definition (HD) Data},
elocation-id = {2024.10.17.618905},
year = {2024},
doi = {10.1101/2024.10.17.618905},
publisher = {Cold Spring Harbor Laboratory},
abstract = {Motivation: Spatial transcriptomics (ST) enables the study of gene expression within its spatial context in histopathology samples. To date, a limiting factor has been the resolution of sequencing based ST products. The introduction of the Visium High Definition (HD) technology opens the door to cell resolution ST studies. However, challenges remain in the ability to accurately map transcripts to cells and in assigning cell types based on the transcript data. Results: We developed ENACT, the first tissue-agnostic pipeline that integrates advanced cell segmentation with Visium HD transcriptomics data to infer cell types across whole tissue sections. Our pipeline incorporates novel bin-to-cell assignment methods, enhancing the accuracy of single-cell transcript estimates. Validated on diverse synthetic and real datasets, our approach is both scalable and effective offering a robust solution for spatially resolved transcriptomics analysis. Availability and implementation: ENACT source code is available at https://github.com/Sanofi-Public/enact-pipeline. Experimental data is available at https://zenodo.org/records/13887921. Supplementary information: Supplementary data are available at BiorXiv online.Competing Interest StatementThe authors have declared no competing interest.},
URL = {https://www.biorxiv.org/content/early/2024/10/20/2024.10.17.618905},
eprint = {https://www.biorxiv.org/content/early/2024/10/20/2024.10.17.618905.full.pdf},
journal = {bioRxiv}
}