Skip to content

Final year project experimenting with clustering and topological data analysis of scRNA-seq data using Python and R across two Jupyter notebooks

Notifications You must be signed in to change notification settings

TomMakesThings/Clustering-and-TDA-of-scRNA-seq-Data

Folders and files

NameName
Last commit message
Last commit date

Latest commit

ย 
ย 
ย 
ย 
ย 
ย 
ย 

Repository files navigation

๐—–๐—น๐˜‚๐˜€๐˜๐—ฒ๐—ฟ๐—ถ๐—ป๐—ด ๐—ฎ๐—ป๐—ฑ ๐—ง๐—ผ๐—ฝ๐—ผ๐—น๐—ผ๐—ด๐—ถ๐—ฐ๐—ฎ๐—น ๐——๐—ฎ๐˜๐—ฎ ๐—”๐—ป๐—ฎ๐—น๐˜†๐˜€๐—ถ๐˜€ ๐—ผ๐—ณ ๐—ฆ๐—ถ๐—ป๐—ด๐—น๐—ฒ-๐—–๐—ฒ๐—น๐—น ๐—ฅ๐—ก๐—” ๐—ฆ๐—ฒ๐—พ๐˜‚๐—ฒ๐—ป๐—ฐ๐—ถ๐—ป๐—ด ๐——๐—ฎ๐˜๐—ฎ

Project by TomMakesThings - 2020/2021

๐—–๐—ผ๐—ป๐˜๐—ฒ๐—ป๐˜๐˜€
  1. About
  2. Results
  3. Running the Code
  4. Repository Contents

๐Ÿงฌ ๐—”๐—ฏ๐—ผ๐˜‚๐˜ ๐Ÿงฌ

This respository hosts the datasets, code, interactive graphs and website for my undergraduate final year project. The aim is to experiment with clustering and topological data analysis to detect hidden gene expression in three different types of datasets. For an overview of the work, refer to this respository's GitHub Pages site, or read the PDF report here. If you'd like to experiment with the code yourself, refer to the Running the Code.

๐Ÿงฌ ๐—ฅ๐—ฒ๐˜€๐˜‚๐—น๐˜๐˜€ ๐Ÿงฌ

๐—–๐—น๐˜‚๐˜€๐˜๐—ฒ๐—ฟ๐—ถ๐—ป๐—ด

Different combinations of the pre-processing, dimensionality reduction methods and clustering algorithm were tested with the best combination varying per dataset.

Benchmark

  • Dataset originally named sc_10x by Tian et al.
  • Contains human lung adenocarcinoma cancer cells with three cell lines
  • Best accuracy was 99.9% in which 901 out of 902 cells were assigned to the correct cell line
  • Use standardization and PCA or ICA with three components with agglomerative hierarchical clustering or BIRCH

Splat Simulated

  • Dataset of artificial data simulated with Splat during this project
  • Gene expression imitates sc_10x
  • Ground truth contains four target groups
  • Was able to achieve 100% accuracy in which 2000 cells were correctly grouped
  • Use standardization and PCA or ICA with four components along with agglomerative hierarchical clustering, BIRCH or mini batch k-means

Mouse Cortex

  • Dataset originally named mouse cortex mRNA by Zeisel et al.
  • Contains brain cells from mouse cortex and hippocampus with nine groups and 47 subgroups determined previously using BackSPIN biclustering
  • Unfortunately I was only able to get 44% accuracy
  • Standard clustering methods are not as reliable with this data as cells show an overlapping spectrum of gene expression

๐—ง๐—ผ๐—ฝ๐—ผ๐—น๐—ผ๐—ด๐—ถ๐—ฐ๐—ฎ๐—น ๐——๐—ฎ๐˜๐—ฎ ๐—”๐—ป๐—ฎ๐—น๐˜†๐˜€๐—ถ๐˜€ ๐˜„๐—ถ๐˜๐—ต ๐—ž๐—ฒ๐—ฝ๐—น๐—ฒ๐—ฟ ๐— ๐—ฎ๐—ฝ๐—ฝ๐—ฒ๐—ฟ

Simplicial complexes for each dataset were created with the same hyperparameters so that topological features can be compared.

Benchmark

Splat Simulated

Mouse Cortex

๐Ÿงฌ ๐—ฅ๐˜‚๐—ป๐—ป๐—ถ๐—ป๐—ด ๐˜๐—ต๐—ฒ ๐—–๐—ผ๐—ฑ๐—ฒ ๐Ÿงฌ

The code was written in Python and R across two Jupyter notebooks. For an explaination of each notebook, see the section below. These were developed in Google Colab which is a free Jupyter notebook environment that allows you to run code through a browser.

Click to Show Instructions
1.

Download the repository by clicking Code โžž Download ZIP.

2.

Extract the contents of the zip.

3.

Visit https://colab.research.google.com.

4.

Sign in to your Google account.

5.

On Colab, go to File โžž Upload notebook.

6.

Navigate to Clustering-and-TDA-of-scRNA-seq-Data-main > Jupyter_Notebooks.

7.

Select the notebook to upload.

8.

Optionally switch from CPU to GPU by selecting Change runtime type โžž Hardware accelerator โžž GPU โžž Save. This is recommended if you selected Clustering_and_TDA.ipynb and wish to train a new autoencoder as it can considerably reduce training time.

9.

Run the code through pressing Runtime โžž Run all.

10.

If you would like to make any changes, for example running with your own dataset, follow the instructions in the notebook.

๐Ÿงฌ ๐—ฅ๐—ฒ๐—ฝ๐—ผ๐˜€๐—ถ๐˜๐—ผ๐—ฟ๐˜† ๐—–๐—ผ๐—ป๐˜๐—ฒ๐—ป๐˜๐˜€ ๐Ÿงฌ

The repository consists of two branches: main and gh-pages. The contents of each branch is explained here.

๐— ๐—ฎ๐—ถ๐—ป ๐—•๐—ฟ๐—ฎ๐—ป๐—ฐ๐—ต

Jupyter Notebooks
Splat_Simulator

The purpose of notebook Splat_Simulator.ipynb is to produce new, artificial scRNA-seq data. For this project, it was used to create the simulated dataset, though can easily be altered to make new data for other purposes. To see a fully executed version of the code, click here.

  • Gene counts and group labels are generated using the Splat simulator, which is part of the R package Splatter, and so the code contains a mix between Python and inline R.
  • To mimick true biological gene expression, the benchmark dataset has been set to use as a seed, though this could be swapped out to imitate another dataset.
  • After seeding the simulator, datapoints are generated with each belonging to one of four groups.
  • The new data and labels are then saved as CSV files using Python. These can be downloaded and reopened to use in Clustering_and_TDA.ipynb.
  • Clustering_and_TDA

    In the notebook Clustering_and_TDA.ipynb, experimentation is performed on the three given datasets. To see a fully executed version of the code with interactive graphs, click here.

    • First the datasets and their target labels are opened as dataframes. The given datasets are downloaded from URL so that the notebook can be run with no set up required, although the code has been designed so that it can also be run with your own dataset.
    • Next, a dataset is selected and an autoencoder with customisable hyperparameters created using PyTorch Lightning to use as a feature extractor for the gene counts
    • Then clustering is performed to divide cells into groups which show similar gene expression. Several clustering algorithms can be chosen including: k-means, agglomerative hierarchical, BIRCH, mini-batch k-means, spectral and Gaussian mixture. The encoding produced by the autoencoder can optionally be used, along with other dimensionality reduction methods such as PCA, ICA or NMF and techniques such as standardization and t-SNE.
    • At the end of the notebook, Kepler Mapper is run on the gene counts to produce a simplicial complex to reveal the topological shape of the high-dimensional data.

Data
Datasets

This folder contains CSV, text and R object files containing the gene count data, labels and metadata for three scRNA-seq datasets. These are downloaded and opened automatically in notebook Clustering_and_TDA.ipynb.

To find out more about the datasets see the GitHub Pages site.

Benchmark_Autoencoder, Simulated_Autoencoder and Evaluation_Autoencoder
These folders contain zip files that are opened automatically in notebook Clustering_and_TDA.ipynb. and do not need to be manually downloaded. These files allow the state of trained autoencoders to be reloaded for the three datasets to avoid training new models every time the notebook is run. Within each zip is a model checkpoint file containing the model weights, as well as text files listing the cells / samples selected for the testing, training and validation data to ensure training and testing data does not overlap when the notebook is run again.

๐—š๐—ถ๐˜๐—›๐˜‚๐—ฏ ๐—ฃ๐—ฎ๐—ด๐—ฒ๐˜€ ๐—•๐—ฟ๐—ฎ๐—ป๐—ฐ๐—ต

Graphs

In this folder, interactive HTML graphs from experiments with clustering and topological data analysis are located.

To view a particular graph, refer to the Graph Finder on the GitHub Pages site.

Website Other folders provide the HTML, CSS, JavaScript and assets required to host the GitHub pages site.