This is group 16 final project for CSE185. It follows the algorithm description in the original t-SNE paper to implement t-SNE tool. The main application of TeeSnee is to cluster gene expression data of cells to reveal cell types.
- Sample Data Download
- Installation Instructions
- Basic Usage
- Complete Usage
- Benchmarking Information
- Contributors
Sample data (count matrix) taken from: https://www.10xgenomics.com/resources/datasets/human-brain-cancer-11-mm-capture-area-ffpe-2-standard
Check file data_processing.py
to copy the commands needed to process the data to get gene by cell matrix.
- Install
teesnee
program with the following command:
git clone https://github.com/m1ma0314/CSE185Group16_TeeSnee.git
cd CSE185Group16_TeeSnee
-
Python Installation Click Me to Get Python
-
Pip Installation
python -m ensurepip --upgrade
python -m pip install --upgrade pip
- Installation requires the
numpy
,pandas
,matplotlib
,scikit-learn
libraries to be installed. You can install these withpip
:
pip install -r requirements.txt
- Change permissions of teesnee.py:
chmod 777 teesnee.py
The basic usage of teesnee
is:
python teesnee.py [-p targer_perlexity] [-z ifzipped] [-o output] filename
To run teesnee
on a small test example (using files in this repo):
python teesnee.py -p 100 -o ./ minimal_dataset.csv
The only required input to my_tsne
is a cell x gene matrix data file. Users may additionally specify the options below:
-p PERPLEXITY
,--target_perplexity PERPLEXITY
: specify target perplexity. If specified, thetsne
function will calculate similarities matrix based on specified perplexity value and generate t-SNE plot. Higher perplexity value is associated with tighter clusters in the final output plot. Otherwise, thetsne
function useperplexity=100
by default.-z ifzipped
,--zipped
: unzip dataset file if this argument is specified. By default, the datafile is viewed as unzipped and will be converted into matrix for further processing.-o FILE
,--output FILE
: Write output to file. By default, output is written totsneplot.png
The output t-SNE plot is in png
format.
The benchmarking analysis is recorded in the folder benchmarking
.
Here is the time complexity plot comparison between teeSnee time complexity and scanpy’s t-SNE.
p.s. teeSnee still needs optimization for runtime TAT
This repository was generated by Annabelle Coles and Mijia Ma, under the guidance of the original t-SNE paper and with inspiration from the article t-SNE from scratch.
Special thanks to CSE185 Professor Dr. Melissa Gymrek, TA Ryan Eveloff, TA Luisa Amaral, TA Himanshu.
Please submit a pull request with any corrections or suggestions. Your suggestions matter a lot to us <3