This repository contains a workflow for finding public RNA-seq samples and studies related to specific diseases.
-
bin:
- Contains MONDO model files (
*.pkl
) that are used to make predictions. These models are pretrained and ready to use with the provided data.
- Contains MONDO model files (
-
data:
aggregated_metadata.json.gz
: Compressed JSON file containing metadata about RNA-seq experiments from refine.biotrue_label__inst_type=study__task=disease.csv.gz
: Compressed CSV file with true labels that includes redundant and non-redundant MONDO terms.
-
src:
extract_data.py
: Script to extract descriptions and accession codes from the compressed JSON metadata file.preprocess.py
: Script to preprocess the extracted descriptions.embedding_lookup_table.py
: Script to generate embeddings for preprocessed descriptions.tfidf_calculator.py
: Script to calculate TF-IDF scores for text data.predict.py
: Script to run predictions using pre-trained MONDO models.
-
results: Contains the filtered descriptions and accession codes after preprocessing the metadata.
IDs.tsv
: List of accession codes after filtering out studies with no description.refinebio_descriptions_filtered.tsv
: Descriptions of the RNA-seq experiments after filtering out studies with no description.
-
run:
run_extraction.sh
: Shell script for extracting and filtering descriptions.run_embedding_lookup_table.sh
: Shell script to generate embeddings for preprocessed descriptions.run_preprocess.sh
: Shell script to preprocess the extracted descriptions.run_predictions.sh
: Shell script to run predictions using the MONDO model files.
-
README.md: This file, providing an overview of the project.
- Extract Descriptions: The script
extract_data.py
reads and parses the compressed JSON metadata file located indata/aggregated_metadata.json.gz
. It filters out entries with no descriptions.- Output: Filtered descriptions saved in
results/refinebio_descriptions_filtered.tsv
. - Accession codes saved in
results/IDs.tsv
.
- Output: Filtered descriptions saved in
- Text Preprocessing: The
preprocess.py
script cleans and preprocesses the extracted descriptions by removing URLs, specific strings, file names, non-UTF-8 characters, and applying text normalization techniques.- Output: Preprocessed descriptions saved in
results/processed_refinebio_descriptions.tsv
for embedding generation.
- Output: Preprocessed descriptions saved in
- Embedding Generation: The
run_embedding_lookup_table.sh
script callsembedding_lookup_table.py
to generate embeddings for the preprocessed descriptions using a pre-trained language model (BiomedBERT).- Output: Embeddings saved in
results/my_custom_embeddings.npz
.
- Output: Embeddings saved in
- Predictions: The
predict.py
script is used to run predictions for each MONDO model file using the generated embeddings and preprocessed descriptions.- Output: Prediction results saved in
prediction_results
folder. This script needs also thistxt2onto2.0/data/disease_desc_embedding.npz
to run.
- Output: Prediction results saved in
- Clone this repository to your local machine:
git clone https://github.com/krishnanlab/Workflow_related_studies.git cd Workflow_related_studies