Development and application of the generalised DNA fragility prediction engine based only on sequence-context.
Clone the project:
git clone https://github.com/SahakyanLab/DNAFragility_ML.git
Please follow the instructions below on how to acquire the public datasets, setup the directory stucture, and software necessary to run all the studies from the publication. At the end of this README
file, you can find two separate bash script commands that runs the majority of the setup and runs the calculations sequentially.
The resource-demanding computations were performed on a single NVIDIA RTX A6000 GPU with 40GB RAM. The developed workflows and analyses employed the R programming language 4.3.2 and Python 3.9.12.
Please run the below script to install the latest versions of the R and Python packages necessary to perform the calculations and analyses.
bash ./setup/install_packages.sh
Please also download and install the below software.
- Please clone the repo from this link (Edlib >= 1.2.7). Place the edlib.h and edlib.cpp into lib/edlib/ folder.
- Please download the DNA parameter file and place it into data/parameters folder.
Please note, if you are using Ubuntu, you may have trouble installing the ggpattern R package. However, the below steps has worked for us.
- sudo apt-get update
- sudo apt-get install libmagick++-dev
- sudo apt install libgdal-dev
- sudo apt-get install -y libudunits2-dev
- install.packages("units")
- install.packages("sf")
- install.packages("gridpattern")
- install.packages("ggpattern")
We retrieved all the somatic mutation data of both the non-coding and coding regions associated with cancer from the Catalogue of Somatic Mutations in Cancer (COSMIC) database, including Non-Coding Variants, Cancer Gene Census, and Breakpoints (structural variants) datasets obtained from release v98, May 2023.
These versions can be downloaded from the following links:
- Non-Coding Variants as "Cosmic_NonCodingVariants_PROCESSED.tsv"
- Coding Variants as "Cosmic_MutantCensus_PROCESSED.tsv"
- Cancer Gene Census as "Cosmic_CancerGeneCensus_v98_GRCh38.tsv"
- Breakpoints as "Cosmic_Breakpoints_v98_GRCh38.tsv".
- Classification as "Cosmic_Classification_v98_GRCh38.tsv"
Unpack and extract the relevant files. Place the contents into COSMIC/data/COSMIC/ folder. Please note, we renamed the above first two files with the "PROCESSED" suffix, as the files were very large due to the SNPs, hence, we removed them. We suggest you do this too, unless you have sufficient memory to load and process them all.
We obtained SVs and SNPs from the variant_summary.txt.gz file downloaded from the ClinVar database accessed on December 6th in 2023 that had a clinically associated pathogenic or benign label. Please note, this file gets updated weekly.
Unpack and extract the relevant files. Place the contents into 04_ClinVar/ folder.
- Gene annotation
- Centromere and Pericentromere
- RepeatMasker
- Telomere
- Housekeeping genes from the HRT Atlas as "Human_Housekeeping_Genes.csv"
- Gene-centric chromosomal fragile sites from HumCFS. Unpack the individual files into the COSMIC/data/annotations/chromosome_bed folder.
- Sites of G4 structures as "G4_PQS.txt". Then, convert this file to "G4_PQS.csv".
To download CpG islands and Isochores from the UCSC Table Browser, please select the following:
- CpG Islands. clade: Mammal, genome: Human, assembly: Jan 2022 (T2T CHM13v2.0/hs1), group: All Tracks, track: CpG Islands, table: hub_3671779_cpgIslandExtUnmasked, output format: BED - browser extensible data, output filename: output_CpG_Islands.csv.
- Isochores. clade: Mammal, genome: Human, assembly: May 2004 (NCBI35/hg17), group: All Tracks, track: Isochores, table: ct_Isochores_9145, output format: BED - browser extensible data, output filename: iso_hg17.bb.
Unpack and extract the relevant files from above. Place the contents into COSMIC/data/annotations/ folder.
- Cancer driver genes from COSMIC relased v98, May 2023. Unpack and extract the relevant files. Place the contents into COSMIC/data/COSMIC/ folder.
We obtained the chromothripsis breakpoint cases from ChromothripsisDB. Please download the dataset from Download -> Full Dataset -> Chromothripsis case data
Unpack and extract the relevant files from above. Place the contents into 03_Chromothripsis/data folder.
We retrieved 247 core-validated vertebrate transcript factor binding sites (TFBS) from the JASPAR 2024 database.
- Download the vertebrate dataset from here
- Download the bed files from here as "jaspar_beds"
Unpack and extract the relevant files from above. Place the contents into data/TFBS/ folder.
We processed all datasets in the reference genome version used as per the deposition. When doing comparative analysis, we lifted the genomic coordinates over to the latest T2T genome assembly.
Unpack and extract the relevant files. Place the contents into COSMIC/data/liftover/ folder.
We processed all datasets in the reference genome version used as per the deposition. For Kmertone, the individual fasta files were needed. This GitHub repo is dependent on the results of DNAFragility_dev, where the reference genomes are downloaded already.
The genomic sequence-based octameric features can be downloaded from the DNAfrAIlib repo. The quantum mechanical hexameric parameters can be downloaded from DNAkmerQM.
This has been automatically setup if you run the below bash script.
bash get_feature_lib.sh
To run the 00_ML_proof_of_concept
work, you need to have two datasets downloaded and processed following the method from DNAFragility_dev. The demonstration used in the paper is based on this study with data deposited on the GEO database. We specifically used DMSO-treated, endogenous DNA fragility in K562 cells. You can also run it on the etoposide-treated DNA fragility in K562 cells enriched at topoisomerase II sites.
For any ML task, you require the genomic sequence range of influence for each of the short-, medium-, and long-range effects. Depending on the dataset used, some datasets had to be pre-processed to handle 5'-3' DNA strand breaks. Hence, running the full DNAFragility_dev study beforehand is strongly advised.
Alternatively, if you wish to skip the DNAFragility_dev process, and just want to process these DNA strand breaks for the present study, please run the below bash script.
bash get_MLdemo_datasets.sh
We used the Homer
software for motif discoveries, including de novo ones. We use the R package marge
to interface with Homer
. Below are two suggestions of downloading and installing the relevant source codes, as Option 1
may fail, depending on your operating system.
Option 1. To install marge
, please follow the instructions from the GitHub page here.
Marge
relies on a local installation of Homer
. To install for your operating system, please follow the instructions from their website here.
Inside the 09_HOMER/lib/ environment, we used the below commands.
# In the terminal, we used the following commands.
mkdir lib
wget -P lib/ http://homer.ucsd.edu/homer/configureHomer.pl
perl /path-to-homer/configureHomer.pl -install homer
perl /path/to/homer/configureHomer.pl -install hg19
vi ~/.bashrc
PATH=$PATH:/path/to/homer/lib
source ~/.bashrc
In the 09_HOMER/Process.R file, we used the below commands.
# In the R environment, we used the following commands.
devtools::install_github('robertamezquita/marge', ref = "master", force = TRUE)
homer_path = "/path/to/homer/lib"
options('homer_path' = homer_path)
library(marge)
check_homer()
Option 2. Depending on your operating system, the above installation may not work. Below is a workaround that has worked in our case. Download the ZIP master file from the GitHub page here. Then, inside marge-master/R/check_homer.R
, change the following line from loc <- system('type -a findMotifsGenome.pl', intern = TRUE)
to loc <- system('type findMotifsGenome.pl', intern = TRUE)
. Then, run the below.
# path to local master R package from GitHub
path_to_file = "path/to/marge-master"
devtools::install(
pkg = path_to_file,
quiet = FALSE,
force = TRUE
)
library(marge)
If the above steps have been successfully implemented, you can run this optional study by going into the 09_HOMER/ folder, then running the below bash script. Please edit the homer_path
inside this file to the path of your saved location.
bash submit.sh
Here, we use 13 statistically significant nullomer sequences from this paper and downloaded from here on 1st Feb 2024. Under Download, select Genomic MAWs
. Download the Genomic_MAWs.tsv
file and place it into 06_Nullomers/data/ folder.
The workflow was the following. First, we wanted to evaluate whether nullomer sequences can bring fragility to a genomic region. Second, we wanted to pinpoint this to the nullomer sequence specifically, by searching for sequences in the human genome that mismatch by 1 base to the nullomer, introduce a SNP to generate the nullomer, and evaluate the change in sequence fragility.
If the above steps have been successfully implemented, you can run this optional study by going into the 06_Nullomers/ folder, then running the below bash script.
bash submit.sh
- All cpp files are interfaced via the Rcpp library in R with
omp.h
when possible. Please ensure you have this installed. RcppArmadillo.h
andRcppEigen.h
are necessary for the feature extraction process. Please ensure you have this installed. By default, will not use it in case you have not installed it.- Various model predictions have been deposited if the compressed file size was within the GitHub file size limit. If you wish to view and/or use them, please
gunzip
the files. - While this repo can run as a standalone study, the results are dependent on DNAFragility_dev and when possible, we have deposited the necessary dependent files.
- When you run the
run_dnafragility.sh
bash script, you will need to include the path to the viennaRNARNAFold
programme as the first argument. Some operating systems allow you to interface it directly viaRNAfold
, others require the literal path to the programme.
If you wish to run all setups, including all the aforementioned bash scripts, please run the below bash script.
bash run_all_setup_files.sh
Please note that many of the calculations were computationally intensive, particularly the 01_LGBM_FullGenome
and 05_DeltaFragility
folders. Most things were run in parallel in smaller batches. However, if you submit the below bash script, it runs all scripts sequentially. This can take several months to complete.
Most tasks take up several tens to hundreds of GBs worth of RAM. The entire study requires between 2-4 TB of hard drive space.
You may need to monitor your memory usage, memory cache, and swap to ensure calculations run smoothly.
Arguments
Rnafold_path
path to the RNAfold function for secondary structure prediction.fast_matrix
If TRUE, will use fast RcppArmadillo matrix calculations. Default FALSE.
bash run_dnafragility.sh $RNAfold_path $fast_matrix