Computational analyses for: SMG1:SMG8:SMG9-complex integrity maintains robustness of nonsense-mediated mRNA decay
This repository contains the codes, scripts and log files for the high-throughput sequencing analyses of the project:
SMG1:SMG8:SMG9-complex integrity maintains robustness of nonsense-mediated mRNA decay
(available as bioRxiv preprint)
This repository primarily aims to provide transparent insight into the high-throughput analysis steps used in the study of SMG8- and/or SMG9-KO, SMG8-deltaKID and SMG1i treatment in human cells (all high-throughput data obtained from colon cancer HCT116 cell line).
- Complete analysis of multiple RNA-Seq datasets (provided in FASTQ format; see here for dataset overview and here for individual sample identification), mapped to Gencode v42 / GRCh38.primary_assembly supplemented with SIRVomeERCCome (from Lexogen; download) using STAR, followed by transcript quantification using Salmon in mapping-based mode with a decoy-aware transcriptome index and the options --numGibbsSamples 30 --useVBOpt --gcBias --seqBias, finished with analyses of differential gene expression (DGE) via DESeq2 and differential transcript expression (DTE) via Swish (pre-revision) or edgeR (post-revision).
- The main Bash script CRSA_V009.sh or CRSA_V010.sh runs the complete pipeline or individual modules using the options (see CRSA_V009.sh -h) and requires a design file specifying the following:
- reference type (gencode.v42.SIRVomeERCCome was used in this study)
- sequencing design (single- or paired-end reads)
- study name
- folder locations (srvdir for raw file locations, mydir for analyses output)
- location of the experiment file which specifies sample IDs and condition
- Please see the provided design.txt file example for more information concerning this design file. An example for the tab-delimited experiment.txt file is provided as well. Please see the comments in CRSA_V009.sh or CRSA_V010.sh for further instructions
- To run/reproduce the complete analysis script, many modules require specific tools. Please make sure you have the following tools installed and configured if required:
- STAR - version 2.7.10b was used for the analyses - with genome indices generated using GRCh38.primary.SIRVomeERCCome.fa and gencode.v42.SIRVomeERCCome.annotation.gtf (both reference files can be found here). The following code was used for genome index generation:
STAR --runMode genomeGenerate --runThreadN 15 --genomeDir /home/volker/reference/gencode.v42.SIRVomeERCCome --genomeFastaFiles /home/volker/reference/Gencode/GRCh38.primary.SIRVomeERCCome.fa --sjdbGTFfile /home/volker/reference/Gencode/gencode.v42.SIRVomeERCCome.annotation.gtf --sjdbOverhang 99
- Alfred - version v0.2.6 was used for the analyses
- samtools - version 1.16.1 (using htslib 1.16) was used for the analyses
- IGV tools - version 2.14.1 or 2.17.2 was used for the analyses - make sure you have the gencode.v42.SIRVomeERCCome.chrom.sizes file (can be found here) located in /PATH/TO/IGV/lib/genomes
- Salmon - version v1.9.0 was used for the analyses - with an index generated using gentrome.v42.SIRV.ERCC.fa.gz and decoys.txt (can be found here). A separate conda environment was created for Salmon. The following code was used for index generation:
salmon index -t /home/volker/reference/Gencode/gentrome.v42.SIRV.ERCC.fa.gz -d /home/volker/reference/Gencode/decoys.txt -p 12 -i /home/volker/reference/Transcriptome/gencode.v42.SIRVomeERCCome --gencode
- Additionally, many analyses were run using a plethora of R packages (including swish, edgeR, ...), please see the session info for the individual R scripts for more information.
- All analyses were performed on a 16-core (2x Intel(R) Xeon(R) CPU E5-2687W v2 @ 3.40GHz) workstation with 128 GB RAM running Ubuntu 22.04.2 LTS
- Please make sure to change installation and file paths in the respective scripts to match your local environment
The specialized scripts called by the main CRSA_V009.sh (pre-revision) or CRSA_V010.sh (post-revision) script can be found here. The main R script to produce the RNA-Seq-based figures can be found here. This script uses the "SMG89_datasets.csv" file in the same folder to load DESeq2 or edgeR output data. The main R script to produce the proteomics-based figures can be found here. This script uses the "SMG89_proteomics.xlsx" file in the data folder.
We have previously used swish to perform DTE analyses, but switched now to edgeR. That is why e.g. the absolute numbers of differentially expressed transcripts in the bioRxiv version will differ from the one in the final version of the manuscript. The main reason was that we made use of the edgeR function to better control “read-to-transcript ambiguity” and false discovery rate (see: https://doi.org/10.1093/nar/gkad1167).
The required Salmon, DESeq2 and edgeR output data to re-run most of the analyses can be found on here Several helper files, Log files and QC-related data, as well as an "ready-to-load-in-R" 2024-10-28_SMG189_datasources.rds file can be found there as well.
Feedback is welcome! For any question, please email: boehmv@uni.koeln.de or create an issue
TBD
Sabrina Kueckelmann, Sophie Theunissen, Jan-Wilm Lackmann, Marek Franitza, Kerstin Becker, Volker Boehm, Niels H. Gehring (2024) SMG1:SMG8:SMG9-complex integrity maintains robustness of nonsense-mediated mRNA decay. bioRxiv 2024.04.15.589496; doi: https://doi.org/10.1101/2024.04.15.589496