This is the associated GitHub page for the book chapter (please cite this work when using our code from this repo):
Richa Bharti, Dominik G Grimm
Design and Analysis of RNA Sequencing Data
In: Kappelmann-Fenzl M. (eds) Next Generation Sequencing and Data Analysis. Learning Materials in Biosciences. Springer, Cham. (https://doi.org/10.1007/978-3-030-62490-3_11).
A comprehensive outline of RNASeq data analysis including quality assessment, pre-processing alignment/mapping of reads, quantification of genes and isoforms and differential expression (DE) analysis is provided. Step wise optimized codes and data formats alongside brief description and references to the corresponding sections of the book chapter are also provided. Please go through the steps below before you begin to use the workflows.
Any Linux based distro should work. Out test OS is:
Distributor ID: Ubuntu
Description: Ubuntu 18.04.2 LTS
Release: 18.04
Codename: bionic
lsb_release -a
on a Ubuntu based system.
It is no secret that the hardware highly influences the speed of the workflows. And just one same can generate ~10-60 Million reads. The most time consuming tasks are the ones that involve reference based alignment. A modest configuration consists of 16+cores and 16 GB of RAM with TB external hard drive or 48 TB server and 3.6 GHz clock speed if one wants to run it on one’s own workstation. A majority of the diskspace is occupied by reference . Our HW configuration consists of 20 core CPU with 128 GB.
All software should be installed by the user directly as the workflow depends on a lot of external software. Without these the workflow will fail to run.
-
gcc, g++
-
java
-
python 2 python3: pip
-
R version 3.6.2
-
git
git clone https://github.com/grimmlab/BookChapter-RNA-Seq-Analyses.git
After cloning the repository, you should have the following files in your repo directory:
$ls -1
Rscripts
differential_expression.sh
download_genome_and_annotation.sh
download_install_tools.sh
download_RNASeq_reads_and_QC.sh
metadata.txt
project_config.sh
quantification.sh
reference_based_alignment.sh
The workflow centered around two variables which contain the project name and the project path. The project name is the root directory of the project under which the project relevant files are stored. The project path is the path on the system where the project root directory is created. These two variables can be set in a configuration file called as project_conf.sh
. The default values of the project_conf.sh are shown here:
PROJECT_NAME=RNASeq_project
PROJECT_PATH=${PWD}
Under the defined project name root directory, four sub-directories are created. They are:
|-- RNASeq_project
| | analysis
| | data
| | reads
| | tools
- analysis: This will contain the output of the analysis performed during each step of the workflow.
- data: This will contain the genome, annotation and index genome files.
- reads: This will contain the sequencing reads which is used as an input to perform the analysis.
- tools: This will contain all the tools and software required to run the workflow.
-
The first step for running the workflow is to download and install all the required tools used by the workflow. This can be done by running the shell script:
$./download_install_tools.sh
This will first create a working folder structure as shown in the above section. This script will download all the tools required to perform the analysis inside the tools folder as listed below:
~/RNASeq_project/tools$ls -1 DEXSeq FastQC STAR-2.7.3a bedtools2 bowtie2 cufflinks cutadapt htseq rseqqc samtools sratoolkit tophat
-
Assumption here is that RNASeq data is from mouse or human. The genomes of these species are available and well annotated. The genome file is in fasta format and the annotation file is in gtf/gff format. This script will download the current gencode genome (default: mouse) and its annotation into the data folder. Currently, the latest version of gencode is used for the mouse genome. The version can be changed by modifying
GENOME_VERSION
parameter and the genome can be changed to human by modifyingGENOME_TYPE
in theproject_conf.sh
file:GENOME_TYPE=mouse # or human GENOME_VERSION=24 #or 33 for human
This can be done by running the shell script:
$./download_genome_and_annotation.sh
After the execution of the script, the data folder will contain:
~/RNASeq_project/data$ls -1 mouse-gencode-version-24
Output:
~/RNASeq_project/data/mouse-gencode-version-24$ls -1 GRCm38.p6.genome.fa GRCm38.p6.genome.fa.gz gencode.vM24.2wayconspseudos.gff3 gencode.vM24.2wayconspseudos.gff3.gz gencode.vM24.annotation.gff3 gencode.vM24.annotation.gff3.gz gencode.vM24.annotation.gtf gencode.vM24.annotation.gtf.gz gencode.vM24.long_noncoding_RNAs.gff3 gencode.vM24.long_noncoding_RNAs.gff3.gz gencode.vM24.tRNAs.gff3 gencode.vM24.tRNAs.gff3.gz genome_annoatation_md5sum.txt
-
This step is used to download an example data to perform the RNASeq analysis. This can be done by running the shell script:
$./download_RNASeq_reads_and_QC.sh
This script contains three main sub-steps:
a. Downloading the publically available data
Publically available data from PMID: 28738885 is used for the analysis purpose. In the first step data is downloaded using
fastq-dump
binary from SRA toolkit in the tools folder.The
fastq-dump
tool uses the following parameters:fastq-dump \ ${SRA_ID} \ --split-files \ --origfmt \ --gzip
Description:
fastq-dump
is the binary${SRA_ID}
are list of fastq files. These SRA files will be downloaded one at a time--split-files
parameter will produce two reads if the data is paired end else single file--origfmt
parameter contains only original sequence name--gzip
will provide the compressed output using gzipOutput: The
fastq-dump
downloads the following files into the reads/rawreads folder:~/RNASeq_project/reads/rawreads$ls -1 SRR5858228_1.fastq.gz SRR5858229_1.fastq.gz SRR5858230_1.fastq.gz SRR5858231_1.fastq.gz SRR5858232_1.fastq.gz SRR5858233_1.fastq.gz SRR5858234_1.fastq.gz SRR5858235_1.fastq.gz SRR5858236_1.fastq.gz
NOTE! To read more about fastq-dump parameters click here.
b. Quality assessment of downloaded data:
The downloaded SRA fastq files are then assessed for sequence quality using
fastqc
program.The
fastqc
tool uses the following parameters:fastqc \ SRR5858228_1.fastq.gz \ -o rawreads_QA_stats
Description:
fastqc
is the binary.SRR5858229_1.fasta.gz
is one example file as an input for quality assessment. This is run in a for loop which will run on all downloaded samples.-o
is path to output directory, here it isreads_QA_stats
.Output: The
fastqc
will generate the output for all downloaded files into the reads/rawreads_QA_stats folder:~/RNASeq_project/reads/rawreads_QA_stats$ls -1 SRR5858228_1_fastqc.html SRR5858228_1_fastqc.zip SRR5858229_1_fastqc.html SRR5858229_1_fastqc.zip SRR5858230_1_fastqc.html SRR5858230_1_fastqc.zip SRR5858231_1_fastqc.html SRR5858231_1_fastqc.zip SRR5858232_1_fastqc.html SRR5858232_1_fastqc.zip SRR5858233_1_fastqc.html SRR5858233_1_fastqc.zip SRR5858234_1_fastqc.html SRR5858234_1_fastqc.zip SRR5858235_1_fastqc.html SRR5858235_1_fastqc.zip SRR5858236_1_fastqc.html SRR5858236_1_fastqc.zip
Note! For more details about fastqc, read section XXX in book and for parameters check FastQC
c. Quality control of downloaded data:
Based on the assessment of the report quality control (QC) is performed using the
cutadapt
tool.The
cutadapt
tool uses the following parameters:cutadapt \ -q 20 \ --minimum-length 35 \ -a "CTGTCTCTTATACACATCT" \ -o reads/filtered_reads/SRR5858228_1_trimmed.gz \ SRR5858229_1.fastq.gz \ #Input file name > SRR5858229_1_cutadapt_stats.txt # Output stats for given file
Description:
-a
is the adapter sequence for trimming--minimum-length
is the minimum sequence length (35)-q
phred score of 20-o
is the output file nameOutput: The
cutadapt
will generated trimmed files into the reads/filtered_reads folder:~/RNASeq_project/reads/filtered_reads$ls -1 SRR5858228_1_trimmed.gz SRR5858229_1_trimmed.gz SRR5858230_1_trimmed.gz SRR5858231_1_trimmed.gz SRR5858232_1_trimmed.gz SRR5858233_1_trimmed.gz SRR5858234_1_trimmed.gz SRR5858235_1_trimmed.gz SRR5858236_1_trimmed.gz
Note! Details of quality control and how to choose the parameter are chosen is explained in chapter (XXX). To understand parameter in detail check Cutadapt.
-
Aligning reads to genome can be challenging due to millions of reads, small read size (2nd generation sequencing), sequencing errors and variation leads to mismatches and indels. In addition to that eukaryotic genome have introns , this means when genome is used a reference, aligner is aware of splice junction in the reference.
This step will perform reference based RNASeq alignment. This can be done by running the shell script:
$./reference_based_alignment.sh
Currently, several splice aware tools are available. In this workflow we have used the two most commonly used tools TopHat2 and STAR. The choice of
ALIGNER
can be set in theproject_conf.sh
file:ALIGNER=star #tophat or star
The table below shows the tools used for each alignment step for both the STAR and TopHat2 aligners.
Alignment Steps STAR Alignment TopHat2 Alignment Genome indexing STAR (a.) bowtie2 (c.) Reference based alignment STAR (b.) TopHat2 (d.) BAM file indexing samtools (e.) samtools Alignment stats RseQC (f.) RseQC Visualization IGV (g.) IGV a. Genome indexing using STAR:
STAR \ --runThreadN 8 \ --runMode genomeGenerate \ --genomeDir data/STAR_Genome_Index \ --genomeFastaFiles data/mouse-gencode-version-24/GRCm38.p6.genome.fa \ --sjdbGTFfile data/mouse-gencode-version-24/gencode.vM24.annotation.gtf
Description:
STAR
is the binary--runMode
is set togenomeGenerate
which generate genome files--genomeDir
is the path to output directory name--genomeFastaFiles
is the path to genome fasta file with file name--sjdbGTFfile
is the path to the annotation file with file name--runThreadN
is number of threads (default used is 8)Output: The
STAR
genome indexing will generated indexed genome files into the data/STAR_Genome_Index folder:~/RNASeq_project/data/STAR_Genome_Index$ls -1 Genome SA SAindex chrLength.txt chrName.txt chrNameLength.txt chrStart.txt exonGeTrInfo.tab exonInfo.tab geneInfo.tab genomeParameters.txt sjdbInfo.txt sjdbList.fromGTF.out.tab sjdbList.out.tab transcriptInfo.tab
Note! For more detail read section XXXX in the book and for parameters check STAR
b. Alignment using STAR:
STAR \ --runMode alignReads \ --runThreadN 8 \ --genomeDir data/STAR_Genome_Index \ --readFilesIn reads/filtered_reads/SRR5858228_1_trimmed.gz \ --readFilesCommand zcat \ --sjdbGTFfile data/mouse-gencode-version-24/gencode.vM24.annotation.gtf \ --outFileNamePrefix /analysis/mapping/star/SRR5858228_1_trimmed \ --outSAMtype BAM SortedByCoordinate
Description:
STAR
is the binary--runMode
isalignReads
which means map reads to the reference genome--runThreadN
is the number of threads (default used is 8)--genomeDir
is the path to genome indexed directory--readFilesIn
is the path to trimmed reads including file name--readFilesCommand
for gzipped files (*.gz) use zcat--sjdbGTFfile
is the path annotation file with file name--outFileNamePrefix
is output prefix name with its path--outSAMtype
is the output sorted by coordinate, similar to samtools sort commandOutput: The
STAR
genome alignment will generated in files into the analysis/mapping/star folder:This is the output for one sample, similar files will be generated for other samples as well
~/RNASeq_project/analysis/mapping/star$ls -1 Percentage_uniquely_mapped_reads.csv SRR5858228_1_trimmed.Aligned.sortedByCoord.out.bam SRR5858228_1_trimmed.Aligned.sortedByCoord.out.bam.bai SRR5858228_1_trimmed.Log.final.out SRR5858228_1_trimmed.Log.out SRR5858228_1_trimmed.Log.progress.out SRR5858228_1_trimmed.SJ.out.tab SRR5858228_1_trimmed._STARgenome
- SRR5858228_1_trimmed.Aligned.sortedByCoord.out.bam is aligned bam file
- SRR5858228_1_trimmed.SJ.out.tab is a tab-delimited file that provides information about alignments to splice junctions
- Percentage_uniquely_mapped_reads.csv file represents the uniquely mapped reads
These three files are as names suggests provides the information about ongoing samples alignment. Out of these file with Log.final.out extension provides the complete mapping stats
- SRR5858228_1_trimmed.Log.final.out
- SRR5858228_1_trimmed.Log.out
- SRR5858228_1_trimmed.Log.progress.out
Note! For more detail read section XXXX in the book and for parameters check STAR
c. Genome indexing using bowtie for TopHat2:
TopHat2 uses bowtie2 (which is gapped aligner) for genome indexing
bowtie2-build \ -f data/mouse-gencode-version-24/GRCm38.p6.genome.fa \ /data/TOPHAT_Genome_Index/ \ -p 8
Description:
bowtie2-build
is the binary-f
is the path to genome fasta file with file name-p
is to launch a specified number of parallel search threadsTOPHAT_Genome_Index
is the output directory created in data folderOutput: The
bowtie-build
genome indexing will generated indexed genome files into the data/TOPHAT_Genome_Index folder:~/RNASeq_project/data/TOPHAT_Genome_Index$ls -1 GRCm38.p6.genome.1.bt2 GRCm38.p6.genome.2.bt2 GRCm38.p6.genome.3.bt2 GRCm38.p6.genome.4.bt2 GRCm38.p6.genome.fa GRCm38.p6.genome.rev.1.bt2 GRCm38.p6.genome.rev.2.bt2
Note! Bowtie2 itself is not able to perform spliced alignments. Bowtie2 is known for its speed and small-memory efficiency because it index the reference genome using an FM index which is a Burrows–Wheeler transform method. For more detail read section XXXX in the book.
d. Alignment using TopHat2:
tophat \ -o /analysis/mapping/tophat \ -p 8 \ -G data/mouse-gencode-version-24/gencode.vM24.annotation.gtf \ /data/TOPHAT_Genome_Index/GRCm38.p6.genome \ reads/filtered_reads/SRR5858228_1_trimmed.gz
Description:
tophat
is the binary-o
is the name of the directory in which TopHat will write all of its outputs-p
is the number of threads to align reads (default used is 8)-G
is the path to the annotation file with file name/data/TOPHAT_Genome_Index/GRCm38.p6.genome
is the path and prefix name of genome fasta filereads/filtered_reads/SRR5858228_1_trimmed.gz
is the path to trimmed reads including file nameOutput: The
tophat
genome alignment will generated in files into the analysis/mapping/tophat folderTopHat generated many files as listed below. Here is example output for one SRA ID SRR5858228:
- SRR5858228_1_trimmed_accepted_hits.bam represents the alignment file in bam format.
- SRR5858228_1_trimmed_junctions.bed consists of a list of exon junctions in BED formation. An exon junction comprises two blocks where either block is as long as the longest overhang of any read present in the junction sequence. The score is the number of alignments identified for a junction sequence.
- SRR5858228_1_trimmed_insertions.bed consists of a list of discovered insertions. In each case chromLeft represents the last genomic base before individual insertions.
- SRR5858228_1_trimmed_deletions.bed consists of a list of discovered deletions. In each case chromLeft represents the last genomic base before individual deletions.
- SRR5858228_1_trimmed_align_summary.txt contains a summary of alignment rates along with the number of reads and pairs showing multiple alignments.
Note! For more detail read section XXXX in the book and for parameters check TopHat2.
e. BAM file indexing for both STAR or TopHat2:
samtools \ index analysis/mapping/star/SRR5858228_1_trimmed.bam
Description:
samtools
is the binaryindex
will index a coordinate-sorted BAM file for fast random accessOutput: This command will index all the bam files and create
.bai
files in the respective mapping folder.Note! For more detail read section XXXX in the book and for parameters check samtools.
f. Generate stats for alignment:
python3 bam_stat.py \ -q 30 \ -i analysis/mapping/star/SRR5858228_1_trimmed.bam > \ analysis/mapping/star/SRR5858228_1_trimmed_RSeQC_alignment_stats.txt
Description:
bam_stat.py
is the python script to calculate bam stats-i
input bam file with its path-q
is the mapping quality to determine uniquely mapped readRseQC produces table in the respective mapping folder where unique reads are considered if their mapping quality is more than 30.
Output: SRR5858228_1_trimmed.Aligned.sortedByCoord.out._RSeQC_alignment_stats.txt file will be created for SRA sample ID SRR5858228.
#============== # All numbers are READ count #============== Total records: 72432644 QC failed: 0 Optical/PCR duplicate: 0 Non primary hits 11660623 Unmapped reads: 0 mapq < mapq_cut (non-unique): 5935125 mapq >= mapq_cut (unique): 54836896 Read-1: 0 Read-2: 0 Reads map to '+': 27245688 Reads map to '-': 27591208 Non-splice reads: 42688777 Splice reads: 12148119 Reads mapped in proper pairs: 0 Proper-paired reads map to different chrom: 0 Note! For more detail read section XXXX in the book and for parameters check RseQC.
g. Visualization of alignment (BAM) files:
The indexed bam files along with genome and annotation can be loaded into several genome browsers, including the Integrative Genomics Viewer IGV
IGV
. Explaining this is beyond the scope of this chapter and recommend to go through https://software.broadinstitute.org/software/igv/UserGuide. -
This step will perform quantification on alignment files generated form reference based RNASeq alignment. This can be done by running the shell script:
$./quantification.sh
The three ways to perform quantification is XXXXXX aware tools are available. In this workflow we have used the two most commonly used tools XXXXX. The choice of
QUANT_COUNT
can be set in theproject_conf.sh
file:QUANT_COUNT=per_gene_bedtools # per_gene_htseq or per_transcript_cufflinks or per_exon_dexseq
Quantification methods (Counting reads) Tools per gene bedtools (a.) per gene HTSeq (b.) per transcript Cufflinks (c.) per exon DEXSeq (d.) a. Counting reads per gene using bedtools:
bedtools intersect \ -S \ -wa \ -c \ -a genes_only.gff3 \ -b analysis/mapping/star/SRR5858228_1_trimmed.bam \ > analysis/quantification/bedtools-count/SRR5858228_1_trimmed_counts.csv
Description:
bedtools
is the binaryintersect
allows one to screen for overlaps between two sets of genomic features-S
is require for different strandedness-c
is for each entry in genomic features A, report the number of hits in genomic features B while restricting to-f
(-f is for minimum overlap required as a fraction of A. Default is 1E-9 (i.e. 1bp))-a
is gff annotation file where each genomic features in A is compared to genomic features B in search of overlaps-b
is the input bam file with its pathOutput: Output will be generated in analysis/quantification/bedtools-count folder for each file. In the end all the bedtools generated output files are merged to created final file Read_per_features_combined.csv in same folder which will be the input for differential expression analysis.
Note! For more details about bedtools read section XXX in the book and for parameters check bedtools
b. Counting reads per gene using HTSeq:
htseq-count \ -f bam \ -a 10 \ -m intersect-strict \ -s no \ -t exon \ -i gene_id \ analysis/mapping/star/SRR5858228_1_trimmed.bam \ data/mouse-gencode-version-24/gencode.vM24.annotation.gtf > \ analysis/quantification/htseq-count/SRR5858228_1_trimmed_counts.csv
Description:
htseq-count
is the binary-f
is the format of the input data.-a
will skip all reads with MAPQ alignment quality lower than the given minimum value (default: 10).-m
is mode to handle reads overlapping more than one feature. In this case itsintersect-strict
.-s
is set whether the data is from a strand-specific assay. Forstranded=no
, a read is considered overlapping with a feature regardless of whether it is mapped to the same or the opposite strand as the feature-t
is the feature type (3rd column in GFF file) to be used, all features of other type are ignored (default, suitable for RNA-Seq analysis using an GENCODE GTF file:exon
)-i
is the attribute to be used as feature ID. The default, suitable for RNA-Seq analysis using an GENCODE GTF file, isgene_id
.Output: The outputs will be generated in analysis/quantification/htseq-count folder. In the end all the HTSeq generated output files are combined to created final file Read_per_features_combined.csv in same folder which will be used as an input for differential expression analysis.
Note! For more details about HTSeq read section XXX in the book and for parameters check HTSeq
c. Counting reads per transcript using cufflinks:
cufflinks \ -G data/mouse-gencode-version-24/gencode.vM24.annotation.gtf \ -b data/mouse-gencode-version-24/GRCm38.p6.genome.fa \ -p 8 \ analysis/mapping/star/SRR5858228_1_trimmed.bam \ -o analysis/quantification/cufflinks-count
Description:
cufflinks
is the binary-G
is the path to the annotation file with file name-b
is the path to genome fasta file with file name-p
is the number threads-o
is the name of the directory containing all of its generated outputOutput: The outputs will be generated in analysis/quantification/cufflinks-count folder. The output consists of transcript and gene-level FPKM-tracking files, which contain FPKM values and their confidence intervals. FPKM tracking files are also produced when a set of samples is tested for differential expression using Cuffdiff.
Note! For more details about Cufflinks read section XXX in the book and for parameters check Cufflinks.
d. Counting reads per exon using DEXSeq:
The initial steps of a DEXSeq analysis are done using two Python scripts. Importantly, these preprocessing steps can also be done using tools equivalent to these Python scripts, for example, using GenomicRanges infrastructure (10.2) or Rsubread (10.3). The following two steps describe how to do this steps using the Python scripts that are provided within DEXSeq.
Preparing the annotation
python2 dexseq_prepare_annotation.py \ data/mouse-gencode-version-24/gencode.vM24.annotation.gtf \ analysis/quantification/dexseq-count/DEXSEQ_GTF_annotation.gff
The Python script dexseq_prepare_annotation.py takes an GTF file and translates it into a GFF file with collapsed exon counting bins. For more details see Figure 1 of the DEXSeq paper (Anders, Reyes, and Huber (2012)) for an illustration.
Counting reads
python2 dexseq_count.py \ -p no \ -s no \ -r name \ analysis/quantification/dexseq-count/DEXSEQ_GTF_annotation.gff \ -f bam \ -a 10 \ analysis/mapping/star/SRR5858228_1_trimmed.bam \ analysis/quantification/dexseq-count/SRR5858228_1_trimmed_exon_counts.csv
Description:
-p
is to set the paired end information. If the data is from a paired-end sequencing run, you need to add the option-p yes
-s
If you have used a library preparation protocol that does not preserve strand information (i.e., reads from a given gene can appear equally likely on either strand), you need to inform the script by specifying the option-s no
-r
is to indicate whether your data is sorted by alignment position or by read name-f
input reads format-a
to specify the minimum alignment quality. All reads with a lower quality than specified (with default-a 10
) are skipped.Output: The outputs for both the steps will be generated in analysis/quantification/dexseq-count folder with files for all the samples.
Note! For more details about DEXSeq read section XXX in the book and for parameters check DEXSeq.
-
Differential expression (DE) analysis refers to the identification of genes (or other types of genomic features, such as, transcripts or exons) that are expressed in significantly different quantities in distinct groups of samples, be it biological conditions (drug-treated vs. controls), diseased vs. healthy individuals, different tissues, different stages of development, or something else. This step will perform quantification on alignment files generated form reference based RNASeq alignment. This can be done by running the shell script:
$./differential_expression.sh
The choice of
QUANT_COUNT
set in theproject_conf.sh
file determines the differential expression tool. The table below shows the differential expression tool used for the generated count data:Generated count data Differential expression tools per gene for bedtools output DESeq2 (a.) -> Differentially expressed genes per gene for HTSeq output DESeq2 -> Differentially expressed genes per exon for DEXSeq output DEXSeq (b.) -> Differentially expressed exon per transcript for cufflinks output cuffdiff (c.) -> Differentially expressed isoforms a. Differential expression for count data generated using bedtools/HTSeq:
This analysis of differential expression (DE) will identify of genes that are expressed in significantly different quantities in distinct groups biological conditions (vehicle_treated vs. drug_treated).
Rscript DE_deseq_genes_bedtools.R \ --genecount analysis/quantification/bedtools-count/Read_per_features_combined.csv \ --metadata metadata.txt \ --condition condition \ --outputpath analysis/DE/deseq2/
Description:
--genecount
is combined generated from bedtools or htseq along with its path--metadata
is the sample information file--condition
is the column in the metadata data file that will be used for pairwise comparision using DESeq2.--outputpath
is the path to the output directory where the result table and plots will be generated.Output: Differentially expressed outfiles for bedtools and HTSeq are generated in analysis/DE/deseq2/
~/RNASeq_project/analysis/DE/deseq2$ls -1 PCA-samples_bedtools.pdf diffexpr-maplot_bedtools.pdf pvalue_histogram_50_bedtools.pdf qc-dispersions_bedtools.pdf qc-heatmap-samples_bedtools.pdf vehicle_treated_vs_drug_treated_diffexpr-results_bedtools.csv
- vehicle_treated_vs_drug_treated_diffexpr-results_bedtools.csv is the final table from deseq2 for the pairwise comparison between the two conditions with stats. The counts are normalized to rlog normalization.
- PCA-samples_bedtools.pdf XXXXXXXXXXXXXXXXXX
- diffexpr-maplot_bedtools.pdf XXXXXXXXXXXXXXXXXX
- pvalue_histogram_50_bedtools.pdf XXXXXXXXXXXXXXXXXX
- qc-dispersions_bedtools.pdf XXXXXXXXXXXXXXXXXX
- qc-heatmap-samples_bedtools.pdf XXXXXXXXXXXXXXXXXX
Note! For more details about differential expression read section XXX in the book and for analysis details check DESeq2.
b. Differential expression for count data generated using DEXSeq:
Rscript DE_dexseq_exon.R \ --exoncountpath analysis/quantification/dexseq-count/ \ --gffFile analysis/quantification/dexseq-count/DEXSEQ_GTF_annotation.gff \ --metadata metadata.txt \ --outputpath analysis/DE/dexseq/
Description:
--exoncountpath
is the dexseq generated exon count file path--gffFile
is the dexseq generated gff file with its path--metadata
is the sample information file--outputpath
is the path to the output directory where the result table and plots will be generated.Output: Differentially expressed output files for bedtools and HTSeq are generated in analysis/DE/dexseq/
~/RNASeq_project/analysis/DE/dexseq2$ls -1 diffexpr-results_dexseq.csv dispersion_plot_dexseq.pdf ma_plot_dexseq.pdf
- diffexpr-results_dexseq.csv is the final table from deseq2 for the pairwise comparison between the two conditions with stats.
- dispersion_plot_dexseq.pdf
- ma_plot_dexseq.pdf XXXXX
Note! For more details about differential expression read section XXX in the book and for analysis details check DEXSeq
c. Differential expression for count data generated using cufflinks:
cuffdiff \ -o analysis/DE/cuffdiff \ -L vt,dt \ --FDR 0.01 \ -u data/mouse-gencode-version-24/gencode.vM24.annotation.gtf \ -p 8 \ BAMLIST
Description:
-o
path to the output folder-L
lists the labels to be used as “conditions”-FDR
cutoff for false discovery rate for the DE analysis-u
path to annotation file-p
is the number threadsBAMLIST
is list of all bam file in comma format
In the end these script will create a folders and sub-folders based on the choice of analysis on the current data as shown below:
|-- RNASeq_project
| | analysis
| | |-- mapping
| | | |-- tophat/star
| | |-- quantification
| | | |-- bedtools-count/cufflinks-count/dexseq-count/htseq-count
| | |-- DE
| | | |-- deseq2/dexseq/cuffdiff
| | data
| | |-- mouse-gencode-version-24/
| | |-- STAR_Genome_Index/TOPHAT_Genome_Index/
| | reads
| | |-- rawreads
| | |-- rawreads_QA_stats
| | |-- filtered_reads
| | tools
| | |-- DEXSeq/FastQC/STAR-2.7.3a/bedtools2/bowtie2/cufflinks/cutadapt/htseq/rseqqc/samtools/sratoolkit/tophat