GitHub

RNA-seq Pipeline

This script automates a basic RNA-seq pipeline for processing Illumina reads from sequencing runs, including quality control, alignment, and differential expression analysis. The pipeline assumes that the input files are either paired-end or single-end reads in gzipped FASTQ format, and that the reference genome and annotation file are available in the appropriate directories.

Dependencies

The following tools and software packages are required to run this pipeline:

FastQC: https://www.bioinformatics.babraham.ac.uk/projects/fastqc/
Trimmomatic: http://www.usadellab.org/cms/?page=trimmomatic
HISAT2: https://daehwankimlab.github.io/hisat2/download/
SAMtools: http://www.htslib.org/download/
featureCounts: http://subread.sourceforge.net/
DESeq2: https://bioconductor.org/packages/release/bioc/html/DESeq2.html
R: https://www.r-project.org/
Python: https://www.python.org/
GNU Bash: https://www.gnu.org/software/bash/

Usage

Set the following variables in the script:
- FASTQ_DIR: the path to the directory containing the input fastq files
- GENOME_DIR: the path to the directory containing the reference genome files
- ANNOTATION_FILE: the path to the GTF annotation file
- OUTPUT_DIR: the path to the directory where output files will be saved
Run the script in a Linux terminal using bash:
- chmod +x rna_seq_pipeline.sh
- bash RNAseq_pipeline.sh

Pipeline Steps

Check that each sample has exactly two input fastq files with the correct naming convention.
Check whether input fastq files are single-end or paired-end.
Perform quality control and preprocessing using fastqc and trimmomatic.
Perform alignment using hisat2.
Convert the resulting SAM file to BAM format and sort it using samtools.
Count features using featureCounts.
Perform differential expression analysis using DESeq2 in R.
Validate the output using validate.py.

Outputs

The pipeline generates the following output files for each sample:

fastqc_results/: a directory containing the output from fastqc
*_aligned_reads.bam: the sorted BAM file containing aligned reads
*_counts.txt: the feature counts for each gene
*_validated_results.txt: the validated differential expression results

All output files are saved to the specified OUTPUT_DIR, with the sample name included in the file name.

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
LICENSE		LICENSE
README.md		README.md
rna_seq_pipeline.sh		rna_seq_pipeline.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

RNA-seq Pipeline

Dependencies

Usage

Pipeline Steps

Outputs

About

Releases

Packages

Languages

License

pyRocksy/Bioinformatics

Folders and files

Latest commit

History

Repository files navigation

RNA-seq Pipeline

Dependencies

Usage

Pipeline Steps

Outputs

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages