Skip to content

Lab 08: Putting it all together

Ryan edited this page Jan 14, 2024 · 8 revisions

Putting it all together

The goal of this lab is to utilize the skills we've built in the past two days to make a simple RNA-Seq analysis pipeline that performs QC and maps reads to a reference genome using the custom built STAR image.

Viewing the RNA-Seq pipeline

If you inspect exercises/08_putting_it_all_together, you'll see a Nextflow pipeline using the same organization we used in labs 4 and 6.

Inspecting the workflow

In main.nf, we first state that we're using DSL2, then import the modules containing all of our processes for a simplified RNA-Seq pipeline.

nextflow.enable.dsl=2

include { FASTQC     } from "./modules/fastqc.nf"
include { MULTIQC    } from "./modules/multiqc.nf"
include { STAR_INDEX } from "./modules/star.nf"
include { STAR_MAP   } from "./modules/star.nf"

Now let's see our workflow.

workflow {

    if (params.fastq_seqs) {
        if (!params.skip_qc) {
            ch_fastqs = Channel.fromFilePairs("${params.fastq_seqs}/*read{1,2}.fastq.gz", checkIfExists: true, flat:true)
            FASTQC(ch_fastqs)
            MULTIQC(FASTQC.out.ch_fastqc.collect())
        }
        STAR_INDEX(params.genome, params.annot)
        STAR_MAP(ch_fastqs, STAR_INDEX.out.star_idx)
    }
}

Our workflow checks if the params.fastq_seqs variable is defined within an if block. A second, nested if block allows for an optional fastqc and multiqc quality control statistics (toggled on or off with params.skip_qc). After the optional QC steps, STAR_INDEX creates an index which is passed to the alignment step, STAR_MAP.

⭐ Take a moment to see if you can tell where every params parameter is coming from.

Inspecting the configuration

Look at the nextflow.config file.

params {
    fasta_seqs = false
    skip_qc = false
}

process {
    publish_dir = "${params.publish_dir}"
        withLabel: star {
            cpus = 2
        }
}

singularity {
    enabled = true
    cacheDir = "${HOME}/singularity/"
    autoMounts = true
}

report {
    enabled = true
    file    = "${process.publish_dir}/summary/report.html"
}

timeline{
    enabled = true
    file    = "${process.publish_dir}/summary/timeline.html"
}

⭐ Notice anything new?

process {
    publish_dir = "${params.publish_dir}"
        withLabel: star {
            cpus = 2
        }
}

Here we're using the variable "publish_dir" from our 'params scope' (also our command line input) and we're making it available to the process scope. We've also added a label called "star".

Using labels to re-use configurations

Let's investigate what the label "star" from our configuration file actually does. In the modules/star.nf file, if we look at the STAR_MAP process, we see the directive label on the second line.

process STAR_MAP {
    label "star"

    publishDir(path: "${publish_dir}/star", mode: "symlink")

    input:
        tuple val(id), path(r1), path(r2)
        path(star_index)

    output:
        tuple val(id), path("*.bam"), emit: ch_bam

    script:
        """
          STAR \
            --genomeDir ${star_index} \
            --readFilesIn ${r1} ${r2} \
            --runThreadN ${task.cpus} \
            --readFilesCommand zcat \
            --outFileNamePrefix ${id} \
            --outSAMtype BAM Unsorted
        """
}

Notice in the script, we are calling the variable task.cpus even though we haven't seemed to have defined it anywhere in the script. The label "star" is actually associating the cpus value from our nextflow.config file with any process with the "star" label. The "publish_dir" variable we define in our config is applied to ALL processes, but by using labels, we can use very specific settings to a subset of our processes with ease.