Skip to content

Nextflow-orchestrated distributed data pipeline for processing and analyzing raw scRNA-seq files

License

Notifications You must be signed in to change notification settings

mattfemia/scrnaseq-pipeline

Repository files navigation

scRNA-seq Pipeline

Build Status codecov version

Nextflow pipeline for reproducible parallel analysis of scRNA-seq data.

Contents

  1. Introduction
  2. Data
  3. Analysis
  4. Pipeline-Environments
    1. Docker
    2. AWS Batch / Terraform
    3. Nextflow / Local
  5. Technology

Introduction

The purpose here is to provide a standard, pre-built bioinformatics infrastructure for processing single-cell RNA sequencing (scRNA-seq) files generated primarily with 10X Genomics' Chromium controllers / library prep (however, the pipeline could be easily modified to exclude these pre-processing steps and work directly with FASTQ files - please submit an issue ticket for assistance).

The overall goal is to promote reproducible research and more generally, reproducible and streamlined bioinformatics workflows.

The pipeline uses Nextflow for orchestrating reproducible parallel analysis of scRNA-seq data across compute environments. It has the flexibility to be run as a dockerized solution, deployed through several batch-processing executors (i.e. AWS Batch, Slurm, GCP, etc.), or built and executed locally.

The analysis workflow involves:

  • Demultiplex Illumina-sequencer-generated BCL files
  • Generate FASTQ files using CellRanger
  • QC / MultiQC on FASTA & FASTQ files
  • Post-analysis of raw_feature_bc_matrices
  • Generate final figures/visuals

A sample post-analysis file can be found in src/python/analysis.py, however, the contents of the file can and should be replaced with any analysis.

Data

The pipeline expects data to be in the /data directory in the root of the repo. Alternatively, users can point to an AWS S3 bucket or similar blob/file storage by changing these properties in nextflow.config. For more information on editing this Nextflow configuration, read more here

Analysis

The contents in python/analysis.py provide a basic example of scRNA post-analysis. However, any analysis can replace the contents of this file and run accordingly.

Pipeline-Environments

Different options are available to run the pipeline using various configurations in this repo:

Docker

The pipeline is containerized and can be run as-is with the following commands to execute the pipeline on sample data in the data/ directory:

Build:

    docker build -f docker/Dockerfile -t scrna-pipeline .

To run:

    docker run scrna-pipeline

The data/ directory should be the entry point for adding data files

A containerized image of the CellRanger pipeline can also be easily built and deployed locally or through a cloud integration like AWS ECS or AWS Batch.

The Docker image is also publicly available and hosted on DockerHub and can be pulled down:

Stable version:

    docker pull mattfemia/scrna-pipeline:0.0.1-dev

Latest:

    docker pull mattfemia/scrna-pipeline:0.0.1-dev

AWS Batch / Terraform

For current AWS users looking to configure the pipeline with AWS Batch, the resources infrastructure can be set-up automatically by configuring the terraform files found in the terraform/ directory.

AWS Infrastructure

When configured, the following resources will be initialized and managed in terraform state:

  • S3 Bucket (Private Access)
  • Batch Compute Environment (with Fargate)
  • Batch Job Queue
  • VPC

Configuration

To configure:

  • Update the provider "aws" {...} block in terraform/main.tf to include your AWS credentials. More information can be found here
  • Update the bucket name in nextflow.config for the profiles {'batch': ...} and profiles {'s3-data': ...} entries
  • (Optional) If state is managed with Terraform Cloud or with a VCS rather than locally, uncomment and edit the backend "remote" {...} block in terraform/main.tf
  • (Optional) Update the S3 bucket name by editing the resource "aws_s3_bucket" "pipeline_bucket" {...} block

Deploying Infrastructure

A basic walkthrough of Terraform can be found here. However the following shell commands will perform setup:

  1. To initialize your project directory:

     terraform init
    
  2. Check formatting of file after editing:

     terraform fmt
    
  3. Deploy changes (WARNING: You will be charged for AWS resources after deploying):

     terraform apply
    
  4. Finally, to teardown resources: terraform destroy

These commands are also available in terraform/tf.sh

Nextflow / Local Pipeline

Requirements

The following steps can be used to run the pipeline locally using Nextflow

  1. If you don't have it already install Docker in your computer. Read more here.

  2. Install Nextflow (version 20.07.x or higher):

    curl -s https://get.nextflow.io | bash

  3. (Optional) If Salmon, FastQC, and Multiqc are not installed, you can add these to your current conda environment by updating the <conda-env> and then running:

     conda env update --name <conda-env> --file conda.yml --prune
    
  4. Launch the pipeline execution:

     ./nextflow run mattfemia/scrna-pipeline -with-docker
    
  5. When the execution completes open in your browser the report generated at the following path:

     results/multiqc_report.html 
    

You can see an example report at the following link.

Technology

  • Nextflow
  • FastQC
  • Multiqc
  • Salmon
  • CellRanger
  • Python
    • Scanpy
    • unittest
  • Terraform
  • AWS
    • Batch
    • S3
    • VPC
    • Fargate
  • GNU make
  • Conda
  • Travis CI

About

Nextflow-orchestrated distributed data pipeline for processing and analyzing raw scRNA-seq files

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published