spark-genome-alignment-demo

An example of bioinformatics and bigdata tools nicely playing together.

You can copy and paste the relevant section below (currently Mac OS X only) to see how the Bowtie aligner can be integrated into an interactive Spark program for doing bioinformatics work in a BigData environment.

Specifically what is being done below:

Build and install prerequisites

Java 1.6+
Apache Maven
perl JSON (sudo cpan JSON)
package manager (as needed)
Apache Spark
Scala
Bowtie
Big Data Genomics ADAM

Index the E.coli genome (NC_008253) that ships with Bowtie
Generate a set of positive-control FastQ reads from NC_008253
Launch spark-shell, the interactive interface to Spark
Align the control reads with Bowtie from spark-shell
Write the aligned reads out in SAM format

Set up the environment

Mac OS X

If you haven't already, install Homebrew:

ruby -e "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/master/install)"

Now we're ready to get to work:

brew install apache-spark
brew install scala
git clone https://github.com/allenday/spark-genome-alignment-demo.git
cd spark-genome-alignment-demo
#we'll assume that wherever you are now is where you want to work
export DEMO=`pwd`
mkdir -p build/data
cd $DEMO/build

#save time on mac, just use the pre-built bowtie from homebrew
brew install homebrew/science/bowtie
bowtie-build data/NC_008253.fna $DEMO/build/data/NC_008253
cat $DEMO/data/NC_008253.fna | sort | tail -50 | perl -ne 'chomp;$q=$_;$q=~s/./B/g;printf qq(\@read%i\n%s\n+\n%s\n), ($., $_, $q)' > $DEMO/build/data/reads.fq

#or do it from source...
#git clone https://github.com/BenLangmead/bowtie.git
#cd $DEMO/build/bowtie
#make
#./bowtie-build genomes/NC_008253.fna $DEMO/build/data/NC_008253
#cat $DEMO/data/NC_008253.fna | sort | tail -50 | perl -ne 'chomp;$q=$_;$q=~s/./B/g;printf qq(\@read%i\n%s\n+\n%s\n), ($., $_, $q)' > $DEMO/build/data/reads.fq

#verify bowtie functions as expected
cat $DEMO/build/data/reads.fq | bowtie $DEMO/build/data/NC_008253 - | md5sum
#should yield ecd5e41dea9692446fa4ae4170d6a1e1
cd $DEMO/build
git clone https://github.com/bigdatagenomics/adam.git
export SPARK_HOME=/usr/local/Cellar/apache-spark/1.4.1
cd $DEMO/build/adam
mvn package install
export ADAM_HOME=`pwd`

Run the demo

cat $DEMO/bin/bowtie_pipe_single.scala | $ADAM_HOME/bin/adam-shell
reset
cat $DEMO/build/data/reads.sam | md5sum
#should yield 6eebbde8d7818136e9ab924d57af8005

#examine the outputs
head $DEMO/build/data/reads.sam

Name		Name	Last commit message	Last commit date
Latest commit History 26 Commits
bin		bin
data		data
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

spark-genome-alignment-demo

Set up the environment

Mac OS X

Run the demo

Further reading

About

Releases

Packages

Contributors 2

Languages

allenday/spark-genome-alignment-demo

Folders and files

Latest commit

History

Repository files navigation

spark-genome-alignment-demo

Set up the environment

Mac OS X

Run the demo

Further reading

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages