Skip to content

An example of bioinformatics and bigdata tools can playing nicely together

Notifications You must be signed in to change notification settings

allenday/spark-genome-alignment-demo

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

26 Commits
 
 
 
 
 
 

Repository files navigation

spark-genome-alignment-demo

An example of bioinformatics and bigdata tools nicely playing together.

You can copy and paste the relevant section below (currently Mac OS X only) to see how the Bowtie aligner can be integrated into an interactive Spark program for doing bioinformatics work in a BigData environment.

Specifically what is being done below:

  1. Build and install prerequisites
  1. Index the E.coli genome (NC_008253) that ships with Bowtie
  2. Generate a set of positive-control FastQ reads from NC_008253
  3. Launch spark-shell, the interactive interface to Spark
  4. Align the control reads with Bowtie from spark-shell
  5. Write the aligned reads out in SAM format

Set up the environment

Mac OS X

If you haven't already, install Homebrew:

ruby -e "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/master/install)"

Now we're ready to get to work:

brew install apache-spark
brew install scala
git clone https://github.com/allenday/spark-genome-alignment-demo.git
cd spark-genome-alignment-demo
#we'll assume that wherever you are now is where you want to work
export DEMO=`pwd`
mkdir -p build/data
cd $DEMO/build

#save time on mac, just use the pre-built bowtie from homebrew
brew install homebrew/science/bowtie
bowtie-build data/NC_008253.fna $DEMO/build/data/NC_008253
cat $DEMO/data/NC_008253.fna | sort | tail -50 | perl -ne 'chomp;$q=$_;$q=~s/./B/g;printf qq(\@read%i\n%s\n+\n%s\n), ($., $_, $q)' > $DEMO/build/data/reads.fq

#or do it from source...
#git clone https://github.com/BenLangmead/bowtie.git
#cd $DEMO/build/bowtie
#make
#./bowtie-build genomes/NC_008253.fna $DEMO/build/data/NC_008253
#cat $DEMO/data/NC_008253.fna | sort | tail -50 | perl -ne 'chomp;$q=$_;$q=~s/./B/g;printf qq(\@read%i\n%s\n+\n%s\n), ($., $_, $q)' > $DEMO/build/data/reads.fq

#verify bowtie functions as expected
cat $DEMO/build/data/reads.fq | bowtie $DEMO/build/data/NC_008253 - | md5sum
#should yield ecd5e41dea9692446fa4ae4170d6a1e1
cd $DEMO/build
git clone https://github.com/bigdatagenomics/adam.git
export SPARK_HOME=/usr/local/Cellar/apache-spark/1.4.1
cd $DEMO/build/adam
mvn package install
export ADAM_HOME=`pwd`

Run the demo

cat $DEMO/bin/bowtie_pipe_single.scala | $ADAM_HOME/bin/adam-shell
reset
cat $DEMO/build/data/reads.sam | md5sum
#should yield 6eebbde8d7818136e9ab924d57af8005

#examine the outputs
head $DEMO/build/data/reads.sam

Further reading

Have a look at bowtie_pipe_single.scala if you're curious about how the integration is done. We're making use of the Spark Resilient Distributed Dataset (RDD) pipe() function to send data out to a subprocess that runs Bowtie, feeds it some data, and collects the results.

You can see another example from Holden Karau of using the RDD pipe() function here. It's from the book Learning Spark.

About

An example of bioinformatics and bigdata tools can playing nicely together

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published