Skip to content

Autoscaling Example Using AWS

John Vivian edited this page Apr 21, 2018 · 1 revision

Instructions

Dependencies

  • Add SSH key to Agent if not already added
    • ssh-add
  • Install Toil and dependencies if not already installed
    • sudo apt-get install build-essential python-dev libssl-dev libffi-dev
    • pip install toil[all]

Launch Cluster

toil launch-cluster aws-example --keyPairName yourkeypair@gmail.com --leaderNodeType t2.medium --zone us-west-2a

SSH to Leader

toil ssh-cluster --zone us-west-2a aws-example

Install toil-rnaseq

pip install toil-rnaseq

Create config / manifest template

toil-rnaseq generate

Fill in config and manifest

Run workflow

toil-rnaseq run --retryCount=2 --nodeTypes=c3.8xlarge --maxNodes=1 --batchSystem=mesos --provisioner=aws aws:us-west-2:aws-example-jobstore

Monitor Run via Mesos Interface (after opening TCP port 5050 in AWS security)

<Leader IP>:5050

Manifest Example

##############################################################################################################
#                                    TOIL RNA-SEQ WORKFLOW MANIFEST FILE                                     #
##############################################################################################################
#   Edit this manifest to include information pertaining to each sample to be run.
#   There are 4 tab-separated columns: filetype, paired/unpaired, UUID, URL(s) to sample
#
#   filetype    Filetype of the sample. Options: "tar", "fq", or "bam" for tarball, fastq/fastq.gz, or BAM
#   paired      Indicates whether the data is paired or single-ended. Options:  "paired" or "single"
#   UUID        This should be a unique identifier for the sample to be processed
#   URL         A URL starting with {scheme} that points to the sample
#
#   If sample is being submitted as a fastq or several fastqs, provide URLs separated by a comma.
#   If providing paired fastqs, alternate the fastqs so every R1 is paired with its R2 as the next URL.
#   Samples must have the same extension - do not mix and match gzip and non-gzipped sample pairs.
#
#   Samples consisting of tarballs with fastq files inside must follow the file name convention of
#   ending in an R1/R2 or _1/_2 followed by one of the 4 extensions: .fastq.gz, .fastq, .fq.gz, .fq
#
#   BAMs are accepted, but must have been aligned from paired reads NOT single-end reads.
#
#   GDC URLs may only point to individual BAM files. No other format is accepted.
#
#   Examples of several combinations are provided below. Lines beginning with # are ignored.
#
#   tar paired  UUID_1  file:///path/to/sample.tar
#   fq  paired  UUID_2  file:///path/to/R1.fq.gz,file:///path/to/R2.fq.gz
#   tar single  UUID_3  http://sample-depot.com/single-end-sample.tar
#   tar paired  UUID_4  s3://my-bucket-name/directory/paired-sample.tar.gz
#   fq  single  UUID_5  s3://my-bucket-name/directory/single-end-file.fq
#   bam paired  UUID_6  gdc://1a5f5e03-4219-4704-8aaf-f132f23f26c7
#
#   Place your samples below, one per line.
fq  paired  EXAMPLE s3://example-aws-bucket/Read1.fq,s3://s3-example-bucket/Read2.fq

Config Example

##############################################################################################################
#                               TOIL RNA-SEQ WORKFLOW CONFIGURATION FILE                                     #
##############################################################################################################

# This configuration file is formatted in YAML. Simply write the value (at least one space) after the colon.
# Edit the values in this configuration file and then rerun the pipeline: "toil-rnaseq run"
# Just Kallisto or STAR/RSEM can be run by supplying only the inputs to those tools
#
# URLs can take the form: http://, ftp://, file://, s3://, gdc://
# Local inputs follow the URL convention: file:///full/path/to/input
# S3 URLs follow the convention: s3://bucket/directory/file.txt
#
# Comments (beginning with #) do not need to be removed. Optional parameters left blank are treated as false.

##############################################################################################################
#                                           REQUIRED OPTIONS                                                 #
##############################################################################################################

# Required: Output location of sample. Can be full path to a directory or an s3:// URL
# WARNING: S3 buckets must exist prior to upload, or it will fail.
output-dir: s3://s3-example-bucket/

##############################################################################################################
#                            WORKFLOW INPUTS (Alignment and Quantification)                                  #
##############################################################################################################

# URL {scheme} to index tarball used by STAR
star-index: http://hgwdev.soe.ucsc.edu/~jtvivian/toil-rnaseq-inputs/starIndex_hg38_no_alt.tar.gz

# URL {scheme} to reference tarball used by RSEM
# Running RSEM requires a star-index as a well as an rsem-ref
rsem-ref: http://hgwdev.soe.ucsc.edu/~jtvivian/toil-rnaseq-inputs/rsem_ref_hg38_no_alt.tar.gz

# URL {scheme} to kallisto index file. 
kallisto-index: http://hgwdev.soe.ucsc.edu/~jtvivian/toil-rnaseq-inputs/kallisto_hg38.idx

# URL {scheme} to hera index
hera-index: http://hgwdev.soe.ucsc.edu/~jtvivian/toil-rnaseq-inputs/hera-index.tar.gz

# Maximum file size of input sample (for resource allocation during initial download)
max-sample-size: 20G

##############################################################################################################
#                                   WORKFLOW OPTIONS (Quality Control)                                       #
##############################################################################################################

# If true, will preprocess samples with cutadapt using adapter sequences.
cutadapt: true

# Adapter sequence to trim when running CutAdapt. Defaults set for Illumina
fwd-3pr-adapter: AGATCGGAAGAG

# Adapter sequence to trim (for reverse strand) when running CutAdapt. Defaults set for Illumina
rev-3pr-adapter: AGATCGGAAGAG

# If true, will run FastQC and include QC in sample output
fastqc: true 

##############################################################################################################
#                   CREDENTIAL OPTIONS (for downloading samples from secure locations)                       #
##############################################################################################################        

# Optional: Provide a full path to a 32-byte key used for SSE-C Encryption in Amazon
ssec: 

# Optional: Provide a full path to the token.txt used to download from the GDC
gdc-token: 

##############################################################################################################
#                                   ADDITIONAL FILE OUTPUT OPTIONS                                           #
##############################################################################################################        

# Optional: If true, saves the wiggle file (.bg extension) output by STAR
# WARNING: Requires STAR sorting, which has memory leak issues that can crash the workflow. 
wiggle: 

# Optional: If true, saves the aligned BAM (by coordinate) produced by STAR
# You must also specify an ssec key if you want to upload to the s3-output-dir
# as read data is assumed to be controlled access
save-bam: 

##############################################################################################################
#                                           DEVELOPER OPTIONS                                                #
##############################################################################################################        

# Optional: If true, uses resource requirements appropriate for continuous integration
ci-test: