TOPMed RNA-seq pipeline

The TOPMed RNA-Seq pipeline was converted to CWL for a deliverable to have a CWL pipeline available through a public Tool Registry Service. Specifically, this workflow is available through Dockstore.org.

Workflow description

This document describes team Helium's implimentation of the TOPMed RNA-seq pipeline as described in commit b65c22b. The CWL Workflow is registered publicly on Dockstore here. This CWL workflow has 4 components described below.

A checker workflow registered on Dockstore is also available to verify operation of this pipeline. See information here.

The scripts and settings used for the TOPMed MESA RNA-seq pilot match commit 725a2bc, packaged here.

Intended Audience

The intended audiance is any scientist familiar with RNA-seq analysis wishing to run RNA-seq analysis on the TOPMed public access data.

Quick Start

Run the pipeline locally with small test input files. Creating these sample input files is described here.

Dockstore CLI, CWLTool, Git, Git LFS and Docker should be installed.

Clone this GitHub repository:

git clone https://github.com/heliumdatacommons/cwl_workflows.git

Decompress sample files.

./topmed-workflows/TOPMed_RNAseq_pipeline/input-examples/download_examples.sh

Use this input file or edit the file paths based on your local machine paths.

Run the workflow with CWLTool.

cwltool topmed-workflows/TOPMed_RNAseq_pipeline/rnaseq_pipeline_fastq.cwl \
topmed-workflows/TOPMed_RNAseq_pipeline/input-examples/Dockstore.json

Checker Workflow

A checker workflow for the TOPMed RNA-seq pipeline is published on Dockstore here. It is described in more detail in this README.md

Sample data sets

The sample data sets intended to be used as input are available through this BioProject.

Direct DataSets link.

Creating downsampled datasets for testing is described here.

Pipeline components

OUTPUTS describes the files generated by the TOPMed RNA-Seq pipeline, for each sample.

Alignment: STAR 2.5.3a
- STAR CWL File
- Python script ran by CWL file in Docker container: run_STAR.py
- INPUT: STAR Index and sample FASTQ's. See example input file.
  - See here to create STAR Index
- OUTPUT: Aligned RNA-seq reads in BAM format.
Post-processing: Picard 2.9.0 MarkDuplicates
- Picard MarkDuplicates CWL File
- Python script ran by CWL file in Docker container: run_MarkDuplicates.py
- INPUT: Aligned BAM file from STAR. See example input
- OUTPUT: Marked duplicates BAM file.
  - Will need to create BAM index file with Samtools index, CWL File, example input
Transcript quantification: RNA-SeQC 1.1.9
- RNA-SeQC CWL File
- Python script ran by CWL file in Docker container: run_rnaseqc.py
- INPUT: Genome FASTA, GTF file, Aligned BAM file from STAR. See example input
- OUTPUT:
  - Transcript-level expression quantifications, provided as TPM, expected read counts, and isoform percentages.
  - Standard quality control metrics derived from the aligned reads.
Gene quantification and quality control: RSEM 1.3.0
- RSEM CWL File
- Python script ran by CWL file in Docker container: run_RSEM.py
- INPUT: RSEM refernce files, BAM with reads aligned to transcriptome from STAR. See example input
  - See here to create RSEM refernce directory.
- OUTPUT: Gene-level expression quantifications based on a collapsed version of a reference transcript annotation, provided as read counts and TPM.
Utilities: SAMtools 1.6 and HTSlib 1.6
- Samtools index is used to create .bai files for input .bam files. CWL File, example input

Alternative Approaches

Many other software packages are available to perform similar funcionality as this pipeline. For deatiled information on RNA-seq analysis steps and other software options, please see A survey of best practices for RNA-seq data analysis.

Docker Image

Currently, republishing the GTEx pipeline Docker container on Docker Hub.

Original: Dockerfile
Local: Dockerfile
Docker Hub Link

Obtaining docker image.

Docker should be installed. See here if not.

Pull the image from Docker Hub

docker pull heliumdatacommons/topmed-rnaseq:latest

Create required inputs

The following steps assume:

You have downloaded the following files:

$ wget ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_26/gencode.v26.annotation.gtf.gz
$ gunzip gencode.v26.annotation.gtf.gz

$ wget https://personal.broadinstitute.org/francois/topmed/Homo_sapiens_assembly38_noALT_noHLA_noDecoy_ERCC.tar.gz
$ tar -xzf Homo_sapiens_assembly38_noALT_noHLA_noDecoy_ERCC.tar.gz

You have obtained the Docker container described here

Create .fai file

Create the index file using samtools faidx.

~/input_files contains the Homo_sapiens_assembly38_noALT_noHLA_noDecoy_ERCC.fasta file.

docker run --rm -v ~/input_files:/input_files heliumdatacommons/topmed-rnaseq \
    samtools faidx /input_files/Homo_sapiens_assembly38_noALT_noHLA_noDecoy_ERCC.fasta

Create .dict file

Create the dictionary file using Picard CreateSequenceDictionary.

~/input_files contains the Homo_sapiens_assembly38_noALT_noHLA_noDecoy_ERCC.fasta file.

docker run --rm -v ~/input_files:/input_files heliumdatacommons/topmed-rnaseq \
    java -jar /opt/picard-tools/picard.jar CreateSequenceDictionary \
    R=/input_files/Homo_sapiens_assembly38_noALT_noHLA_noDecoy_ERCC.fasta \
    O=/input_files/Homo_sapiens_assembly38_noALT_noHLA_noDecoy_ERCC.dict

Create STAR Index

Create .fai and .dict file for Genome FASTA (both described above).
GTF file, Genome FASTA file, .fai and .dict should all be in the same directory. Use this directoy as a volume mount when running docker. We used input_files below.

Run the following command:

docker run --rm -v ~/input_files:/input_files heliumdatacommons/topmed-rnaseq \
    STAR --runMode genomeGenerate \
    --genomeDir /input_files/star_index \
    --genomeFastaFiles /input_files/Homo_sapiens_assembly38_noALT_noHLA_noDecoy_ERCC.fasta \
    --sjdbGTFfile /input_files/gencode.v26.annotation.gtf \
    --sjdbOverhang 100 --runThreadN 10

Upon completion, your STAR Index will be in the ~/input_files/star_index directory.

Create RSEM Reference

Create .fai and .dict file for Genome FASTA (both described above).
GTF file, Genome FASTA file, .fai and .dict should all be in the same directory. Use this directoy as a volume mount when running docker.
Create RSEM reference using rsem-prepare-reference:

docker run --rm -v ~/input_files:/input_files heliumdatacommons/topmed-rnaseq:latest \
    rsem-prepare-reference --num-threads 4 \
    --gtf /input_files/gencode.v26.annotation.gtf \
    /input_files/Homo_sapiens_assembly38_noALT_noHLA_noDecoy_ERCC.fasta \
    /input_files/rsem_reference

Upon completion, the RSEM reference directory will be in the ~/input_files/rsem_reference directory.

Name		Name	Last commit message	Last commit date
Latest commit History 84 Commits
Docker		Docker
bin		bin
src		src
workflow		workflow
.gitattributes		.gitattributes
.gitignore		.gitignore
CONTRIBUTING		CONTRIBUTING
Downsampled_test_data.md		Downsampled_test_data.md
LICENSE		LICENSE
README.md		README.md
Running_Helium_Datacommons.md		Running_Helium_Datacommons.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

TOPMed RNA-seq pipeline

Workflow description

Intended Audience

Quick Start

Checker Workflow

Sample data sets

Pipeline components

Alternative Approaches

Docker Image

Create required inputs

Create .fai file

Create .dict file

Create STAR Index

Create RSEM Reference

About

Releases 4

Packages

Contributors 2

Languages

License

heliumdatacommons/TOPMed_RNAseq_CWL

Folders and files

Latest commit

History

Repository files navigation

TOPMed RNA-seq pipeline

Workflow description

Intended Audience

Quick Start

Checker Workflow

Sample data sets

Pipeline components

Alternative Approaches

Docker Image

Create required inputs

Create .fai file

Create .dict file

Create STAR Index

Create RSEM Reference

About

Resources

License

Stars

Watchers

Forks

Releases 4

Packages 0

Contributors 2

Languages

Packages