Nextflow pipeline for demultiplexing Illumina Next-Gen sequencing data.
Clone this repository:
git clone --recursive https://github.com/NYU-Molecular-Pathology/demux-nf.git
The included deploy
recipe should be used to create a new directory for demultiplexing based on a currently existing sequencing run directory. Include arguments that describe the configuration for your sequencing run.
cd demux-nf
make deploy RUNID=170809_NB501073_0019_AH5FFYBGX3 SAMPLESHEET=SampleSheet.csv SEQTYPE=Archer
arguments:
-
RUNID
: the identifier given to the run by the sequencer -
SAMPLESHEET
: the samplesheet required for demultiplexing withbcl2fastq
-
SEQTYPE
: the type of sequencing; currently onlyArcher
orNGS580
are used -
SEQDIR
: parent directory where the sequencer outputs its data (pre-configured for NYU server locations) -
PRODDIR
: parent directory where demultiplexing output should be stored (pre-configured for NYU server locations)
This will first check that the specified run exists on the server before cloning into a new directory at the given production output location and configuring it for demultiplexing using the subsequent commands described here.
Assuming you used make deploy
or make config
to prepare your demultiplexing directory, the following command can be used to automatically run the workflow based on the pre-defined settings and settings from your current system.
make run
Extra parameters to be passed to Nextflow can be supplied with the EP
argument:
make run EP='--samplesheet SampleSheet.csv --runDir /path/to/sequencer/data/170809_NB501073_0019_AH5FFYBGX3'
To submit the parent Nextflow pipeline as a job on the HPC cluster:
make submit
# with a different submission queue:
make submit SUBQ=fn_long
# with a different submission time:
make submit SUBQ=cpu_long SUBTIME='--time=6-00:00:00'
For alternative run
methods, consult the Makefile
.
Demultiplexing metadata for the workflow can be provided through several methods, evaluated in the following order:
- parameters can be supplied directly to Nextflow via CLI
nextflow run main.nf --runID 12345
- if the file
config.json
is present, non-null
parameters will be retrieved
{
"runDir": "/path/to/sequencer/data/170809_NB501073_0019_AH5FFYBGX3",
"samplesheet": "SampleSheet.csv",
"runID": "170809_NB501073_0019_AH5FFYBGX3"
}
-
this file is generated automatically during the
deploy
step, using the includedconfig.py
script -
the following items in the current directory will be used if present:
-
SampleSheet.csv
: default samplesheet file -
runDir
: default sequencing run source directory (can be a symlink) -
runID.txt
: a text file, the first line of which will be used as the run ID
-
- (re)initialize configurations (overwrites old
config.json
):
make config RUNDIR=/path/to/sequencer/data/170809_NB501073_0019_AH5FFYBGX3 SAMPLESHEET=SampleSheet.csv RUNID=170809_NB501073_0019_AH5FFYBGX3
- update an existing directory to the latest version of this repo:
make update
- clean up workflow intermediary files to save space (workflow cannot be resumed after this):
make finalize
- clean up output from all old workflows (saves current workflow output):
make clean
- delete the output from all workflows:
make clean-all
- mark that the demultiplexing suceeded and the results passed QC for downstream analysis:
make passed
- deploy a new NGS580 analysis using the current results:
make deploy-NGS580
- make a 'deliverables' directory with just the results for samples for a specific client
make deliverable CLIENT=somelab SHEET=list_of_clients_samples.txt
Required:
-
Java 8 (Nextflow)
-
Python 2.7+
-
GNU
make
Optional; must be installed to system or available with Singularity containers:
-
bcl2fastq
version 2.17.1 -
FastQC version 0.11.7
-
R (3.3.0+, with
knitr
andrmarkdown
libraries) -
Pandoc 1.13.1+