This pipeline requires Nextflow 21.04.0 or higher. Other dependencies are containerized with Singularity and Docker.
As default, this pipeline uses the profile for Kiel medcluster. Should you choose to run it locally on your own computer, please set -profile local. Important: Change parameters in conf/local.config to your local hardware specifications prior running the pipeline.
Reference databases for Metaphlan4, Kraken2 and HUMAnN3 (Set also a Metaphlan4 DB for HUMAnN3.6) are needed. On Kiel Medcluster, these are already set in the respective config file.
Metaphlan DB: --metaphlan_db
HUMAnN DB: --humann_db
Kraken DB: --kraken2_db
Salmon DB: --salmon_db
Sylph DB: --sylph_db
Pipeline is module based and will run in the most basic run only the QC module.
Run the Pipeline with
nextflow run ikmb/TOFU-MAaPO --reads '/path/to/fastqfiles/*_R{1,2}_001.fastq.gz'
Either use:
--reads
With a glob to your fastq.gz files or to a csv-file containing the columns id, read1,read2 that lists all samples that you want to process. For single-end mode, use only columns "id" and "read1".
or:
--sra
NCBI SRA Accession ID. Pipeline will download automatically all fastq files for your query. It is mandatory to provide your personal API key for your NCBI account with --apikey
. Also lists are possible: "--sra ['ERR908507', 'ERR908506', 'ERR908505']". WARNING: The used Nextflow API call to NCBI is not free of bugs. Expect more samples to be processed than are in the input list. Also some samples might be missing.
For analysis following modules are available:
--metaphlan
Run Metaphlan4, a tool for profiling the composition of microbial communities
--humann
Run HUMAnN3, a tool for profiling the abundance of microbial metabolic pathways and other molecular functions
--kraken
Run Kraken2, a tool for taxonomic classification tool, with a on Medcluster preconfigured RefSeq virus database.
--bracken
Run Bracken (Bayesian Reestimation of Abundance with KrakEN) after Kraken2. Kraken2 DB must be bracken-ready
--salmon
Run salmon.
--sylph
Run sylph.
--assembly
Run an extended genome assembly workflow with MAGScoT Bin Refinement.
--updatemetaphlan
Download the Metaphlan4 database to the directory set in parameter metaphlan_db.
--updatehumann
Download the HUMAnN3 database to the directory set in parameter humann_db. HUMAnN3 requires the Metaphlan4 database, too.
--updategtdbtk
Download the GTDB-Tk reference data to the directory set in parameter gtdbtk_reference.
--genome
Set host genome. On the IKMB Medcluster valid options are human, mouse or chimp. In other cases this needs to be pre-configured. How to add a host genome to the pipeline?
--cleanreads
Publish QC'ed fastq.gz files. Disabled by default.
--no_qc
Skips QC-Module. Only use if your input reads are the output of --cleanreads
--fastp
QC is performed with fastp
--metaphlan_db
Directory of Metaphlan database. REQUIRED!
--publish_metaphlanbam
Publish the bam file output of Metaphlan.
--metaphlan_db
Directory of Metaphlan database. REQUIRED!
--humann_db
Directory of HUMAnN database. REQUIRED!
--assemblymode
Set the mode, if co-assembly (group or all) or single (single, default mode) sample assembly should be performed. The option group is only available, if the input is a csv-file with a column "group". In case of co-assembly, only up to 100 samples per group (in "group" mode) or run (in "all" mode) are recommended due to hardware restrictions.
--binner
Comma separated list of binning tools to use. Options are: concoct,maxbin,semibin,metabat and vamb. For best performance choose multiple. Default uses all of them.
--contigsminlength
Set a minimum length of contig. Smaller contigs will be discarded. Default: 2000.
--semibin_environment
Set the trained environment for SemiBin. Default is human_gut. See the SemiBin Documentation for other options. Choose global if no other environment is appropiate.
--skip_gtdbtk
Skip GTDB-TK. Both Genome Assembly Modules will run GTDB-TK for taxonomical profiling as a default.
--skip_checkm
Skip Checkm bin quality check.
--gtdbtk_reference
GTDB-TK Reference. Reference database for GTDB-TK needs to be set (already set on Kiel Medcluster):
--publish_megahit
Publish results of megahit with .
--publish_rawbins
Publish the results of all used binning tools in the genome assembly workflow.
--vamb_groupsize
Only used when binning with vamb is performed and assemblymode is "single". Set a subgrouping size for vamb, default is 100. This is a temporary fix to enable the pipeline to handle very large cohorts on medium sized hardware. For best results adjust the groupsize to the total sample size of your cohort.
--magscot_min_sharing
Scoring parameter a [default=1]
--magscot_score_a
Scoring parameter a [default=1]
--magscot_score_b
Scoring parameter b [default=0.5]
--magscot_score_c
Scoring parameter c [default=0.5]
--magscot_threshold
Scoring minimum completeness threshold [default=0.5]
--magscot_min_markers
Minimum number of unique markers in bins to be considered as seed for bin merging [default=25]
--magscot_iterations
Number of merging iterations to perform. [default=2]
--single_end
Set the pipeline for single end reads.
--outdir
Set a custom output directory, default is "results".
-resume
Resumes pipeline and will continue the run with already completed, cached processes.
-profile
Change the configuration of the pipeline. Valid options are medcluster (default), local or custom. You can add a new profile for your compute system by editing the file custom.config in the folder conf or create a new one and add it in the file nextflow.config under 'profiles'.
-work-dir
Set a custom work directory, default is "work".
-r
Use a specific branch or release version of the pipeline.
--publish_rawreads
Publish unprocessed/raw files downloaded from SRA in the output directory.
--kraken2_db
Directory of used Kraken2 database. Should be Bracken ready for use with Bracken. REQUIRED!
--salmon_db
Directory of used salmon database. REQUIRED!
--salmon_reference
Path to tab-separated taxonomy file corresponding to the used salmon database. Not required if used with default database. Two column file with header line containing in the first column the bin names used in the salmon database and in the second column the taxonomic assignment by GTDB-Tk in the format "d__Bacteria;p__Pseudomonadota;c__Gammaproteobacteria;o__Enterobacterales;f__Enterobacteriaceae;g__Escherichia;s__Escherichia coli".
--salmon_processing
NOT RECOMMENDED! Shortcut for high-throughput data processing with salmon, skips qc, no other modules available in this mode.
--sylph_db
Set the path to a sylph databse.
--sylph_merge
All sylph profiling will be done in one process. Produces a single output for all samples combined.
--bracken_length
= 100
--bracken_level
= "S"
--bracken_threshold
= 0