Skip to content

Latest commit

 

History

History
103 lines (84 loc) · 7.76 KB

usage.md

File metadata and controls

103 lines (84 loc) · 7.76 KB

Usage:

This pipeline requires Nextflow 21.04.0 or higher. Other dependencies are containerized with Singularity and Docker.

As default, this pipeline uses the profile for Kiel medcluster. Should you choose to run it locally on your own computer, please set -profile local. Important: Change parameters in conf/local.config to your local hardware specifications prior running the pipeline.

Reference databases for Metaphlan4, Kraken2 and HUMAnN3 (Set also a Metaphlan4 DB for HUMAnN3.6) are needed. On Kiel Medcluster, these are already set in the respective config file.
Metaphlan DB: --metaphlan_db
HUMAnN DB: --humann_db
Kraken DB: --kraken2_db
Salmon DB: --salmon_db
Sylph DB: --sylph_db

Pipeline is module based and will run in the most basic run only the QC module.

Run the Pipeline with

nextflow run ikmb/TOFU-MAaPO --reads '/path/to/fastqfiles/*_R{1,2}_001.fastq.gz'

Input:

Either use:
--reads With a glob to your fastq.gz files or to a csv-file containing the columns id, read1,read2 that lists all samples that you want to process. For single-end mode, use only columns "id" and "read1".
or:
--sra NCBI SRA Accession ID. Pipeline will download automatically all fastq files for your query. It is mandatory to provide your personal API key for your NCBI account with --apikey. Also lists are possible: "--sra ['ERR908507', 'ERR908506', 'ERR908505']". WARNING: The used Nextflow API call to NCBI is not free of bugs. Expect more samples to be processed than are in the input list. Also some samples might be missing.

Available modules:

For analysis following modules are available:
--metaphlan Run Metaphlan4, a tool for profiling the composition of microbial communities
--humann Run HUMAnN3, a tool for profiling the abundance of microbial metabolic pathways and other molecular functions
--kraken Run Kraken2, a tool for taxonomic classification tool, with a on Medcluster preconfigured RefSeq virus database.
--bracken Run Bracken (Bayesian Reestimation of Abundance with KrakEN) after Kraken2. Kraken2 DB must be bracken-ready
--salmon Run salmon.
--sylph Run sylph.
--assembly Run an extended genome assembly workflow with MAGScoT Bin Refinement.

Initialization options:

--updatemetaphlan Download the Metaphlan4 database to the directory set in parameter metaphlan_db.
--updatehumann Download the HUMAnN3 database to the directory set in parameter humann_db. HUMAnN3 requires the Metaphlan4 database, too.
--updategtdbtk Download the GTDB-Tk reference data to the directory set in parameter gtdbtk_reference.

QC options:

--genome Set host genome. On the IKMB Medcluster valid options are human, mouse or chimp. In other cases this needs to be pre-configured. How to add a host genome to the pipeline?
--cleanreads Publish QC'ed fastq.gz files. Disabled by default.
--no_qc Skips QC-Module. Only use if your input reads are the output of --cleanreads
--fastp QC is performed with fastp

Metaphlan options:

--metaphlan_db Directory of Metaphlan database. REQUIRED!
--publish_metaphlanbam Publish the bam file output of Metaphlan.

HUMAnN options:

--metaphlan_db Directory of Metaphlan database. REQUIRED!
--humann_db Directory of HUMAnN database. REQUIRED!

Assembly options:

--assemblymode Set the mode, if co-assembly (group or all) or single (single, default mode) sample assembly should be performed. The option group is only available, if the input is a csv-file with a column "group". In case of co-assembly, only up to 100 samples per group (in "group" mode) or run (in "all" mode) are recommended due to hardware restrictions.
--binner Comma separated list of binning tools to use. Options are: concoct,maxbin,semibin,metabat and vamb. For best performance choose multiple. Default uses all of them.
--contigsminlength Set a minimum length of contig. Smaller contigs will be discarded. Default: 2000.
--semibin_environment Set the trained environment for SemiBin. Default is human_gut. See the SemiBin Documentation for other options. Choose global if no other environment is appropiate.
--skip_gtdbtk Skip GTDB-TK. Both Genome Assembly Modules will run GTDB-TK for taxonomical profiling as a default.
--skip_checkm Skip Checkm bin quality check.
--gtdbtk_reference GTDB-TK Reference. Reference database for GTDB-TK needs to be set (already set on Kiel Medcluster):
--publish_megahit Publish results of megahit with .
--publish_rawbins Publish the results of all used binning tools in the genome assembly workflow.
--vamb_groupsize Only used when binning with vamb is performed and assemblymode is "single". Set a subgrouping size for vamb, default is 100. This is a temporary fix to enable the pipeline to handle very large cohorts on medium sized hardware. For best results adjust the groupsize to the total sample size of your cohort.

MAGScoT options:

--magscot_min_sharing Scoring parameter a [default=1]
--magscot_score_a Scoring parameter a [default=1]
--magscot_score_b Scoring parameter b [default=0.5]
--magscot_score_c Scoring parameter c [default=0.5]
--magscot_threshold Scoring minimum completeness threshold [default=0.5]
--magscot_min_markers Minimum number of unique markers in bins to be considered as seed for bin merging [default=25]
--magscot_iterations Number of merging iterations to perform. [default=2]

Other options:

--single_end Set the pipeline for single end reads.
--outdir Set a custom output directory, default is "results".
-resume Resumes pipeline and will continue the run with already completed, cached processes.
-profile Change the configuration of the pipeline. Valid options are medcluster (default), local or custom. You can add a new profile for your compute system by editing the file custom.config in the folder conf or create a new one and add it in the file nextflow.config under 'profiles'.
-work-dir Set a custom work directory, default is "work".
-r Use a specific branch or release version of the pipeline.
--publish_rawreads Publish unprocessed/raw files downloaded from SRA in the output directory.

Kraken2 options:

--kraken2_db Directory of used Kraken2 database. Should be Bracken ready for use with Bracken. REQUIRED!

Salmon options:

--salmon_db Directory of used salmon database. REQUIRED!
--salmon_reference Path to tab-separated taxonomy file corresponding to the used salmon database. Not required if used with default database. Two column file with header line containing in the first column the bin names used in the salmon database and in the second column the taxonomic assignment by GTDB-Tk in the format "d__Bacteria;p__Pseudomonadota;c__Gammaproteobacteria;o__Enterobacterales;f__Enterobacteriaceae;g__Escherichia;s__Escherichia coli".
--salmon_processing NOT RECOMMENDED! Shortcut for high-throughput data processing with salmon, skips qc, no other modules available in this mode.

Sylph options:

--sylph_db Set the path to a sylph databse.
--sylph_merge All sylph profiling will be done in one process. Produces a single output for all samples combined.

Bracken options and their default:

--bracken_length = 100
--bracken_level = "S"
--bracken_threshold = 0