Skip to content

Workflow

Anna Price edited this page Sep 9, 2020 · 4 revisions

The first stage of the workflow is to create two initial input channels: EditFasta and OldFasta. The EditFasta channel is created from params.addFasta, it should contain new fasta files to be added to the database. The channel uses the mapping .map{ file -> tuple(file.getParent().getName(), file) } to map each fasta file to the name of the directory it sits in (the fastas should be sorted into directories for each taxon where the directory name is the taxon name separated by underscores). The OldFasta channel is created from params.previousDatabase and should contain fastas files from a previous database build, which have already been mapped to their tax ID. If there is no previous database build then params.previousDatabase should be set to null, creating an empty channel for OldFasta.

The main workflow then calls three component workflows: prepareNewFasta, selectFasta and krakenBuild.

prepareNewFasta

The prepareNewFasta workflow component takes the new fasta files to be added to the database (i.e. the EditFasta channel) and looks up the tax ID and adds it to the sequence headers and the filename. Fasta files from previous database builds skip this stage (i.e. the OldFasta channel).

prepareNewFasta contains one process called autoDatabase_addTaxon. This process takes an individual fasta and its taxon/directory name as input and uses the taxadd script to add the tax ID to the sequence headers and filenames. The taxadd script uses the NCBI taxonomy names.dmp to look up the tax ID. The output of autoDatabase_addTaxon is the tax ID mapped fasta.

Once the prepareNewFasta workflow component has finished executing, the AllFasta channel is created containing the output fasta from autoDatabase_addTaxon and the fasta files from OldFasta. The fasta are mapped to their tax ID which is scraped from the filename .map{ file -> tuple(file.getName().split("_")[0], file) }. They are then grouped by tax ID using .groupTuple(sort: true).

selectFasta

The selectFasta workflow component takes the AllFasta channel and selects high quality assemblies using Mash. Parallelisation is by taxon.

The first process autoDatabase_mash has the AllFasta channel and the tax ID as input, calculating the pairwise mash distances for each taxon and outputting text files containing these distances of the form ${taxid}_mashdist.txt.

The next process is autoDatabase_qc, this is the quality control stage of the workflow which selects high quality assemblies which will go on to form the database. This process takes the output text file from autoDatabase_mash for each taxon and uses the fastaselect script to output text files listing the high quality assemblies for each taxon. The fastaselect script builds a mash distance matrix, finds the average distance for each assembly, and finds the mode to 2 s.f. The filenames of assemblies that have an average distance that is within 10% in the mode are then recorded in a text file. These are the high quality assemblies which will go onto build the database. If there are less than three samples for a taxon, then the assembly/assemblies for this taxon will be added to the database with no quality control.

The final process in selectFasta is autoDatabase_cleanFasta, it is a serial process which takes the output lists of high quality assemblies from autoDatabase_qc and the channel AllFasta and moves the fasta files listed in the text files to the assemblies directory. This is done to create a channel which contains all the high quality assemblies which can be passed to the kraken database building stage.

krakenBuild

The krakenBuild workflow component takes the high quality assemblies and builds a database using Kraken2.

The autoDatabase_kraken2Build process takes the high quality fasta from selectFasta as input, collecting them using .collect(), and outputs the .k2d kraken2 database files. The script for autoDatabase_kraken2Build first downloads the taxonomy for May 2020 from the NCBI and ammends taxon information for Mycobacterium tomidae in names.dmp and nodes.dmp (as tomidae is currently absent from the NCBI taxonomy, it is assigned a tax ID of the largest taxID value +1) and then moves names.dmp and nodes.dmp to the taxonomy directory. The fasta files are then added to the kraken2 library using kraken2-build --add-to-library, once all the files have been added to the library, the database is then built using kraken2-build --build with the number of cpus set to 24 using the cpus 24 directive.

Clone this wiki locally