Work in progress
Pesica is a collection of tools to detect and extract pESI plasmid contigs from Salmonella Infantis genomes. The two main steps in the pipeline are:
- Screening: Detect the presence of marker genes in the input genomes that indicate the presence of the pESI plasmid
- Extraction: If pESI is detected in the previous step, extract the plasmid contigs from the input genomes
After using the pipeline, you will a fasta file containing the extracted pESI-like contigs and its annotations.
Pesica is a collection of snakemake workflows collected together with a command line interface. It is required to use conda
(and recommended to use mamba
) to use the pipeline in order to manage the various virtual environments.
- Create the pesica conda environment:
conda/mamba env create -f env.yml
- Install the light and full dbs from bakta
- See the database download section of the Bakta README for instructions
- If you already have an existing bakta database, you can skip this step
- Edit
workflow/config.yml
. For basic uses of the workflow, you'll only need to edit the following:input_directory
is the directory that contains the genomes that you would like to run the pipeline on.- For now, the pipeline has only been tested on
gzipped
files (ie, those ending ingz
, nottar.gz
). It should work for uncompressed fastas, but this hasn't been tested.
- For now, the pipeline has only been tested on
run_name
can be whatever name you choose to help organize the workflow products inside theoutputs
directory.- Edit the
light_db
andfull_db
paths in thebakta
section with the path where you downloaded the bakta directories to. Thelight_db
andfull_db
values should be absolute paths ending indb-light
anddb
.- Example
shared/databases/db-light
orshared/databases/db
- Example
- You can leave the other values alone.
- Activate the virtual environment:
conda activate pesica
- Run the setup script:
python pesica setup
- This will extract the included blastdb, download the required bakta databases, and set up the virtual environments used in the rest of the pipeline
After setting the config values and activating the pesica
virtual environment as described above, you should be able to run pesica
using:
python pesica <subcommand> <options>
Here, subcommand
is one of screen
, extract
, or ordinate
(Note: for now, only screen
and extract
is implemented). Each subcommand (and pesica
itself) has a help
page that you can access with --help
or -h
:
python pesica --help
# To get help on the screen command:
python pesica screen -h
To get started, run the screen
command in dry-run
mode by doing:
python pesica screen --dry-run
# Alternatively:
python pesica screen -n
This will run the screen
workflow in dry-run mode, which will show you the result of running the pipeline on the inputs without actually running it. If this exits without error, you can then run the pipeline by repeating the command without the dry run flag:
python pesica screen
Assuming everything completes correctly, you should see an output directory outputs/<run_name>/screen
, where run_name
is the name that was set in the config.yml
file. This directory contains the following files:
<genome_name>.bakta
- A directory containingbakta
outputs for each input genome. In particular, it contains<genome_name>.tsv
, the gene annotations frombakta
that will be parsed downstreampesi_presence.tsv
- a tsv with a row for each input genome, with columns:genome
- the name of the genomehas_repA
- whether or not the pESI repA gene was detected in the genomepesi_gene_count
- the number of pESI marker genes detected in the genomehas_pesi
- whether or not the genome is predicted to contain pESI. This value is True if pESI repA was detected in the genome and there were at least 2 pESI marker genes detected in the genome
genomes_with_pesi.txt
- a list of the genomes that were predicted to contain pESI
After running this first step, we can now extract the plasmid genomes from the input genomes where pESI was detected with the extract
subcommand. We first check that all the required files are present with a dry-run:
python pesica extract --dry-run
After checking for errors, we can run it with:
python pesica extract
This will output several files that are detailed in the extract subcommand details, but the most important ones for the next steps are:
outputs/<run_name>/pesi_genomes
- a directory containing the plasmid genomes extracted from the input genomesoutputs/<run_name>/pesi_annotations
- directories containing thebakta
annotations for the extracted plasmid genomes.
- Since these are just snakemake workflows with a basic CLI, the pipeline needs to be run from inside this directory.
- If you are familiar with snakemake, you can skip the CLI and run the workflow directly with your own commands and options. The commands should look something like
snakemake --use-conda --snakefile workflow/rules/<subcommand>.smk
, but you can add whatever options you want besides that. - The
screen
step isn't too resource intensive, but theextract
step can be. It is recommended to run it with more cores, such as through an interactive session (eg,salloc -n 32
). Job submission to slurm will be added later.
Searches input sequences for pESI contigs.
For those sequences found to contain pESI, the plasmid-only contigs will be extracted
- Snakemake - workflow management
- Click - command line interface
- Bakta - gene annotation
- BLAST - pESI gene search
- Panaroo - pangenome insertion and clustering
- Documentation site
- HPC/Slurm support
This error appears to mostly occur when a dry-run is attempted (ie, snakemake -np
). This error should not impact the actual runs of the pipeline.