Note: PathOGiST is currently not compatible with OSX
We recommend you create a conda environment for PathOGiST, and install PathOGiST through conda. First set up Bioconda as per the instructions here. PathOGiST requires Python 3.5 or newer:
conda create --name pathogist
And then activate the environment and install PathOGiST:
source activate pathogist
conda install pathogist
When inside the pathogist
conda environment, you can then simply run PATHOGIST -h
, for example.
Note that you will need to install CPLEX separately, as CPLEX is proprietary software.
This subcommand runs the PathOGiST pipeline from start to finish (i.e. distance matrix creation -> correlation clustering -> consensus clustering).
The main input file is a YAML configuration file, which you can create with the command
PATHOGIST run [path to where you want your config] --new_config
The configuration file will look like this.
Modify the configuration by adding paths to files, changing parameters, etc. You can add your own keys to the YAML configuration file, and delete the default keys which aren't relevant to your experiment.
The inputs to the genotyping
entries should be a file which contains absolute paths to your call files.
For example, mlst_calls.txt
should look something like:
/absolute/path/to/SRR00001.calls
/absolute/path/to/SRR00002.calls
/absolute/path/to/SRR00003.calls
The output of PathOGiST is a TSV file containing the file consensus cluster assignment for each sample.
This subcommand is for clustering bacterial samples based on a distance matrix.
The inputs to correlation clustering are:
- A distance matrix in the form of a TSV file
- A threshold cutoff value for the construction of the similarity matrix The output is a TSV file containing the cluster assignments of the samples described by the distance matrix.
You can run correlation clustering with the following command:
PATHOGIST correlation [distance matrix] [threshold] [output path]
This subcommand is used for creating distance matrices from genotyping calls, e.g. SNPs, MLSTs, CNVs, etc. Currently, this subcommand is only compatible with SNP calls from Snippy, MLST calls from MentaLiST, and CNV calls from Prince. The input is:
- A text file containing paths to genotyping call files.
The output is a distance matrix represented as a TSV file.
You can run this subcommand like so:
PATHOGIST distance [path/to/calls_file.tsv] [one of SNP/MLST/CNV] [output path]
The input for consensus clustering is three files:
- A text file containing paths to distance matrices in
.tsv
format. - A text file containing paths to clustering assignments in
.tsv
format. - A text file containing the names of the clusterings which are 'finest'.
The output is a TSV file containing the cluster assignments of the samples which are common to all the input distance matrices.
You can run consensus clustering with the following command:
PATHOGIST consensus [distances] [clusterings] [fine_clusterings] [output path]
Each line of the input files should correspond to a specific data type, e.g. SNPs, MLSTs, or CNVs.
Absolute paths to distance matrices and cluster assignments should be prepended with the name of the clustering and an equal sign, i.e. [name]=[absolute path to file]
.
An example:
Distances file
SNP=/path/to/snp_dist
MLST=/path/to/mlst_dist
CNV=/path/to/cnv_dist
Clusterings file
SNP=/path/to/snp_clust
MLST=/path/to/mlst_clust
CNV=/path/to/cnv_clust
Fine clusterings file
SNP
To cite PathOGiST in publications, please use: