Tree-based joint lineage inference and somatic mutation calling
Assume that we start with a set of mutations in the file recallVAF_filtered.vcf.gz
The typical treecall workflow consists of three steps,
- Infer an initial tree relating the samples
# Using a neighbor-joining approach,
python treecall.py nbjoin -m 60 recallVAF_filtered.vcf.gz recall.nbjoin
# Or using a top-down partitioning approach,
python treecall.py part -m 60 recallVAF_filtered.vcf.gz recall.part
- Jointly genotype the variants with the help of a lineage tree
python treecall.py gtype -t recall.partition.nj.nwk \
-m 60 \
recallVAF_filtered.vcf.gz \
recall.partition.gtcall
- Annotate the lineage tree with genotype calls
python treecall.py annot -t recall.partition.nj.nwk \
recall.partition.gtcall \
recall.partition.annotation
More information about the individual commands is below.
usage: treecall.py [-h] <command> ...
positional arguments:
<command> sub-commands
tview view tree
compare compare tree topology
compat calculate pairwise compatibility between all pairs of sites
nbjoin neighbor-joining
part a top-down method that recursively partition samples based on partition cost
gtype genotype samples with help of a lineage tree
annot annotate lineage tree with genotype calls
optional arguments:
-h, --help show this help message and exit
usage: treecall.py tview [-h] [-a STR] [-l FILE] <nwk>
positional arguments:
<nwk> input tree in Newick format
optional arguments:
-h, --help show this help message and exit
-a STR node attributes to print given by a comma separated list
-l FILE leaves label
usage: treecall.py compare [-h] -t FILE [FILE ...] -r FILE
optional arguments:
-h, --help show this help message and exit
-t FILE [FILE ...] input tree(s), in Newick format
-r FILE reference tree, in Newick format
usage: treecall.py compat [-h] [-v INT] <vcf> <output>
positional arguments:
<vcf> input vcf/vcf.gz file, "-" for stdin
<output> output compatibility matrix
optional arguments:
-h, --help show this help message and exit
-v INT minimum evidence in Phred scale for a site to be considered, default 60
usage: treecall.py nbjoin [-h] [-m INT] [-e INT] [-v INT] <vcf> output
positional arguments:
<vcf> input vcf/vcf.gz file, "-" for stdin
output output basename
optional arguments:
-h, --help show this help message and exit
-m INT mutation rate in Phred scale, default 80
-e INT heterozygous rate in Phred scale, default 30
-v INT minimum evidence in Phred scale for a site to be considered, default 60
usage: treecall.py part [-h] [-m INT] [-e INT] [-v INT] <vcf> <output>
positional arguments:
<vcf> input vcf/vcf.gz file, "-" for stdin
<output> output basename
optional arguments:
-h, --help show this help message and exit
-m INT mutation rate in Phred scale, default 80
-e INT heterozygous rate in Phred scale, default 30
-v INT minimum evidence in Phred scale for a site to be considered, default 60
usage: treecall.py gtype [-h] -t FILE [-n INT] [-m INT] [-e INT] <vcf> <output>
positional arguments:
<vcf> input vcf/vcf.gz file, "-" for stdin
<output> output basename
optional arguments:
-h, --help show this help message and exit
-t FILE lineage tree
-n INT number of sites processed once, default 1000
-m INT mutation rate in Phred scale, default 80
-e INT heterozygous rate in Phred scale, default 30, 0 for uninformative
usage: treecall.py annot [-h] -t FILE <gtcall> <outnwk>
positional arguments:
<gtcall> input gtype calls, "-" for stdin
<outnwk> output tree in Newick format
optional arguments:
-h, --help show this help message and exit
-t FILE lineage tree
Some of the output files are explained below,
The columns in the *.gtcall
file are as follows,
- chromosome
- position
- reference allele
- null_P
- mut_P
- MLE_null_base_gtype
- MLE_null_base_gtype_P
- MLE_mut_base_gtype
- MLE_mut_alt_gtype
- MLE_mut_base_gtype_P
- MLE_mut_location
- MLE_mut_samples