Skip to content

Commit

Permalink
introduced SANGER mode (#6)
Browse files Browse the repository at this point in the history
introduced SANGER mode (#6)
  • Loading branch information
jonas-fuchs committed Mar 28, 2023
2 parents c1af79f + 20f6316 commit a506564
Show file tree
Hide file tree
Showing 12 changed files with 246 additions and 148 deletions.
20 changes: 10 additions & 10 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,30 +2,30 @@

# varVAMP

[![License: GPL v3](https://img.shields.io/badge/License-GPLv3-blue.svg)](https://www.gnu.org/licenses/gpl-3.0)
[![License: GPL v3](https://img.shields.io/github/license/jonas-fuchs/varvamp)](https://www.gnu.org/licenses/gpl-3.0) <img src=https://img.shields.io/github/v/release/jonas-fuchs/varvamp /> <img src=https://img.shields.io/badge/language-%3Epython3.9-red />

For a lot of virus genera it is difficult to design pan-specific primers. varVAMP solves this, by introducing ambiguous characters into primers and minimizes mismatches at the 3' end. Primers might not work for some sequences of your input alignment but should recognize the large majority.

**varVAMP comes in three different flavors:**

<img src="./docs/varvamp.png" alt="varVAMP logo" />
<img src="https://github.com/jonas-fuchs/varVAMP/blob/master/docs/varvamp.png" alt="varVAMP logo" />

**SANGER** *(coming soon)*: varVAMP searches for the very best primers and reports back an amplicon which can be used for PCR-based screening approaches.
**SANGER**: varVAMP searches for the very best primers and reports back non-overlapping amplicons which can be used for PCR-based screening approaches.

**TILED**: varVAMP uses a graph based approach to design overlapping amplicons that tile the entire viral genome. This designs amplicons that are suitable for Oxford Nanopore or Illumina based full-genome sequencing.

**QPCR** *(coming soon)*: varVAMP searches for small amplicons with an internal primer for the probe. It minimizes temperature differences between the primers.

This program is currently being developed and in an alpha state. You are welcome to use this software. If you successfully design primers, drop me a mail. It might be possible to collaborate!
This program is currently being developed and in an alpha state. You are welcome to use this software. If you successfully design primers, drop me a mail. It might be possible to collaborate! Ideas and suggestions are highly welcome.

# Documentation

* [Installation](docs/installation.md)
* [Preparing the data](docs/preparing_the_data.md)
* [Usage](docs/usage.md)
* [Output](docs/output.md)
* [How it works](docs/how_varvamp_works.md)
* [FAQ](docs/FAQ.md)
* [Installation](https://github.com/jonas-fuchs/varVAMP/blob/master/docs/installation.md)
* [Preparing the data](https://github.com/jonas-fuchs/varVAMP/blob/master/docs/preparing_the_data.md)
* [Usage](https://github.com/jonas-fuchs/varVAMP/blob/master/docs/usage.md)
* [Output](https://github.com/jonas-fuchs/varVAMP/blob/master/docs/output.md)
* [How it works](https://github.com/jonas-fuchs/varVAMP/blob/master/docs/how_varvamp_works.md)
* [FAQ](https://github.com/jonas-fuchs/varVAMP/blob/master/docs/FAQ.md)

---

Expand Down
12 changes: 8 additions & 4 deletions docs/FAQ.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,18 +4,22 @@

Start with setting your optimal length of your amplicon, the max length that you can tolerate and the overlap that you want to achieve. If you want to allow ambiguous characters, set them to 2 as a start. Set the threshold to the mean sequence identity of your alignment. Run varvamp and then optimize the output.

2. **How do I optimize the output?**
2. **How do I optimize the output for TILED mode?**

It all depends on how many conserved regions varVAMP is able to find! There are two main parameters that influence this. The number of ambiguous bases allowed within a primer and the threshold for consensus nucleotides. Setting the threshold higher or the number of ambiguous bases lower will result in less conserved regions. If you have set the parameters below and get a decent output, increase the threshold until the output gets worse. This will increase the specificity of your primers. Likewise, if you do not have a good output, consider increasing the number of ambiguous bases before you lower the threshold. The console output varVAMP will also give you some suggestions.

3. **varVAMP reported primer dimers. What now?**

In your case varVAMP could not find suitable replacement primers. You can either rerun varVAMP and try different settings or you can perform a third pool that contains a amplicon that has one of the conflicting dimers. Notably, varVAMP also reports the dimer melting temperature. If it is still reasonable low, using a hot start polymerase might still lead to successful PCR amplification.
In your case varVAMP could not find suitable replacement primers in the TILED mode. You can either rerun varVAMP and try different settings or you can perform a third pool that contains a amplicon that has one of the conflicting dimers. Notably, varVAMP also reports the dimer melting temperature. If it is still reasonable low, using a hot start polymerase might still lead to successful PCR amplification.

4. **I have multiple amplicons after SANGER mode. Which should I use?**

4. **How fast is varVAMP?**
varVAMP sorts all amplicons by score and always takes the best one of non-overlapping amplicons. If you are not interested in a specific gene region, amplicon_0 is your best candidate!

5. **How fast is varVAMP?**

varVAMP is pretty fast given the complexity of the problem. Running time is depended on the alignment length, number of sequences and the running mode. While the TILED is rather slow, qPCR and SANGER can be faster. An alignment with a few hundred sequences and with a genome size of 10 kb will likely run in under a minute for the TILED mode. For large e.g. DNA viruses (200 kb) it takes considerably longer, but should still finish in minutes. Running time optimizations are planned.

varVAMP is pretty fast given the complexity of the problem. Running time is depended on the alignment length, number of sequences and the running mode. While the TILED is rather slow, qPCR and SANGER are faster. An alignment with a few hundred sequences and with a genome size of 10 kb will likely run in under a minute for the TILED mode. For large e.g. DNA viruses (200 kb) it takes considerably longer, but should still finish in minutes. Running time optimizations are planned.



Expand Down
2 changes: 1 addition & 1 deletion docs/how_varvamp_works.md
Original file line number Diff line number Diff line change
Expand Up @@ -37,7 +37,7 @@ To search for the best scoring amplicon, varVAMP uses a graph based approach.
7. Lastly, the best scoring scheme is evaluated for primer dimers in their respective pools. If a primer dimer pair is found, varVAMP evaluates for each primer their overlapping previously not considered primers (primer search step 2) and again minimizes the score. The scheme and all primers are updated. If no alternative primers can be found, varVAMP issues a warning and reports the unsolvable primer dimers.

#### Sanger sequencing
coming soon
varVAMP sorts all amplicons by their score and takes the non-overlapping amplicon with the lowest score! As varVAMP gives a size penalty to amplicons, varVAMP automatically finds amplicons with low primer scores close to your optimal length (if possible).

#### qPCR
coming soon
Expand Down
4 changes: 4 additions & 0 deletions docs/installation.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,10 +3,14 @@ varVAMP runs on UNIX Systems, MacOSX and Windows with Python3 >=3.9 installed.

## Installation

From PyPI:

```shell
pip install varvamp
```

That was already it. To check if it worked:

```shell
varvamp -v
```
Expand Down
2 changes: 1 addition & 1 deletion docs/preparing_the_data.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,7 @@ mafft my_sequences.fasta > my_alignment.fasta


### BUT...
If your sequences are two diverse, also varVAMP will not perform well. Therefore, it is important that, phylogenically, the input alignment makes sense. To analyze this you can calculate a tree with tools like [iqtree](http://www.iqtree.org/) and then use [TreeCluster](https://github.com/niemasd/TreeCluster) to get phylogenically related sequence clusters. However, this can be also computationally intensive.
If your sequences are too diverse, also varVAMP will not perform well. Therefore, it is important that, phylogenically, the input alignment makes sense. To analyze this you can calculate a tree with tools like [iqtree](http://www.iqtree.org/) and then use [TreeCluster](https://github.com/niemasd/TreeCluster) to get phylogenically related sequence clusters. However, this can be also computationally intensive.

We have had good experience in using varVAMP with the sequence identity based clustering algorithm [vsearch](https://github.com/torognes/vsearch). A good starting point is a sequence identity between 0.8 and 0.85. For such clusters varVAMP should perform reasonably well.

Expand Down
35 changes: 17 additions & 18 deletions docs/usage.md
Original file line number Diff line number Diff line change
Expand Up @@ -15,34 +15,33 @@ In this case varVAMP uses as standard settings:
* ```MAX_LENGTH``` = 2000 (maximum amplicon length)
* ```OVERLAP``` = 100 (minimum overlap length)
* ```THRESHOLD``` = 0.9 (nucleotide consensus threshold)
* ```ALLOWED_AMBIGUOUS``` = 4 (number of allowed ambiguous characters in a primer)
* ```N_AMBIG``` = 4 (number of allowed ambiguous characters in a primer)
* ```MODE``` = TILED

These settings are quite relaxed and can produce decent results for diverse viruses (80-90 % sequence identity). However, you can likely optimize the result.

**Full usage:**
```shell
varvamp <alignment> <output dir> [OPTIONS]
usage: varvamp <alignment> <output dir> <mode> [options]
```

```
varvamp: variable virus amplicon design
positional arguments:
input alignment file and dir to write results
input alignment file and dir to write results
mode varvamp modes: TILED, SANGER
optional arguments:
-h, --help show this help message and exit
-ol OPT_LENGTH, --opt-length OPT_LENGTH
optimal length of the amplicons
-ml MAX_LENGTH, --max-length MAX_LENGTH
max length of the amplicons
-o OVERLAP, --overlap OVERLAP
min overlap of the amplicons
-t THRESHOLD, --threshold THRESHOLD
threshold for nucleotides in alignment to be considered conserved
-a ALLOWED_AMBIGUOUS, --allowed-ambiguous ALLOWED_AMBIGUOUS
number of ambiguous characters that are allowed within a primer
--console, --no-console
show varvamp console output (default: True)
-v, --version show program's version number and exit
-h, --help show this help message and exit
-ol , --opt-length optimal length of the amplicons
-ml , --max-length max length of the amplicons
-t , --threshold threshold for conserved nucleotides
-a , --n-ambig max number of ambiguous characters in a primer
-o , --overlap TILED: min overlap of the amplicons
-n , --report-n SANGER: report the top n best hits
--verb, --no-verb show varvamp console output (default: True)
-v, --version show program's version number and exit
```

## Further customization (advanced)
Expand Down
2 changes: 1 addition & 1 deletion varvamp/__init__.py
Original file line number Diff line number Diff line change
@@ -1,3 +1,3 @@
"""Tool to design amplicons for highly variable virusgenomes"""
_program = "varvamp"
__version__ = "0.3"
__version__ = "0.4"
143 changes: 87 additions & 56 deletions varvamp/command.py
Original file line number Diff line number Diff line change
Expand Up @@ -9,70 +9,88 @@
import argparse

# varVAMP
from . import _program
from varvamp import __version__
from varvamp.scripts import logging
from varvamp.scripts import alignment
from varvamp.scripts import config
from varvamp.scripts import consensus
from varvamp.scripts import conserved
from varvamp.scripts import primers
from varvamp.scripts import reporting
from varvamp.scripts import scheme
from varvamp import __version__
from . import _program


# DEFs
def get_args(sysargs):
"""
arg parsing for varvamp
"""
parser = argparse.ArgumentParser(
prog=_program,
description='varvamp: variable virus amplicon design',
usage='''varvamp <alignment> <output dir> [options]''')
usage='''varvamp <alignment> <output dir> <mode> [options]''')

parser.add_argument(
"input",
nargs=2,
help="alignment file and dir to write results"
)
parser.add_argument(
"mode",
type=str,
nargs='?',
default="TILED",
help="varvamp modes: TILED, SANGER",
)
parser.add_argument(
"-ol",
"--opt-length",
help="optimal length of the amplicons",
metavar="",
type=int,
default=config.AMPLICON_OPT_LENGTH
default=1000
)
parser.add_argument(
"-ml",
"--max-length",
help="max length of the amplicons",
metavar="",
type=int,
default=config.AMPLICON_MAX_LENGTH
)
parser.add_argument(
"-o",
"--overlap",
type=float,
default=config.AMPLICON_MIN_OVERLAP,
help="min overlap of the amplicons"
default=2000
)
parser.add_argument(
"-t",
"--threshold",
metavar="",
type=float,
default=config.FREQUENCY_THRESHOLD,
help="threshold for nucleotides in alignment to be considered conserved"
default=0.9,
help="threshold for conserved nucleotides"
)
parser.add_argument(
"-a",
"--allowed-ambiguous",
"--n-ambig",
metavar="",
type=int,
default=2,
help="max number of ambiguous characters in a primer"
)
parser.add_argument(
"-o",
"--overlap",
type=int,
metavar="",
default=100,
help="TILED: min overlap of the amplicons"
)
parser.add_argument(
"-n",
"--report-n",
type=int,
default=config.PRIMER_ALLOWED_N_AMB,
help="number of ambiguous characters that are allowed within a primer"
metavar="",
default=float("inf"),
help="SANGER: report the top n best hits"
)
parser.add_argument(
"--console",
"--verb",
action=argparse.BooleanOptionalAction,
default=True,
help="show varvamp console output"
Expand All @@ -97,8 +115,9 @@ def main(sysargs=sys.argv[1:]):
"""
# start varVAMP
args = get_args(sysargs)
if not args.console:
if not args.verb:
sys.stdout = open(os.devnull, 'w')
args.mode = args.mode.upper()
start_time = time.process_time()
results_dir, data_dir, log_file = logging.create_dir_structure(args.input[1])
logging.raise_arg_errors(args, log_file)
Expand Down Expand Up @@ -136,7 +155,7 @@ def main(sysargs=sys.argv[1:]):
# generate conserved region list
conserved_regions = conserved.find_regions(
ambiguous_consensus,
args.allowed_ambiguous
args.n_ambig
)
if not conserved_regions:
logging.raise_error(
Expand Down Expand Up @@ -201,46 +220,57 @@ def main(sysargs=sys.argv[1:]):
log_file,
exit=True
)
amplicon_graph = scheme.create_amplicon_graph(amplicons, args.overlap)
logging.varvamp_progress(
log_file,
progress=0.8,
job="Finding potential amplicons.",
progress_text=str(len(amplicons)) + " potential amplicons"
)
# search for amplicon scheme
coverage, amplicon_scheme = scheme.find_best_covering_scheme(
amplicons,
amplicon_graph,
all_primers
progress_text=f"{len(amplicons)} potential amplicons"
)
dimers_not_solved = scheme.check_and_solve_heterodimers(
amplicon_scheme,
left_primer_candidates,
right_primer_candidates,
all_primers)
if dimers_not_solved:
logging.raise_error(
f"varVAMP found {len(dimers_not_solved)} primer dimers without replacements. Check the dimer file and perform the PCR for incomaptible amplicons in a sperate reaction.",
log_file

if args.mode == "TILED":
amplicon_graph = scheme.create_amplicon_graph(amplicons, args.overlap)
# search for amplicon scheme
coverage, amplicon_scheme = scheme.find_best_covering_scheme(
amplicons,
amplicon_graph,
all_primers
)
reporting.write_dimers(dir, dimers_not_solved)
percent_coverage = round(coverage/len(ambiguous_consensus)*100, 2)
logging.varvamp_progress(
log_file,
progress=0.9,
job="Creating amplicon scheme.",
progress_text=f"{percent_coverage} % total coverage with {len(amplicon_scheme[0]) + len(amplicon_scheme[1])} amplicons"
)
if percent_coverage < 70:
logging.raise_error(
"coverage < 70 %. Possible solutions:\n"
"\t - lower threshold\n"
"\t - increase amplicons lengths\n"
"\t - increase number of ambiguous nucleotides\n"
"\t - relax primer settings (not recommended)\n",
log_file
dimers_not_solved = scheme.check_and_solve_heterodimers(
amplicon_scheme,
left_primer_candidates,
right_primer_candidates,
all_primers)
if dimers_not_solved:
logging.raise_error(
f"varVAMP found {len(dimers_not_solved)} primer dimers without replacements. Check the dimer file and perform the PCR for incomaptible amplicons in a sperate reaction.",
log_file
)
reporting.write_dimers(dir, dimers_not_solved)
percent_coverage = round(coverage/len(ambiguous_consensus)*100, 2)
logging.varvamp_progress(
log_file,
progress=0.9,
job="Creating amplicon scheme.",
progress_text=f"{percent_coverage} % total coverage with {len(amplicon_scheme[0]) + len(amplicon_scheme[1])} amplicons"
)
if percent_coverage < 70:
logging.raise_error(
"coverage < 70 %. Possible solutions:\n"
"\t - lower threshold\n"
"\t - increase amplicons lengths\n"
"\t - increase number of ambiguous nucleotides\n"
"\t - relax primer settings (not recommended)\n",
log_file
)
elif args.mode == "SANGER":
amplicon_scheme = scheme.find_best_amplicons(amplicons, all_primers, args.report_n)
logging.varvamp_progress(
log_file,
progress=0.9,
job="Finding low scoring amplicons.",
progress_text="The lowest scoring amplicon is amplicon_0."
)

# write files
reporting.write_alignment(data_dir, alignment_cleaned)
reporting.write_fasta(data_dir, "majority_consensus", majority_consensus)
Expand All @@ -250,7 +280,8 @@ def main(sysargs=sys.argv[1:]):
reporting.write_scheme_to_files(
results_dir,
amplicon_scheme,
ambiguous_consensus
ambiguous_consensus,
args.mode
)
reporting.varvamp_plot(
results_dir,
Expand Down
Loading

0 comments on commit a506564

Please sign in to comment.