introduced SANGER mode (#6)

jonas-fuchs · Mar 28, 2023 · a506564 · a506564
2 parents c1af79f + 20f6316
commit a506564
Show file tree

Hide file tree

Showing 12 changed files with 246 additions and 148 deletions.
diff --git a/README.md b/README.md
@@ -2,30 +2,30 @@
 
 # varVAMP
 
-[![License: GPL v3](https://img.shields.io/badge/License-GPLv3-blue.svg)](https://www.gnu.org/licenses/gpl-3.0)
+[![License: GPL v3](https://img.shields.io/github/license/jonas-fuchs/varvamp)](https://www.gnu.org/licenses/gpl-3.0) <img src=https://img.shields.io/github/v/release/jonas-fuchs/varvamp /> <img src=https://img.shields.io/badge/language-%3Epython3.9-red />
 
 For a lot of virus genera it is difficult to design pan-specific primers. varVAMP solves this, by introducing ambiguous characters into primers and minimizes mismatches at the 3' end. Primers might not work for some sequences of your input alignment but should recognize the large majority.
 
 **varVAMP comes in three different flavors:**
 
-<img src="./docs/varvamp.png" alt="varVAMP logo" />
+<img src="https://github.com/jonas-fuchs/varVAMP/blob/master/docs/varvamp.png" alt="varVAMP logo" />
 
-**SANGER** *(coming soon)*: varVAMP searches for the very best primers and reports back an amplicon which can be used for PCR-based screening approaches.
+**SANGER**: varVAMP searches for the very best primers and reports back non-overlapping amplicons which can be used for PCR-based screening approaches.
 
 **TILED**: varVAMP uses a graph based approach to design overlapping amplicons that tile the entire viral genome. This designs amplicons that are suitable for Oxford Nanopore or Illumina based full-genome sequencing.
 
 **QPCR** *(coming soon)*: varVAMP searches for small amplicons with an internal primer for the probe. It minimizes temperature differences between the primers.
 
-This program is currently being developed and in an alpha state. You are welcome to use this software. If you successfully design primers, drop me a mail. It might be possible to collaborate!
+This program is currently being developed and in an alpha state. You are welcome to use this software. If you successfully design primers, drop me a mail. It might be possible to collaborate! Ideas and suggestions are highly welcome.
 
 # Documentation
 
-* [Installation](docs/installation.md)
-* [Preparing the data](docs/preparing_the_data.md)
-* [Usage](docs/usage.md)
-* [Output](docs/output.md)
-* [How it works](docs/how_varvamp_works.md)
-* [FAQ](docs/FAQ.md)
+* [Installation](https://github.com/jonas-fuchs/varVAMP/blob/master/docs/installation.md)
+* [Preparing the data](https://github.com/jonas-fuchs/varVAMP/blob/master/docs/preparing_the_data.md)
+* [Usage](https://github.com/jonas-fuchs/varVAMP/blob/master/docs/usage.md)
+* [Output](https://github.com/jonas-fuchs/varVAMP/blob/master/docs/output.md)
+* [How it works](https://github.com/jonas-fuchs/varVAMP/blob/master/docs/how_varvamp_works.md)
+* [FAQ](https://github.com/jonas-fuchs/varVAMP/blob/master/docs/FAQ.md)
 
 ---
 

diff --git a/docs/FAQ.md b/docs/FAQ.md
@@ -4,18 +4,22 @@
 
 Start with setting your optimal length of your amplicon, the max length that you can tolerate and the overlap that you want to achieve. If you want to allow ambiguous characters, set them to 2 as a start. Set the threshold to the mean sequence identity of your alignment. Run varvamp and then optimize the output.
 
-2. **How do I optimize the output?**
+2. **How do I optimize the output for TILED mode?**
 
 It all depends on how many conserved regions varVAMP is able to find! There are two main parameters that influence this. The number of ambiguous bases allowed within a primer and the threshold for consensus nucleotides. Setting the threshold higher or the number of ambiguous bases lower will result in less conserved regions. If you have set the parameters below and get a decent output, increase the threshold until the output gets worse. This will increase the specificity of your primers. Likewise, if you do not have a good output, consider increasing the number of ambiguous bases before you lower the threshold. The console output varVAMP will also give you some suggestions.
 
 3. **varVAMP reported primer dimers. What now?**
 
-In your case varVAMP could not find suitable replacement primers. You can either rerun varVAMP and try different settings or you can perform a third pool that contains a amplicon that has one of the conflicting dimers. Notably, varVAMP also reports the dimer melting temperature. If it is still reasonable low, using a hot start polymerase might still lead to successful PCR amplification.
+In your case varVAMP could not find suitable replacement primers in the TILED mode. You can either rerun varVAMP and try different settings or you can perform a third pool that contains a amplicon that has one of the conflicting dimers. Notably, varVAMP also reports the dimer melting temperature. If it is still reasonable low, using a hot start polymerase might still lead to successful PCR amplification.
 
+4. **I have multiple amplicons after SANGER mode. Which should I use?**
 
-4. **How fast is varVAMP?**
+varVAMP sorts all amplicons by score and always takes the best one of non-overlapping amplicons. If you are not interested in a specific gene region, amplicon_0 is your best candidate!
+
+5. **How fast is varVAMP?**
+
+varVAMP is pretty fast given the complexity of the problem. Running time is depended on the alignment length, number of sequences and the running mode. While the TILED is rather slow, qPCR and SANGER can be faster. An alignment with a few hundred sequences and with a genome size of 10 kb will likely run in under a minute for the TILED mode. For large e.g. DNA viruses (200 kb) it takes considerably longer, but should still finish in minutes. Running time optimizations are planned.
 
-varVAMP is pretty fast given the complexity of the problem. Running time is depended on the alignment length, number of sequences and the running mode. While the TILED is rather slow, qPCR and SANGER are faster. An alignment with a few hundred sequences and with a genome size of 10 kb will likely run in under a minute for the TILED mode. For large e.g. DNA viruses (200 kb) it takes considerably longer, but should still finish in minutes. Running time optimizations are planned.
 
 
 

diff --git a/docs/how_varvamp_works.md b/docs/how_varvamp_works.md
@@ -37,7 +37,7 @@ To search for the best scoring amplicon, varVAMP uses a graph based approach.
 7. Lastly, the best scoring scheme is evaluated for primer dimers in their respective pools. If a primer dimer pair is found, varVAMP evaluates for each primer their overlapping previously not considered primers (primer search step 2) and again minimizes the score. The scheme and all primers are updated. If no alternative primers can be found, varVAMP issues a warning and reports the unsolvable primer dimers.
 
 #### Sanger sequencing
-coming soon
+varVAMP sorts all amplicons by their score and takes the non-overlapping amplicon with the lowest score! As varVAMP gives a size penalty to amplicons, varVAMP automatically finds amplicons with low primer scores close to your optimal length (if possible).
 
 #### qPCR
 coming soon

diff --git a/docs/installation.md b/docs/installation.md
@@ -3,10 +3,14 @@ varVAMP runs on UNIX Systems, MacOSX and Windows with Python3 >=3.9 installed.
 
 ## Installation
 
+From PyPI:
+
 ```shell
 pip install varvamp
 ```
+
 That was already it. To check if it worked:
+
 ```shell
 varvamp -v
 ```

diff --git a/docs/preparing_the_data.md b/docs/preparing_the_data.md
@@ -12,7 +12,7 @@ mafft my_sequences.fasta > my_alignment.fasta
 
 
 ### BUT...
-If your sequences are two diverse, also varVAMP will not perform well. Therefore, it is important that, phylogenically, the input alignment  makes sense. To analyze this you can calculate a tree  with tools like [iqtree](http://www.iqtree.org/) and then use [TreeCluster](https://github.com/niemasd/TreeCluster) to get phylogenically related sequence clusters. However, this can be also computationally intensive.
+If your sequences are too diverse, also varVAMP will not perform well. Therefore, it is important that, phylogenically, the input alignment  makes sense. To analyze this you can calculate a tree  with tools like [iqtree](http://www.iqtree.org/) and then use [TreeCluster](https://github.com/niemasd/TreeCluster) to get phylogenically related sequence clusters. However, this can be also computationally intensive.
 
 We have had good experience in using varVAMP with the sequence identity based clustering algorithm [vsearch](https://github.com/torognes/vsearch). A good starting point is a sequence identity between 0.8 and 0.85. For such clusters varVAMP should perform reasonably well.
 

diff --git a/docs/usage.md b/docs/usage.md
@@ -15,34 +15,33 @@ In this case varVAMP uses as standard settings:
 * ```MAX_LENGTH``` = 2000 (maximum amplicon length)
 * ```OVERLAP``` = 100 (minimum overlap length)
 * ```THRESHOLD``` = 0.9 (nucleotide consensus threshold)
-* ```ALLOWED_AMBIGUOUS``` = 4 (number of allowed ambiguous characters in a primer)
+* ```N_AMBIG``` = 4 (number of allowed ambiguous characters in a primer)
+* ```MODE``` = TILED
 
 These settings are quite relaxed and can produce decent results for diverse viruses (80-90 % sequence identity). However, you can likely optimize the result.
 
 **Full usage:**
 ```shell
-varvamp <alignment> <output dir> [OPTIONS]
+usage: varvamp <alignment> <output dir> <mode> [options]
 ```
+
 ```
+varvamp: variable virus amplicon design
+
 positional arguments:
-  input                 alignment file and dir to write results
+  input                alignment file and dir to write results
+  mode                 varvamp modes: TILED, SANGER
 
 optional arguments:
-  -h, --help            show this help message and exit
-  -ol OPT_LENGTH, --opt-length OPT_LENGTH
-                        optimal length of the amplicons
-  -ml MAX_LENGTH, --max-length MAX_LENGTH
-                        max length of the amplicons
-  -o OVERLAP, --overlap OVERLAP
-                        min overlap of the amplicons
-  -t THRESHOLD, --threshold THRESHOLD
-                        threshold for nucleotides in alignment to be considered conserved
-  -a ALLOWED_AMBIGUOUS, --allowed-ambiguous ALLOWED_AMBIGUOUS
-                        number of ambiguous characters that are allowed within a primer
-  --console, --no-console
-                        show varvamp console output (default: True)
-  -v, --version         show program's version number and exit
-
+  -h, --help           show this help message and exit
+  -ol , --opt-length   optimal length of the amplicons
+  -ml , --max-length   max length of the amplicons
+  -t , --threshold     threshold for conserved nucleotides
+  -a , --n-ambig       max number of ambiguous characters in a primer
+  -o , --overlap       TILED: min overlap of the amplicons
+  -n , --report-n      SANGER: report the top n best hits
+  --verb, --no-verb    show varvamp console output (default: True)
+  -v, --version        show program's version number and exit
 ```
 
 ## Further customization (advanced)

diff --git a/varvamp/__init__.py b/varvamp/__init__.py
@@ -1,3 +1,3 @@
 """Tool to design amplicons for highly variable virusgenomes"""
 _program = "varvamp"
-__version__ = "0.3"
+__version__ = "0.4"
diff --git a/varvamp/command.py b/varvamp/command.py
@@ -9,70 +9,88 @@
 import argparse
 
 # varVAMP
-from . import _program
-from varvamp import __version__
 from varvamp.scripts import logging
 from varvamp.scripts import alignment
-from varvamp.scripts import config
 from varvamp.scripts import consensus
 from varvamp.scripts import conserved
 from varvamp.scripts import primers
 from varvamp.scripts import reporting
 from varvamp.scripts import scheme
+from varvamp import __version__
+from . import _program
 
 
-# DEFs
 def get_args(sysargs):
     """
     arg parsing for varvamp
     """
     parser = argparse.ArgumentParser(
         prog=_program,
         description='varvamp: variable virus amplicon design',
-        usage='''varvamp <alignment> <output dir> [options]''')
+        usage='''varvamp <alignment> <output dir> <mode> [options]''')
 
     parser.add_argument(
         "input",
         nargs=2,
         help="alignment file and dir to write results"
     )
+    parser.add_argument(
+        "mode",
+        type=str,
+        nargs='?',
+        default="TILED",
+        help="varvamp modes: TILED, SANGER",
+    )
     parser.add_argument(
         "-ol",
         "--opt-length",
         help="optimal length of the amplicons",
+        metavar="",
         type=int,
-        default=config.AMPLICON_OPT_LENGTH
+        default=1000
     )
     parser.add_argument(
         "-ml",
         "--max-length",
         help="max length of the amplicons",
+        metavar="",
         type=int,
-        default=config.AMPLICON_MAX_LENGTH
-    )
-    parser.add_argument(
-        "-o",
-        "--overlap",
-        type=float,
-        default=config.AMPLICON_MIN_OVERLAP,
-        help="min overlap of the amplicons"
+        default=2000
     )
     parser.add_argument(
         "-t",
         "--threshold",
+        metavar="",
         type=float,
-        default=config.FREQUENCY_THRESHOLD,
-        help="threshold for nucleotides in alignment to be considered conserved"
+        default=0.9,
+        help="threshold for conserved nucleotides"
     )
     parser.add_argument(
         "-a",
-        "--allowed-ambiguous",
+        "--n-ambig",
+        metavar="",
+        type=int,
+        default=2,
+        help="max number of ambiguous characters in a primer"
+    )
+    parser.add_argument(
+        "-o",
+        "--overlap",
+        type=int,
+        metavar="",
+        default=100,
+        help="TILED: min overlap of the amplicons"
+    )
+    parser.add_argument(
+        "-n",
+        "--report-n",
         type=int,
-        default=config.PRIMER_ALLOWED_N_AMB,
-        help="number of ambiguous characters that are allowed within a primer"
+        metavar="",
+        default=float("inf"),
+        help="SANGER: report the top n best hits"
     )
     parser.add_argument(
-        "--console",
+        "--verb",
         action=argparse.BooleanOptionalAction,
         default=True,
         help="show varvamp console output"
@@ -97,8 +115,9 @@ def main(sysargs=sys.argv[1:]):
     """
     # start varVAMP
     args = get_args(sysargs)
-    if not args.console:
+    if not args.verb:
         sys.stdout = open(os.devnull, 'w')
+    args.mode = args.mode.upper()
     start_time = time.process_time()
     results_dir, data_dir, log_file = logging.create_dir_structure(args.input[1])
     logging.raise_arg_errors(args, log_file)
@@ -136,7 +155,7 @@ def main(sysargs=sys.argv[1:]):
     # generate conserved region list
     conserved_regions = conserved.find_regions(
         ambiguous_consensus,
-        args.allowed_ambiguous
+        args.n_ambig
     )
     if not conserved_regions:
         logging.raise_error(
@@ -201,46 +220,57 @@ def main(sysargs=sys.argv[1:]):
             log_file,
             exit=True
         )
-    amplicon_graph = scheme.create_amplicon_graph(amplicons, args.overlap)
     logging.varvamp_progress(
         log_file,
         progress=0.8,
         job="Finding potential amplicons.",
-        progress_text=str(len(amplicons)) + " potential amplicons"
-    )
-    # search for amplicon scheme
-    coverage, amplicon_scheme = scheme.find_best_covering_scheme(
-        amplicons,
-        amplicon_graph,
-        all_primers
+        progress_text=f"{len(amplicons)} potential amplicons"
     )
-    dimers_not_solved = scheme.check_and_solve_heterodimers(
-        amplicon_scheme,
-        left_primer_candidates,
-        right_primer_candidates,
-        all_primers)
-    if dimers_not_solved:
-        logging.raise_error(
-            f"varVAMP found {len(dimers_not_solved)} primer dimers without replacements. Check the dimer file and perform the PCR for incomaptible amplicons in a sperate reaction.",
-            log_file
+
+    if args.mode == "TILED":
+        amplicon_graph = scheme.create_amplicon_graph(amplicons, args.overlap)
+        # search for amplicon scheme
+        coverage, amplicon_scheme = scheme.find_best_covering_scheme(
+            amplicons,
+            amplicon_graph,
+            all_primers
         )
-        reporting.write_dimers(dir, dimers_not_solved)
-    percent_coverage = round(coverage/len(ambiguous_consensus)*100, 2)
-    logging.varvamp_progress(
-        log_file,
-        progress=0.9,
-        job="Creating amplicon scheme.",
-        progress_text=f"{percent_coverage} % total coverage with {len(amplicon_scheme[0]) + len(amplicon_scheme[1])} amplicons"
-    )
-    if percent_coverage < 70:
-        logging.raise_error(
-            "coverage < 70 %. Possible solutions:\n"
-            "\t - lower threshold\n"
-            "\t - increase amplicons lengths\n"
-            "\t - increase number of ambiguous nucleotides\n"
-            "\t - relax primer settings (not recommended)\n",
-            log_file
+        dimers_not_solved = scheme.check_and_solve_heterodimers(
+            amplicon_scheme,
+            left_primer_candidates,
+            right_primer_candidates,
+            all_primers)
+        if dimers_not_solved:
+            logging.raise_error(
+                f"varVAMP found {len(dimers_not_solved)} primer dimers without replacements. Check the dimer file and perform the PCR for incomaptible amplicons in a sperate reaction.",
+                log_file
+            )
+            reporting.write_dimers(dir, dimers_not_solved)
+        percent_coverage = round(coverage/len(ambiguous_consensus)*100, 2)
+        logging.varvamp_progress(
+            log_file,
+            progress=0.9,
+            job="Creating amplicon scheme.",
+            progress_text=f"{percent_coverage} % total coverage with {len(amplicon_scheme[0]) + len(amplicon_scheme[1])} amplicons"
         )
+        if percent_coverage < 70:
+            logging.raise_error(
+                "coverage < 70 %. Possible solutions:\n"
+                "\t - lower threshold\n"
+                "\t - increase amplicons lengths\n"
+                "\t - increase number of ambiguous nucleotides\n"
+                "\t - relax primer settings (not recommended)\n",
+                log_file
+            )
+    elif args.mode == "SANGER":
+        amplicon_scheme = scheme.find_best_amplicons(amplicons, all_primers, args.report_n)
+        logging.varvamp_progress(
+            log_file,
+            progress=0.9,
+            job="Finding low scoring amplicons.",
+            progress_text="The lowest scoring amplicon is amplicon_0."
+        )
+
     # write files
     reporting.write_alignment(data_dir, alignment_cleaned)
     reporting.write_fasta(data_dir, "majority_consensus", majority_consensus)
@@ -250,7 +280,8 @@ def main(sysargs=sys.argv[1:]):
     reporting.write_scheme_to_files(
         results_dir,
         amplicon_scheme,
-        ambiguous_consensus
+        ambiguous_consensus,
+        args.mode
     )
     reporting.varvamp_plot(
         results_dir,