diff --git a/README.md b/README.md index c1084d6..3bab491 100644 --- a/README.md +++ b/README.md @@ -2,30 +2,30 @@ # varVAMP -[![License: GPL v3](https://img.shields.io/badge/License-GPLv3-blue.svg)](https://www.gnu.org/licenses/gpl-3.0) +[![License: GPL v3](https://img.shields.io/github/license/jonas-fuchs/varvamp)](https://www.gnu.org/licenses/gpl-3.0) For a lot of virus genera it is difficult to design pan-specific primers. varVAMP solves this, by introducing ambiguous characters into primers and minimizes mismatches at the 3' end. Primers might not work for some sequences of your input alignment but should recognize the large majority. **varVAMP comes in three different flavors:** -varVAMP logo +varVAMP logo -**SANGER** *(coming soon)*: varVAMP searches for the very best primers and reports back an amplicon which can be used for PCR-based screening approaches. +**SANGER**: varVAMP searches for the very best primers and reports back non-overlapping amplicons which can be used for PCR-based screening approaches. **TILED**: varVAMP uses a graph based approach to design overlapping amplicons that tile the entire viral genome. This designs amplicons that are suitable for Oxford Nanopore or Illumina based full-genome sequencing. **QPCR** *(coming soon)*: varVAMP searches for small amplicons with an internal primer for the probe. It minimizes temperature differences between the primers. -This program is currently being developed and in an alpha state. You are welcome to use this software. If you successfully design primers, drop me a mail. It might be possible to collaborate! +This program is currently being developed and in an alpha state. You are welcome to use this software. If you successfully design primers, drop me a mail. It might be possible to collaborate! Ideas and suggestions are highly welcome. # Documentation -* [Installation](docs/installation.md) -* [Preparing the data](docs/preparing_the_data.md) -* [Usage](docs/usage.md) -* [Output](docs/output.md) -* [How it works](docs/how_varvamp_works.md) -* [FAQ](docs/FAQ.md) +* [Installation](https://github.com/jonas-fuchs/varVAMP/blob/master/docs/installation.md) +* [Preparing the data](https://github.com/jonas-fuchs/varVAMP/blob/master/docs/preparing_the_data.md) +* [Usage](https://github.com/jonas-fuchs/varVAMP/blob/master/docs/usage.md) +* [Output](https://github.com/jonas-fuchs/varVAMP/blob/master/docs/output.md) +* [How it works](https://github.com/jonas-fuchs/varVAMP/blob/master/docs/how_varvamp_works.md) +* [FAQ](https://github.com/jonas-fuchs/varVAMP/blob/master/docs/FAQ.md) --- diff --git a/docs/FAQ.md b/docs/FAQ.md index 8ee18f9..c00dae1 100644 --- a/docs/FAQ.md +++ b/docs/FAQ.md @@ -4,18 +4,22 @@ Start with setting your optimal length of your amplicon, the max length that you can tolerate and the overlap that you want to achieve. If you want to allow ambiguous characters, set them to 2 as a start. Set the threshold to the mean sequence identity of your alignment. Run varvamp and then optimize the output. -2. **How do I optimize the output?** +2. **How do I optimize the output for TILED mode?** It all depends on how many conserved regions varVAMP is able to find! There are two main parameters that influence this. The number of ambiguous bases allowed within a primer and the threshold for consensus nucleotides. Setting the threshold higher or the number of ambiguous bases lower will result in less conserved regions. If you have set the parameters below and get a decent output, increase the threshold until the output gets worse. This will increase the specificity of your primers. Likewise, if you do not have a good output, consider increasing the number of ambiguous bases before you lower the threshold. The console output varVAMP will also give you some suggestions. 3. **varVAMP reported primer dimers. What now?** -In your case varVAMP could not find suitable replacement primers. You can either rerun varVAMP and try different settings or you can perform a third pool that contains a amplicon that has one of the conflicting dimers. Notably, varVAMP also reports the dimer melting temperature. If it is still reasonable low, using a hot start polymerase might still lead to successful PCR amplification. +In your case varVAMP could not find suitable replacement primers in the TILED mode. You can either rerun varVAMP and try different settings or you can perform a third pool that contains a amplicon that has one of the conflicting dimers. Notably, varVAMP also reports the dimer melting temperature. If it is still reasonable low, using a hot start polymerase might still lead to successful PCR amplification. +4. **I have multiple amplicons after SANGER mode. Which should I use?** -4. **How fast is varVAMP?** +varVAMP sorts all amplicons by score and always takes the best one of non-overlapping amplicons. If you are not interested in a specific gene region, amplicon_0 is your best candidate! + +5. **How fast is varVAMP?** + +varVAMP is pretty fast given the complexity of the problem. Running time is depended on the alignment length, number of sequences and the running mode. While the TILED is rather slow, qPCR and SANGER can be faster. An alignment with a few hundred sequences and with a genome size of 10 kb will likely run in under a minute for the TILED mode. For large e.g. DNA viruses (200 kb) it takes considerably longer, but should still finish in minutes. Running time optimizations are planned. -varVAMP is pretty fast given the complexity of the problem. Running time is depended on the alignment length, number of sequences and the running mode. While the TILED is rather slow, qPCR and SANGER are faster. An alignment with a few hundred sequences and with a genome size of 10 kb will likely run in under a minute for the TILED mode. For large e.g. DNA viruses (200 kb) it takes considerably longer, but should still finish in minutes. Running time optimizations are planned. diff --git a/docs/how_varvamp_works.md b/docs/how_varvamp_works.md index 63a53dd..0493ae6 100644 --- a/docs/how_varvamp_works.md +++ b/docs/how_varvamp_works.md @@ -37,7 +37,7 @@ To search for the best scoring amplicon, varVAMP uses a graph based approach. 7. Lastly, the best scoring scheme is evaluated for primer dimers in their respective pools. If a primer dimer pair is found, varVAMP evaluates for each primer their overlapping previously not considered primers (primer search step 2) and again minimizes the score. The scheme and all primers are updated. If no alternative primers can be found, varVAMP issues a warning and reports the unsolvable primer dimers. #### Sanger sequencing -coming soon +varVAMP sorts all amplicons by their score and takes the non-overlapping amplicon with the lowest score! As varVAMP gives a size penalty to amplicons, varVAMP automatically finds amplicons with low primer scores close to your optimal length (if possible). #### qPCR coming soon diff --git a/docs/installation.md b/docs/installation.md index b7ef2b6..b98a51d 100644 --- a/docs/installation.md +++ b/docs/installation.md @@ -3,10 +3,14 @@ varVAMP runs on UNIX Systems, MacOSX and Windows with Python3 >=3.9 installed. ## Installation +From PyPI: + ```shell pip install varvamp ``` + That was already it. To check if it worked: + ```shell varvamp -v ``` diff --git a/docs/preparing_the_data.md b/docs/preparing_the_data.md index 9757aae..a850e3a 100644 --- a/docs/preparing_the_data.md +++ b/docs/preparing_the_data.md @@ -12,7 +12,7 @@ mafft my_sequences.fasta > my_alignment.fasta ### BUT... -If your sequences are two diverse, also varVAMP will not perform well. Therefore, it is important that, phylogenically, the input alignment makes sense. To analyze this you can calculate a tree with tools like [iqtree](http://www.iqtree.org/) and then use [TreeCluster](https://github.com/niemasd/TreeCluster) to get phylogenically related sequence clusters. However, this can be also computationally intensive. +If your sequences are too diverse, also varVAMP will not perform well. Therefore, it is important that, phylogenically, the input alignment makes sense. To analyze this you can calculate a tree with tools like [iqtree](http://www.iqtree.org/) and then use [TreeCluster](https://github.com/niemasd/TreeCluster) to get phylogenically related sequence clusters. However, this can be also computationally intensive. We have had good experience in using varVAMP with the sequence identity based clustering algorithm [vsearch](https://github.com/torognes/vsearch). A good starting point is a sequence identity between 0.8 and 0.85. For such clusters varVAMP should perform reasonably well. diff --git a/docs/usage.md b/docs/usage.md index 723df01..e22cae2 100644 --- a/docs/usage.md +++ b/docs/usage.md @@ -15,34 +15,33 @@ In this case varVAMP uses as standard settings: * ```MAX_LENGTH``` = 2000 (maximum amplicon length) * ```OVERLAP``` = 100 (minimum overlap length) * ```THRESHOLD``` = 0.9 (nucleotide consensus threshold) -* ```ALLOWED_AMBIGUOUS``` = 4 (number of allowed ambiguous characters in a primer) +* ```N_AMBIG``` = 4 (number of allowed ambiguous characters in a primer) +* ```MODE``` = TILED These settings are quite relaxed and can produce decent results for diverse viruses (80-90 % sequence identity). However, you can likely optimize the result. **Full usage:** ```shell -varvamp [OPTIONS] +usage: varvamp [options] ``` + ``` +varvamp: variable virus amplicon design + positional arguments: - input alignment file and dir to write results + input alignment file and dir to write results + mode varvamp modes: TILED, SANGER optional arguments: - -h, --help show this help message and exit - -ol OPT_LENGTH, --opt-length OPT_LENGTH - optimal length of the amplicons - -ml MAX_LENGTH, --max-length MAX_LENGTH - max length of the amplicons - -o OVERLAP, --overlap OVERLAP - min overlap of the amplicons - -t THRESHOLD, --threshold THRESHOLD - threshold for nucleotides in alignment to be considered conserved - -a ALLOWED_AMBIGUOUS, --allowed-ambiguous ALLOWED_AMBIGUOUS - number of ambiguous characters that are allowed within a primer - --console, --no-console - show varvamp console output (default: True) - -v, --version show program's version number and exit - + -h, --help show this help message and exit + -ol , --opt-length optimal length of the amplicons + -ml , --max-length max length of the amplicons + -t , --threshold threshold for conserved nucleotides + -a , --n-ambig max number of ambiguous characters in a primer + -o , --overlap TILED: min overlap of the amplicons + -n , --report-n SANGER: report the top n best hits + --verb, --no-verb show varvamp console output (default: True) + -v, --version show program's version number and exit ``` ## Further customization (advanced) diff --git a/varvamp/__init__.py b/varvamp/__init__.py index 136d60d..36ac97e 100644 --- a/varvamp/__init__.py +++ b/varvamp/__init__.py @@ -1,3 +1,3 @@ """Tool to design amplicons for highly variable virusgenomes""" _program = "varvamp" -__version__ = "0.3" +__version__ = "0.4" diff --git a/varvamp/command.py b/varvamp/command.py index 0869739..66b1d56 100644 --- a/varvamp/command.py +++ b/varvamp/command.py @@ -9,19 +9,17 @@ import argparse # varVAMP -from . import _program -from varvamp import __version__ from varvamp.scripts import logging from varvamp.scripts import alignment -from varvamp.scripts import config from varvamp.scripts import consensus from varvamp.scripts import conserved from varvamp.scripts import primers from varvamp.scripts import reporting from varvamp.scripts import scheme +from varvamp import __version__ +from . import _program -# DEFs def get_args(sysargs): """ arg parsing for varvamp @@ -29,50 +27,70 @@ def get_args(sysargs): parser = argparse.ArgumentParser( prog=_program, description='varvamp: variable virus amplicon design', - usage='''varvamp [options]''') + usage='''varvamp [options]''') parser.add_argument( "input", nargs=2, help="alignment file and dir to write results" ) + parser.add_argument( + "mode", + type=str, + nargs='?', + default="TILED", + help="varvamp modes: TILED, SANGER", + ) parser.add_argument( "-ol", "--opt-length", help="optimal length of the amplicons", + metavar="", type=int, - default=config.AMPLICON_OPT_LENGTH + default=1000 ) parser.add_argument( "-ml", "--max-length", help="max length of the amplicons", + metavar="", type=int, - default=config.AMPLICON_MAX_LENGTH - ) - parser.add_argument( - "-o", - "--overlap", - type=float, - default=config.AMPLICON_MIN_OVERLAP, - help="min overlap of the amplicons" + default=2000 ) parser.add_argument( "-t", "--threshold", + metavar="", type=float, - default=config.FREQUENCY_THRESHOLD, - help="threshold for nucleotides in alignment to be considered conserved" + default=0.9, + help="threshold for conserved nucleotides" ) parser.add_argument( "-a", - "--allowed-ambiguous", + "--n-ambig", + metavar="", + type=int, + default=2, + help="max number of ambiguous characters in a primer" + ) + parser.add_argument( + "-o", + "--overlap", + type=int, + metavar="", + default=100, + help="TILED: min overlap of the amplicons" + ) + parser.add_argument( + "-n", + "--report-n", type=int, - default=config.PRIMER_ALLOWED_N_AMB, - help="number of ambiguous characters that are allowed within a primer" + metavar="", + default=float("inf"), + help="SANGER: report the top n best hits" ) parser.add_argument( - "--console", + "--verb", action=argparse.BooleanOptionalAction, default=True, help="show varvamp console output" @@ -97,8 +115,9 @@ def main(sysargs=sys.argv[1:]): """ # start varVAMP args = get_args(sysargs) - if not args.console: + if not args.verb: sys.stdout = open(os.devnull, 'w') + args.mode = args.mode.upper() start_time = time.process_time() results_dir, data_dir, log_file = logging.create_dir_structure(args.input[1]) logging.raise_arg_errors(args, log_file) @@ -136,7 +155,7 @@ def main(sysargs=sys.argv[1:]): # generate conserved region list conserved_regions = conserved.find_regions( ambiguous_consensus, - args.allowed_ambiguous + args.n_ambig ) if not conserved_regions: logging.raise_error( @@ -201,46 +220,57 @@ def main(sysargs=sys.argv[1:]): log_file, exit=True ) - amplicon_graph = scheme.create_amplicon_graph(amplicons, args.overlap) logging.varvamp_progress( log_file, progress=0.8, job="Finding potential amplicons.", - progress_text=str(len(amplicons)) + " potential amplicons" - ) - # search for amplicon scheme - coverage, amplicon_scheme = scheme.find_best_covering_scheme( - amplicons, - amplicon_graph, - all_primers + progress_text=f"{len(amplicons)} potential amplicons" ) - dimers_not_solved = scheme.check_and_solve_heterodimers( - amplicon_scheme, - left_primer_candidates, - right_primer_candidates, - all_primers) - if dimers_not_solved: - logging.raise_error( - f"varVAMP found {len(dimers_not_solved)} primer dimers without replacements. Check the dimer file and perform the PCR for incomaptible amplicons in a sperate reaction.", - log_file + + if args.mode == "TILED": + amplicon_graph = scheme.create_amplicon_graph(amplicons, args.overlap) + # search for amplicon scheme + coverage, amplicon_scheme = scheme.find_best_covering_scheme( + amplicons, + amplicon_graph, + all_primers ) - reporting.write_dimers(dir, dimers_not_solved) - percent_coverage = round(coverage/len(ambiguous_consensus)*100, 2) - logging.varvamp_progress( - log_file, - progress=0.9, - job="Creating amplicon scheme.", - progress_text=f"{percent_coverage} % total coverage with {len(amplicon_scheme[0]) + len(amplicon_scheme[1])} amplicons" - ) - if percent_coverage < 70: - logging.raise_error( - "coverage < 70 %. Possible solutions:\n" - "\t - lower threshold\n" - "\t - increase amplicons lengths\n" - "\t - increase number of ambiguous nucleotides\n" - "\t - relax primer settings (not recommended)\n", - log_file + dimers_not_solved = scheme.check_and_solve_heterodimers( + amplicon_scheme, + left_primer_candidates, + right_primer_candidates, + all_primers) + if dimers_not_solved: + logging.raise_error( + f"varVAMP found {len(dimers_not_solved)} primer dimers without replacements. Check the dimer file and perform the PCR for incomaptible amplicons in a sperate reaction.", + log_file + ) + reporting.write_dimers(dir, dimers_not_solved) + percent_coverage = round(coverage/len(ambiguous_consensus)*100, 2) + logging.varvamp_progress( + log_file, + progress=0.9, + job="Creating amplicon scheme.", + progress_text=f"{percent_coverage} % total coverage with {len(amplicon_scheme[0]) + len(amplicon_scheme[1])} amplicons" ) + if percent_coverage < 70: + logging.raise_error( + "coverage < 70 %. Possible solutions:\n" + "\t - lower threshold\n" + "\t - increase amplicons lengths\n" + "\t - increase number of ambiguous nucleotides\n" + "\t - relax primer settings (not recommended)\n", + log_file + ) + elif args.mode == "SANGER": + amplicon_scheme = scheme.find_best_amplicons(amplicons, all_primers, args.report_n) + logging.varvamp_progress( + log_file, + progress=0.9, + job="Finding low scoring amplicons.", + progress_text="The lowest scoring amplicon is amplicon_0." + ) + # write files reporting.write_alignment(data_dir, alignment_cleaned) reporting.write_fasta(data_dir, "majority_consensus", majority_consensus) @@ -250,7 +280,8 @@ def main(sysargs=sys.argv[1:]): reporting.write_scheme_to_files( results_dir, amplicon_scheme, - ambiguous_consensus + ambiguous_consensus, + args.mode ) reporting.varvamp_plot( results_dir, diff --git a/varvamp/scripts/config.py b/varvamp/scripts/config.py index a78435d..b29d3e3 100644 --- a/varvamp/scripts/config.py +++ b/varvamp/scripts/config.py @@ -6,10 +6,6 @@ # CAN BE CHANGED -# alignment and consensus creation threshold -FREQUENCY_THRESHOLD = 0.9 # freq at which a nucleotide is considered conserved -PRIMER_ALLOWED_N_AMB = 4 # allowed number of ambiguous chars in primer - # basic primer parameters PRIMER_TMP = (57, 63, 60) # temperatur (min, max, opt) PRIMER_GC_RANGE = (40, 60, 50) # gc (min, max, opt) @@ -36,10 +32,6 @@ PRIMER_3_PENALTY = (10, 10, 10) # penalties for 3' mismatches PRIMER_PERMUTATION_PENALTY = 0.1 # penalty for the number of permutations -# amplicon settings -AMPLICON_MIN_OVERLAP = 100 -AMPLICON_OPT_LENGTH = 1000 -AMPLICON_MAX_LENGTH = 2000 # DO NOT CHANGE # nucleotide definitions diff --git a/varvamp/scripts/logging.py b/varvamp/scripts/logging.py index 2ab6d18..9a66027 100644 --- a/varvamp/scripts/logging.py +++ b/varvamp/scripts/logging.py @@ -50,7 +50,7 @@ def varvamp_progress(log_file, start_time=None, progress=0, job="", progress_tex else: if progress == 1: stop_time = str(round(time.process_time() - start_time, 2)) - progress_text = f"all done \n\n\rvarVAMP created an amplicon scheme in {stop_time} sec!\n{datetime.datetime.now()}" + progress_text = f"all done \n\n\rvarVAMP finished in {stop_time} sec!\n{datetime.datetime.now()}" job = "Finalizing output." print( "\rJob:\t\t " + job + "\nProgress: \t [{0}] {1}%".format("█"*block + "-"*(barLength-block), progress*100) + "\t" + progress_text, @@ -91,13 +91,13 @@ def raise_arg_errors(args, log_file): log_file, exit=True ) - if args.allowed_ambiguous < 0: + if args.n_ambig < 0: raise_error( "number of ambiguous chars can not be negative", log_file, exit=True ) - if args.allowed_ambiguous > 4: + if args.n_ambig > 4: raise_error( "high number of ambiguous nucleotides in primer leads to a high " "degeneracy. Consider reducing.", @@ -115,39 +115,59 @@ def raise_arg_errors(args, log_file): log_file, exit=True ) - if args.opt_length < 200 or args.max_length < 200: + if args.mode != "TILED" and args.mode != "SANGER": raise_error( - "your amplicon lengths might be to small. Consider increasing", - log_file - ) - if args.overlap < 0: - raise_error( - "overlap size can not be negative.", - log_file, - exit=True - ) - if args.overlap < 50: - raise_error( - "small overlaps might hinder downstream analyses. Consider increasing.", - log_file - ) - if args.overlap > args.max_length/2 - config.PRIMER_SIZES[1]: - raise_error( - "min overlap must be lower than half of your maximum length - maximum primer length. To achieve optimal results reduce it to at least half of your optimal length", - log_file, - exit=True - ) - if args.overlap > args.opt_length: - raise_error( - "overlap can not be higher than your optimal length.", + "non valid varvamp mode. Use TILED or SANGER.", log_file, exit=True ) - if args.overlap > args.opt_length/2: - raise_error( - "your intended overlap is higher than half of your optimal length. This reduces how well varvamps will find overlapping amplicons. Consider decreasing.", - log_file - ) + if args.mode == "SANGER": + if args.report_n < 1: + raise_error( + "number of reported amplicons cannot be below 1.", + log_file, + exit=True + ) + # TILED specific warnings + if args.mode == "TILED": + if args.report_n != float("inf"): + raise_error( + "n-report is for SANGER and ignored for TILED", + log_file + ) + if args.opt_length < 200 or args.max_length < 200: + raise_error( + "your amplicon lengths might be to small. Consider increasing", + log_file + ) + if args.overlap < 0: + raise_error( + "overlap size can not be negative.", + log_file, + exit=True + ) + if args.overlap < 50: + raise_error( + "small overlaps might hinder downstream analyses. Consider increasing.", + log_file + ) + if args.overlap > args.max_length/2 - config.PRIMER_SIZES[1]: + raise_error( + "min overlap must be lower than half of your maximum length - maximum primer length. To achieve optimal results reduce it to at least half of your optimal length", + log_file, + exit=True + ) + if args.overlap > args.opt_length: + raise_error( + "overlap can not be higher than your optimal length.", + log_file, + exit=True + ) + if args.overlap > args.opt_length/2: + raise_error( + "your intended overlap is higher than half of your optimal length. This reduces how well varvamps will find overlapping amplicons. Consider decreasing.", + log_file + ) def confirm_config(args, log_file): @@ -159,12 +179,6 @@ def confirm_config(args, log_file): # check if all variables exists all_vars = [ - # arg dependent - "FREQUENCY_THRESHOLD", - "PRIMER_ALLOWED_N_AMB", - "AMPLICON_OPT_LENGTH", - "AMPLICON_MAX_LENGTH", - "AMPLICON_MIN_OVERLAP", # arg independent "PRIMER_TMP", "PRIMER_GC_RANGE", @@ -306,16 +320,36 @@ def confirm_config(args, log_file): var_dic = vars(config) with open(log_file, 'a') as f: print( - "settings that can be adjusted via arguments\n", - f"PRIMER_OPT_LENGTH = {args.opt_length}", - f"PRIMER_MAX_LENGTH = {args.max_length}", - f"PRIMER_MIN_OVERLAP = {args.overlap}", - f"PRIMER_THRESHOLD = {args.threshold}", - f"PRIMER_ALLOWED_N_AMB = {args.allowed_ambiguous}", - "\nconfig settings\n", + f"MODE = {args.mode}", + sep="\n", + file=f + ) + print( + "\nsettings that can be adjusted via arguments\n", + f"OPT_LENGTH = {args.opt_length}", + f"MAX_LENGTH = {args.max_length}", + f"THRESHOLD = {args.threshold}", + f"ALLOWED_N_AMB = {args.n_ambig}", sep="\n", file=f ) + if args.mode == "TILED": + print( + f"MIN_OVERLAP = {args.overlap}", + sep="\n", + file=f + ) + if args.mode == "SANGER": + print( + f"REPORT_N_AMPLICONS = {args.report_n}", + sep="\n", + file=f + ) + print( + "\nconfig settings\n", + sep="\n", + file = f + ) for var in all_vars[5:]: print(f"{var} = {var_dic[var]}", file=f) - print("\nprogress\n", file=f) + print("\nprogress", file=f) diff --git a/varvamp/scripts/reporting.py b/varvamp/scripts/reporting.py index f4e8c9e..b0be4d6 100644 --- a/varvamp/scripts/reporting.py +++ b/varvamp/scripts/reporting.py @@ -102,7 +102,7 @@ def get_permutations(seq): return[''.join(p) for p in itertools.product(*splits)] -def write_scheme_to_files(dir, amplicon_scheme, ambiguous_consensus): +def write_scheme_to_files(dir, amplicon_scheme, ambiguous_consensus, mode): """ write all relevant bed files and a tsv file with all primer stats """ @@ -134,7 +134,11 @@ def write_scheme_to_files(dir, amplicon_scheme, ambiguous_consensus): right = (primer_names[1], amplicon_scheme[pool][amp][primer_names[1]]) # write amplicon bed - print("ambiguous_consensus", left[1][1], right[1][2], new_name, pool, sep="\t", file=bed) + if mode == "TILED": + print("ambiguous_consensus", left[1][1], right[1][2], new_name, pool, sep="\t", file=bed) + elif mode == "SANGER": + print("ambiguous_consensus", left[1][1], right[1][2], new_name, round(left[1][3] + right[1][3], 1), sep="\t", file=bed) + # write primer assignments tabular file print(left[0], right[0], sep="\t", file=tabular) @@ -346,7 +350,7 @@ def varvamp_plot(dir, threshold, alignment_cleaned, conserved_regions, all_prime ax[idx].set_title(primer[0], loc="left") ax[idx].xaxis.set_ticks(np.arange(primer[1][1], primer[1][1]+len(x), 1)) ax[idx].xaxis.set_ticklabels(x, rotation=45) - ax[idx].set_ylabel(ylabel="% of sequences") + ax[idx].set_ylabel(ylabel="freq of sequences") ax[idx].set_xlabel("position") ax[idx].set_ylim(0, 1-threshold) # - to pdf diff --git a/varvamp/scripts/scheme.py b/varvamp/scripts/scheme.py index 1aad401..efc9184 100644 --- a/varvamp/scripts/scheme.py +++ b/varvamp/scripts/scheme.py @@ -388,3 +388,33 @@ def check_and_solve_heterodimers(amplicon_scheme, left_primer_candidates, right_ primer_dimers = test_scheme_for_dimers(amplicon_scheme) return not_solvable + + +def find_best_amplicons(amplicons, all_primers, n): + """ + find the best scoring non-overlapping amplicons from all found amplicons + """ + # sort amplicons + sorted_amplicons = sorted(amplicons.items(), key=lambda x: x[1][5]) + to_retain = [sorted_amplicons[0]] + amplicon_range = list(range(sorted_amplicons[0][1][0], sorted_amplicons[0][1][1]+1)) + amplicon_set = set(amplicon_range) + # find lowest non-overlapping + for amp in sorted_amplicons: + amp_positions = list(range(amp[1][0], amp[1][1]+1)) + if not any(x in amp_positions for x in amplicon_set): + amplicon_set.update(amp_positions) + to_retain.append(amp) + # build the final dictionary + scheme_dictionary = {0: {}} + counter = 1 + for amp in to_retain: + scheme_dictionary[0][amp[0]] = {} + scheme_dictionary[0][amp[0]][amp[1][2]] = all_primers["+"][amp[1][2]] + scheme_dictionary[0][amp[0]][amp[1][3]] = all_primers["-"][amp[1][3]] + if n != float("inf"): + if counter == n: + break + counter += 1 + + return scheme_dictionary