Skip to content
Ryan Wick edited this page Jan 15, 2024 · 27 revisions

Polypolish

The problem

While long-read sequencing platforms (like Oxford Nanopore) have gotten a lot better in recent years, long-read-only assemblies still suffer from some consensus sequence errors. Homopolymer-length errors are the most common type, e.g. AAAAAAAA becoming AAAAAAA. One can use short reads (like from an Illumina platform) to correct errors in a long-read assembly, a process known as short-read polishing. There are a number of short-read polishing tools, including HyPo, NextPolish, ntEdit, Pilon, POLCA and Racon.

However, errors in repeat sequences can be difficult to fix. Most of those short-read polishing tools rely on alignments generated from tools like BWA-MEM. When run with default settings, aligners put each read in a single best location (randomly chosen in the case of a tie). So if the assembly has an error in a repeat, reads may not align to it because they can get a better alignment in other instances of the repeat. For example, consider a genome with a two-copy exact repeat (I'll call them copy A and copy B), and the assembly of this genome has an error in copy A. When aligning short reads to the assembly, all reads which originated from the error-containing region of copy A will instead align to the corresponding region of copy B, because they can achieve a more accurate alignment there. This leaves no reads aligned over the error, and a short-read polishing tool will therefore have no information with which to fix the error.

Long-read assemblies with short-read polishing can be very accurate, and their non-repeat sequences may in fact be perfect. But due to the scenario described above, errors often remain in repeat sequences. This problems keeps truly error-free genome assemblies out of reach.

The solution

Polypolish is a short-read polishing tool that differs from existing tools in an important way: it uses short-read alignments where each read is aligned to all possible locations. This means that errors in repeats will be covered by short-read alignments, and Polypolish can therefore fix those errors. For an illustrated walk-through of how it works, check out the Toy example page of this wiki.

Some caveats

No polishing tool is perfect, Polypolish included. This means that you should make your long-read assembly as accurate as possible before doing any short-read polishing. Trycycler can deliver clean assemblies which are free from medium-to-large scale errors. Long-read polishing with Medaka can further improve accuracy, so this should be done before running Polypolish.

Since different short-read polishers use difference algorithms, I have found that using a combination of tools can deliver the most accurate assemblies. Aside from Polypolish, I like both ntEdit and POLCA – try using them in addition to Polypolish.

Nothing about Polypolish is intrinsically specific to bacterial genomes – its approach should work on eukaryotes too. However, I've only ever used it on bacterial genomes, and repeat-rich eukaryote genomes might cause issues (see this question in the FAQ). Try at your own risk!

Where to begin?

Check out the Software requirements and Installation pages to get Polypolish up and running. Then the How to run Polypolish page will show you how to use it.

If you want to read more or need to cite Polypolish, here is its corresponding manuscript: Wick RR, Holt KE. Polypolish: short-read polishing of long-read bacterial genome assemblies. PLOS Computational Biology. 2022. doi:10.1371/journal.pcbi.1009802.