Skip to content
Ryan Wick edited this page Jul 13, 2021 · 27 revisions

Polypolish

The problem

While long-read sequencing technologies (like Oxford Nanopore) have gotten a lot better in recent years, long-read-only assemblies still suffer from some consensus sequence errors. Homopolymer-length errors are the most common type, e.g. AAAAAAAA becoming AAAAAAA. One can use short reads (like from an Illumina platform) to correct errors in a long-read assembly, a process known as short-read polishing. There are a number of short-read polishing tools, including HyPo, NextPolish, ntEdit, Pilon, POLCA and Racon.

However, repeat sequences can cause a problem. Almost all of those polishing tools rely on short-read alignments generated from tools like BWA. When run with default settings, aligners put each read in a single best location (randomly chosen in the case of a tie). So if the assembly has an error in a repeat, reads may not align to it because they can get a better alignment in other instances of the repeat. For example, consider a genome with a two-copy exact repeat (I'll call them copy A and copy B), and the assembly of this genome has an error in copy A. When aligning short reads to the assembly, all reads which originated from the error-containing region of copy A will instead align to the corresponding region of copy B, because they can achieve a more accurate alignment there. This leaves no reads aligned over the error, and a short-read polishing tool will therefore have no information with which to fix the error.

Long-read assemblies with short-read polishing can be very accurate, and their non-repeat sequences may in fact be perfect. But due to the scenario described above, errors often remain in repeat sequences. This frustratingly keeps truly error-free genome assemblies out of reach.

The solution

Polypolish is a short-read polishing tool that differs from existing tools in an important way: it takes as input short-read alignments where each read is aligned to all possible locations. This means that errors in repeats will be covered by short-read alignments, and Polypolish can therefore fix those errors. For an illustrated walk-through of how it works, check out the Toy example page of this wiki.

In my tests, Polypolish outperformed other short-read polishing tools on long-read assemblies of bacterial genomes. A manuscript is in the works – stay tuned!

Some caveats

No polishing tool is perfect, Polypolish included. This means that you should make your long-read assembly as accurate as possible before doing any short-read polishing. Trycycler can deliver clean assemblies which are free from medium-to-large scale errors. Long-read polishing with Medaka can further improve accuracy, so this should be done before running Polypolish.

Since different short-read polishers use difference algorithms, I have found that using a combination of tools can deliver the most accurate assemblies. Aside from Polypolish, my favourite short-read polishers are HyPo, ntEdit, Pilon and POLCA – try using them in combination with Polypolish. In particular, I've seen very nice results by polishing first with Polypolish, then with HyPo and finally with ntEdit.

Polypolish can be greedy with RAM, often requiring more than 10 GB to run. If you have a particularly repeat-rich genome (e.g. Shigella) then even 50 GB of RAM or more may be needed. This is because aligning short reads to all possible locations in a repeat-rich genome can yield a lot of alignments. A reimplementation of Polypolish's algorithm in a more efficient language (e.g. C++ or Rust) is probably called for.

Nothing about Polypolish is specific to bacterial genomes – its approach should work on eukaryotes too. However, I've only ever used it on bacterial genomes, and due to the aforementioned RAM usage, it may be impractical to run Polypolish on a eukaryote genome. Try at your own risk!

Where to begin?

Check out the Software requirements and Installation pages to get Polypolish up and running. Then the How to run Polypolish page will show you how to use it. Don't worry – it's pretty simple!