Skip to content

Insert size filter

Ryan Wick edited this page Apr 30, 2024 · 23 revisions

Before running polypolish polish, you can (and probably should) run polypolish filter to exclude some alignments based on their insert size. This should reduce the number of excessive alignments, particularly near the edges of repeat sequences, improving Polypolish's ability to fix errors in those regions.

The goal

Insert-size filtering aims to use read pairing to remove spurious alignments, i.e. alignments of a read to the wrong instance of a repeat.

Consider this toy example, where read A-1 aligns to three locations (because it's in a repeat) and read A-2 (its pair) aligns to only one location:

insert size filter (before)

Based on our expectations of paired-end orientation and insert size, we can assume that the first alignment and last alignment for read A-1 are spurious, and only the middle one should be kept:

insert size filter (after)

How it works

The first thing the script does is figure out the insert size distribution for the read set. It does this by looking exclusively at read pairs which have exactly one alignment per read. It will also automatically determine the read orientation (typically FR for paired-end reads) if it wasn't specified with the --orientation option. The script then sets a low threshold and a high threshold using the insert size distribution. These will be the 0.1st and 99.9th percentiles by default but can be adjusted with the --low and --high options. These thresholds now define what a 'good' insert size looks like: larger than the low threshold and smaller than the high threshold.

The script then assesses each alignment using this logic:

  • If there are no pair alignments, it passes. I.e. if the script can't use read pairs to assess the alignment, it is kept.
  • If there is exactly one alignment for this read, it passes. I.e. the script is not going to throw out the only alignment for a read.
  • If there are multiple alignments for this read and at least one pair alignment, then the alignment passes if and only if it fits (good insert size and correct orientation) with any of the pair alignments.

Alignments that pass the filter are written unchanged to the output SAM file. Alignments that fail the filter are written with a ZP:Z:fail tag. Polypolish will then know to ignore these alignments in its algorithm. Incidentally, you can use this tag yourself if you want to write other alignment filters to run before Polypolish.

You might wonder why the script doesn't simply exclude failed alignments from the output file. This is because SAM files from BWA only contain sequence for one alignment per read (the rest use * to save space), so excluding alignments could lead to reads without sequence in the output SAM file, which would cause Polypolish to crash.

Benefit

Polish v0.4.3 vs v0.5.0

This figure compares Polypolish v0.4.3 (no insert size alignment filter) with Polypolish v0.5.0 (with insert size alignment filter) using the data from the Polypolish manuscript. Results are shown for Polypolish alone and Polypolish+POLCA (the combination recommended in the manuscript) using both simulated reads (A) and real reads (B), following the same methods used in the manuscript.

The difference is small but positive: for every assembly, Polypolish v0.5.0 produced an equivalent or more accurate sequence than Polypolish v0.4.3.

Mate-pair read sets

I haven't yet tried this script with mate-pair read sets, because these are uncommon for bacterial genomes. However, insert-size filtering should work especially well with mate-pair reads. A large insert size increases the chance that at least one read in each pair has a unique alignment (i.e. decreases the chance that both reads in a pair fall in a repeat). Since mate-pair libraries can have multi-kilobase insert sizes, they should provide a lot of alignment-culling power to polypolish filter.