Skip to content

Read extraction

Ryan Wick edited this page Apr 17, 2018 · 4 revisions

Often when investigating an incomplete/suspicious/off part of an assembly, it is useful to gather up the long reads which align to that region. This page describes a method for doing just that.

First, you'll need to save the sequence of interest to a fasta file. If you're examining an assembly graph in Bandage, you can save selected contigs to fasta or save a path through the graph to fasta. If you're analysing the assembly with some other tool, it too hopefully has output-to-fasta options.

Next, use this one-liner monstrosity to save reads which align to the sequence of interest:

zgrep -A 1 --no-group-separator -E $(minimap2 -x map-ont -c -t 8 sequence.fasta long_reads.fastq.gz | awk '{if ($2 >= 2000 && $11 >= 1000 && $10/$11 >= 0.85) print $1;}' | sort | uniq | paste -s -d '|') long_reads.fastq.gz | sed 's|^@|>|' > reads.fasta

To use:

  • Replace long_reads.fastq.gz with the actual path to your reads.
  • Replace sequence.fasta with the actual path to your sequence of interest.
  • Replace 2000 with the minimum allowed read length.
  • Replace 1000 with the minimum allowed alignment length.
  • Replace 0.85 with the minimum allowed alignment identity.

Explanation:

  • minimap2 -x map10k -c -t 8 sequence.fasta long_reads.fastq.gz | awk '{if ($2 >= 2000 && $11 >= 1000 && $10/$11 >= 0.85) print $1;}' | sort | uniq | paste -s -d '|' produces a pipe-delimited list of string names which is used as a grep query.
    • minimap2 -x map10k -c -t 8 sequence.fasta long_reads.fastq.gz is the actual alignment command.
      • -x map10k is the minimap2 preset for Nanopore-read-to-reference alignment.
      • -c tells minimap2 to do a base-by-base alignment, which allows us to get the alignment identity.
      • -t 8 is the thread count, but minimap2 is pretty fast so this doesn't matter too much.
    • awk '{if ($2 >= 2000 && $11 >= 1000 && $10/$11 >= 0.85) print $1;}' gives the read name if the alignment is good enough.
      • A description of the columns in minimap2's PAF output can be found here.
      • $2 >= 2000 restricts output to when the read length is long enough.
      • $11 >= 1000 restricts output to when the alignment length is long enough.
      • $10/$11 >= 0.85 restricts output to when the alignment identity is good enough.
      • Adjust these thresholds as appropriate. E.g. if you have very deep long reads, you can get away with more stringent thresholds.
    • | sort | uniq ensures we don't save the same read multiple times.
    • | paste -s -d '|' replaces newlines with pipes, so the result is ready to pass to grep.
  • zgrep -A 1 --no-group-separator -E $(...) long_reads.fastq.gz | sed 's|^@|>|' gets the header and sequences lines for relevant reads out of the file of full reads.
    • zgrep works on gzipped files - change to just grep if your long reads aren't gzipped.
    • -A 1 makes grep return both the matching line (header) and the following line (sequence).
    • --no-group-separator turns off the -- lines between matches so we can more easily turn the result into a fasta file.
    • -E uses extended regex which lets us have multiple queries separated by pipes.
    • $(...) is the output of the minimap/sort/awk/paste command which produces the grep queries.
    • sed 's|^@|>| replaces @ with > so the result is fasta format.
  • reads.fasta is made by this command and will contain the relevant long reads.

Mac version

I struggled to make that command work on my Mac (some of the command line tools are frustratingly different). Here's an alternative that works on my Mac:

gunzip -c long_reads.fastq.gz | grep -A 1 --no-group-separator -E $(minimap2 -x map10k -c -t 8 sequence.fasta long_reads.fastq.gz | awk '{if ($2 >= 2000 && $11 >= 1000 && $10/$11 >= 0.85) print $1;}' | sort | uniq | tr '\n' '|' | sed 's/.$//') | sed 's|^@|>|' > reads.fasta

Note that you might need to brew install grep to get the GNU version of grep instead of the BSD one Macs come with.

Sorting by position

Sometimes, after extracting the reads, I like to sort them by their order on the reference/query sequence. This command does that:

minimap2 -a -x map10k -t 4 assembly.fasta reads.fasta | samtools view -F 258 -F 2048 | sort -u -k1,1 | sort -k3,3 -k4,4n | awk '{print ">"$1"\n"$10}' > reads_sorted.fasta

This command also flips reverse-strand reads to the forward strand, making reads on the same strand. This can make the reads a bit easier to work with, especially if you're going to be viewing their alignments in Bandage.

Explanation:

  • minimap2 -a -x map10k -t 4 assembly.fasta reads.fasta aligns the reads to the sequence of interest, this time outputting in SAM format.
  • samtools view -F 258 -F 2048 gets rid of non-primary alignments
  • sort -u -k1,1 makes sure we have just one line per read
  • sort -k3,3 -k4,4n sorts the SAM by reference and position
  • awk '{print ">"$1"\n"$10}' converts the SAM to FASTA