-
Notifications
You must be signed in to change notification settings - Fork 15
/
3.02-bioinf-file-formats.Rmd
1732 lines (1472 loc) · 89.3 KB
/
3.02-bioinf-file-formats.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
# Bioinformatic file formats
Almost all the high-throughput sequencing data you will deal with should
arrive in just a few different formats. There are some specialized formats
(like those output by the program TASSEL, etc.) but we will largely ignore
those, focusing instead on the formats used in production by the 1000 genomes
and 10K vertebrate genomes projects. In a sentence, the most important are:
FASTA, FASTQ, SAM, BAM, and VCF.
Plan: Go over these, and for each, pay special attention to compressed and indexed
forms and explain why that is so important. I think that we should probably
talk about the programs that are available for manipulating each of these, as well,
but I won't add that until we all have access to an HPC or other proper
Unix environment.
I was originally going to do tools for manipulating files in the different
formats, but I will do that in a separate chapter(s) later.
## Sequences
## FASTQ {#fastq}
The FASTQ format is the standard format for lightly-processed data
coming off an Illumina machine. If you are doing whole genome sequencing,
it is typical for the Illumina processing pipeline to have separated all
the reads with different barcodes into different, separate files. Typically
this means that all the reads from one library prep for one sample will
be in a single file. The barcodes and any associated additional adapter
sequence is typically also removed from the reads. If you are processing
RAD data, the reads may not be separated by barcode, because, due to the
vagaries of the RAD library prep, the barcodes might appear on different ends
of some reads than expected.
A typical file extension for FASTQ files is `.fq`.
Almost all FASTQ files you get from a sequencer should be gzipped to save space.
Thus, in order to view the file you will have to uncompress it. Since you would,
in most circumstances, want to visually inspect just a few lines, it is best
to do that with `gzcat` and pipe the output to `head`.
As we have seen, paired-end sequencing produces two separate reads of a DNA
fragment. Those two different reads are usually stored in two separate files named
in such a manner as to transparently denote whether it contains sequences obtained
from read 1 or read 2.
For example `bird_08_B03_R1.fq.gz` and `bird_08_B03_R2.fq.gz`. Read 1 and Read 2 from
a paired read must occupy the sames lines in their respective files, i.e., lines
10001-10004 in `bird_08_B03_R1.fq.gz` and lines 10001-10004 in `bird_08_B03_R2.fq.gz`
should both pertain to the same DNA fragment that was sequenced. That is what is
meant by "paired-end" sequencing: the sequences come in pairs from different
ends of the same fragment.
The FASTQ format is _very_ simple: information about each read occupies just four lines.
This means that the number of lines in a proper FASTQ file must always be a multiple of four.
Briefly, the four lines of information about each read are always in the same order as follows:
1. An Identifier line
2. The DNA sequence as A's, C's, G's and T's.
3. A line that is almost always simply a `+` sign, but can optionally be followed
by a repeat of the ID line.
4. An ASCII-encoded, Phred-scaled base quality score. This gives an estimated
measure of certainty about each base call in the sequence.
The code block below shows three reads worth (twelve lines) of
information from a FASTQ file.
Take a moment to identify the four different lines for each read.
```
@K00364:64:HTYYCBBXX:1:1108:4635:14133/1
TAGAATACGCCAGGTGTAAGAATAGTAGAATACGCCAGGTGTAAGAATAGTAGAACACGCCAGGTGTAAGAATAGTAGAA
+
AAAFFJJJJJFFJFJJJFJJFFJFJFJJJJJJJJFFJJJFJJJFJJAJJFJJFJJFJJJ7JJJF-<JAFJJ<F<AJAJJF
@K00364:64:HTYYCBBXX:1:1108:5081:25527/1
AAAACACCAAAAGAAAGATGCCCAGGGTCCCTGCTCATCTGCGTGAAAGTGACTTAGGCATGCTGCAAGGAGGCATGAGG
+
AAFFFJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJ
@K00364:64:HTYYCBBXX:1:1108:5852:47295/1
AGGTGGCTCTAGAAGGTTCTCGGACCGAGAAACAGCCTCGTACATTTGCAATGATTTCAATTCATTTTGACCATTACGAA
+
AAFFFJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJ
```
Lines 2 and 3 are self-explanatory, but we will expound upon lines 1 and 4 below.
### Line 1: Illumina identifier lines {#illumina-ids}
The identifier line can be just about any string that starts with an `@`, but, from Illumina data
you might see something like this:
```
@K00364:64:HTYYCBBXX:1:1108:4635:14133/1
```
The colons (and the `/`) are field separators. The separate parts of the line
above are interpreted something along the lines as follows (keeping in mind that
Illumina occasionally changes the format and that there may be additional
optional fields):
```
@ : mandatory character that starts the ID line
K00364 : Unique sequencing machine ID
64 : Run number on instrument
HTYYCBBXX : Unique flow cell identifier
1 : Lane number
1108 : Tile number (section of the lane)
4635 : x-coordinate of the cluster within the tile
14133 : y-coordinate of the cluster within the tile
1 : Whether the sequence is from read 1 or read 2
```
For Illumina data processed with versions 1.8+ of Casava, the identifier lines are a little
different, like this:
```
@A00600:191:HY75HDSX2:1:2624:27480:35744 2:N:0:AAGACCGT+CAATCGAC
```
in which we have:
```
@ : mandatory character that starts the ID line
A00600 : Unique sequencing machine ID
191 : Run number on instrument
HY75HDSX2 : Unique flow cell identifier
1 : Lane number
2624 : Tile number (section of the lane)
27480 : x-coordinate of the cluster within the tile
35744 : y-coordinate of the cluster within the tile
2 : Whether the sequence is from read 1 or read 2
N : N if the sequence was not flagged as crappy by the maching, Y otherwise
0 : 0 when no control bits are turned on
AAGACCGT+CAATCGAC : Index/barcode of the sample
```
Question: For paired reads, do you expect the x- and y-coordinates for read 1 and read 2 to
be the same?
### Line 4: Base quality scores {#bqscores}
The base quality scores give an estimate of the probability that the base call is incorrect.
This comes from data the sequencer collects on the brightness and compactness of the cluster
radiance, other factors, and Illumina's own models for base call accuracy. If we let $p$
be the probability that a base is called incorrectly, then $Q$, the Phred-scaled base quality
score, is:
$$
Q = \lfloor-10\log_{10}p\rfloor,
$$
where $\lfloor x \rfloor$ means "the largest integer smaller than $x$.
To get the estimate of the probability that the base is called incorrectly from
the Phred scaled score, you invert the above equation:
$$
p = \frac{1}{10^{Q/10}}
$$
Base quality scores from Illumina range from 0 to 40.
The easiest values to use as guideposts are $Q = 0, 10, 20, 30, 40$, which correspond to
error probabilites of 1, 1 in 10, 1 in 100, 1 in 1,000, and 1 in 10,000, respectively.
All this is fine and well, but when we look at the quality score line above we see
something like `7JJJF-<JAFJJ<F<AJAJJF`. What gives? Well, from a file storage and
parsing perspective, it makes sense to only use a single character to store the
quality score for every base. So, that is what has been done: each of those
single characters represents a quality score---a number between 0 and 40, inclusive.
The values that have been used are the _decimal representations of ASCII text characters_ minus 33.
The decimal representation or each character can be found in Figure \@ref(fig:ascii).
```{r ascii, echo=FALSE, fig.align='center', dpi=270, fig.cap='This lovely ASCII table shows the binary, hexadecimal, octal and decimal representations of ASCII characters (in the corners of each square; see the legend rectangle at bottom. Table produced from TeX code written and developed by Victor Eijkhout available at [https://ctan.math.illinois.edu/info/ascii-chart/ascii.tex](https://ctan.math.illinois.edu/info/ascii-chart/ascii.tex)'}
knitr::include_graphics("figs/ascii-crop.png", auto_pdf = TRUE)
```
The decimal representation is in the upper left of each character's rectangle.
Find the characters corresponding to base quality scores of 0, 10, 20, 30, and 40. Remember that the
base quality score is the character's decimal representation _minus 33_.
Here is another question: why do you think the scale starts with ASCII character 33?
### A FASTQ 'tidyverse' Interlude
Here we demonstrate some R code while exploring FASTQ files. First, in order to
do these exercises you will want to download and launch the RStudio project,
`big-fastq-play`,
that has the data
in it. Please work within that RStudio project to do these exercises.
Here is a direct download link to it:
[https://docs.google.com/uc?export=download&id=1iD8tz_KSOHDBpXvXssqo3noPZAm1Qsjp](https://docs.google.com/uc?export=download&id=1iD8tz_KSOHDBpXvXssqo3noPZAm1Qsjp)
Note that this might stress out older laptops as it loads 100s of Mb of sequence
data into memory.
#### Reading FASTQs with `read_lines`
Load packages:
```r
library(tidyverse)
# if you don't have this package, get it
# install.packages("viridis")
library(viridis)
```
Read the FASTQ, make it a matrix, then make it a tibble
```r
# use read_lines to read the R1 fastq file line by line;
# then make a 4 column matrix, filling by rows
# then drop column 3, which corresponds to the "+" line
R1 <- read_lines("data/Battle_Creek_01_chinook_R1.fq.gz") %>%
matrix(ncol = 4, byrow = TRUE) %>%
.[,-3]
# add colnames
colnames(R1) <- c("ID", "seq", "qual")
# now make a tibble out that. We will assign
# it back to the variable R1, to note carry
# extra memory around
R1 <- as_tibble(R1)
# Look at it:
R1
# OK, 1 million reads.
```
Look at the first ID line:
```
@E00430:101:HKT7WCCXY:1:1101:6411:1204 1:N:0:NGATGT
```
Aha! This is a slightly different format than the above. The part after the
space has colon-separated fields that are:
```
1 : which read of the pair
N : has this been filtered (Y/N)
0 : control number (always 0 on HiSeq and NextSeq)
NGATGT : barcode on the read
```
OK, our mission is to actually look at the locations of these reads on different
tiles. To do that we will want to access the x and y coordinates and the
tiles, etc. In the tidyverse, this means giving each of those things its own
column.
#### tidyr::separate()
How do we break those colon-separated fields into columns? This is a job for
`tidyr::separate` which breaks a text string on user-defined separators into
columns in a tibble.
Here we can use it to break the ID into two parts split on the space, and then
break the first part of the ID into its constituent parts:
```r
# first we break on the space,
# then we break the ID on the colons, but keep the original "id" for later
R1 %>%
separate(ID, into = c("id", "part2"), sep = " ") %>%
separate(
id,
into = c("machine", "run", "flow_cell", "lane", "tile", "x", "y"),
sep = ":",
remove = FALSE
)
```
Wow! That was cool. Now, your mission is to pipe that (with `%>%`)
into another `separate()` command that breaks part2 into
`read`, `filter`, `cnum`, and `barcode`, and save that into `R1_sep`
Here is a starter:
```r
R1_sep <- R1 %>%
separate(ID, into = c("id", "part2"), sep = " ") %>%
separate(
id,
into = c("machine", "run", "flow_cell", "lane", "tile", "x", "y"),
sep = ":",
remove = FALSE
) %>%
...your code here...
# when you are done with that, look at it
R1_sep
```
Doh! There is one thing to note. Look at the first few columns of that:
```r
# A tibble: 1,000,000 x 11
id machine run flow_cell lane tile x y
<chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
1 @E00430:10… @E00430 101 HKT7WCCXY 1 1101 6411 1204
2 @E00430:10… @E00430 101 HKT7WCCXY 1 1101 7324 1204
3 @E00430:10… @E00430 101 HKT7WCCXY 1 1101 8582 1204
4 @E00430:10… @E00430 101 HKT7WCCXY 1 1101 9841 1204
5 @E00430:10… @E00430 101 HKT7WCCXY 1 1101 10186 1204
```
The x- and the y-coordinates are listed as characters (`<chr>`) but they should
be numeric. This shows the default behavior of `separate()`: it just breaks each
field into a column of strings. However, you can ask `separate()` to make a good-faith
guess of the type of each column and convert it to that. This works suitably in this
situation, so, let's repeat the last command, but convert types automatically:
```{r, eval=FALSE}
R1_sep <- R1 %>%
separate(ID, into = c("id", "part2"), sep = " ") %>%
separate(
id,
into = c("machine", "run", "flow_cell", "lane", "tile", "x", "y"),
sep = ":",
remove = FALSE,
convert = TRUE
) %>%
separate(part2, into = c("read", "filter", "cnum", "barcode"), sep = ":", convert = TRUE)
```
#### Counting tiles
Let's see how many tiles these reads came from. Basically we just want to count the
number of rows with different values for tile. Read the documentation for
`dplyr::count` with
```r
?count
```
and when you are done count up the number of reads in each tile:
```r
R1_sep %>%
...your code here...
```
Turns out they are all from the same tile...
#### Parsing quality scores
Now, we want to turn the quality-score ASCII-characters into Phred-scaled
qualities, ultimately taking the average over each sequence of those.
Check out this function:
```{r}
utf8ToInt("!*JGH")
```
We can use that to get Phred-scaled values by subtracting 33 from the result. Let's check:
```{r}
utf8ToInt("!+5?IJ") - 33
```
Note that `J` seems to be used for "below 1 in 10,000" as it is the highest I have ever seen.
So, what we want to do is make a new column called `mean_qual` that gives the
mean of the Phred-scaled qualities. Any time you need to make a new column in a
tibble, where the result in each row depends only on the values in current columns
in that same row, that is a job for `mutate()`.
However, in this case, because the `utf8ToInt()` function doesn't take vector input,
computing the `mean_qual` for every row requres using one of the `map()` family of functions.
It is a little beyond
what we want to delve into at the moment. But here is what it looks like:
```{r, eval=FALSE}
R1_sepq <- R1_sep %>%
mutate(mean_qual = map_dbl(.x = qual, .f = function(x) mean(utf8ToInt(x) - 33))
```
Check out the distribution of those mean quality scores:
```r
ggplot(R1_sepq, aes(x = mean_qual)) +
geom_histogram(binwidth = 1)
```
#### Investigating the spatial distribution of reads and quality scores
Now, use the above as a template to investigate the distribution of the
x-values:
```r
ggplot(R1_sepq, aes(...your code here...)) +
geom_histogram(bins = 500)
```
And, do the same for the y-values.
```r
# Dsn of y
ggplot(R1_sepq, aes(x = y)) +
geom_histogram(bins = 500) +
coord_flip()
```
Hmm... for fun, make a 2-D hex-bin plot
```r
ggplot(R1_sepq, aes(x = x, y = y)) +
geom_hex(bins = 100) +
scale_fill_viridis_c()
```
That is super interesting. It looks like the flowcell or camera must have a mild
issue where the smears are between y = 20,000 and 30,000.
It is natural to wonder if the quality scores of the reads that did actually get
recovered from those regions have lower quality scores. This makes a hexbin plot
of the mean quality scores:
```r
# hexbin of the mean quality score
ggplot(R1_sepq, aes(x = x, y = y, z = mean_qual)) +
stat_summary_hex(bins = 100) +
scale_fill_viridis_c()
```
Cool.
### Comparing read 1 to read 2
One question that came up in class is whether the quality of read 2 is typically lower
than that of read 1. We can totally answer that with the data we have. It would involve,
1. reading in all the read 2 reads and separating columns.
2. computing the read 2 mean quality scores
3. joining (see `left_join()`) on the `id` columns and then making a scatter plot.
Go for it...
## FASTA {#fasta}
The FASTQ format, described above, is tailored for representing short DNA sequences---and their associated
quality scores---that have been generated from high-throughput sequencing machines.
A simpler, leaner format is used to represent longer DNA sequences that have typically
been established from a lot of sequencing, and which no longer travel with their
quality scores. This is the FASTA format, which you will typically see storing
the DNA sequence from _reference genomes_. FASTA files typically use the file extensions `.fa`, `.fasta`,
or `.fna`, the latter denoting it as a FASTA file of nucleotides.
In an ideal world, a reference genome
would contain a single, uninterrupted sequence of DNA for every chromosome. While the resources
for some well-studied species include "chromosomal-level assemblies" which have much
sequence organized into chromosomes in a FASTA file, even these genome assemblies often
include a large number of short fragments of sequence that are known to belong
to the species, but whose location and orientation in the genome remain unknown.
More often, in conservation genetics, the reference genome for an organism you are working
on might be the product of a recent, small-scale, assembly of a _low-coverage genome_.
In this case, the genome may be represented by thousands, or tens of thousands, of _scaffolds_,
only a few of which might be longer than one or a few megabases. All of these scaffolds
go into the FASTA file of the reference genome.
Here are the first 10 lines of the FASTA holding a reference genome for
Chinook salmon:
```
>CM008994.1 Oncorhynchus tshawytscha isolate JC-2011-M1
AGTGTAGTAGTATCTTACCTATATAGGGGACAGTGTAGTAGTATCTTACTTATTTGGGGGACAATGCTCTAGTGTAGTAG
AATCTTACCTTTATAGGGGACAGTGCTGGAGTGCACTGGTATCTTACCTATATAGGGGACAGTGCTGGAGTGTAGTAGTG
TCTCGGCCCACAGCCGGCAGGCCTCAGTCTTAGTTAGACTCTCCACTCCATAAGAAAGCTGGTACTCCATCTTGGACAGG
ACATAGACAGGGACCACCTGCAGGACACACACGCAGGTTTACTAAGGGTTTACTCAACACAGTGAACAGCATATACCAGA
TGTGTGGTACATGTTTACAGAGAAGGAGtatattaaaaacagaaaactgTTTTGGttgaaatatttgtttttgtctgaAG
CCCGAAAAACACATGAAATTCAAAAGATAATTTGACCTACGCACTAACTAGGCTTTTCAGCAGCTCAACTACTGTCCGTT
TATTGATCTACTGTACTGCAACAACATATGTACTCACACAACAGACTATATTGGATTCAGACAGGACCTATAGGTTACCA
TGCTTCCTCTCTACAGGACCTATAGGTTACCATGCTTCCTCTCTACAAGGTCTATAGGTTACCATGCGTCCTCTCTACAG
GACCTATAGGTTACCATGCTTCCTCTCTACAGGGCCTATAGGTTACCATGCTTCCTCTCTACAGGACCTGTAGGTTACCA
```
The format is straightforward: any line starting with `>` is interpreted as a
line holding the identifier of the following sequence. In this case, the identifier
is `CM008994.1`. The remainder of the line (following the first white space) are
further comments about the sequence, but will not typically be carried through
downstream analysis (such as aligment) pipelines. In this case `CM008994.1` is
the name of an assembled chromosome in this reference genome. The remaining lines give
the DNA sequence of that assembled chromosome.
It is convention with FASTA files that lines of DNA sequence should be less
than 80 characters, but this is inconsistently enforced by different analysis
programs. However, most of the FASTA files you will see will have lines that
are 80 characters long.
In the above fragment, we see DNA sequence that is either upper or lower case.
A common convention is that the lowercase bases are segments of DNA that
have been inferred (by, for example RepeatMasker) to include repetitive DNA.
It is worth noting this if you are trying to design assays from sequence
data! However, not all reference genomes have repeat-sequences denoted
in this fashion.
Most reference genomes contain gaps. Sometimes the length of these gaps
can be accurately known, in which case each missing base pair is replaced by
an `N`. Sometimes gaps of unknown length are represented by a string of `N`'s of
some fixed length (like 100).
Finally, it is worth reiterating that the sequence in a reference-genome FASTA file represents
the sequence only one strand of a double-stranded molecule. In chromosomal-scale assemblies
there is a convention to use the strand that has its 5' end at the telomere of the
short arm of the chromosome [@cartwrightMultiplePersonalitiesWatson2011].
Obviously, such a convention cannot be enforced
in a low-coverage genome in thousands of pieces. In such a genome, different
scaffolds will represent sequence on different strands; however the sequence
in the FASTA file, whichever strand it is upon, is taken to be the reference,
and that sequence is referred to as the _forward_ strand of the reference genome.
### Genomic ranges
Almost every aspect of genomics or bioinformatics involves talking about the
"address" or "location" of a piece of DNA in the reference genome. These locations
within a reference genome can almost universally be described in terms of a "genomic range"
with a format that looks like:
```
SegmentName:start-stop
```
For example,
```
CM008994.1:1000001-1001000
```
denotes the 1 Kb chunk of DNA starting from position 1000001 and proceeding
to (and including!) position 1001000.
Such nomenclature is often referred to as the _genomic coordinates_ of a
segment.
In most applications we will encounter, the first position in a chromosome is
labeled 1. This is called a _base 1 coordinate system_. In some genomic
applications, a _base 0 coordinate system_ is employed; however, for the most
part such a system is only employed internally in the guts of code of software that we
will use, while the user interface of the software consistently uses a
base 1 coordinate system.
Sometimes you will have to convert between base 0 and base1 coordinate systems.
Thinking about how to do this is easy if you keep a simple picture in your head---I
like to think of it as,
the base-1 coordinate system counts "beads" that are the actual base pairs,
while the base-0 system counts "fences" that separate the beads. In the following picture,
the fences and their corresponding numbers are red, while the beads and their corresponding
numbers are black.
```{r, echo=FALSE}
knitr::include_graphics("figs/base-0-and-1-cropped.svg")
```
From this picture, it is pretty clear that if you have a base-1
range from $x$ to $y$ (with $x \leq y$), then you could refer to that
by the base-0 range from $x-1$ to $y$ (those are the "fences" that contain the
beads).
Likewise, if you have a base-0 range from $v$ to $w$ (with $v<w$), then the corresponding
base-1 range would be from $v+1$ to $w$, as those are the beads that are contained by the
fences at $v$ and $w$.
### Extracting genomic ranges from a FASTA file
Commonly (for example, when designing primers for assays) it is necessary
to pick out a precise genomic range from a reference genome. This
is something that you should _never_ try to do by hand. That is too
slow and too error prone. Rather the software package `samtools` (which
will be discussed in detail later) provides the `faidx` utility to _index_ a FASTA
file. It then uses that index to provide lightning fast access to
specific genomic coordinates, returning them in a new FASTA file with
identifiers giving the genomic ranges. Here is an example using `samtools faidx` to
extract four DNA sequences of length 150 from within the Chinook salmon genome
excerpted above:
```sh
# assume the working directory is where the fasta file resides
# create the index
samtools faidx GCA_002831465.1_CHI06_genomic.fna
# that created the file: GCA_002831465.1_CHI06_genomic.fna.fai
# which holds four columns that constitute the index
# now we extract the four sequences:
samtools faidx \
GCA_002831465.1_CHI06_genomic.fna \
CM009007.1:3913989-3914138 \
CM009011.1:2392339-2392488 \
CM009013.1:11855194-11855343 \
CM009019.1:1760297-1760446
# the output is like so:
>CM009007.1:3913989-3914138
TTACCGAtggaacattttgaaaaacacaaCAATAAAGCCTTGTGTCCTATTGTTTGTATT
TGCTTCGTGCTGTTAATGGTAgttgcacttgattcagcagccgtAGCGCCGGGAAggcag
tgttcccattttgaaaaaTGTCATGTCTGA
>CM009011.1:2392339-2392488
gatgcctctagcactgaggatgccttagaccgctgtgccactcgggaggccttcaGCCTA
ACTCTAACTGTAAGTAAATTGTGTGTATTTTTGGGTACATTTCGCTGGTCCCCACAAGGG
GAAAGggctattttaggtttagggttaagg
>CM009013.1:11855194-11855343
TGAGGTTTCTGACTTCATTTTCATTCACAGCAGTTACTGTATGCCTCGGTCAAATTGAAA
GGAAAGTAAAGTAACCATGTGGAGCTGtatggtgtactgtactgtactgtattgtactgt
attgtgtgGGACGTGAGGCAGGTCCAGATA
>CM009019.1:1760297-1760446
ttcccagaatctctatgttaaccaaggtgtttgcaaatgtaacatcagtaggggagagag
aggaaataaagggggaagaggtatttatgactgtcataaacctacccctcaggccaacgt
catgacactcccgttaatcacacagactGG
```
### Downloading reference genomes from NCBI
I might want to write blurb here about how much nicer I find it is to
use the ftp link: [http://ftp.ncbi.nlm.nih.gov/genomes/genbank/](http://ftp.ncbi.nlm.nih.gov/genomes/genbank/) to find genomes on Genbank. Although now that the number of genomes is
growing so quickly, it can take quite a while for the pages to load. But I still find
it easier to navigate to exactly the files I want using this system than the actual
user interface that they have.
## Alignments {#sambamfiles}
A major task in bioinformatics is _aligning_ reads from a sequencing machine
to a reference genome. We will discuss the operational features of that task
in a later chapter, but here we treat the topic of the SAM, or
Sequence Alignment Map, file format which is widely used to represent the results of
sequence alignment. We attempt to motivate this topic by first considering a handful
of the intricacies that arise during sequence alignment, before proceeding
to a discussion of the various parts of the SAM file that are employed to handle
the many and complex ways in which DNA alignments can occur and be represented.
This will necessarily be an incomplete and relatively humane introduction to SAM files.
For the adventurous a more complete---albeit astonishingly terse---description of the
[SAM format specification](https://samtools.github.io/hts-specs/SAMv1.pdf)
is maintained and regularly updated.
### How might I align to thee? Let me count the ways...
We are going to consider the alignment of very short (10 bp)
paired-end reads from the ends of a short (50 bp) fragment from
the fourth line of the FASTA file printed above. In other words,
those 80 bp of the reference genome are:
```
5' ACATAGACAGGGACCACCTGCAGGACACACACGCAGGTTTACTAAGGGTTTACTCAACACAGTGAACAGCATATACCAGA 3'
```
And we will be considering double-stranded DNA occupying the middle 50 base
pairs of that piece of reference genome. That piece of double stranded
DNA looks like:
```
5' ACCTGCAGGACACACACGCAGGTTTACTAAGGGTTTACTCAACACAGTGA 3'
||||||||||||||||||||||||||||||||||||||||||||||||||
3' TGGACGTCCTGTGTGTGCGTCCAAATGATTCCCAAATGAGTTGTGTCACT 5'
```
If we print it alongside (underneath, really) our reference genome,
we can see where it lines up:
```
5' ACATAGACAGGGACCACCTGCAGGACACACACGCAGGTTTACTAAGGGTTTACTCAACACAGTGAACAGCATATACCAGA 3'
5' ACCTGCAGGACACACACGCAGGTTTACTAAGGGTTTACTCAACACAGTGA 3'
||||||||||||||||||||||||||||||||||||||||||||||||||
3' TGGACGTCCTGTGTGTGCGTCCAAATGATTCCCAAATGAGTTGTGTCACT 5'
```
Now, remember that any template being sequenced on an Illumina machine is
going to be single-stranded, and we have no control over which strand, from
a double-stranded fragment of DNA, will get sequenced. Futhermore, recall that
for this tiny example, we are assuming that the reads are only 10 bp long.
Ergo, if everything has gone according to plan, we can expect to see two
different possible _templates_, where I have denoted the base pairs
that do not get sequenced with `-`'s
```
either:
5' ACCTGCAGGA------------------------------AACACAGTGA 3'
or:
3' TGGACGTCCT------------------------------TTGTGTCACT 5'
```
If we see the top situation, we have a situation in which the template that reached
the lawn on the Illumina machine comes from the strand that is represented
in the reference genome. This is called the _forward strand_. On the other hand,
the bottom situation is one in which the template is from the reverse complement
of the strand represented by the reference. This is called the _reverse strand_.
Now, things start to get a little more interesting, because we don't get to look at the
the entire template as one contiguous piece of DNA in the 5' to 3' direction.
Rather, we get to "see" one end of it by reading Read 1 in the 5' to 3' direction,
and then we "see" the other end of it by reading Read 2, also in the 5' to 3' direction,
_but Read 2 is read off the complementary strand_.
So, if we take the template from the top situation:
```
the original template is:
5' ACCTGCAGGA------------------------------AACACAGTGA 3'
So the resulting reads are:
Read 1: 5' ACCTGCAGGA 3' --> from 5' to 3' on the template
Read 2: 5' TCACTGTGTT 3' --> the reverse complement of the
read on the 3' end of the template
```
And if we take the template from the bottom scenario:
```
the original template is:
3' TGGACGTCCT------------------------------TTGTGTCACT 5'
So the resulting reads are:
Read 1: 5' TCACTGTGTT 3' --> from 5' to 3' on the template
Read 2: 5' ACCTGCAGGA 3' --> the reverse complement of the
read on the 3' end of the template
```
Aha! Regardless of which strand of DNA the original template
comes from, sequences must be read off of it in a 5' to 3' direction
(as that is how the biochemistry works). So, there are only two
possible sequences you will see, and these correspond to reads from 5' to 3'
off of each strand. So, the only difference that happens when
the template is from the forward or the reverse strand (relative to the
reference), is whether Read 1 is from the forward strand and Read 2 is
from the reverse strand, or whether Read 1 is from the reverse strand
and Read 2 is from the forward strand. The actual pair of sequences
you will end up seeing is still the same.
So, to repeat, with a segment of DNA that is a faithful copy of the reference genome,
there are only two read sequences that you might see, and as we will show
below _Read 1 and Read 2 must
align to opposite strands of the reference._
What does a faithful segment from the reference genome look like
in alignment? Well, in the top case we have:
```
Read 1: 5' ACCTGCAGGA 3'
5' ACATAGACAGGGACCACCTGCAGGACACACACGCAGGTTTACTAAGGGTTTACTCAACACAGTGAACAGCATATACCAGA 3'
forward-strand
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
reverse-strand
3' TGTATCTGTCCCTGGTGGACGTCCTGTGTGTGCGTCCAAATGATTCCCAAATGAGTTGTGTCACTTGTCGTATATGGTCT 5'
Read 2: 3' TTGTGTCACT 5'
```
And in the bottom case we have:
```
Read 2: 5' ACCTGCAGGA 3'
5' ACATAGACAGGGACCACCTGCAGGACACACACGCAGGTTTACTAAGGGTTTACTCAACACAGTGAACAGCATATACCAGA 3'
forward-strand
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
reverse-strand
3' TGTATCTGTCCCTGGTGGACGTCCTGTGTGTGCGTCCAAATGATTCCCAAATGAGTTGTGTCACTTGTCGTATATGGTCT 5'
Read 1: 3' TTGTGTCACT 5'
```
Note that, although one of the reads always aligns to the reverse strand, the position
at which it is deemed to align is still read off of the position on the forward strand.
(Thank goodness for that! Just
think how atrocious it would be if we counted positions of fragments mapping to the reverse strand
by starting from the reverse strands's 5' end, on the other end of the chromosome, and counting
forward from there!!)
Note that the _alignment position_ is one of the most important pieces of information
about an alignment. It gets recorded in the **POS** column of an alignment. It is recorded as the
first position (counting from 1 on the 5' end of the forward reference strand) at which
the alignment starts. Of course, the name of the reference sequence the read maps to is
essential. In a SAM file this is called the **RNAME** or reference name.
Both of the last two alignments illustrated above involve paired end reads that align
"properly," because one read in the pair aligns to the forward strand and one read
aligns to the reverse strand of the reference genome. As we saw above, that is just what
we would expect if the template we were sequencing is a faithful copy (apart from a few SNPs
or indels) of either the forward or the reverse strand of the reference sequence. In alignment
parlance we say that each of the reads is "mapped in a proper pair." This is obviously
an important piece of information about an alignment and it is recorded in a SAM file
in the so-called **FLAG** column for each alignment. (More on these flags later...)
How can a read pair not be properly mapped? There are a few possibilities:
(@) One read of the pair gets aligned, but the other does not. For example something
that in our schematic would look like this:
```
5' ACATAGACAGGGACCACCTGCAGGACACACACGCAGGTTTACTAAGGGTTTACTCAACACAGTGAACAGCATATACCAGA 3'
5' ACCTGCAGGA 3'
```
(@) Both reads of the pair map to the same strand. If our paired end reads looked like:
```
Read 1: 5' ACCTGCAGGA 3'
Read 2: 5' AACACAGTGA 5'
```
then they would both align nicely to just the forward strand of the reference genome:
```
5' ACATAGACAGGGACCACCTGCAGGACACACACGCAGGTTTACTAAGGGTTTACTCAACACAGTGAACAGCATATACCAGA 3'
5' ACCTGCAGGA------------------------------AACACAGTGA 3'
```
And, as we saw above, this would indicate the the template must not
conform to the reference genome in some way.
This may occur if there is a rearrangement (like an inversion, etc.) in the
genome being sequenced, relative to the reference genome.
(@) The two different reads of a pair get aligned to different chromosomes/scaffolds or
they get aligned so far apart on the same chromosome/scaffold that the alignment program
determines the pair to be aberrant. This evaluation requires that the program have
a lot of other paired end reads from which to estimate the distribution, in the sequencing library,
of the _template length_---the length of the original template. The template length
for each read pair is calculated from the mapping positions of the two reads and
is stored in the **TLEN** column of the SAM file.
What is another way I might align to thee? Well, one possbility is that a read pair
might align to many different places in the genome (this can happen if the reads are from
a repetitive element in the genome, for example). In such cases, there is typically
a "best" or "most likely" alignment, which is called the _primary alignment_. The SAM
file output might record other "less good" alignments, which are called _secondary alignments_
and whose status as such is recorded in the **FLAG** column. The aligner `bwa mem` has an
option to allow you to output all secondary alignments. Since you don't typically output and inspect
all secondary alignments (something that would be an unbearable task), most aligners
provide some measure of confidence about the alignment of a read pair. The aligner, _bwa_, for example,
looks at all possible alignments and computes a score for each. Then it evaluates confidence
in the primary alignment by comparing its score to the sum of the scores of all the alignments.
This provides the _mapping quality score_ found in the
**MAPQ** column of a SAM file. It can be interpreted, roughly, as the probability that
the given alignment of the read pair is incorrect. These can be small probabilities,
and are represented as Phred scaled values (using integers, not characters!) in the
SAM file.
The last way that a read might align to a reference is by not perfectly matching
every base pair in the reference. Perhaps only the first part of the read matches
base pairs in the reference, or maybe the read contains an insertion or a deletion.
For example, if instead of appearing like `5' ACCTGCAGGA 3'`, one of our reads
had an insertion of AGA, giving: `5' ACCAGAGTGCAGGA 3'`, this fragment would
still align to the reference, at 10 bp, and we might record that alignment,
but would still want a compact way of denoting the position and length of the
insertion---a task handled by the **CIGAR** column.
To express all these different ways in which an alignment can occur,
each read occupies a single line in a SAM file. This row holds information about
the read's alignment to the reference genome in a number of TAB-delimited columns.
There are 11 required columns in each alignment row, after which different aligners
may provide additional columns of information.
Table \@ref(tab:sam-cols) gives a brief description of the 11 required columns (intimations
of most of which occurred in **ALL CAPS BOLDFACE** in the preceding paragraphs. Some,
like **POS** are relatively self-explanatory. Others, like **FLAG** and **CIGAR**
benefit from further explanation as given in the subsections below.
```{r, echo=FALSE, message=FALSE}
tab <- readr::read_delim("table_inputs/sam-columns-table.txt", delim = "&", trim_ws = TRUE)
pander::pander(
tab,
booktabs = TRUE,
caption = '(\\#tab:sam-cols) Brief description of the 11 required columns in a SAM file.',
justify = "left")
```
### Play with simple alignments
Now, everyone **who has a Mac** should clone the RStudio project repository on GitHub at [https://github.com/eriqande/alignment-play](https://github.com/eriqande/alignment-play)
(by opening a new RStudio project with the "From Version Control" --> GitHub option, for example).
This has a notebook that will let us do simple alignments and familiarize
ourselves with the output in SAM format.
### SAM Flags
The FLAG column expresses the status of
the alignment associated with a given read (and its _mate_ in paired-end
sequencing) in terms of a combination of
12 yes-or-no statements. The combination of all of these "yesses" and "nos" for a
given aligned read is
called its SAM _flag_. The yes-or-no status of any single one of the twelve statements is called a "bit" because it can be
thought of as a single binary digit
whose value can be 0 (No/False) or 1 (Yes/True). Sometimes a picture can be helpful:
we can represent each statement as a circle which is shaded if it is true and open
if it is false. Thus, if all 12 statements are false you would see $\bitsopen~\bitsopen~\bitsopen$. However,
if statements 1, 2, 5, and 7 are true then you would see $\bitsmany$. In computer parlance
we would say that bits 1, 3, 5, and 7 are "set" if they indicate Yes/True. As these are bits
in a binary number, each bit is associated with a power of 2 as shown in Table \@ref(tab:sam-flags),
which also lists the meaning of each bit.
```{r, echo=FALSE, message=FALSE}
tab <- readr::read_delim("table_inputs/sam-flag-table.txt", delim = "&", trim_ws = TRUE)
pander::pander(
tab,
booktabs = TRUE,
caption = '(\\#tab:sam-flags) SAM flag bits in a nutshell. The description of these in the SAM specification is more general, but if we restrict ourselves to paired-end Illumina data, each bit can be interpreted by the meanings shown here. The "bit-grams" show a visual representation of each bit with open circles meaning 0 or False and filled circles denoting 1 or True. The bit grams are broken into three groups of four, which show the values that correspond to different place-columns in the hexadecimal representation of the bit masks.',
justify = "left")
```
If we think of the 12 bits as coming in three groups of four we can easily represent them as hexadecimal
numbers. Hexadecimal number are numbers in base-16. They are expressed with a leading "0x" but otherwise behave
like decimal numbers, except that instead of a 1's place, 10's place, and 100's place, and so on, we have
a 1's place, a 16's place, and 256's place, and so forth.
In the first group of four bits (reading from right to left) the bits correspond to
0x1, 0x2, 0x4, and 0x8, in hexadecimal. The next set of four bits correspond to 0x10, 0x20, 0x40, and 0x80,
and the last set of four bits correspond to 0x100, 0x200, 0x400, 0x800. It can be worthwhile becoming comfortable with
these hexadecimal names of each bit.
In the SAM format,
the **FLAG** field records the decimal (integer) equivalent of the binary number that represents
the yes-or-no answers to the 12 different statements. It is relatively easy to do arithmetic with
the hexadecimal flags to
find the decimal equivalent: add up the numbers in each of the three hexadecimal value places
(the 1's, 16's, and 256's places)
and multiply the result by 16 raised to the number of zeros right of the "x" in the hexadecimal number. For example if the
bits set on an alignment are 0x1 & 0x2 & 0x10 & 0x40, then they sum column-wise to 0x3, and 0x50, so the
value listed in the FLAG field of a SAM file would be $3 + 5 \cdot 16 = 83$.
While it is probably possible to get good at computing these 12-bit combinations
from hexadecimal in your
head, it is also quite convenient to use the Broad Institute's wonderful
[SAM flag calculator](https://broadinstitute.github.io/picard/explain-flags.html).
We will leave our discussion of the various SAM flag values by noting that the large SAM-flag bits
(0x100, 0x200, 0x400, and 0x800) all signify something "not good" about the alignment.
The same goes for 0x4 and 0x8. On the other hand, when you are dealing with paired-end
data, one of the reads has to be read 1 and the other read 2, and that is known from their
read names and the FASTQ file that they are in. So, we expect that 0x40 and 0x80 should always be set,
trivially. With paired-end data, we are always comforted to see bits 0x1 and 0x2 set, as departures from
that condition indicate that the pairing of the read alignments does not make sense given the
sequence in the reference genome. As
we saw in our discussion of how a template can properly map to a reference, you should be able to convince
yourself that, in a properly mapped alignment, exactly one of the two bits 0x10 and 0x20 should be set for
one read in the pair, and the other should be set for the other.
Therefore, in good, happy, properly paired reads, from a typical whole genome sequencing library
preparation, we should find either:
```
read 1 : 0x1 & 0x2 & 0x10 & 0x40 = 83
read 2 : 0x1 & 0x2 & 0x20 & 0x80 = 163
```
or
```
read 1 : 0x1 & 0x2 & 0x20 & 0x40 = 99
read 2 : 0x1 & 0x2 & 0x10 & 0x80 = 147
```
So, now that we know all about SAM flags and the values that they take,
what should we do with them? First, investigating the distribution of
SAM flags is an important way of assessing the nature and reliability of
the alignments you have made (this is what _samtools flagstat_ is for, as discussed
in a later chapter). Second, you might wonder if you should do some sort
of filtering of your alignments before you do variant calling. With most modern
variant callers, the answer to that is, "No." Modern variant callers take account of
the information in the SAM flags to weight information from different alignments, so,
leaving bad alignments in your SAM file should not have a large effect on the final
results. Furthermore, filtering out your data might make it hard to follow up on interesting patterns in
your data, for example, the occurrence of improperly aligning reads can be used to
infer the presence of inversions. If all those improperly paired reads had been discarded,
they could not be used in such an endeavor.
### The CIGAR string {#cigar}
**CIGAR** is an acronym for Compressed Idiosyncratic Gapped Alignment Report. It provides a
space-economical way of describing the manner in which a single read aligns to a reference
genome. It is particularly important for recording the presence of insertions or deletions within
the read, relative to the reference genome. This is done by counting up, along the alignment, the
number of base pairs that: _match_ (`M`) the reference; that are _inserted_ (`I`) into the read and
absent from the reference; and that are _deleted_ (`D`) from the read, but present in the reference.
To arrive at the syntax of the CIGAR string you catenate a series of Number-Letter pairs that describe the
sequence of matches, insertions and deletions that describe an alignment.
Some examples are in order. We return to our 80 base-pair reference from above and consider
the alignment to it of a 10 bp read that looks like `5' ACCTGCAGGA 3'`:
```
5' ACATAGACAGGGACCACCTGCAGGACACACACGCAGGTTTACTAAGGGTTTACTCAACACAGTGAACAGCATATACCAGA 3'
5' ACCTGCAGGA 3'
```
Such an alignment has no insertions or deletions, or other weird things, going on. So its
CIGAR string would be `10M`, signifying 10 matching base pairs. A _very important_ thing
to note about this is that the `M` refers to bases that match in _position_ in the alignment
_even though they might not match the specific nucleotide types_. For example, even if
bases 3 and 5 in the read don't match the exact base nucleotides in the alignment, like this:
```
5' ACATAGACAGGGACCACCTGCAGGACACACACGCAGGTTTACTAAGGGTTTACTCAACACAGTGAACAGCATATACCAGA 3'
5' ACTTACAGGA 3'
```
its CIGAR string will typically still be `10M`. (The SAM format allows for an `X` to denote
mismatches in the base nucleotides between a reference and a read, but I have never seen
it used in practice.)
Now, on the other hand, if our read carried a deletion of bases 3 and 4. It would look
like `5' ACGCAGGA 3'` and we might represent it in an alignment like:
```
5' ACATAGACAGGGACCACCTGCAGGACACACACGCAGGTTTACTAAGGGTTTACTCAACACAGTGAACAGCATATACCAGA 3'
5' AC--GCAGGA 3'
```
where the `-`'s have replaced the two deleted bases. The CIGAR string for this
alignment would be `2M2D6M`.
Continuing to add onto this example, suppose that not only have bases 3 and 4 been deleted,
but also a four-base insertion of `ACGT` occurs in the read between positions 8 and 9 (of the original
read). That would appear like:
```
5' ACATAGACAGGGACCACCTGCAG----GACACACACGCAGGTTTACTAAGGGTTTACTCAACACAGTGAACAGCATATACCAGA 3'
5' AC--GCAGACGTGA 3'
```
where `-`'s have been added to the reference at the position of the insertion in the read.
The CIGAR string for this arrangement would be `2M2D4M4I2M` which can be hard to parse, visually,
if your eyes are getting as old as mine, but it translates to:
```
2 bp Match
2 bp Deletion
4 bp Match
4 bp Insert
2 bp Match
```
In addition to `M`, `D` and `I` (and `X`) there are also `S` and `H`, which are
typically seen with longer sequences. They refer to _soft-_ and _hard-clipping_,
respectively, which are situations in which a terminal piece of the read, from
either near the 3' or 5' end, does not align to the reference, but the central
part, or the other end of the read does. Hard clipping removes the clipped sequence
from the read as represented in the SEQ column, while soft clipping does not
remove the clipped sequence from the SEQ column representation.
One important thing to understand about CIGAR strings is that they always represent
the alignment as it appears in the 5' to 3' direction. As a consequence, it is the
same whether you are reading it off the read in the 5' to 3' direction _or_ if you are reading it
off from how the reverse complement of the read would align to the opposite strand of the reference.
Another picture is in order: if we saw a situation like the following,
with a deletion in Read 1 (which aligns to the reverse strand),
the CIGAR string would be, from 5' to 3' on Read 1, `6M2D2M`,
which is just what we would have if we were to align the reverse
complement of Read 1, called `Comp R1` below to the forward strand
of the reference.
```
Read 2: 5' ACCTGCAGGA 3' Comp R1: 5' AA--CAGTGA 3'
5' ACATAGACAGGGACCACCTGCAGGACACACACGCAGGTTTACTAAGGGTTTACTCAACACAGTGAACAGCATATACCAGA 3'
forward-strand
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||