Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

insert sizes change when BAM->CRAM->BAM #365

Closed
RichardCorbett opened this issue Mar 18, 2019 · 4 comments
Closed

insert sizes change when BAM->CRAM->BAM #365

RichardCorbett opened this issue Mar 18, 2019 · 4 comments
Labels
Milestone

Comments

@RichardCorbett
Copy link

RichardCorbett commented Mar 18, 2019

Hi folks,

We're looking into using CRAM for archiving our genome BAM files and recently noticed that although BWA mem alignments go through CRAM compression unedited, the minimap2 alignments occasionally (about 0.25% of reads) experience a change in insert size estimate in the post compression BAM.

Alignment command from BAM header:
minimap2 PN:minimap2 CL:minimap2 -t 32 -R @RG\tID:foo\tSM:GM24385_DNA_2.bam\tPL:ILLUMINA -ax sr /projects/rcorbettprj2/minION/AM38_1/GRCh37-lite.mmi GM24385_DNA_2.bam.fq1 GM24385_DNA_2.bam.fq2 VN:2.15-r905

I'm currently regenerating the alignments and can post an example in a hour or two if it is helpful.

The changes in insert size appear to be exclusively in reads with significant clipping, in case that is of interest.

If you think this is more likely a problem with CRAM rather than minimap2 I can post a ticket over there instead.

thanks,
RIchard

Before compression example:

HISEQ104_146:3:2106:23663:54436 101 1 66509 0 * = 66509 28 ATAGTATATATTATAATATATTATTTTATATTACATAATATATTTTATATATCACATATTACATGTTACATACATAATGTTAAAATATATATTATATTATGATATATTATATATCATACATATTATGTATCCTATATTGTATAATATATA AAFFFJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJFJJFFJJJFJJJJJJJJJJJJJJJJJJJJFFJJJJJAJJJJJJAFJJFAJAJJJJFJJJJFJJJJJFA<FJFFF--AFFFAFFJJ-F<AFA7<-FAFA RG:Z:1
HISEQ104_146:3:2106:23663:54436 153 1 66509 1 13S28M109S = 66509 -28 GTATCCTATATTGTATAATATATATTATATATTATATATTAGGTGTGTTGTCTATTATATGGTATATATAATTACATATGAAATGTATTATTATATATAACGATATGAATATGTAATTATATATTATTATTTATATTATATATAACTATA<7<F7JJJFF<JJJJJJJJJJJJFFJJJJJJFJJJJJJJJJJJJJJJJJJJJJJFJJJJJJJJJJJJJJJJJJJJJJJJFJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJFFFAA s1:i:30 s2:i:30 RG:Z:1 NM:i:1 AS:i:46 de:f:0.0357 cm:i:1 nn:i:0 tp:A:P ms:i:46

Same read pair, after going BAM->CRAM->BAM

HISEQ104_146:3:2106:23663:54436 101 1 66509 0 * = 66509 0 ATAGTATATATTATAATATATTATTTTATATTACATAATATATTTTATATATCACATATTACATGTTACATACATAATGTTAAAATATATATTATATTATGATATATTATATATCATACATATTATGTATCCTATATTGTATAATATATA AAFFFJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJFJJFFJJJFJJJJJJJJJJJJJJJJJJJJFFJJJJJAJJJJJJAFJJFAJAJJJJFJJJJFJJJJJFA<FJFFF--AFFFAFFJJ-F<AFA7<-FAFA RG:Z:1
HISEQ104_146:3:2106:23663:54436 153 1 66509 1 13S28M109S = 66509 0 GTATCCTATATTGTATAATATATATTATATATTATATATTAGGTGTGTTGTCTATTATATGGTATATATAATTACATATGAAATGTATTATTATATATAACGATATGAATATGTAATTATATATTATTATTTATATTATATATAACTATA<7<F7JJJFF<JJJJJJJJJJJJFFJJJJJJFJJJJJJJJJJJJJJJJJJJJJJFJJJJJJJJJJJJJJJJJJJJJJJJFJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJFFFAA s1:i:30 s2:i:30 NM:i:1 AS:i:46 de:f:0.0357 cm:i:1 nn:i:0 tp:A:P ms:i:46 RG:Z:1

@jmarshall
Copy link
Contributor

In this example read pair, the first read (the one with flags = 101) is unmapped. So minimap2 ought to be outputting the insert size as 0 for both reads, and the CRAM roundtrip has effectively updated the field to its correct value.

CRAM has some capabilities for recording incorrect field values so they can be preserved across roundtrips, but apparently not for TLEN (@jkbonfield?).

Do all the alignment records experiencing this change have one or both reads in the pair being unmapped? If so, this would appear to be indeed a minimap2 problem. (The SAM spec is pretty clear that TLEN should only be non-zero if both ends are mapped (to the same reference sequence, in fact), so this should be rare enough that IMHO it doesn't by itself warrant extending CRAM's incorrectness-preserving facilities to cover TLEN.)

@jkbonfield
Copy link

I think CRAM's TLEN preserving is finding the pair, checking the observed template size (this actually is 28, based on leftmost and rightmost coordinates from one read only), checking if it matches the stored figure (it does) and if so lets it go. Otherwise it marks the pair as "detached" which forces the values to be stored verbatim. (That is normally only used for read-pairs that span a slice boundary.) I checked and forcing it to store in detached mode does then round-trip properly.

The decode follows the same logic, and it gets +28 and -28, but is then scuppered by this piece of code:

		// set mate unmapped if needed
		if (s->crecs[cr->mate_line].flags & BAM_FUNMAP) {
		    cr->flags |= BAM_FMUNMAP;
		    cr->tlen = 0;
		}
		if (cr->flags & BAM_FUNMAP) {
		    cr->tlen = 0;
		}

This is correcting fields. Such "corrections" in this case are also causing the round-trip failure, although it still comes under the semantically correct vs identical debate. The SAM spec states the field is zero for a single segment template, but that's perhaps ambiguous when we have a paired template with only one mapped even though the intention is clear.

Note: There are some fields CRAM simply cannot stored though for unmapped data. Eg storing CIGAR and MAPQ is conditional on the read being mapped, but I've seen both set before on unmapped data due to bugs in aligners.

@RichardCorbett
Copy link
Author

Thanks folks.
I haven't confirmed this rigorously, but the few hundred reads with differences after a CRAM roundtrip I did check had 1 of the two reads unaligned.

Here are a couple more pre-roundtrip examples that were corrected by CRAM to add the the example at the top.

HISEQ104_146:3:2106:23663:54436 101 1 66509 0 * = 66509 28 ATAGTATATATTATAATATATTATTTTATATTACATAATATATTTTATATATCACATATTACATGTTACATACATAATGTTAAAATATATATTATATTATGATATATTATATATCATACATATTATGTATCCTATATTGTATAATATATA AAFFFJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJFJJFFJJJFJJJJJJJJJJJJJJJJJJJJFFJJJJJAJJJJJJAFJJFAJAJJJJFJJJJFJJJJJFA<FJFFF--AFFFAFFJJ-F<AFA7<-FAFA RG:Z:1
HISEQ104_146:3:2106:23663:54436 153 1 66509 1 13S28M109S = 66509 -28 GTATCCTATATTGTATAATATATATTATATATTATATATTAGGTGTGTTGTCTATTATATGGTATATATAATTACATATGAAATGTATTATTATATATAACGATATGAATATGTAATTATATATTATTATTTATATTATATATAACTATA<7<F7JJJFF<JJJJJJJJJJJJFFJJJJJJFJJJJJJJJJJJJJJJJJJJJJJFJJJJJJJJJJJJJJJJJJJJJJJJFJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJFFFAA s1:i:30 s2:i:30 RG:Z:1 NM:i:1 AS:i:46 de:f:0.0357 cm:i:1 nn:i:0 tp:A:P ms:i:46

HISEQ101_180:2:1102:30157:58655 101 1 381113 0 * = 381113 150 ACATTAGGTGTATCTCCTAGTGCTGTCGGTCCGGCACCCTTCACGTGTGGGCGCAACAGCGGCGGGGGGCTGTGGGGTTTACCGGGGATGTCTGATGACTGCGCAAGAGGGAAGTGGCAGGCGGGTTAAATAGCGGCAAGGACACACGGA <A-FFFA<JJJFJAJF-FJ-AFJJ<J7A-------77----7-7<A---7--7-7--<<----7--<-7--77---777---7<---7-----7---7-7--7-A-A---7--7-----77-77-7-----<---77--A----7---7- RG:Z:1
HISEQ101_180:2:1102:30157:58655 153 1 381113 0 150M = 381113 -150 AGAAATAGGAAAACTTTTCCACTGTTGGTGGGACTGTAAACTAGTTCAACCATTGTGGAAGTCAGTGTGGCGATTTCTTAGGGATCTATAACTAGAAATACCATCTGACCCAGCCAACCCATTACTGGGTATATATCCAAAGGATTATAA F7--F-7---7-7----7-A7-F-<-A7-F777---<A7-A-A-FFFF7-FAF-F7-FFFJFFF7-7--<-FJ7-----FAA7AJJAF-JA<FF7<<A<AF<F<-<7-<7AFF<FA-AJJFFAF--F<JJJ-FFF-F-FJJJJJFAA-AA s1:i:25 s2:i:25 RG:Z:1 NM:i:7 AS:i:230 de:f:0.0467 cm:i:1 nn:i:0 tp:A:P ms:i:230

HISEQ101_180:2:1102:30157:58655 101 1 381113 0 * = 381113 150 ACATTAGGTGTATCTCCTAGTGCTGTCGGTCCGGCACCCTTCACGTGTGGGCGCAACAGCGGCGGGGGGCTGTGGGGTTTACCGGGGATGTCTGATGACTGCGCAAGAGGGAAGTGGCAGGCGGGTTAAATAGCGGCAAGGACACACGGA <A-FFFA<JJJFJAJF-FJ-AFJJ<J7A-------77----7-7<A---7--7-7--<<----7--<-7--77---777---7<---7-----7---7-7--7-A-A---7--7-----77-77-7-----<---77--A----7---7- RG:Z:1
HISEQ101_180:2:1102:30157:58655 153 1 381113 0 150M = 381113 -150 AGAAATAGGAAAACTTTTCCACTGTTGGTGGGACTGTAAACTAGTTCAACCATTGTGGAAGTCAGTGTGGCGATTTCTTAGGGATCTATAACTAGAAATACCATCTGACCCAGCCAACCCATTACTGGGTATATATCCAAAGGATTATAA F7--F-7---7-7----7-A7-F-<-A7-F777---<A7-A-A-FFFF7-FAF-F7-FFFJFFF7-7--<-FJ7-----FAA7AJJAF-JA<FF7<<A<AF<F<-<7-<7AFF<FA-AJJFFAF--F<JJJ-FFF-F-FJJJJJFAA-AA s1:i:25 s2:i:25 RG:Z:1 NM:i:7 AS:i:230 de:f:0.0467 cm:i:1 nn:i:0 tp:A:P ms:i:230

Each of these shows that one of the two reads is unaligned, but there is a non-zero insert size reported for the aligned read.

jmarshall added a commit to jmarshall/minimap2 that referenced this issue Apr 4, 2019
this_rid/this_pos will be copied from r_prev(=r_next)'s values when this
read is unmapped (i.e., r is NULL). In this case, we can write RNEXT as
'=' but should not calculate TLEN from these placeholder values.
Similarly when the mate is unmapped (i.e., r_next is NULL).

Fixes lh3#365.
@lh3 lh3 closed this as completed in #373 Apr 5, 2019
lh3 pushed a commit that referenced this issue Apr 5, 2019
this_rid/this_pos will be copied from r_prev(=r_next)'s values when this
read is unmapped (i.e., r is NULL). In this case, we can write RNEXT as
'=' but should not calculate TLEN from these placeholder values.
Similarly when the mate is unmapped (i.e., r_next is NULL).

Fixes #365.
@lh3 lh3 added the bug label Apr 5, 2019
@lh3
Copy link
Owner

lh3 commented Apr 5, 2019

Thanks for reporting the issue. @jmarshall's #373 should fix this issue.

@lh3 lh3 added this to the 2.17 milestone Apr 9, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants