-
Notifications
You must be signed in to change notification settings - Fork 424
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
insert sizes change when BAM->CRAM->BAM #365
Comments
In this example read pair, the first read (the one with flags = 101) is unmapped. So minimap2 ought to be outputting the insert size as 0 for both reads, and the CRAM roundtrip has effectively updated the field to its correct value. CRAM has some capabilities for recording incorrect field values so they can be preserved across roundtrips, but apparently not for TLEN (@jkbonfield?). Do all the alignment records experiencing this change have one or both reads in the pair being unmapped? If so, this would appear to be indeed a minimap2 problem. (The SAM spec is pretty clear that TLEN should only be non-zero if both ends are mapped (to the same reference sequence, in fact), so this should be rare enough that IMHO it doesn't by itself warrant extending CRAM's incorrectness-preserving facilities to cover TLEN.) |
I think CRAM's TLEN preserving is finding the pair, checking the observed template size (this actually is 28, based on leftmost and rightmost coordinates from one read only), checking if it matches the stored figure (it does) and if so lets it go. Otherwise it marks the pair as "detached" which forces the values to be stored verbatim. (That is normally only used for read-pairs that span a slice boundary.) I checked and forcing it to store in detached mode does then round-trip properly. The decode follows the same logic, and it gets +28 and -28, but is then scuppered by this piece of code:
This is correcting fields. Such "corrections" in this case are also causing the round-trip failure, although it still comes under the semantically correct vs identical debate. The SAM spec states the field is zero for a single segment template, but that's perhaps ambiguous when we have a paired template with only one mapped even though the intention is clear. Note: There are some fields CRAM simply cannot stored though for unmapped data. Eg storing CIGAR and MAPQ is conditional on the read being mapped, but I've seen both set before on unmapped data due to bugs in aligners. |
Thanks folks. Here are a couple more pre-roundtrip examples that were corrected by CRAM to add the the example at the top.
Each of these shows that one of the two reads is unaligned, but there is a non-zero insert size reported for the aligned read. |
this_rid/this_pos will be copied from r_prev(=r_next)'s values when this read is unmapped (i.e., r is NULL). In this case, we can write RNEXT as '=' but should not calculate TLEN from these placeholder values. Similarly when the mate is unmapped (i.e., r_next is NULL). Fixes lh3#365.
this_rid/this_pos will be copied from r_prev(=r_next)'s values when this read is unmapped (i.e., r is NULL). In this case, we can write RNEXT as '=' but should not calculate TLEN from these placeholder values. Similarly when the mate is unmapped (i.e., r_next is NULL). Fixes #365.
Thanks for reporting the issue. @jmarshall's #373 should fix this issue. |
Hi folks,
We're looking into using CRAM for archiving our genome BAM files and recently noticed that although BWA mem alignments go through CRAM compression unedited, the minimap2 alignments occasionally (about 0.25% of reads) experience a change in insert size estimate in the post compression BAM.
Alignment command from BAM header:
minimap2 PN:minimap2 CL:minimap2 -t 32 -R @RG\tID:foo\tSM:GM24385_DNA_2.bam\tPL:ILLUMINA -ax sr /projects/rcorbettprj2/minION/AM38_1/GRCh37-lite.mmi GM24385_DNA_2.bam.fq1 GM24385_DNA_2.bam.fq2 VN:2.15-r905
I'm currently regenerating the alignments and can post an example in a hour or two if it is helpful.
The changes in insert size appear to be exclusively in reads with significant clipping, in case that is of interest.
If you think this is more likely a problem with CRAM rather than minimap2 I can post a ticket over there instead.
thanks,
RIchard
Before compression example:
Same read pair, after going BAM->CRAM->BAM
The text was updated successfully, but these errors were encountered: