Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

csi indexing of *.bam.sv.bam - invalid file pointer #520

Closed
keiranmraine opened this issue Aug 19, 2021 · 13 comments
Closed

csi indexing of *.bam.sv.bam - invalid file pointer #520

keiranmraine opened this issue Aug 19, 2021 · 13 comments

Comments

@keiranmraine
Copy link

I'm finding that the csi index generated for the *.bam.sv.bam file in some instances results in an invalid file pointer:

INFO    2021-08-17 07:24:30     AssemblyEvidenceSource  Starting assembly on chunk 164 (9:98840289-9:108840288)
ERROR   2021-08-17 07:24:31     AssemblyEvidenceSource  Error assembling chunk 164 (9:98840289-9:108840288)
htsjdk.samtools.util.RuntimeIOException: java.io.IOException: Invalid file pointer: 686201611812555 for /var/spool/workspace/PD13371b.sample.dupmarked.bam.gridss.working/PD13371b.sample.dupmarked.bam.sv.bam
        at htsjdk.samtools.BAMFileReader$BAMFileIterator.readNextRecord(BAMFileReader.java:972)
        at htsjdk.samtools.BAMFileReader$BAMFileIterator.access$1800(BAMFileReader.java:804)
        at htsjdk.samtools.BAMFileReader$BAMFileIterator$AsyncBamDecoder.performReadAhead(BAMFileReader.java:932)
        at htsjdk.samtools.util.AsyncReadTaskRunner.readNextBatch(AsyncReadTaskRunner.java:226)
        at java.base/java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1700)
        at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
        at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
        at java.base/java.lang.Thread.run(Thread.java:829)
Caused by: java.io.IOException: Invalid file pointer: 686201611812555 for /var/spool/workspace/PD13371b.sample.dupmarked.bam.gridss.working/PD13371b.sample.dupmarked.bam.sv.bam
        at htsjdk.samtools.util.BlockCompressedInputStream.seek(BlockCompressedInputStream.java:382)
        at htsjdk.samtools.BAMFileReader$BAMFileIndexIterator.advanceToNextRecordStart(BAMFileReader.java:1136)
        at htsjdk.samtools.BAMFileReader$BAMFileIterator.readNextRecord(BAMFileReader.java:952)
        ... 7 more

Rerunning gives the same result, however if I reindex the file with the 1.10 samtools included in the docker image the processing can be resumed.

Under GRCh38 this works fine, but GRCh37 it fails on the same sample. Just logging here incase others encounter this.

Example command:

gridss \
--steps preprocess,assemble,call \
--externalaligner \
--reference ref/genome.fa \
--blacklist ref/gridss/blacklist-2011-05-04-ENCFF001TDO.bed \
--threads 8 \
--labels SAMPLE \
--skipsoftcliprealignment \
--assembly result/PD13371b.assembly.bam \
--output result/SAMPLE.gridss.vcf.gz \
--workingdir workspace \
SAMPLE.sample.dupmarked.bam
@d-cameron
Copy link
Member

Can you validate that the index itself is corrupt? If it is, then it looks like I can't rely on htsjdk-generated .csi index files :(

@keiranmraine
Copy link
Author

I'm waiting on a reoccurrence.

@keiranmraine
Copy link
Author

This is reproducible on this data set, fails on 37, fine on 38.

@cboursnell
Copy link

I now get this error. I am using:

  • samtools Version: 1.10 (using htslib 1.10.2)
  • hg38
  • gridss-2.12.1-gridss-jar-with-dependencies.jar

@keiranmraine
Copy link
Author

keiranmraine commented Oct 19, 2021

FYI, I'm seeing while attempting to run our CI pipeline on large scale data with docker pull gridss/gridss:2.12.2, now confirmed on GRCh38 too.

@keiranmraine
Copy link
Author

Additional info. Within the container attempt to count reads in region where the failure occured:

large-vm$ docker run --rm -ti -v $PWD/work:/data/work:rw gridss/gridss:2.12.2 bash
# cd /data/work/workspace/PD13371b.sample.dupmarked.bam.gridss.working/ 
# samtools view -c PD13371b.sample.dupmarked.bam.sv.bam 2:100499380-110499379
0

No error, but also no reads.

Accessing the the BAM+CSI file with samtools 1.13 I get an appropriate response:

dev-box$ samtools view -c PD13371b.sample.dupmarked.bam.sv.bam 2:100499380-110499379
424383

It appears here that the CSI is valid for samtools 1.13 but not samtools 1.10. Went back into the container:

large-vm$ docker run --rm -ti -v $PWD/work:/data/work:rw gridss/gridss:2.12.2 bash
# cd /data/work/workspace/PD13371b.sample.dupmarked.bam.gridss.working/
# samtools view -c PD13371b.sample.dupmarked.bam.sv.bam 2:100499380-110499379
0
# mv PD13371b.sample.dupmarked.bam.sv.bam.csi PD13371b.sample.dupmarked.bam.sv.bam.csi.orig
# samtools index -c PD13371b.sample.dupmarked.bam.sv.bam
# samtools view -c PD13371b.sample.dupmarked.bam.sv.bam 2:100499380-110499379
424383
# # md5sum PD13371b.sample.dupmarked.bam.sv.bam.csi*
059ee8f08237ca0a62adab738aa5fa39  PD13371b.sample.dupmarked.bam.sv.bam.csi
e7f7226f555d6153d2784a5fc57030ec  PD13371b.sample.dupmarked.bam.sv.bam.csi.orig

I'm not sure what is reading the CSI index within gridss, but it appears that samtools 1.13 can read it, but not samtools 1.10.

@keiranmraine
Copy link
Author

I think I've isolated it down to the one place that the gridss shell script is generating csi indexes instead of bai indexes (during sort). Currently building and testing this patch:

diff --git a/scripts/gridss b/scripts/gridss
index 7ffec1a5..9f3e6a58 100644
--- a/scripts/gridss
+++ b/scripts/gridss
@@ -801,7 +801,7 @@ if [[ $do_preprocess == true ]] ; then
                                        | $timecmd $samtools_sort \
                                                        -T $tmp_prefix.coordinate-tmp \
                                                        -Obam \
-                                                       -o $tmp_prefix.coordinate.bam \
+                                                       -o $tmp_prefix.coordinate.bam##idx##$tmp_prefix.coordinate.bam.bai \
                                                        $preprocess_sort_args \
                                                        /dev/stdin \
                                        ; } 1>&2 2>> $logfile
@@ -809,8 +809,8 @@ if [[ $do_preprocess == true ]] ; then
                                        if [[ $skipsoftcliprealignment == "true" ]] ; then
                                                write_status "Skipping  SoftClipsToSplitReads   $f"
                                                mv $tmp_prefix.coordinate.bam $prefix.sv.bam
-                                               mv $tmp_prefix.coordinate.bam.csi $prefix.sv.bam.csi
-                                               touch $prefix.sv.bam.csi # make sure the index file is older so we don't get htsjdk WARNING spam
+                                               mv $tmp_prefix.coordinate.bam.bai $prefix.sv.bam.bai
+                                               touch $prefix.sv.bam.bai # make sure the index file is older so we don't get htsjdk WARNING spam
                                        else
                                                write_status "Running   SoftClipsToSplitReads   $f"
                                                rm -f $tmp_prefix.sc2sr.suppsorted.sv-tmp*

From samtools docs:

The --write-index option enables automatic index creation while writing out BAM, CRAM or bgzf SAM files. ... By default SAM and BAM will use CSI indices while CRAM will use CRAI indices. If you need to create BAI indices note that it is possible to specify the name of the index being written to, and hence the format, by using the filename##idx##indexname notation.

@keiranmraine
Copy link
Author

@keiranmraine
Copy link
Author

@d-cameron I've taken a branch at the v2.12.2 tag and the docker image fails to build before I've made changes:

$ git log | head -n 5
commit 28fd5fa9f656edae45eb18fe8423dad573419897
Author: Daniel Cameron <d-cameron@github.com>
Date:   Thu Oct 14 17:21:37 2021 +1100

    virusbreakend: not using --skipsoftcliprealignment
Results :

Failed tests: 
  SswJniAlignerTest>SmithWatermanAlignerTest.should_align_deletion:15->SmithWatermanAlignerTest.create:11->create:8
  SswJniAlignerTest>SmithWatermanAlignerTest.should_use_soft_clips:29->SmithWatermanAlignerTest.create:11->create:8
  SswJniAlignerTest>SmithWatermanAlignerTest.should_use_zero_based_reference_offset:23->SmithWatermanAlignerTest.create:11->create:8

Tests run: 1542, Failures: 3, Errors: 0, Skipped: 28

The more detail on the failed tests:

Running au.edu.wehi.idsv.alignment.SswJniAlignerTest
Tests run: 3, Failures: 3, Errors: 0, Skipped: 0, Time elapsed: 0.005 sec <<< FAILURE! - in au.edu.wehi.idsv.alignment.SswJniAlignerTest
should_align_deletion(au.edu.wehi.idsv.alignment.SswJniAlignerTest)  Time elapsed: 0.004 sec  <<< FAILURE!
java.lang.AssertionError
	at au.edu.wehi.idsv.alignment.SswJniAlignerTest.create(SswJniAlignerTest.java:8)

should_use_zero_based_reference_offset(au.edu.wehi.idsv.alignment.SswJniAlignerTest)  Time elapsed: 0 sec  <<< FAILURE!
java.lang.AssertionError
	at au.edu.wehi.idsv.alignment.SswJniAlignerTest.create(SswJniAlignerTest.java:8)

should_use_soft_clips(au.edu.wehi.idsv.alignment.SswJniAlignerTest)  Time elapsed: 0.001 sec  <<< FAILURE!
java.lang.AssertionError
	at au.edu.wehi.idsv.alignment.SswJniAlignerTest.create(SswJniAlignerTest.java:8)

@d-cameron
Copy link
Member

Thanks for the feedback. Going with updated samtools version requirements

@secondspass
Copy link

@d-cameron I just checked the latest gridss docker image and it doesn't look like the samtools version was properly updated. If you look:

student@student-VirtualBox:~/Downloads/gridss$ sudo docker run -it --entrypoint /bin/bash gridss/gridss:latest
root@a095790330fc:/data# apt info samtools
Package: samtools
Version: 1.10-3
Status: install ok installed
Priority: optional
Section: science
Maintainer: Ubuntu Developers <ubuntu-devel-discuss@lists.ubuntu.com>
Original-Maintainer: Debian Med Packaging Team <debian-med-packaging@lists.alioth.debian.org>
Installed-Size: 1174 kB
Depends: libc6 (>= 2.29), libhts3 (>= 1.10), libncurses6 (>= 6), libtinfo6 (>= 6), zlib1g (>= 1:1.1.4)
Recommends: python3, cwltool
Breaks: bio-tradis (<= 1.4.5+dfsg-1), iva (<= 1.0.9+ds-6)
Homepage: http://www.htslib.org/
Download-Size: unknown
APT-Manual-Installed: yes
APT-Sources: /var/lib/dpkg/status
Description: processing sequence alignments in SAM, BAM and CRAM formats
 Samtools is a set of utilities that manipulate nucleotide sequence alignments
 in the binary BAM format. It imports from and exports to the ascii SAM
 (Sequence Alignment/Map) and CRAM formats, does sorting, merging and indexing,
 and allows one to retrieve reads in any regions swiftly. It is designed to work
 on a stream, and is able to open a BAM or CRAM (not SAM) file on a remote FTP
 or HTTP server.

@keiranmraine
Copy link
Author

See #549

@afurches
Copy link

afurches commented Dec 13, 2021

Hi @d-cameron ,

I tried rebuilding the docker image on 12/07/21 and again today, 12/13/21, using "latest" and "v2.13.0" tags and still get the samtools version error when running the Assembly step. Here's an excerpt from .err:

Mon Dec 13 14:26:02 EST 2021: Using GRIDSS jar /opt/gridss/gridss-2.13.0-gridss-jar-with-dependencies.jar
...
Mon Dec 13 14:26:07 EST 2021: Found /usr/bin/Rscript
Mon Dec 13 14:26:07 EST 2021: Found /usr/bin/samtools
Mon Dec 13 14:26:07 EST 2021: Found /usr/bin/java
Mon Dec 13 14:26:07 EST 2021: Found /usr/bin/bwa
Mon Dec 13 14:26:08 EST 2021: samtools version: 1.10+htslib-1.10.2-3
Mon Dec 13 14:26:08 EST 2021: samtools 1.13 or later is required.

My HPC support folks tell me:

...problem is the version of samtools. Currently, the latest docker image has samtools v1.10 where it actually needs v1.13. But the samtools in the docker image is installed with apt-get, and the docker image is based on Ubuntu 20.04 whose package repositories only has samtools 1.10. Supposedly the gridss maintainer had updated the gridss docker image to include a newer samtools but we checked the docker image and it doesn't have that newer version.

How can I work around this?
Thanks,
Anna

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants