Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FastxFile truncates gzip compressed fastq files #738

Closed
davidtsao opened this issue Oct 31, 2018 · 12 comments
Closed

FastxFile truncates gzip compressed fastq files #738

davidtsao opened this issue Oct 31, 2018 · 12 comments

Comments

@davidtsao
Copy link

I've noticed that using pysam.FastxFile to iterate over a bgzip'd fastq file only returns a subset of fastq records. The expected result is that all records are iterated over.

Example:

example.fastq.gz contains 1,009,470 reads.

When I iterate over the compressed fastq file using pysam, I only get 26,107 reads instead of the expected 1,009,470 reads.

In [1]: import pysam

In [2]: len([x for x in pysam.FastxFile('example.fastq.gz')])
Out[2]: 26107

I've traced the issue back to bgzip not decompressing all the contents of the fastq file.

For example, I get the expected ~4M lines using gunzip to decompress:

$ bgzip -c -d example.fastq.gz | wc -l 
 4037880

But using bgzip to decompress only returns ~104k lines, which is the same result obtained from pysam (26107 * 4 = 104428):

$ bgzip -c -d example.fastq.gz | wc -l
  104428
@AndreasHeger
Copy link
Contributor

Hi, I guess the issue is that the the gzip'ed file is actually a concatenation of multiple gzip files?
This is no problem for gzip, but bgzip and pysam will stop at the end of the first component thinking that the file is complete.

@davidtsao
Copy link
Author

The fastq.gz files I'm working with came directly from Illumina's fastq generation process, so unfortunately I have no control over it.

My current workarounds are to either 1) not use pysam for iterating over fastq records or 2) re-compress the fastq.gz file (e.g. gunzip example.fastq.gz && gzip example.fastq)

Perhaps there could be a way to log a warning when this occurs, or a note in the documentation? It led to a hard to trace bug in my code. I'm happy to submit a pull request for a documentation change.

@AndreasHeger
Copy link
Contributor

If you could provide a PR for docs, that would be great, thanks!
I agree that this is a serious issue and leave it open - if I have time will look into this to see if the gzip iterator is recoverable from a bogus EOF.

@cgjosephlee
Copy link

cgjosephlee commented Apr 30, 2019

Any follow-ups of this issue?

@jmarshall
Copy link
Member

Without an example file, it is unclear whether this might be early stopping due to extra BGZF trailer blocks (samtools/htslib#45) or early stopping due to multiple GZIP members in a plain GZIP (non-bgzipped) file (samtools/htslib#742) or due to another reason.

The former has been fixed since HTSlib 1.4 so it is probably not that.

The latter has been fixed on HTSlib develop but after the 1.9 release. So if this problem is that, it will be fixed in an upcoming release.

@cgjosephlee
Copy link

I guess I'm in the second problem, looking forward to it. Thank you.

@zmunro
Copy link

zmunro commented Dec 17, 2019

Is a fix on the way for this? I am still experiencing this issue.

@jmarshall
Copy link
Member

There is now an HTSlib release containing the fix for the second problem noted in #738 (comment), so when pysam's HTSlib is updated that problem will go away.

If you would like to provide an example file exhibiting the issue you're experiencing, we can investigate whether the problem is one of those two or a different problem still needing fixing.

@zmunro
Copy link

zmunro commented Dec 17, 2019

@jmarshall Here is a link to a fastqgz file that I have made publicly downloadable from my google drive. This is the file that I was having issues with. Here is the code I was running on it:

num_reads = 0
with pysam.FastxFile(path_to_fastqgz) as fastqgz_file:
    for entry in fastqgz_file:
        num_reads +=1
print("num reads: " + str(num_reads))

And the output is 22 with the fastqgz file I provided. However if you decompress the fastqgz file and instead run the code on a fastq, the output is 154470.
I am using pysam==0.15.3, and python3.7.4

@bioinformed
Copy link
Member

bioinformed commented Dec 17, 2019 via email

@jmarshall
Copy link
Member

@zmunro: I can confirm that your file is one that is fixed by samtools/htslib#744 so will work in an upcoming pysam release incorporating the recent HTSlib release.

@jmarshall
Copy link
Member

This has been fixed since pysam 0.16.0 (when used with HTSlib 1.10 or later).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants