-
Notifications
You must be signed in to change notification settings - Fork 274
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
FastxFile truncates gzip compressed fastq files #738
Comments
Hi, I guess the issue is that the the gzip'ed file is actually a concatenation of multiple gzip files? |
The fastq.gz files I'm working with came directly from Illumina's fastq generation process, so unfortunately I have no control over it. My current workarounds are to either 1) not use pysam for iterating over fastq records or 2) re-compress the fastq.gz file (e.g. Perhaps there could be a way to log a warning when this occurs, or a note in the documentation? It led to a hard to trace bug in my code. I'm happy to submit a pull request for a documentation change. |
If you could provide a PR for docs, that would be great, thanks! |
Any follow-ups of this issue? |
Without an example file, it is unclear whether this might be early stopping due to extra BGZF trailer blocks (samtools/htslib#45) or early stopping due to multiple GZIP members in a plain GZIP (non-bgzipped) file (samtools/htslib#742) or due to another reason. The former has been fixed since HTSlib 1.4 so it is probably not that. The latter has been fixed on HTSlib develop but after the 1.9 release. So if this problem is that, it will be fixed in an upcoming release. |
I guess I'm in the second problem, looking forward to it. Thank you. |
Is a fix on the way for this? I am still experiencing this issue. |
There is now an HTSlib release containing the fix for the second problem noted in #738 (comment), so when pysam's HTSlib is updated that problem will go away. If you would like to provide an example file exhibiting the issue you're experiencing, we can investigate whether the problem is one of those two or a different problem still needing fixing. |
@jmarshall Here is a link to a fastqgz file that I have made publicly downloadable from my google drive. This is the file that I was having issues with. Here is the code I was running on it:
And the output is 22 with the fastqgz file I provided. However if you decompress the fastqgz file and instead run the code on a fastq, the output is 154470. |
We are working on a (large) update to pysam to support htslib 1.10.x.
There is no specific ETA yet.
…-Kevin
On Tue, Dec 17, 2019 at 2:11 PM Zachary Munro ***@***.***> wrote:
@jmarshall <https://github.com/jmarshall> Here is a link
<https://drive.google.com/file/d/16TDoy9hibyYCSL7NH4Z9HNV9wsfLsq_g/view?usp=sharing>
to a fastqgz file that I have made publicly downloadable from my google
drive. This is the file that I was having issues with. Here is the code I
was running on it:
num_reads = 0
with pysam.FastxFile(path_to_fastqgz) as fastqgz_file:
for entry in fastqgz_file:
num_reads +=1
print("num reads: " + str(num_reads))
And the output is 22 with the fastqgz file I provided. However if you
decompress the fastqgz file and instead run the code on a fastq, the output
is 154470.
I am using pysam==0.15.3, and python3.7.4
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#738?email_source=notifications&email_token=AAJT5PYS7V2GORPXU5VIV5TQZEP5JA5CNFSM4GAWDWX2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEHDT5NQ#issuecomment-566705846>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAJT5P7RE5QWWBPTSS5EIF3QZEP5JANCNFSM4GAWDWXQ>
.
|
@zmunro: I can confirm that your file is one that is fixed by samtools/htslib#744 so will work in an upcoming pysam release incorporating the recent HTSlib release. |
This has been fixed since pysam 0.16.0 (when used with HTSlib 1.10 or later). |
I've noticed that using pysam.FastxFile to iterate over a bgzip'd fastq file only returns a subset of fastq records. The expected result is that all records are iterated over.
Example:
example.fastq.gz contains 1,009,470 reads.
When I iterate over the compressed fastq file using pysam, I only get 26,107 reads instead of the expected 1,009,470 reads.
I've traced the issue back to bgzip not decompressing all the contents of the fastq file.
For example, I get the expected ~4M lines using gunzip to decompress:
But using bgzip to decompress only returns ~104k lines, which is the same result obtained from pysam (26107 * 4 = 104428):
The text was updated successfully, but these errors were encountered: