-
Notifications
You must be signed in to change notification settings - Fork 15
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Error in Stage median_cov_target
#25
Comments
Hi @ifiddes-10x, It looks like the error might be occurring within the Could you please tell me a little more about how you have set up the pipeline? The messages "chromosome: STR-AAAAAC not found in NA24143.bam" etc don't necessarily indicate a problem, if for example there are no AAAAAC expansions in your genome. The message is generated by bedtools, but I could probably suppress it. |
I am using a bed file of only a few specific STRs that we are interested in. Could that be the problem? Should I find a genome wide one? Other than that, I am doing all default parameters, with my custom STR BED file, on group of hg38 BAMs like so, in a subfolder I made to reduce clutter in the main folder: |
STRetch works by remapping reads to a custom reference genome with STR decoys. It works on the premise that these reads were likely to have been mis-mapped to the standard reference genome, because the expanded sequence is no present there. The first stage in the pipeline is extracting all reads that are likely to contain STRs. So, all reads mapping to any STR locus, flanking region, and unmapped reads. |
I see, I misunderstood. Thanks for the STR BED file! |
@ifiddes-10x The hg38 bed file is here: Usage: At the moment STRetch ships with hg19, so it will remap against the hg19 + STR decoys, then give you results in terms of hg19. It is definitely time for me to generate and upload a hg38 reference, but it might take me a while to create and test appropriately. |
Sorry, where did you upload the hg19 BED file?
On Wed, Jan 31, 2018 at 5:35 PM Harriet Dashnow ***@***.***> wrote:
@ifiddes-10x <https://github.com/ifiddes-10x> The hg38 bed file is here:
https://figshare.com/s/198f7a51bd95ce16b3db
Usage:
bpipe run STRetch/pipelines/STRetch_wgs_bam_pipeline.groovy
hg38.simpleRepeat_period1-6.dedup.sorted.bed sample1.bam sample2.bam
At the moment STRetch ships with hg19, so it will remap against the hg19 +
STR decoys, then give you results in terms of hg19. It is definitely time
for me to generate and upload a hg38 reference, but it might take me a
while to create and test appropriately.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#25 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AhnBndFi6Ec6NBeWOxW28Wi_E2el9Ib3ks5tQRT0gaJpZM4R0zXt>
.
--
|
Your input data is in hg38 so you will need to use the file linked above. Once reads are extracted the pipeline will switch to hg19. You should already have those reference files from the installation script |
I have some inputs that are mapped to hg38, some to hg19. Is the hg19 STR
file one of the ones that the install script placed in the reference-data
folder?
On Wed, Jan 31, 2018 at 5:55 PM Harriet Dashnow ***@***.***> wrote:
Your input data is in hg38 so you will need to use the file linked above.
Once reads are extracted the pipeline will switch to hg19. You should
already have those reference files from the installation script
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#25 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AhnBnR5gxwLywaZWJ9BVNm2THvO7pfW-ks5tQRmYgaJpZM4R0zXt>
.
--
|
Yep |
Alright thanks!
On Wed, Jan 31, 2018 at 5:59 PM Harriet Dashnow ***@***.***> wrote:
Yep
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#25 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AhnBndlxBoEF5eN6daWEj2A-V9wzNaM1ks5tQRqPgaJpZM4R0zXt>
.
--
|
I am still getting the same error in
|
That's an error I haven't seen before. |
That appears to be the case. The run folder ends up with 128 smaller BAMs
from the `samtools collate` step, but the output FASTQ are empty. So
something is wrong with this step `cat <(
/mnt/home/ian/yard/STR/STRetch/tools/bin/samtools view -hu -L
combined_strs.slop.bed ../NA24385_longranger.GRCh37.bam ) <(
/mnt/home/ian/yard/STR/STRetch/tools/bin/samtools view -u -f 4
../NA24385_longranger.GRCh37.bam ) |
/mnt/home/ian/yard/STR/STRetch/tools/bin/samtools collate -Ou -n 128 -
NA24385_L001_R1.fastq | /mnt/home/ian/yard/STR/STRetch/tools/bin/bedtools
bamtofastq -i - -fq >(gzip -c > NA24385_L001_R1.fastq.gz) -fq2 >(gzip -c >
NA24385_L001_R2.fastq.gz)`. I can check any of the 128 smaller BAMs and see
that they contain valid, paired reads. I am running the above step by hand
to see if I can see what happens.
On Thu, Feb 1, 2018 at 2:37 PM Harriet Dashnow ***@***.***> wrote:
That's an error I haven't seen before.
You can see which commands it was trying to run incommandlog.txt in the
working directory.
I looks like goleft is complaining about reads with zero insert sizes. And
later the STR_locus_counts is complaining about a NoneType object when
reading the bamfile.
Any chance the intermediate bam file is empty?
Check the bam file feeding into that stage has reads in it, and that they
have pairs.
—
You are receiving this because you modified the open/close state.
Reply to this email directly, view it on GitHub
<#25 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AhnBnSxrJ7EAA_oARtvPcu253bc0jmnHks5tQjy_gaJpZM4R0zXt>
.
--
|
Hi Ian, Okay, that makes sense. If that step is failing, no reads are getting extracted so the later stages fail. I've had the same error reported by another user, also using hg38. I'm taking a look at his data today, so hopefully I can nut it out. |
Hey Ian, I have an idea, and if this is the problem I'm going to feel a bit silly! Are you using the pipeline on a cluster where jobs are given limited walltime? I suspect the walltime for that job is set to the default, 2 hours. So it's getting cut off mid-way through. I previously only tested the STRetch WGS pipelines on a cluster that doesn't restrict walltime. If you wouldn't mind editing the following: In
And just to be on the safe side, let's increase the walltime to 48 hours. In
|
I am running it on a local machine. I will try your changes. I ran the
command from the command log by hand and got 'Error reading input file':
```
bespin1 Thu Feb 01 14:48 test $cat <(
/mnt/home/ian/yard/STR/STRetch/tools/bin/samtools view -hu -L
combined_strs.slop.bed ../NA24385_longranger.GRCh37.bam ) <(
/mnt/home/ian/yard/STR/STRetch/tools/bin/samtools view -u -f 4
../NA24385_longranger.GRCh37.bam ) |
/mnt/home/ian/yard/STR/STRetch/tools/bin/samtools collate -Ou -n 128 -
NA24385_L001_R1.fastq | /mnt/home/ian/yard/STR/STRetch/tools/bin/bedtools
bamtofastq -i - -fq >(gzip -c > NA24385_L001_R1.fastq.gz) -fq2 >(gzip -c >
NA24385_L001_R2.fastq.gz)
Error reading input file
```
On Thu, Feb 1, 2018 at 4:27 PM Harriet Dashnow ***@***.***> wrote:
Hey Ian,
I have an idea, and if this is the problem I'm going to feel a bit silly!
Are you using the pipeline on a cluster where jobs are given limited
walltime?
I suspect the walltime for that job is set to the default, 2 hours. So
it's getting cut off mid-way through. I previously only tested the STRetch
WGS pipelines on a cluster that doesn't restrict walltime.
If you wouldn't mind editing the following:
In STRetch/pipelines/pipeline_stages.groovy, add , "bwamem" to the end of
the triple quotes in extract_reads_region. It should be line 175 if you
have the version that's currently in master. The stage should now look like
this:
extract_reads_region = {
doc "Extract reads from bam region + unaligned"
def fastaname = get_fname(REF)
produce(branch.sample + '_L001_R1.fastq.gz', branch.sample + '_L001_R2.fastq.gz') {
exec """
cat <( $samtools view -hu -L $input.bed $input.bam )
<( $samtools view -u -f 4 $input.bam ) |
$samtools collate -Ou -n 128 - $output.prefix |
$bedtools bamtofastq -i - -fq >(gzip -c > $output1.gz) -fq2 >(gzip -c > $output2.gz)
""", "bwamem"
}
}
And just to be on the safe side, let's increase the walltime to 48 hours.
In STRetch/pipelines/bpipe.config, change the walltime for bwa, so it
looks like this:
bwamem {
walltime="48:00:00"
procs=8
}```
I'm testing this too, but obviously it will take a day or two to get a result!
—
You are receiving this because you modified the open/close state.
Reply to this email directly, view it on GitHub
<#25 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AhnBnbUc2P3WYgZcy0K1GmQc5QKywnVTks5tQlaCgaJpZM4R0zXt>
.
--
|
Those changes did not work. In the end, I am still seeing lots of small
(~30mb) bam files, showing that `samtools collate` is doing its job
properly and shuffling the reads, but the `bedtools bamtofastq` step
doesn't work -- I end up with two empty .fastq.gz files. `goleft` then
fails on these empty files.
|
Thanks, Ian. I've been able to replicate the error now. I'll let you know when I have a fix. |
I have made some changes to the code to try to address this issue. Could you please update STRetch, re-install it (including the reference data) and the try again? From the STRetch directory:
|
I am trying to use the
STRetch_wgs_bam_pipeline.groovy
pipeline with a previously generated BAM against GRCh38, targeting a specific set of STRs I have identified in a BED file. After taking an hour to walk the BAM (why does it not use random access with the index?), it crashes with an error about the reference:Am I misunderstanding how to use this pipeline? Do I need to be mapped against a special reference?
The text was updated successfully, but these errors were encountered: