-
Notifications
You must be signed in to change notification settings - Fork 31
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Loosing information after update from R9 to R10 flowcell #160
Comments
Can you send over the BAM of one or a few failed r10 tests? Just the reads covering your interested variant will be enough. |
Sure! Can you tell me if there is a way to send it to you without going through github ? |
Please send to the two email addresses at https://github.com/HKU-BAL/Clair3#readme |
Hi, |
Hi, thanks for the answers, I'll send the bam asap. |
I've just sent you one of the negative bam by mail. |
A little update regarding this issues : It looks like I arrive to catch my interested variant at 100% (among 100 test) when I put the option --ref_pct_full to a threashold of 0.7. As it's an experimental option, I don't know how dangerous it is to play with it. Maybe there is some risk to turn false the informations inside my vcf output when I put up the threashold from 0.1 to 0.7 in this context? I'll continue to investigate in order to know what threashold makes me find/loose my interested variant. In the meantime, I have to say that I don't clearly understand what does this option do in the process.
Can you explain me what is the purpose of this option? And why does it change that much my results? Best regards, |
Changing --ref_pct_full to 0.7 is a good try, it runs slower cuz more variants go into the slower full-alignment based calling, but the performance is monotonically better if not much better most of the time. If the slowdown is acceptable, or you prefer performance way over speed, setting both --var_pct_full and --ref_pct_full to a higher value is suggested. |
Very interesting, many thanks for this answer. I don't really care about the speed, what I really need is to be sure to see my interested variants every time I'm running Clair3, no mater the sample sequenced. So here I have two question : _ Is a good practice to set both variable, --var_pct_full and --ref_pct_full, equal to 1 in order to catch all my interested variants? How could it be problematic regarding the information I'll have in my final VCF? _ Why this problem didn't seems to appear when I run Clair3 on a R9 sequenced sample, especially when the same pipeline protocol is applied? Is that a question of read quality? Or a potential wrong match between chemestry and clair3 model on R10? |
Full-alignment model runs slower but is more sensitive and accurate than the pileup model in the repetitive and low-complexity regions. Setting --var_pct_full and --ref_pct_full to 1 means sending all reference call and variant call candidates in the full-alignment model. If the 100 tests you've done are at relatively more challenging genome positions, setting --var_pct_full and --ref_pct_full to 1 certainly leads to more correct answers. The default parameters work best at whole genome calling. In your use case, if you are only genotyping the 100 positions of interest, setting both --var_pct_full and --ref_pct_full to a higher value or even 1 is reasonable. Regarding why R9 got more correct answers than R10 in your tests, there are a few possible explanations, let me look in to the case you sent and turn back to you later. |
HI Boris, quick question. What's the depth of your datasets? Is it amplicon data with high depth (much higher than 60X for example)? Thanks, |
Hi All, Thank you! |
@mproberts99 Thank you for the information. Working with colleagues from ONT, we noticed that the R10 models were trained with a coverage relatively lower than that of the R9 model. For high-depth amplicon data, an immediate solution is to set --var_pct_full and --ref_pct_full to 1 to use the full-alignment model that always subsamples the input to at most 89-fold to handle all variant candidates. In the long run, we are working with colleagues from ONT to either increase the training coverage, or fine-tune the code to handle high-depth amplicon input. Will post more updates here. |
Hi everyone, @philres Previously to the variant calling, I'm doing a subsample of 1000 reads. What I'm seeing, knowing that I'm taking reads with quality higher than Q7, is that in every case I can get a depth higher than 950X, calculate by mosdepth software. @mproberts99 I had this hypothesis too, that the subsampling is leading wrong our results by loosing information. What I saw from many bootstraps is that I always can get all the requested variants informations when I'm running in R9, compare to R10 tests (where I'm loosing informations in 9% of my case, without touching any options). Curiously, I had exactly the opposite problem for an another sample, I was able to catch my interested variants in R10, but not in R9. Related to the fact that in my case I'm taking care about the sequencing of only one gene, I suppose it could be more related to pipelines options/models than to the subsampling, even if the subsampling is still able to bring variability in the results by taking more reads with higher or lower quality, that could maybe plays in the Clair3 process (?). @aquaskyline Thanks again, I'm happy to confirm that by setting --var_pct_full, --ref_pct_full and --var_pct_phasing to 1, I'm able to catch at 100% every interested variants using the R10 flowcells in my VCF output. |
Hi, I would like to share a problem that I have when I do a variant callilg in the BRCA1 gene. I've done 2 iterations of my pipeline from the same sequenced sample, and I have 2 differents results. Let me explain : My pipeline is taking a subsample of 1000 read above 15 Q quality from raw FASTQs files, and we do a variant calling with clair 3 with --var_pct_full, --ref_pct_full and --var_pct_phasing equal to 1. In the first iteration, I've got the result for this variant :
It appears in pileup.vcf.gz :
It appears in full_alignment.vcf.gz :
I can see in the log that this variant appear as "RefCall" in the pileup mode, and then find again as 0/1 in full pipeline. Here the log of clair3 :
But in the second iteration of my pipeline, this variant doesn't appear in the full alignment vcf fileIt appears in pileup.vcf.gz :
and I have this Clair3 log is here :
I don't understand. Do you kwow why I lose my variant from one iteration to the other in the full aligment pipeline? Many thanks in advance. |
Weird. I saw log differences at phasing. Could you run a third iteration and see if the log changes again? |
Hi, I've taken time to make several double iteration. We stay in touch, |
If possible, could you please share the data with me for troubleshooting? |
Hi,
I'm facing a curious problem. I've sequenced twice a sample : first with the R9 flowcell version and second with the R10 flowcell version, and it looks like I'm loosing some information in the clair3 results when I am interested in a particular variant. Here some results :
Both sequencing had been proceeded with :
1. Command line with the flowcells R9.4.1 with Kit 12 chemistry :
Result : I found my variant in 100% (among 100 tests) of my VCF outputs.
2. Command line with the flowcells R10.4.1 with Kit 14 chemistry :
Result : I found my variant in 91% (among 100 tests) of my VCF outputs.
Do you know a potential reason about why I'm loosing my interested variant in 9% of the case when I'm upgrading the flowcell version from R9 to R10? I've also tested the R10 protocol with the r1041_e82_260bps_hac_g632 and all the VCF outputs didn't contain my variant of interest.
Many thanks in advance.
Best regards,
Boris
The text was updated successfully, but these errors were encountered: