Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Deep sequencing #62

Closed
KBT59 opened this issue Apr 5, 2018 · 22 comments
Closed

Deep sequencing #62

KBT59 opened this issue Apr 5, 2018 · 22 comments

Comments

@KBT59
Copy link

KBT59 commented Apr 5, 2018

I am interested in training DeepVariant for deep sequencing on a capture panel. We are interested in lower frequency variants - say 1%. Depth of coverage is on the order of 1000 to 1700 for the data I am using. I have set the default height of the pileup tensors to 2000 via
https://github.com/google/deepvariant/blob/r0.5/deepvariant/make_examples.py#L177

In a set with 497 confirmed 'true' variants I'm getting a much smaller number of variants out of make_examples:

I0404 17:02:18.420840 140137671104256 make_examples.py:1032] Found 487 candidate variants
I0404 17:02:18.421224 140137671104256 make_examples.py:620] ----- VariantCounts -----
I0404 17:02:18.421346 140137671104256 make_examples.py:624] All: 29/29 (100.00%)
I0404 17:02:18.421475 140137671104256 make_examples.py:624] SNPs: 27/29 (93.10%)
I0404 17:02:18.421593 140137671104256 make_examples.py:624] Indels: 2/29 (6.90%)
I0404 17:02:18.421717 140137671104256 make_examples.py:624] BiAllelic: 29/29 (100.00%)
I0404 17:02:18.421834 140137671104256 make_examples.py:624] MultiAllelic: 0/29 (0.00%)
I0404 17:02:18.421953 140137671104256 make_examples.py:624] HomRef: 28/29 (96.55%)
I0404 17:02:18.422069 140137671104256 make_examples.py:624] Het: 1/29 (3.45%)

What, besides setting the pileup height to match my data, should I be looking at?

@pichuan
Copy link
Collaborator

pichuan commented Apr 5, 2018

Hi KBT59,
because the released models are trained with the default height of the pileup images, by just changing --pileup_image_height at inference time won't really give you better results. Currently DeepVariant is a germline variant caller, so it's not designed to call variants with 1% frequency.

@pichuan
Copy link
Collaborator

pichuan commented Apr 5, 2018

Hi again,
I didn't read carefully so I missed that you said you want to train a model.
If you want to get make_examples to create more candidates, the other flags you need to consider are: vsc_min_count_snps, vsc_min_count_indels, vsc_min_fraction_snps, vsc_min_fraction_indels. With the default values of these flags for VSC (Very Sensitive Caller), you simply won't be able to even get candidates generated for low allele fraction variants. So I would suggest playing around with those flags and see if more candidates come out.

Thanks! Let us know how it goes.

@KBT59
Copy link
Author

KBT59 commented Apr 9, 2018 via email

@pichuan
Copy link
Collaborator

pichuan commented Apr 10, 2018

Lowering the fractions makes sense. Since you're doing something very experimental, you'll need to look into your own metrics to see what threshold makes sense. I think you'll want to confirm that your new setting does give you enough sensitivity. Because if something is not picked up by the Very Sensitive Caller, it won't be called later on.
There's a chance that the current model won't work well on your use case at all (and you might need to use a different kind of model), but it's worth a try.

@KBT59
Copy link
Author

KBT59 commented Apr 10, 2018 via email

@pichuan
Copy link
Collaborator

pichuan commented Apr 10, 2018

I think you'll want:
tfrecord_path: "/home2/myModelAttempt/output/5PRR-RD_S86.examples.tfrecord-?????-of-00064"

@KBT59
Copy link
Author

KBT59 commented Apr 10, 2018 via email

@pichuan
Copy link
Collaborator

pichuan commented Apr 12, 2018 via email

@KBT59
Copy link
Author

KBT59 commented Apr 13, 2018 via email

@pichuan
Copy link
Collaborator

pichuan commented Apr 13, 2018

Hi,
originally I was thinking a small/synthetic dataset could subsampled from your data. I actually don't want the full data anyway (that wouldn't really be a small thing I can try). But I understand if you can't even subsample from your real data.
How about at least posting the commands you used?

From earlier discussions, it sounds like the main thing you're changing about the data representation is the pileup_image_height. You can actually do the same thing on the QuickStart or CaseStudy data too. It will just look like a taller image with the bottom being mostly empty.
(You can use logic like this https://github.com/google/deepvariant/blob/r0.6/docs/visualizing_examples.ipynb to visualize them)

And then, I suspect there's a high probability that you can get the same error on the CaseStudy data if you follow the same steps.

Once you're able to do that, post every steps (similar to QuickStart and CaseStudy) here. And note the place where you're having an error.

@KBT59
Copy link
Author

KBT59 commented Apr 13, 2018 via email

@KBT59
Copy link
Author

KBT59 commented Apr 13, 2018 via email

@pichuan
Copy link
Collaborator

pichuan commented Apr 14, 2018

Hi,
I'm not seeing the zip file.

@KBT59
Copy link
Author

KBT59 commented Apr 16, 2018 via email

@KBT59
Copy link
Author

KBT59 commented May 1, 2018 via email

@pgrosu
Copy link

pgrosu commented May 1, 2018

Hi Brad,

Sometimes smtp (email) servers block zip files. Just put it on Google Drive or DropBox and share the link to it.

~p

@KBT59
Copy link
Author

KBT59 commented May 1, 2018 via email

@pgrosu
Copy link

pgrosu commented May 1, 2018

If they are small you might attach them individually directly in Github as shown here:

https://blog.github.com/2015-09-25-attach-files-to-comments/

@KBT59
Copy link
Author

KBT59 commented May 1, 2018

These are the files I mentioned above.

bundle.zip

@pichuan
Copy link
Collaborator

pichuan commented May 2, 2018

Hi,
I'll take a look. Give me a few days. Please feel free to ping back if you don't hear from me by end of this week.

@pichuan
Copy link
Collaborator

pichuan commented May 11, 2018

Update:
I can confirm that I'm able to reproduce your error. We're working on a fix. Stay tuned!

@depristo
Copy link

I've figured out what's going on here and have some good news and bad news.

First, the bad news is that setting the height to 2000 isn't going to work in the short run. This is a limitation coming from inception_v3 itself. At such large image sizes, we would have to run with spatial_squeeze=False to avoid this exception. By doing so we'd essentially end up with a "tile" of deepvariant predictions every 64 rows in the image, and then have to pool them together somehow, which makes sense in the general object detection case but not for us in DeepVariant.

The good news is that the maximum supported depth is 362. So you can get a lot more information into your images than the default 100 value. Give 362 a try and let us know if that works.

I should point out that we use a reservoir sampler to create these images. So a height of 362 means you'll get a random sampling of 362 - 5 [for the reference] reads from your very deep sequencing. It's not ideal if you want to detect things occurring in only 1 or 2 reads, but you get a reasonable number of reads if you are looking for things >1% or so frequency in the reads.

Hope that helps!

Mark

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants