Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for multiple alternate alleles. #193

Closed
arq5x opened this issue Aug 14, 2013 · 15 comments
Closed

Support for multiple alternate alleles. #193

arq5x opened this issue Aug 14, 2013 · 15 comments

Comments

@arq5x
Copy link
Owner

arq5x commented Aug 14, 2013

No description provided.

@roryk
Copy link
Collaborator

roryk commented Oct 4, 2013

Hi Aaron,

Just pinging on this issue, we have a couple of users for which being able to handle multiple alternate alleles would be really helpful. I saw in #161 there has been discussion about how to handle it. Was there ever a consensus?

@arq5x
Copy link
Owner Author

arq5x commented Oct 5, 2013

Hi Rory,

Support for multiple alleles is still in the works. We now know that both SnpEff and VEP can provide distinct annotations for the impact of each allele on each transcript. This will be next on the list after standardizing SO terms and allowing auto_* tools to support families without defined parents (both are being worked on as we speak).

The one thing I haven't thought through fully is how to represent genotypes for each sample in the case of multiple alleles.

For example, consider the following multi-allele variant:

REF   ALT   S1   S2   S3   S4   S5
G       A,T   0/1   0/2   0/0   1/1    2/2

We would split this into two variant rows, yet I find it a bit confusing as to how to properly report the genotypes for samples that ARE variable, yet for the allele now stored in the other row. One convention would be to just store such genotypes as 0/0 (HOM_REF). I have marked these below with a *. It seems like that is the best way, yet the genotypes are technically incorrect. I guess another way would be to mark them as unknown instead - thoughts?

REF   ALT   S1   S2   S3   S4   S5
G       A   0/1   0/0*   0/0   1/1    0/0*
G       T   0/0*   0/1   0/0   0/0*    1/1

@arq5x
Copy link
Owner Author

arq5x commented Oct 7, 2013

I also relaize that correctly calculating the AAF column for each allele will be tricky. It seems like setting the starred genotypes above to unknown is the way to go.

@hmkim
Copy link

hmkim commented Nov 21, 2013

Refer to https://github.com/ekg/vcflib#vcfbreakmulti

I'm working this issue on my project..
If anybody has suggestion for this issue (ALT multiple allele*), please discuss!

Thanks.

@arq5x
Copy link
Owner Author

arq5x commented Mar 3, 2015

An update. @brentp and I discussed the changes that need to be made to support multiple alternate alleles and I think we have a decent preliminary plan to get this in place in the near future. If we run into any complications, we will revisit the discussion here.

@chapmanb
Copy link
Collaborator

chapmanb commented Mar 3, 2015

Aaron and Brent;
I knew y'all would tackle this as soon as I added a workaround to bcbio. If it helps at all, here is how I handle it now:

https://github.com/chapmanb/bcbio-nextgen/blob/master/bcbio/variation/multiallelic.py

It decomposes multi-allelic inputs into bi-allelic, redoes the effects prediction, then uploads into GEMINI. It's great that you all are working on this.

@brentp
Copy link
Collaborator

brentp commented Mar 3, 2015

@chapmanb the plan is to split multi-allelic variants to have separate rows in the variants table.
this will require users to specify "Allele" (#161 (comment)) when calling VEP so that we know which effect belongs to which allele. SNPEFF reports this by default now. [Does this sound reasonable to you?]

This will require changing the index on the variants table to be a joint index on variant_id, allele.

There are a number of other changes, including many that I'm likely not aware of, but I hope to figure those out as I get familiar with the code-base.

@brentp
Copy link
Collaborator

brentp commented Mar 3, 2015

link to the ga4gh issue talking about multiple alts from @chapmanb 's implementation: ga4gh/ga4gh-schemas#169

@chapmanb
Copy link
Collaborator

chapmanb commented Mar 4, 2015

Brent;
That sounds perfect, thank you again for taking this on. Let me know if I can help with anything as you dig into it. I haven't really thought deeply about all of the issues beyond reading that GA4GH thread but will be cool to have a better representation in place.

@oleraj
Copy link

oleraj commented Mar 4, 2015

Hi Brent and Aaron, I've been following this thread and others related to multi-allelic representation in GEMINI as it is one thing I consider high priority for me as a user. So, I'm glad to hear this is being worked on!
One other thought I had, I was just wondering if you are also considering converting alleles to their minimal representation when doing the comparisons with 1kg, ExAC, etc. to populate the in_1kg, aaf_1kg_*, etc. columns in the variants table. If the multi-allelic variants are used as represented in the VCF file, you could run into the issue of not finding a match for the variant in the reference database, even if there really is one -- discussed in more detail in this blog post: http://www.cureffi.org/2014/04/24/converting-genetic-variants-to-their-minimal-representation/ . (I saw this posted on the gemini-variation forum (by Mark Cowley) and thought it also made sense to include it in the conversation here.) It seems for this to work that both the input VCF and the reference VCF (e.g., ExAC) alleles would need to be in their minimal representation (even if only in the way they are stored internally, without modifying the actual VCF files) if you're matching on CHROM + POS + REF + ALT.
Andrew

@arq5x
Copy link
Owner Author

arq5x commented Mar 5, 2015

Hi Andrew,
Yes we will definitely need to make sure that comparisons to variant annotation datasets (e.g., 1kg and ExAC) also account for alternate alleles. We are still looking into the best way to do this, but the strategy outlined in the link you sent is a good primer. Thanks for the input.
Aaron

@brentp
Copy link
Collaborator

brentp commented Mar 5, 2015

@oleraj I have this implemented. as a simple function, along with the stuff to replicate what vt decompose does. Together, these should reduce false negatives.

brentp added a commit to brentp/gemini that referenced this issue Mar 5, 2015
@brentp
Copy link
Collaborator

brentp commented Mar 5, 2015

The linked branch still needs to parse out the per-alt effects from SnpEff/VEP, but the initial utility code is there.

@oleraj
Copy link

oleraj commented Mar 5, 2015

Great, looking forward to these updates. Thanks for your work on this.

@brentp
Copy link
Collaborator

brentp commented Mar 18, 2015

this has been merged into master.

@arq5x arq5x closed this as completed Mar 19, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants