Skip to content
This repository has been archived by the owner on Jan 24, 2018. It is now read-only.

this query get wrong server response back. #433

Closed
jingchunzhu opened this issue Jun 2, 2015 · 29 comments
Closed

this query get wrong server response back. #433

jingchunzhu opened this issue Jun 2, 2015 · 29 comments
Assignees
Labels

Comments

@jingchunzhu
Copy link

This following query against umd get 2520 variants back, many are not within the start and end interval
python client_dev.py -vv variants-search --variantSetIds umd --referenceName 17 --start 41227747 --end 41244748 --pageSize 1000 http://ec2-54-148-207-224.us-west-2.compute.amazonaws.com:8000/v0.6.e6d6074
2520 variants

The most strange thing is that the xena browser client (in javascript) also get the "same"(same in the sense that the same number of variants returned) wrong answer back.

however, curl command seems to get the correct response back.

curl --data '{ "callSetIds": [], "end": 41244748, "referenceName": "17", "start":41227747, "variantSetIds": ["umd”], “pageSize”:1000}’ --header 'Content-Type: application/json' http://ec2-54-148-207-224.us-west-2.compute.amazonaws.com:8000/v0.6.e6d6074/variants/search

@jingchunzhu
Copy link
Author

All three clients are now consistently getting the incorrect response back.

pageSize parameter in the curl command does not seem to be used. once you manually add pageToken parameter, it also triggers the incorrect response back from the server.

curl --data '{ "callSetIds": [], "end": 41244748, "referenceName": "17", "start":41227747, "variantSetIds": ["umd”], “pageSize”:1000}’ --header 'Content-Type: application/json' http://ec2-54-148-207-224.us-west-2.compute.amazonaws.com:8000/v0.6.e6d6074/variants/search

@dcolligan
Copy link
Member

Which version of the server is this? Is it an old one that has the bug we fixed in this update? https://github.com/ga4gh/server/releases/tag/v0.1.1

It looks like you are running 0.1.0b1, not 0.1.1. Try updating and see what happens.

@jingchunzhu
Copy link
Author

For querying umd database on chr17 between position 41227747 to 41244748 the server found 2520 variants, many are not in this interval.

here are just a random copy of paste a few that are outside (both start and end) the interval.

variant_start variant_end
41226453 41226454
41226456 41226457
41226473 41226474
41226487 41226488
41226487 41226488
41226487 41226488
41226487 41226488
41226487 41226488
41226487 41226488

You should see the same thing using the python client.

server implementation used: http://ec2-54-148-207-224.us-west-2.compute.amazonaws.com:8000/v0.5.1

GA4GH reference server 0.1.1

Protocol version 0.5.1

Operations available

Method Path
GET /v0.5.1
POST /v0.5.1/callsets/search
POST /v0.5.1/readgroupsets/search
POST /v0.5.1/reads/search
POST /v0.5.1/referencesets/search
POST /v0.5.1/variants/search
POST /v0.5.1/variantsets/search
Uptime

Running since 13 hours ago (04:23:35 04 Jun 2015)
Configuration

Key Value
DEBUG True
REQUEST_VALIDATION False
RESPONSE_VALIDATION False
Data

VariantSets

Clinvar
ex_lovd
lovd
1000_genomes
exac
bic
umd
ReadGroupSets

@dcolligan
Copy link
Member

@jeromekelleher do you have any guesses why this might be happening? it seems like backend.py should iterate past these variants that are getting returned, since they're outside the requested range...

@jeromekelleher
Copy link
Contributor

Yes, it looks like this is a bug in our paging code. Thanks for reporting this @jingchunzhu.

@dcolligan, I've no guesses right now. We'll need to get the data and try to reproduce this locally so we can get to the bottom of it. @jingchunzhu, can you provide us with some information on how we can download this dataset so we can debug the problem please?

@jingchunzhu
Copy link
Author

Jerome;

use data on GA4GH server at :
http://ec2-54-148-207-224.us-west-2.compute.amazonaws.com:7000/v0.6.e6d6074

python command line : python client_dev.py -vv variants-search
--variantSetIds umd --referenceName 17 --start 41227747 --end 41244748
--pageSize 1000
http://ec2-54-148-207-224.us-west-2.compute.amazonaws.com:7000/v0.6.e6d6074

jing

On Mon, Jun 8, 2015 at 2:16 AM, Jerome Kelleher notifications@github.com
wrote:

Yes, it looks like this is a bug in our paging code. Thanks for reporting
this @jingchunzhu https://github.com/jingchunzhu.

@dcolligan https://github.com/dcolligan, I've no guesses right now.
We'll need to get the data and try to reproduce this locally so we can get
to the bottom of it. @jingchunzhu https://github.com/jingchunzhu, can
you provide us with some information on how we can download this dataset so
we can debug the problem please?


Reply to this email directly or view it on GitHub
#433 (comment).

@dcolligan
Copy link
Member

@jingchunzhu , it would be better if we actually had access to the data that this server is using, both because we could cross-reference the returned results with the actual data and set breakpoints on our locally running server.

@jeromekelleher
Copy link
Contributor

Yes, we really do need to get the data to reproduce this, sorry @jingchunzhu. Is there a download link that we can use to get at it?

@jingchunzhu
Copy link
Author

Charlie, can we share the umd BRCA vcf files with Jerome for figuring out
the server bug?

Jing

On Tue, Jun 9, 2015 at 7:04 AM, Jerome Kelleher notifications@github.com
wrote:

Yes, we really do need to get the data to reproduce this, sorry
@jingchunzhu https://github.com/jingchunzhu. Is there a download link
that we can use to get at it?


Reply to this email directly or view it on GitHub
#433 (comment).

@jingchunzhu
Copy link
Author

Yes, I believe so. Do you want to know where it's located on the server? Or
do you want me to send the file to him directly?

-Charlie

On Tue, Jun 9, 2015 at 11:25 AM, Jing Zhu jingchunzhu@gmail.com wrote:

Charlie, can we share the umd BRCA vcf files with Jerome for figuring
out the server bug?

Jing

On Tue, Jun 9, 2015 at 7:04 AM, Jerome Kelleher notifications@github.com
wrote:

Yes, we really do need to get the data to reproduce this, sorry
@jingchunzhu https://github.com/jingchunzhu. Is there a download link
that we can use to get at it?


Reply to this email directly or view it on GitHub
#433 (comment).

@jingchunzhu
Copy link
Author

Jerome,

Attached is the data file (in vcf).

jing

On Tue, Jun 9, 2015 at 11:59 AM, Charles Markello cmarkell@ucsc.edu wrote:

Yes, I believe so. Do you want to know where it's located on the server?
Or do you want me to send the file to him directly?

-Charlie

On Tue, Jun 9, 2015 at 11:25 AM, Jing Zhu jingchunzhu@gmail.com wrote:

Charlie, can we share the umd BRCA vcf files with Jerome for figuring
out the server bug?

Jing

On Tue, Jun 9, 2015 at 7:04 AM, Jerome Kelleher <notifications@github.com

wrote:

Yes, we really do need to get the data to reproduce this, sorry
@jingchunzhu https://github.com/jingchunzhu. Is there a download link
that we can use to get at it?


Reply to this email directly or view it on GitHub
#433 (comment).

@jeromekelleher
Copy link
Contributor

I'm afraid attachments don't work for github --- you can either post a download link here or send the file to me over email at jk@well.ox.ac.uk. Thanks.

@pgrosu
Copy link

pgrosu commented Jun 9, 2015

@jingchunzhu, @jeromekelleher You can attach files into comments on GitHub if you rename your file to with a .png extension and just drag-and-drop it into the comment section. Then anyone can download/save the PNG file and rename it back to the original extension and see the complete contents.

@dcolligan
Copy link
Member

@jingchunzhu can you also email me the index file you are using for that vcf file?

@jingchunzhu
Copy link
Author

Charlie,

Can you help?

Jing

On Wed, Jun 10, 2015 at 6:38 AM, Danny Colligan notifications@github.com
wrote:

@jingchunzhu https://github.com/jingchunzhu can you also email me the
index file you are using for that vcf file?


Reply to this email directly or view it on GitHub
#433 (comment).

@jingchunzhu
Copy link
Author

Here's the index file. Is there anything else you guys need?

-Charlie

On Wed, Jun 10, 2015 at 7:23 AM, Jing Zhu jingchunzhu@gmail.com wrote:

Charlie,

Can you help?

Jing

On Wed, Jun 10, 2015 at 6:38 AM, Danny Colligan notifications@github.com
wrote:

@jingchunzhu https://github.com/jingchunzhu can you also email me the
index file you are using for that vcf file?


Reply to this email directly or view it on GitHub
#433 (comment).

@dcolligan
Copy link
Member

I didn't receive it. You'll need to mail it to me as an attachment.

@jeromekelleher
Copy link
Contributor

I think you can remake the index using tabix @dcolligan, it shouldn't be necessary to get the original index (unless we're having trouble reproducing the problem).

@dcolligan dcolligan self-assigned this Jun 15, 2015
@dcolligan
Copy link
Member

For this particular query, it looks like 529 out of 2520 variants returned to the client are defective, all because the start and end of the returned variant are less than the request's start attribute. It looks like a problem with the paging code, as 1991 variants are returned by the lower-level query to HtslibVariantSet.getVariants (and none of those have start and end less than startPosition). Also, all the defective variants are returned in the first page of results: the page size is 1000 and the first defective variant is the (0-indexed) 23rd returned, the final one the 562nd.

For this particular request, there are 5 POST /v0.5.1/variants/search requests, each returning this number of results; they are sub-1000 since the response buffer fills up beyond its allowed byte number:

20 RESULTS
583 RESULTS
178 RESULTS
873 RESULTS
866 RESULTS

Setting MAX_RESPONSE_LENGTH in serverconfig.py to something very high yields the correct number of results.

Continuing to investigate...

@dcolligan
Copy link
Member

@jeromekelleher it looks like our algorithm only works if the VCF is ordered, and this VCF appears to be unordered (as in, when we pull in the variants they are unordered by (variant.start, variant.end))... I think that's the problem

@diekhans
Copy link
Contributor

Suggest requiring VCF to be ordered. Ranges searches are going to be slower
if not ordered.

Danny Colligan notifications@github.com writes:

@jeromekelleher it looks like our algorithm only works if the VCF is ordered,
and this VCF appears to be unordered (as in, when we pull in the variants they
are unordered by (variant.start, variant.end))... I think that's the problem


Reply to this email directly or view it on GitHub.*

@bioinformed
Copy link
Contributor

VCF and BCF files are required by the specification to be position
ordered. We would be well within our rights to enforce that requirement
and fail loudly.

  1. POS - position: The reference position, with the 1st base having
    position 1. Positions are sorted numerically, in increasing order, within
    each reference sequence CHROM
    . It is permitted to have multiple records
    with the same POS. Telomeres are indicated by using positions 0 or N+1,
    where N is the length of the corresponding chromosome or contig. (Integer,
    Required)

On Mon, Jun 15, 2015 at 7:12 PM, Mark Diekhans notifications@github.com
wrote:

Suggest requiring VCF to be ordered. Ranges searches are going to be slower
if not ordered.

Danny Colligan notifications@github.com writes:

@jeromekelleher it looks like our algorithm only works if the VCF is
ordered,
and this VCF appears to be unordered (as in, when we pull in the
variants they
are unordered by (variant.start, variant.end))... I think that's the
problem


Reply to this email directly or view it on GitHub.*


Reply to this email directly or view it on GitHub
#433 (comment).

@lh3
Copy link
Member

lh3 commented Jun 16, 2015

Isn't the server using htslib? Bcftools/tabix should refuse to create the index if the input is not sorted. If it is not reporting the error, there is a bug.

@dcolligan
Copy link
Member

Ok, the variants we get out of the VCF file are ordered by (variant.start) within the reference, just not (variant.start, variant.end). This is both conforming to the specification and doesn't, so far as I can tell, present a problem for our paging algorithm, so there goes my previous theory...

@dcolligan
Copy link
Member

@jeromekelleher is there a way of deriving a unique id from a GAVariant? There are duplicate variant.ids within the result set, and also duplicate (referenceName, start, end)s within the set. Computing a sha over the variant.toJsonString() doesn't work since the fields within the JSON string aren't predictably ordered. I suppose as a last resort I could create a method that converted all the variant's attributes to a string in a predictable way, then hash it, but is there something easier? It's hard to determine what is going wrong with the paging if I can't easily tell one variant from another...

@jeromekelleher
Copy link
Contributor

This is tricky all right, and I can't see exactly what the problem is. I think our test coverage is a bit thin for the interval paging functionality (which is really quite complicated), so I've started an extra module to test the general code using randomly generated intervals. I'll keep going at this until it uncovers the same issue as we have here, and report back once there's some to update on.

Sorry for the delay in fixing this @jingchunzhu.

@jeromekelleher
Copy link
Contributor

OK, starting to see the light here. This is definitely a fault in our interval paging algorithm. We're not handling the case where intervals in which the start coordinate is less than the query coordinate overlap with the search interval correctly. Working on a fix.

@jeromekelleher
Copy link
Contributor

This has been addressed in #457, and we'll try to make a bugfix release containing this and a few other bits and pieces ASAP. Thanks again for the report @jingchunzhu.

@dcolligan
Copy link
Member

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

No branches or pull requests

7 participants