-
Notifications
You must be signed in to change notification settings - Fork 91
this query get wrong server response back. #433
Comments
All three clients are now consistently getting the incorrect response back. pageSize parameter in the curl command does not seem to be used. once you manually add pageToken parameter, it also triggers the incorrect response back from the server. curl --data '{ "callSetIds": [], "end": 41244748, "referenceName": "17", "start":41227747, "variantSetIds": ["umd”], “pageSize”:1000}’ --header 'Content-Type: application/json' http://ec2-54-148-207-224.us-west-2.compute.amazonaws.com:8000/v0.6.e6d6074/variants/search |
Which version of the server is this? Is it an old one that has the bug we fixed in this update? https://github.com/ga4gh/server/releases/tag/v0.1.1 It looks like you are running 0.1.0b1, not 0.1.1. Try updating and see what happens. |
For querying umd database on chr17 between position 41227747 to 41244748 the server found 2520 variants, many are not in this interval. here are just a random copy of paste a few that are outside (both start and end) the interval. variant_start variant_end You should see the same thing using the python client. server implementation used: http://ec2-54-148-207-224.us-west-2.compute.amazonaws.com:8000/v0.5.1 GA4GH reference server 0.1.1 Protocol version 0.5.1 Operations available Method Path Running since 13 hours ago (04:23:35 04 Jun 2015) Key Value VariantSets Clinvar |
@jeromekelleher do you have any guesses why this might be happening? it seems like backend.py should iterate past these variants that are getting returned, since they're outside the requested range... |
Yes, it looks like this is a bug in our paging code. Thanks for reporting this @jingchunzhu. @dcolligan, I've no guesses right now. We'll need to get the data and try to reproduce this locally so we can get to the bottom of it. @jingchunzhu, can you provide us with some information on how we can download this dataset so we can debug the problem please? |
Jerome; use data on GA4GH server at : python command line : python client_dev.py -vv variants-search jing On Mon, Jun 8, 2015 at 2:16 AM, Jerome Kelleher notifications@github.com
|
@jingchunzhu , it would be better if we actually had access to the data that this server is using, both because we could cross-reference the returned results with the actual data and set breakpoints on our locally running server. |
Yes, we really do need to get the data to reproduce this, sorry @jingchunzhu. Is there a download link that we can use to get at it? |
Charlie, can we share the umd BRCA vcf files with Jerome for figuring out Jing On Tue, Jun 9, 2015 at 7:04 AM, Jerome Kelleher notifications@github.com
|
Yes, I believe so. Do you want to know where it's located on the server? Or -Charlie On Tue, Jun 9, 2015 at 11:25 AM, Jing Zhu jingchunzhu@gmail.com wrote:
|
Jerome, Attached is the data file (in vcf). jing On Tue, Jun 9, 2015 at 11:59 AM, Charles Markello cmarkell@ucsc.edu wrote:
|
I'm afraid attachments don't work for github --- you can either post a download link here or send the file to me over email at jk@well.ox.ac.uk. Thanks. |
@jingchunzhu, @jeromekelleher You can attach files into comments on GitHub if you rename your file to with a |
@jingchunzhu can you also email me the index file you are using for that vcf file? |
Charlie, Can you help? Jing On Wed, Jun 10, 2015 at 6:38 AM, Danny Colligan notifications@github.com
|
Here's the index file. Is there anything else you guys need? -Charlie On Wed, Jun 10, 2015 at 7:23 AM, Jing Zhu jingchunzhu@gmail.com wrote:
|
I didn't receive it. You'll need to mail it to me as an attachment. |
I think you can remake the index using tabix @dcolligan, it shouldn't be necessary to get the original index (unless we're having trouble reproducing the problem). |
For this particular query, it looks like 529 out of 2520 variants returned to the client are defective, all because the start and end of the returned variant are less than the request's start attribute. It looks like a problem with the paging code, as 1991 variants are returned by the lower-level query to For this particular request, there are 5
Setting Continuing to investigate... |
@jeromekelleher it looks like our algorithm only works if the VCF is ordered, and this VCF appears to be unordered (as in, when we pull in the variants they are unordered by (variant.start, variant.end))... I think that's the problem |
Suggest requiring VCF to be ordered. Ranges searches are going to be slower Danny Colligan notifications@github.com writes:
|
VCF and BCF files are required by the specification to be position
On Mon, Jun 15, 2015 at 7:12 PM, Mark Diekhans notifications@github.com
|
Isn't the server using htslib? Bcftools/tabix should refuse to create the index if the input is not sorted. If it is not reporting the error, there is a bug. |
Ok, the variants we get out of the VCF file are ordered by (variant.start) within the reference, just not (variant.start, variant.end). This is both conforming to the specification and doesn't, so far as I can tell, present a problem for our paging algorithm, so there goes my previous theory... |
@jeromekelleher is there a way of deriving a unique id from a |
This is tricky all right, and I can't see exactly what the problem is. I think our test coverage is a bit thin for the interval paging functionality (which is really quite complicated), so I've started an extra module to test the general code using randomly generated intervals. I'll keep going at this until it uncovers the same issue as we have here, and report back once there's some to update on. Sorry for the delay in fixing this @jingchunzhu. |
OK, starting to see the light here. This is definitely a fault in our interval paging algorithm. We're not handling the case where intervals in which the start coordinate is less than the query coordinate overlap with the search interval correctly. Working on a fix. |
This has been addressed in #457, and we'll try to make a bugfix release containing this and a few other bits and pieces ASAP. Thanks again for the report @jingchunzhu. |
This following query against umd get 2520 variants back, many are not within the start and end interval
python client_dev.py -vv variants-search --variantSetIds umd --referenceName 17 --start 41227747 --end 41244748 --pageSize 1000 http://ec2-54-148-207-224.us-west-2.compute.amazonaws.com:8000/v0.6.e6d6074
2520 variants
The most strange thing is that the xena browser client (in javascript) also get the "same"(same in the sense that the same number of variants returned) wrong answer back.
however, curl command seems to get the correct response back.
curl --data '{ "callSetIds": [], "end": 41244748, "referenceName": "17", "start":41227747, "variantSetIds": ["umd”], “pageSize”:1000}’ --header 'Content-Type: application/json' http://ec2-54-148-207-224.us-west-2.compute.amazonaws.com:8000/v0.6.e6d6074/variants/search
The text was updated successfully, but these errors were encountered: