this query get wrong server response back. #433

jingchunzhu · 2015-06-02T21:17:26Z

This following query against umd get 2520 variants back, many are not within the start and end interval
python client_dev.py -vv variants-search --variantSetIds umd --referenceName 17 --start 41227747 --end 41244748 --pageSize 1000 http://ec2-54-148-207-224.us-west-2.compute.amazonaws.com:8000/v0.6.e6d6074
2520 variants

The most strange thing is that the xena browser client (in javascript) also get the "same"(same in the sense that the same number of variants returned) wrong answer back.

however, curl command seems to get the correct response back.

curl --data '{ "callSetIds": [], "end": 41244748, "referenceName": "17", "start":41227747, "variantSetIds": ["umd”], “pageSize”:1000}’ --header 'Content-Type: application/json' http://ec2-54-148-207-224.us-west-2.compute.amazonaws.com:8000/v0.6.e6d6074/variants/search

jingchunzhu · 2015-06-02T22:36:16Z

All three clients are now consistently getting the incorrect response back.

pageSize parameter in the curl command does not seem to be used. once you manually add pageToken parameter, it also triggers the incorrect response back from the server.

curl --data '{ "callSetIds": [], "end": 41244748, "referenceName": "17", "start":41227747, "variantSetIds": ["umd”], “pageSize”:1000}’ --header 'Content-Type: application/json' http://ec2-54-148-207-224.us-west-2.compute.amazonaws.com:8000/v0.6.e6d6074/variants/search

dcolligan · 2015-06-02T22:50:14Z

Which version of the server is this? Is it an old one that has the bug we fixed in this update? https://github.com/ga4gh/server/releases/tag/v0.1.1

It looks like you are running 0.1.0b1, not 0.1.1. Try updating and see what happens.

jingchunzhu · 2015-06-04T17:43:57Z

For querying umd database on chr17 between position 41227747 to 41244748 the server found 2520 variants, many are not in this interval.

here are just a random copy of paste a few that are outside (both start and end) the interval.

variant_start variant_end
41226453 41226454
41226456 41226457
41226473 41226474
41226487 41226488
41226487 41226488
41226487 41226488
41226487 41226488
41226487 41226488
41226487 41226488

You should see the same thing using the python client.

server implementation used: http://ec2-54-148-207-224.us-west-2.compute.amazonaws.com:8000/v0.5.1

GA4GH reference server 0.1.1

Protocol version 0.5.1

Operations available

Method Path
GET /v0.5.1
POST /v0.5.1/callsets/search
POST /v0.5.1/readgroupsets/search
POST /v0.5.1/reads/search
POST /v0.5.1/referencesets/search
POST /v0.5.1/variants/search
POST /v0.5.1/variantsets/search
Uptime

Running since 13 hours ago (04:23:35 04 Jun 2015)
Configuration

Key Value
DEBUG True
REQUEST_VALIDATION False
RESPONSE_VALIDATION False
Data

VariantSets

Clinvar
ex_lovd
lovd
1000_genomes
exac
bic
umd
ReadGroupSets

dcolligan · 2015-06-04T19:06:11Z

@jeromekelleher do you have any guesses why this might be happening? it seems like backend.py should iterate past these variants that are getting returned, since they're outside the requested range...

jeromekelleher · 2015-06-08T09:16:14Z

Yes, it looks like this is a bug in our paging code. Thanks for reporting this @jingchunzhu.

@dcolligan, I've no guesses right now. We'll need to get the data and try to reproduce this locally so we can get to the bottom of it. @jingchunzhu, can you provide us with some information on how we can download this dataset so we can debug the problem please?

jingchunzhu · 2015-06-08T20:44:11Z

Jerome;

use data on GA4GH server at :
http://ec2-54-148-207-224.us-west-2.compute.amazonaws.com:7000/v0.6.e6d6074

python command line : python client_dev.py -vv variants-search
--variantSetIds umd --referenceName 17 --start 41227747 --end 41244748
--pageSize 1000
http://ec2-54-148-207-224.us-west-2.compute.amazonaws.com:7000/v0.6.e6d6074

jing

On Mon, Jun 8, 2015 at 2:16 AM, Jerome Kelleher notifications@github.com
wrote:

Yes, it looks like this is a bug in our paging code. Thanks for reporting
this @jingchunzhu https://github.com/jingchunzhu.

@dcolligan https://github.com/dcolligan, I've no guesses right now.
We'll need to get the data and try to reproduce this locally so we can get
to the bottom of it. @jingchunzhu https://github.com/jingchunzhu, can
you provide us with some information on how we can download this dataset so
we can debug the problem please?

—
Reply to this email directly or view it on GitHub
#433 (comment).

dcolligan · 2015-06-09T13:25:27Z

@jingchunzhu , it would be better if we actually had access to the data that this server is using, both because we could cross-reference the returned results with the actual data and set breakpoints on our locally running server.

jeromekelleher · 2015-06-09T14:04:51Z

Yes, we really do need to get the data to reproduce this, sorry @jingchunzhu. Is there a download link that we can use to get at it?

jingchunzhu · 2015-06-09T18:55:03Z

Charlie, can we share the umd BRCA vcf files with Jerome for figuring out
the server bug?

Jing

On Tue, Jun 9, 2015 at 7:04 AM, Jerome Kelleher notifications@github.com
wrote:

Yes, we really do need to get the data to reproduce this, sorry
@jingchunzhu https://github.com/jingchunzhu. Is there a download link
that we can use to get at it?

—
Reply to this email directly or view it on GitHub
#433 (comment).

jingchunzhu · 2015-06-09T18:59:18Z

Yes, I believe so. Do you want to know where it's located on the server? Or
do you want me to send the file to him directly?

-Charlie

On Tue, Jun 9, 2015 at 11:25 AM, Jing Zhu jingchunzhu@gmail.com wrote:

Charlie, can we share the umd BRCA vcf files with Jerome for figuring
out the server bug?

Jing

On Tue, Jun 9, 2015 at 7:04 AM, Jerome Kelleher notifications@github.com
wrote:

Yes, we really do need to get the data to reproduce this, sorry
@jingchunzhu https://github.com/jingchunzhu. Is there a download link
that we can use to get at it?

—
Reply to this email directly or view it on GitHub
#433 (comment).

jingchunzhu · 2015-06-09T19:12:39Z

Jerome,

Attached is the data file (in vcf).

jing

On Tue, Jun 9, 2015 at 11:59 AM, Charles Markello cmarkell@ucsc.edu wrote:

Yes, I believe so. Do you want to know where it's located on the server?
Or do you want me to send the file to him directly?

-Charlie

On Tue, Jun 9, 2015 at 11:25 AM, Jing Zhu jingchunzhu@gmail.com wrote:

Charlie, can we share the umd BRCA vcf files with Jerome for figuring
out the server bug?

Jing

On Tue, Jun 9, 2015 at 7:04 AM, Jerome Kelleher <notifications@github.com

wrote:

Yes, we really do need to get the data to reproduce this, sorry
@jingchunzhu https://github.com/jingchunzhu. Is there a download link
that we can use to get at it?

—
Reply to this email directly or view it on GitHub
#433 (comment).

jeromekelleher · 2015-06-09T20:31:25Z

I'm afraid attachments don't work for github --- you can either post a download link here or send the file to me over email at jk@well.ox.ac.uk. Thanks.

pgrosu · 2015-06-09T23:35:57Z

@jingchunzhu, @jeromekelleher You can attach files into comments on GitHub if you rename your file to with a .png extension and just drag-and-drop it into the comment section. Then anyone can download/save the PNG file and rename it back to the original extension and see the complete contents.

dcolligan · 2015-06-10T13:38:22Z

@jingchunzhu can you also email me the index file you are using for that vcf file?

jingchunzhu · 2015-06-10T14:23:31Z

Charlie,

Can you help?

Jing

On Wed, Jun 10, 2015 at 6:38 AM, Danny Colligan notifications@github.com
wrote:

@jingchunzhu https://github.com/jingchunzhu can you also email me the
index file you are using for that vcf file?

—
Reply to this email directly or view it on GitHub
#433 (comment).

jingchunzhu · 2015-06-10T19:11:48Z

Here's the index file. Is there anything else you guys need?

-Charlie

On Wed, Jun 10, 2015 at 7:23 AM, Jing Zhu jingchunzhu@gmail.com wrote:

Charlie,

Can you help?

Jing

On Wed, Jun 10, 2015 at 6:38 AM, Danny Colligan notifications@github.com
wrote:

@jingchunzhu https://github.com/jingchunzhu can you also email me the
index file you are using for that vcf file?

—
Reply to this email directly or view it on GitHub
#433 (comment).

dcolligan · 2015-06-10T19:55:49Z

I didn't receive it. You'll need to mail it to me as an attachment.

jeromekelleher · 2015-06-11T05:15:16Z

I think you can remake the index using tabix @dcolligan, it shouldn't be necessary to get the original index (unless we're having trouble reproducing the problem).

dcolligan · 2015-06-15T21:11:49Z

For this particular query, it looks like 529 out of 2520 variants returned to the client are defective, all because the start and end of the returned variant are less than the request's start attribute. It looks like a problem with the paging code, as 1991 variants are returned by the lower-level query to HtslibVariantSet.getVariants (and none of those have start and end less than startPosition). Also, all the defective variants are returned in the first page of results: the page size is 1000 and the first defective variant is the (0-indexed) 23rd returned, the final one the 562nd.

For this particular request, there are 5 POST /v0.5.1/variants/search requests, each returning this number of results; they are sub-1000 since the response buffer fills up beyond its allowed byte number:

20 RESULTS
583 RESULTS
178 RESULTS
873 RESULTS
866 RESULTS

Setting MAX_RESPONSE_LENGTH in serverconfig.py to something very high yields the correct number of results.

Continuing to investigate...

dcolligan · 2015-06-15T22:07:22Z

@jeromekelleher it looks like our algorithm only works if the VCF is ordered, and this VCF appears to be unordered (as in, when we pull in the variants they are unordered by (variant.start, variant.end))... I think that's the problem

diekhans · 2015-06-15T23:12:22Z

Suggest requiring VCF to be ordered. Ranges searches are going to be slower
if not ordered.

Danny Colligan notifications@github.com writes:

@jeromekelleher it looks like our algorithm only works if the VCF is ordered,
and this VCF appears to be unordered (as in, when we pull in the variants they
are unordered by (variant.start, variant.end))... I think that's the problem

—
Reply to this email directly or view it on GitHub.*

bioinformed · 2015-06-16T00:16:57Z

VCF and BCF files are required by the specification to be position
ordered. We would be well within our rights to enforce that requirement
and fail loudly.

POS - position: The reference position, with the 1st base having
position 1. Positions are sorted numerically, in increasing order, within
each reference sequence CHROM. It is permitted to have multiple records
with the same POS. Telomeres are indicated by using positions 0 or N+1,
where N is the length of the corresponding chromosome or contig. (Integer,
Required)

On Mon, Jun 15, 2015 at 7:12 PM, Mark Diekhans notifications@github.com
wrote:

Suggest requiring VCF to be ordered. Ranges searches are going to be slower
if not ordered.

Danny Colligan notifications@github.com writes:

@jeromekelleher it looks like our algorithm only works if the VCF is
ordered,
and this VCF appears to be unordered (as in, when we pull in the
variants they
are unordered by (variant.start, variant.end))... I think that's the
problem

—
Reply to this email directly or view it on GitHub.*

—
Reply to this email directly or view it on GitHub
#433 (comment).

lh3 · 2015-06-16T01:31:40Z

Isn't the server using htslib? Bcftools/tabix should refuse to create the index if the input is not sorted. If it is not reporting the error, there is a bug.

dcolligan · 2015-06-16T02:06:06Z

Ok, the variants we get out of the VCF file are ordered by (variant.start) within the reference, just not (variant.start, variant.end). This is both conforming to the specification and doesn't, so far as I can tell, present a problem for our paging algorithm, so there goes my previous theory...

dcolligan · 2015-06-16T02:46:03Z

@jeromekelleher is there a way of deriving a unique id from a GAVariant? There are duplicate variant.ids within the result set, and also duplicate (referenceName, start, end)s within the set. Computing a sha over the variant.toJsonString() doesn't work since the fields within the JSON string aren't predictably ordered. I suppose as a last resort I could create a method that converted all the variant's attributes to a string in a predictable way, then hash it, but is there something easier? It's hard to determine what is going wrong with the paging if I can't easily tell one variant from another...

jeromekelleher · 2015-06-16T14:53:27Z

This is tricky all right, and I can't see exactly what the problem is. I think our test coverage is a bit thin for the interval paging functionality (which is really quite complicated), so I've started an extra module to test the general code using randomly generated intervals. I'll keep going at this until it uncovers the same issue as we have here, and report back once there's some to update on.

Sorry for the delay in fixing this @jingchunzhu.

jeromekelleher · 2015-06-16T16:43:23Z

OK, starting to see the light here. This is definitely a fault in our interval paging algorithm. We're not handling the case where intervals in which the start coordinate is less than the query coordinate overlap with the search interval correctly. Working on a fix.

jeromekelleher · 2015-06-17T14:33:20Z

This has been addressed in #457, and we'll try to make a bugfix release containing this and a few other bits and pieces ASAP. Thanks again for the report @jingchunzhu.

dcolligan · 2015-06-24T16:05:10Z

Fixed in https://github.com/ga4gh/server/releases/tag/v0.1.2

dcolligan self-assigned this Jun 15, 2015

dcolligan assigned jeromekelleher and unassigned dcolligan Jun 16, 2015

dcolligan mentioned this issue Jun 16, 2015

Fixes for reads/search #456

Closed

jeromekelleher added the bug label Jun 17, 2015

jeromekelleher mentioned this issue Jun 17, 2015

Updated interval paging algorithm + better tests. #457

Merged

dcolligan mentioned this issue Jun 24, 2015

Bump version to 0.1.2 #463

Merged

dcolligan closed this as completed Jun 24, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

this query get wrong server response back. #433

this query get wrong server response back. #433

jingchunzhu commented Jun 2, 2015

jingchunzhu commented Jun 2, 2015

dcolligan commented Jun 2, 2015

jingchunzhu commented Jun 4, 2015

dcolligan commented Jun 4, 2015

jeromekelleher commented Jun 8, 2015

jingchunzhu commented Jun 8, 2015

dcolligan commented Jun 9, 2015

jeromekelleher commented Jun 9, 2015

jingchunzhu commented Jun 9, 2015

jingchunzhu commented Jun 9, 2015

jingchunzhu commented Jun 9, 2015

jeromekelleher commented Jun 9, 2015

pgrosu commented Jun 9, 2015

dcolligan commented Jun 10, 2015

jingchunzhu commented Jun 10, 2015

jingchunzhu commented Jun 10, 2015

dcolligan commented Jun 10, 2015

jeromekelleher commented Jun 11, 2015

dcolligan commented Jun 15, 2015

dcolligan commented Jun 15, 2015

diekhans commented Jun 15, 2015

bioinformed commented Jun 16, 2015

lh3 commented Jun 16, 2015

dcolligan commented Jun 16, 2015

dcolligan commented Jun 16, 2015

jeromekelleher commented Jun 16, 2015

jeromekelleher commented Jun 16, 2015

jeromekelleher commented Jun 17, 2015

dcolligan commented Jun 24, 2015

this query get wrong server response back. #433

this query get wrong server response back. #433

Comments

jingchunzhu commented Jun 2, 2015

jingchunzhu commented Jun 2, 2015

dcolligan commented Jun 2, 2015

jingchunzhu commented Jun 4, 2015

dcolligan commented Jun 4, 2015

jeromekelleher commented Jun 8, 2015

jingchunzhu commented Jun 8, 2015

dcolligan commented Jun 9, 2015

jeromekelleher commented Jun 9, 2015

jingchunzhu commented Jun 9, 2015

jingchunzhu commented Jun 9, 2015

jingchunzhu commented Jun 9, 2015

jeromekelleher commented Jun 9, 2015

pgrosu commented Jun 9, 2015

dcolligan commented Jun 10, 2015

jingchunzhu commented Jun 10, 2015

jingchunzhu commented Jun 10, 2015

dcolligan commented Jun 10, 2015

jeromekelleher commented Jun 11, 2015

dcolligan commented Jun 15, 2015

dcolligan commented Jun 15, 2015

diekhans commented Jun 15, 2015

bioinformed commented Jun 16, 2015

lh3 commented Jun 16, 2015

dcolligan commented Jun 16, 2015

dcolligan commented Jun 16, 2015

jeromekelleher commented Jun 16, 2015

jeromekelleher commented Jun 16, 2015

jeromekelleher commented Jun 17, 2015

dcolligan commented Jun 24, 2015