-
Notifications
You must be signed in to change notification settings - Fork 25k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add from
support to top_hits aggregator.
#6299
Comments
+1 |
pagination for top_hits aggregator, definitely +1 |
Judging by commit a989229, it seem that you added pagination for hits, but what about groups (buckets) themselves? For example, if i group books by book author and search query returns 9 unique authors, can I show first page with 5 authors and next page with other 4 authors? |
already tested, your assumption is correct. we need pagination support for buckets. |
Paging is tricky. We might be able to expose it when sorting by term (would it work for you?), but if you are sorting by counts or by sub aggregation, then #1305 would make counts wrong and ordering inconsistent across pages. |
yes, that would work for me. please expose this feature. |
I need at least sort by term and sort by docs count. Is approximate count/paging possible? |
I'm a bit reluctant to add paging support when sorting by counts given that it would give more accurate results on the 1st result of the 2nd page than on the last one of the 1st page. Please also note that the way it would work behind the scenes would not be different from what you could do on client side by first requesting 10 buckets for the 1st page and then 20 for the 2nd page and disregard the first 10 buckets, etc. for subsequent pages. |
Yes, it can be implemented on client (in my system it already is, but it's not very efficient). If there is only one way to sort/page buckets, then clients can either decide in runtime which implementation to use (which requires a lot of code and conditions) or implement grouping themselves altogether. Maybe I can help with some experiments to see if it can be done at all? |
I think this feature would be easy to implement, what I'm more concerned about here is to expose a feature that would be error-prone. :( |
In our system it is already implemented. But we have memory issues on requesting all possible buckets, therefore we need a way to effectively navigate through the buckets on server side. i thought it would help to get the buckets paged from the server. edit: the memory issues are on the elasticsearch server not on the client side |
You already provide at least one feature, that can be inaccurate - cardinality aggregations. Also, from documentation about terms aggregation: |
This feature already has accuracy issues indeed, but in my opinion paging will make it even worse. For example, let's imagine that your top terms are term1, term2, ..., term10. If your page size is 5, it could happen that Elasticsearch returns term1, term2, term3, term4 and term6 on the first page (6 instead of 5 because of inaccuracy), and then term6, term7, term8, term9 and term10 (as expected). So you would have one term that would be completely invisible to your users (term5) and another one that would appear twice (term6). I think this is too confusing. |
Too bad, without counts sort this feature is only half useful. Can this issue be re-evaluated as a separate task, connected to the issue #256? |
And how about this: you provide java interface for sorter, and through plugin, we can add our own sorters for possibly-innacurate results? |
in my concrete problem. i have million of documents that i want to aggregate by key. the elasticsearch fails at this point. in the worst possible case there are over 200.000 buckets and per bucket about 10 matched documents. i thought a effective way to minimize the memory consumption is it to page through the buckets on server side. is there any other solution except to enlarge the ram capacity ? |
Also, have you seen SolrCloud's implementation of facets? Something like: http://mail-archives.apache.org/mod_mbox/lucene-solr-user/201209.mbox/%3Calpine.DEB.2.02.1209261450570.2316@frisbee%3E |
@artemredkin there is a difference between returning exact counts for specific terms, and guaranteeing total ordering. The first can be done with another round after picking the top N, the second can not. |
@Kumen memory usage currently depends mostly on your number of buckets, not the size of the page that is requested. If you are not running Elasticsearch 1.2 yet, I would recommend on upgrading as memory usage of the terms aggregation improved significantly in this release. |
I am currently using the version 1.3 (manual build from branch). thanks |
@kimchy I may be horribly wrong here (will do more digging today), but solr's field collapsing works in distributed environment and provides paging/ordering (at least i do not see in their documentation any indication, that it is not supported). Plus, @jpountz pointed to #1305 as a source for ordering problem. |
@artemredkin guaranteed total ordering can't be solved (with a 2 way execution) unless all the values are streamed, so by definition, pagination will not be "exact", that is the problem. You can say that for the top N (or paginated N), the count for each term will be exact by executing another round, but not the total order of them. It is an interesting problem, specifically with the fact that we would love to solve it in a somewhat performant manner. We obviously would love to solve it, if we manage to come up with a way to do it that is :) |
@kimchy Aren't top N groups exactly what we need for pagination? I was going to implement in on client in 2 hops (it would be 3-way execution, yes?). On first - get (1.5_size of group page) worth of terms, maybe with cardinality, sorting them, and on second hop - get those terms with top_hits. For second page - get (3_size of group page) worth of terms and so on. Is it wrong :) ? Or just slow for inclusion inside elasticsearch itself? |
I didn't mean to stop this discussion by closing this issue... Adding pagination in the The 'result grouping' approach in ES relies on the terms aggregation to determine the correct groups and an inner Pagination can be simulated by using terms aggregation's |
Hm,
|
Yes, you can sort by a |
from
support to top_hits aggregator.
was added pagination to top_hits aggregation
was added pagination to top_hits aggregation
No description provided.
The text was updated successfully, but these errors were encountered: