Add `from` support to top_hits aggregator. #6299

martijnvg · 2014-05-23T15:17:27Z

No description provided.

jpountz · 2014-05-23T15:22:02Z

+1

Kumen · 2014-05-26T06:41:43Z

pagination for top_hits aggregator, definitely +1

…ption. Closes elastic#6299

artemredkin · 2014-05-27T08:03:15Z

Judging by commit a989229, it seem that you added pagination for hits, but what about groups (buckets) themselves? For example, if i group books by book author and search query returns 9 unique authors, can I show first page with 5 authors and next page with other 4 authors?

Kumen · 2014-05-27T08:12:58Z

already tested, your assumption is correct. we need pagination support for buckets.

jpountz · 2014-05-27T09:33:22Z

Paging is tricky. We might be able to expose it when sorting by term (would it work for you?), but if you are sorting by counts or by sub aggregation, then #1305 would make counts wrong and ordering inconsistent across pages.

Kumen · 2014-05-27T09:41:09Z

yes, that would work for me. please expose this feature.

artemredkin · 2014-05-27T09:41:58Z

I need at least sort by term and sort by docs count. Is approximate count/paging possible?

jpountz · 2014-05-27T09:54:02Z

I'm a bit reluctant to add paging support when sorting by counts given that it would give more accurate results on the 1st result of the 2nd page than on the last one of the 1st page. Please also note that the way it would work behind the scenes would not be different from what you could do on client side by first requesting 10 buckets for the 1st page and then 20 for the 2nd page and disregard the first 10 buckets, etc. for subsequent pages.

artemredkin · 2014-05-27T10:00:34Z

Yes, it can be implemented on client (in my system it already is, but it's not very efficient). If there is only one way to sort/page buckets, then clients can either decide in runtime which implementation to use (which requires a lot of code and conditions) or implement grouping themselves altogether. Maybe I can help with some experiments to see if it can be done at all?

jpountz · 2014-05-27T10:10:38Z

I think this feature would be easy to implement, what I'm more concerned about here is to expose a feature that would be error-prone. :(

Kumen · 2014-05-27T10:14:36Z

In our system it is already implemented. But we have memory issues on requesting all possible buckets, therefore we need a way to effectively navigate through the buckets on server side. i thought it would help to get the buckets paged from the server.

edit: the memory issues are on the elasticsearch server not on the client side

artemredkin · 2014-05-27T10:59:17Z

You already provide at least one feature, that can be inaccurate - cardinality aggregations. Also, from documentation about terms aggregation:
The higher the requested size is, the more accurate the results will be, but also, the more expensive it will be to compute the final results (both due to bigger priority queues that are managed on a shard level and due to bigger data transfers between the nodes and the client).
So you also have an error-prone feature as well (clients can set size to 0 on high cardinality field and shoot themselves in the foot).
Another consideration, even in case of simple solution you proposed (dropping n-1 pages on elasticsearch side) can be advantageous, since we can run elasticsearch on more powerful machines, then our backends.

jpountz · 2014-05-27T18:27:42Z

This feature already has accuracy issues indeed, but in my opinion paging will make it even worse. For example, let's imagine that your top terms are term1, term2, ..., term10. If your page size is 5, it could happen that Elasticsearch returns term1, term2, term3, term4 and term6 on the first page (6 instead of 5 because of inaccuracy), and then term6, term7, term8, term9 and term10 (as expected). So you would have one term that would be completely invisible to your users (term5) and another one that would appear twice (term6). I think this is too confusing.

artemredkin · 2014-05-27T18:36:17Z

Too bad, without counts sort this feature is only half useful. Can this issue be re-evaluated as a separate task, connected to the issue #256?

artemredkin · 2014-05-28T05:59:50Z

And how about this: you provide java interface for sorter, and through plugin, we can add our own sorters for possibly-innacurate results?

Kumen · 2014-05-28T06:23:50Z

in my concrete problem. i have million of documents that i want to aggregate by key. the elasticsearch fails at this point. in the worst possible case there are over 200.000 buckets and per bucket about 10 matched documents. i thought a effective way to minimize the memory consumption is it to page through the buckets on server side. is there any other solution except to enlarge the ram capacity ?

artemredkin · 2014-05-28T07:48:26Z

Also, have you seen SolrCloud's implementation of facets? Something like: http://mail-archives.apache.org/mod_mbox/lucene-solr-user/201209.mbox/%3Calpine.DEB.2.02.1209261450570.2316@frisbee%3E

kimchy · 2014-05-28T09:14:25Z

@artemredkin there is a difference between returning exact counts for specific terms, and guaranteeing total ordering. The first can be done with another round after picking the top N, the second can not.

jpountz · 2014-05-28T09:18:57Z

@Kumen memory usage currently depends mostly on your number of buckets, not the size of the page that is requested. If you are not running Elasticsearch 1.2 yet, I would recommend on upgrading as memory usage of the terms aggregation improved significantly in this release.

Kumen · 2014-05-28T09:29:16Z

I am currently using the version 1.3 (manual build from branch).
Thus i need more memory.

thanks

artemredkin · 2014-05-28T09:34:57Z

@kimchy I may be horribly wrong here (will do more digging today), but solr's field collapsing works in distributed environment and provides paging/ordering (at least i do not see in their documentation any indication, that it is not supported). Plus, @jpountz pointed to #1305 as a source for ordering problem.
In your opinion, can this problem (ordering of groups by count) be solved at all (leveraging cardinatlity agg, for example)? Maybe in later releases?

kimchy · 2014-05-28T09:42:17Z

@artemredkin guaranteed total ordering can't be solved (with a 2 way execution) unless all the values are streamed, so by definition, pagination will not be "exact", that is the problem. You can say that for the top N (or paginated N), the count for each term will be exact by executing another round, but not the total order of them.

It is an interesting problem, specifically with the fact that we would love to solve it in a somewhat performant manner. We obviously would love to solve it, if we manage to come up with a way to do it that is :)

artemredkin · 2014-05-28T10:32:42Z

@kimchy Aren't top N groups exactly what we need for pagination? I was going to implement in on client in 2 hops (it would be 3-way execution, yes?). On first - get (1.5_size of group page) worth of terms, maybe with cardinality, sorting them, and on second hop - get those terms with top_hits. For second page - get (3_size of group page) worth of terms and so on. Is it wrong :) ? Or just slow for inclusion inside elasticsearch itself?
Anyway, thanks for explaining things, it would be awesome, it you solve this :).

…dding `from` option. Closes #6299

martijnvg · 2014-05-30T10:56:57Z

I didn't mean to stop this discussion by closing this issue...

Adding pagination in the terms aggregation is tricky like @kimchy and @jpountz describe and the correctness would depend on the ordering. The correctness of the ordering depends to what order the terms aggregation is set to. If the terms aggregation's order is set to _term or to specific inner metric aggregations (min or max metric bucket, but not avg metric bucket), the ordering is correct.

The 'result grouping' approach in ES relies on the terms aggregation to determine the correct groups and an inner max aggregation for ordering of the groups:
http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/search-aggregations-bucket-top-hits-aggregation.html#_field_collapse_example

Pagination can be simulated by using terms aggregation's exclude option. On subsequent search requests the previous emitted term buckets should be added to the exclude option, this way previous seen groups don't end up in the next aggregation response.

artemredkin · 2014-05-30T11:01:36Z

Hm,

or to specific inner metric aggregations
does this mean, that I can use 'cardinality' sub-agg to sort term groups?

martijnvg · 2014-05-30T11:57:38Z

does this mean, that I can use 'cardinality' sub-agg to sort term groups?

Yes, you can sort by a cardinality inner metric aggregation, but the ordering of the buckets depend on the accuracy of the cardinality aggregation.

was added pagination to top_hits aggregation

martijnvg added enhancement labels May 23, 2014

javanna mentioned this issue May 26, 2014

Field Collapsing/Combining #256

Closed

martijnvg added a commit to martijnvg/elasticsearch that referenced this issue May 26, 2014

Added pagination support to top_hits aggregation by adding from o…

a989229

…ption. Closes elastic#6299

martijnvg mentioned this issue May 26, 2014

Added pagination support to top_hits aggregation by adding from option #6312

Closed

martijnvg closed this as completed in aab38fb May 30, 2014

martijnvg added a commit that referenced this issue May 30, 2014

Aggregations: added pagination support to top_hits aggregation by a…

2af1c0f

…dding `from` option. Closes #6299

clintongormley changed the title ~~Add from support to top_hits aggregator.~~ Aggregations: Add from support to top_hits aggregator. Jul 16, 2014

kelaban mentioned this issue Oct 2, 2014

Feature: Terms Aggregation with From Parameter #7956

Closed

clintongormley added the :Top Hits label Jun 7, 2015

clintongormley changed the title ~~Aggregations: Add from support to top_hits aggregator.~~ Add from support to top_hits aggregator. Jun 7, 2015

colings86 added :Analytics/Aggregations Aggregations and removed :Analytics/Aggregations Aggregations labels Mar 31, 2017

dpblh pushed a commit to dpblh/elastic4s that referenced this issue Jun 5, 2018

https://github.com/elastic/elasticsearch/issues/6299

e1b39d7

was added pagination to top_hits aggregation

sksamuel pushed a commit to Philippus/elastic4s that referenced this issue Jun 13, 2018

elastic/elasticsearch#6299 (#1399)

de428f9

was added pagination to top_hits aggregation

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add `from` support to top_hits aggregator. #6299

Add `from` support to top_hits aggregator. #6299

martijnvg commented May 23, 2014

jpountz commented May 23, 2014

Kumen commented May 26, 2014

artemredkin commented May 27, 2014

Kumen commented May 27, 2014

jpountz commented May 27, 2014

Kumen commented May 27, 2014

artemredkin commented May 27, 2014

jpountz commented May 27, 2014

artemredkin commented May 27, 2014

jpountz commented May 27, 2014

Kumen commented May 27, 2014

artemredkin commented May 27, 2014

jpountz commented May 27, 2014

artemredkin commented May 27, 2014

artemredkin commented May 28, 2014

Kumen commented May 28, 2014

artemredkin commented May 28, 2014

kimchy commented May 28, 2014

jpountz commented May 28, 2014

Kumen commented May 28, 2014

artemredkin commented May 28, 2014

kimchy commented May 28, 2014

artemredkin commented May 28, 2014

martijnvg commented May 30, 2014

artemredkin commented May 30, 2014

martijnvg commented May 30, 2014

Add from support to top_hits aggregator. #6299

Add from support to top_hits aggregator. #6299

Comments

martijnvg commented May 23, 2014

jpountz commented May 23, 2014

Kumen commented May 26, 2014

artemredkin commented May 27, 2014

Kumen commented May 27, 2014

jpountz commented May 27, 2014

Kumen commented May 27, 2014

artemredkin commented May 27, 2014

jpountz commented May 27, 2014

artemredkin commented May 27, 2014

jpountz commented May 27, 2014

Kumen commented May 27, 2014

artemredkin commented May 27, 2014

jpountz commented May 27, 2014

artemredkin commented May 27, 2014

artemredkin commented May 28, 2014

Kumen commented May 28, 2014

artemredkin commented May 28, 2014

kimchy commented May 28, 2014

jpountz commented May 28, 2014

Kumen commented May 28, 2014

artemredkin commented May 28, 2014

kimchy commented May 28, 2014

artemredkin commented May 28, 2014

martijnvg commented May 30, 2014

artemredkin commented May 30, 2014

martijnvg commented May 30, 2014

Add `from` support to top_hits aggregator. #6299

Add `from` support to top_hits aggregator. #6299