Field Collapsing/Combining #256

ppearcy · 2010-07-13T17:02:32Z

Ability to collapse on a field. For example, I want the most relevant result from all different report types. Or similarly, the most recent result of each report type. Or maybe, I want to de-dup on headline.

So, the sort order would dictate which one from the group is returned. Similar to what is discussed here:
http://blog.jteam.nl/2009/10/20/result-grouping-field-collapsing-with-solr/

From my understanding, it seems that in order for field collapsing to be efficient, the result set must be relatively small.

This is also referred to as "Combine" on some other search products.

Omega359 · 2010-08-13T19:51:42Z

Count this comment as a vote to have this feature added.

kwloafman · 2010-09-05T17:46:50Z

I could make good use of this feature. Go for it!

Fiedzia · 2010-09-30T11:12:46Z

+1 vote for that

ekalyoncu · 2010-10-29T13:02:34Z

yes it's really cool feature.

ekalyoncu · 2010-10-29T13:04:16Z

In SOLR, grouping is not supported for distributed search. If it's implemented, it can be big plus for ElasticSearch

giorgiovinci · 2010-10-29T13:04:30Z

The only workaround is to "group" the results on the client side is correct?
+1 For this. To have the logic on the server is what we need!

jeroenr · 2010-11-02T13:00:18Z

+1 This sounds really useful

apatrida · 2010-11-09T14:42:48Z

This is probably a broader topic of collapsing (dropping dupes based on sort order although many times one field isn't enough to decide a good dedupe), or full rollups where you retain the individual documents within an aggregate replacement document ("5 books by this author").

There are fun issues with each, such as do you try to satisfy the requested window results? How does paging work when things are missing? Does the total document count get adjusted (but is still wrong as you don't know what other pages hold)? ...

Fiedzia · 2010-11-09T15:23:05Z

For me this should work like "select distinct" in sql - so i expect duplicates to be removed everywhere - including total document count, pagination and window result.

apatrida · 2010-11-09T15:37:17Z

at that point, its a full group-by and in SQL you are getting aggregate values back in functions, and sometimes undefined if you ask for non-aggregate fields ... in the search engine how are the other fields besides the rollup key being treated? Is it a grouping into a master aggregate document listing all the children, or at least the fact that there are children such as what Endeca does? Of is it a deduping and the first one at highest relevancy wins even if many of the other fields differ outside of the key (you need compound keys then as deduping on a single field isn't enough to make that desirable)?

ppearcy · 2010-12-14T20:38:26Z

Hey,
Just wanted to say that we are using our own poor man's version of this to satisfy some requirements by just requesting 10x the amount requested and collapsing down client side. Complete hack, but works 99% of the time.

We're now applying this and adding facets to it with a two phased approach. We first get the list of doc ids and then we pass them in as a term list and faceting on that query.

Was curious if there was any more efficient method of doing this?

Thanks,
Paul

dmartinpro · 2011-04-04T10:03:21Z

+1 vote for this issue too.
This is a really useful feature. Think about an e-commerce shop, indexing all sku. When looking at a product, a customer should have in his results list the products (and not the sku).

till · 2011-05-10T14:19:34Z

subscribe

tfreitas · 2011-05-10T16:58:54Z

+1

vincenttheeten · 2011-05-13T11:03:03Z

plz don't make us switch to SOLR just for this feature
+1

kimchy · 2011-05-13T11:54:53Z

Note that solr does not implment it for a distributed search (as far as I know) and the implementation is problematic (my view).

till · 2011-05-13T18:13:18Z

Are you referring to the "field collapse patch" floating around in their Jira? I haven't checked if that made it into a recent release so I don't know how up to date my info is, I just noticed that queries using "field collapse patch" are by magnitude slower than queries without.

mikemccand · 2011-05-18T18:52:42Z

Note that there is now (finally!) a new grouping module in Lucene -- see https://issues.apache.org/jira/browse/LUCENE-1421

It's been back-ported to 3.x, under lucene/contrib/grouping.

So in theory exposing this in ElasticSearch should be straightforward? (And, if it's not, I'd really like to know about that so we can fix it!).

There is some performance hit but not as bad as I had expected. See the 3 TermGroupXXX charts here: http://people.apache.org/~mikemccand/lucenebench -- it's ~ 2.3x-2.5X slower than the straight TermQuery, when grouping by a field with 100, 10K, 1M unique values (though, the sort and groupSort are relevance; maybe when sorting by other fields this is slower). This should also be the worst-case slowdown since TermQuery is such an "easy" query; queries which are "hard" and don't produce many results should see less net impact from the grouping overhead, I expect.

kimchy · 2011-05-19T21:02:37Z

Cool!, saw that a few days ago, will definitely have a look.

tfreitas · 2011-06-03T22:53:46Z

Hi, with the release of Lucene 3.2, one of its features are:
"A new grouping module, under lucene / contrib / grouping, enable search results to Be group by single-valued indexed field "
http://wiki.apache.org/lucene-java/ReleaseNote32

darxriggs · 2011-06-11T23:34:31Z

+1

aaronbinns · 2011-06-13T18:55:10Z

+1

0xPIT · 2011-06-13T20:48:04Z

++1

mkreidenweis · 2011-06-14T07:26:32Z

+1

bbock · 2011-06-14T07:56:12Z

+1

selaux · 2011-06-14T08:14:31Z

+1

jmayr · 2011-06-14T08:28:24Z

+1

mikemccand · 2011-06-14T10:37:51Z

I'm also working on making it easy(ier) to distribute grouping, by adding static merge methods to TopDocs/TopGroups. Ie, each shard can run the 1st pass collector, send top groups back to front end, front end merges the top groups (SearchGroup.merge) and issues request to all shards to run 2nd pass collector, gets results back, merges with TopGroups.merge. This is all under https://issues.apache.org/jira/browse/LUCENE-3191

spinscale · 2011-06-15T06:52:21Z

+1

stevencasey · 2011-06-17T03:46:54Z

+1

any news on whether https://issues.apache.org/jira/browse/LUCENE-1421 as mentioned by mikemccand will work in elasticsearch?

dearlordylord · 2014-04-03T13:42:54Z

+1

imarsman · 2014-04-04T07:24:41Z

This would be incredibly useful for the application I am writing for my company. I am, however, amazed at how capable Elasticsearch is already that I feel it would be rude not to say thank-you before adding my YES to this request for this feature to be added.

petard · 2014-04-04T09:14:24Z

+1

grishick · 2014-04-12T00:23:30Z

+1 this is a tie breaker for us right now when evaluating ES vs Solr

Limfocit · 2014-04-29T09:43:58Z

+1

zeelax · 2014-05-12T10:35:27Z

+1

clintongormley · 2014-05-12T12:02:44Z

See #6124, which looks like it will handle all field-collapsing requirements, in a distributed manner.

thejohnfreeman · 2014-05-13T03:05:23Z

While neat, is it possible to perform aggregations against all collapsed documents? For example, collapse a set of books on the author field, then aggregate terms in the publisher field, to find the most common publishers by number of distinct authors?

mattweber · 2014-05-13T03:46:18Z

@thejohnfreeman I imagine #6124 is just the first steps, but considering this is a bucket aggregator, what you describe should be possible. Keep and eye on the PR.

martijnvg · 2014-05-23T14:08:10Z

Let me +1 this issue for the last time :)

The top_hits aggregation will handle the field collapse requirements and #6124 is the first step.

@thejohnfreeman Right now the top_hits can only be used as leaf aggregation. Can you example also be implemented via two nested terms aggregations (first on author field and then on publisher) and a top_hits aggregation as leaf?

artemredkin · 2014-05-23T14:51:03Z

What about paging? As far as I can tell, where is no way to page agg results.

martijnvg · 2014-05-23T15:16:13Z

@artemredkin Pagination isn't supported yet, but it shouldn't be to difficult to add that.

brusic · 2014-05-23T15:34:55Z

+1

:)

artemredkin · 2014-05-23T16:16:19Z

Cool!
You are awesome :)

artemredkin · 2014-05-25T17:06:50Z

should I add an issue for pagination?

javanna · 2014-05-26T06:32:46Z

Hi @artemredkin we already have issue #6299 for it ;)

artemredkin · 2014-05-26T06:34:35Z

Got it, thanks!

vvaradhan · 2014-06-26T18:27:41Z

Is there a master-snapshot version available through maven? I can start on my development till 1.3.0 gets officially released.

Also, what would be a likely release date of 1.3.0?

SaSa1983 · 2014-06-26T18:48:29Z

You can build the 1.3.0 branch

It contains the aggregations feature

dadoonet · 2014-06-26T21:28:37Z

@vvaradhan 1.3.0-SNAPSHOT is available on Sonatype repo: https://oss.sonatype.org/#nexus-search;gav~org.elasticsearch~elasticsearch~1.3.0-SNAPSHOT~~

HTH

mikemccabe · 2014-08-13T09:48:58Z

Released in http://www.elasticsearch.org/downloads/1-3-0/ - #6124 is referenced in release notes.

JnBrymn-EB · 2015-06-14T01:16:12Z

No traffic on this in almost a year. Should it be presumed that this issue is closed by #6124 ?

brusic · 2015-06-14T01:24:50Z

Correct.
On Jun 14, 2015 9:17 AM, "JnBrymn-EB" notifications@github.com wrote:

No traffic on this in almost a year. Should it be presumed that this issue
is closed by #6124 #6124 ?

—
Reply to this email directly or view it on GitHub
#256 (comment)
.

🤖 ESQL: Merge upstream

yao23 mentioned this issue May 9, 2014

Terms facet results retrieve and pick certain number products for top users #6109

Closed

martijnvg mentioned this issue May 12, 2014

Add top_hits aggregation #6124

Closed

martijnvg closed this as completed May 23, 2014

artemredkin mentioned this issue May 27, 2014

Add from support to top_hits aggregator. #6299

Closed

costin pushed a commit that referenced this issue Dec 6, 2022

Merge pull request #256 from elastic/main

8417e22

🤖 ESQL: Merge upstream

Field Collapsing/Combining #256

Field Collapsing/Combining #256

Comments

ppearcy commented Jul 13, 2010

Omega359 commented Aug 13, 2010

kwloafman commented Sep 5, 2010

Fiedzia commented Sep 30, 2010

ekalyoncu commented Oct 29, 2010

ekalyoncu commented Oct 29, 2010

giorgiovinci commented Oct 29, 2010

jeroenr commented Nov 2, 2010

apatrida commented Nov 9, 2010

Fiedzia commented Nov 9, 2010

apatrida commented Nov 9, 2010

ppearcy commented Dec 14, 2010

dmartinpro commented Apr 4, 2011

till commented May 10, 2011

tfreitas commented May 10, 2011

vincenttheeten commented May 13, 2011

kimchy commented May 13, 2011

till commented May 13, 2011

mikemccand commented May 18, 2011

kimchy commented May 19, 2011

tfreitas commented Jun 3, 2011

darxriggs commented Jun 11, 2011

aaronbinns commented Jun 13, 2011

0xPIT commented Jun 13, 2011

mkreidenweis commented Jun 14, 2011

bbock commented Jun 14, 2011

selaux commented Jun 14, 2011

jmayr commented Jun 14, 2011

mikemccand commented Jun 14, 2011

spinscale commented Jun 15, 2011

stevencasey commented Jun 17, 2011

dearlordylord commented Apr 3, 2014

imarsman commented Apr 4, 2014

petard commented Apr 4, 2014

grishick commented Apr 12, 2014

Limfocit commented Apr 29, 2014

zeelax commented May 12, 2014

clintongormley commented May 12, 2014

thejohnfreeman commented May 13, 2014

mattweber commented May 13, 2014

martijnvg commented May 23, 2014

artemredkin commented May 23, 2014

martijnvg commented May 23, 2014

brusic commented May 23, 2014

artemredkin commented May 23, 2014

artemredkin commented May 25, 2014

javanna commented May 26, 2014

artemredkin commented May 26, 2014

vvaradhan commented Jun 26, 2014

SaSa1983 commented Jun 26, 2014

dadoonet commented Jun 26, 2014

mikemccabe commented Aug 13, 2014

JnBrymn-EB commented Jun 14, 2015

brusic commented Jun 14, 2015