Add K-means clustering feature #5512

geekpete · 2014-03-24T11:26:29Z

Add k-means clustering to allow detection of clusters in data sets.
http://en.wikipedia.org/wiki/K-means_clustering

Would be useful for geo points but also other use cases too.

Thanks to https://github.com/koobs for suggesting this one in Sydney Elastic Training.

savioteles · 2014-05-20T17:23:01Z

It would be great! I really need this feature. Is there any estimate of when you will start coding?

geekpete · 2014-05-20T23:16:51Z

Not that I'm seeking to drop her in it, but Britta https://twitter.com/a2tirb would definitely have the madskillz to build this feature but not sure of her priorities/bandwidth/interest to attack this feature with sticks.

I'm also not sure how new features are selected or voted up for prioritisation by elasticsearch overlords either.

brwe · 2014-05-21T09:50:04Z

If I recall correctly, @geekpete proposed to have that in the context of aggregations, that is, build cluster and then use these as buckets inside the aggregations framework. Indeed, this would be an extremely useful feature.

While it would be very much fun to implement unfortunately I do not think we will implement it in the near future. Anyone coming up with a pull request for this is of course more than welcome :-)

For now I can only point you to the carrot2 plugin which does an excellent job in clustering search results.

koobs · 2014-05-22T02:00:36Z

I'll add the comment that 'k' clusters ought to user-suppliable as an argument to the aggregation for maximum value, with possible k-values being:

N
Random
Basic Rule-Of-Thumb
Document (m) by Term(n) matrix
Some other cool method :)

For context, I brought this up @ the ElasticSearch Training in response to a brief conversation about search vs 'insight' in relation to data, the former where you know what you're looking for, the latter where you dont, or might not. The specific example was geospatial result sets with arbitrary demography data fields. It was a great session @brwe!

mishakogan · 2014-08-21T16:03:51Z

I also would like to cast my vote for some kind of automated clustering feature. Carrot2 is great but as far as understand can only work on small amount of data. Would be great to have something that clusters ALL the data all the time. Maybe custom clustering analyzer?

clintongormley · 2014-11-10T19:47:40Z

@brwe would #8110 help here?

brwe · 2014-11-12T11:03:06Z

@clintongormley not really. Bucket reducers from #8110 would run on the final aggregation but clustering needs the documents.

jpountz · 2014-11-12T11:20:54Z

@brwe I think implementing clustering as a reducer could help reduce the cost very significantly? K-means is costly so running such an algorithm on a dataset containing lots of documents could be very slow. On the other hand, if we take geo-clustering as an example, we could make it very fast (though a bit lossy) by working on top of the output of the geo-hash grid aggregation as a bucket reducer?

brwe · 2014-11-14T10:40:31Z

True, I should distinguish use cases. For up to 2d it might help indeed. For text clustering I do not see it.

yehosef · 2015-06-29T08:09:25Z

just found this - would be great. +1

dsingley · 2015-10-09T15:20:23Z

+1

ghost · 2015-10-12T19:02:34Z

search for this... this would be a very great feature. Also other Mining-algorithms.

colings86 · 2015-11-13T10:53:08Z

Implementing this as a pipeline aggregation should now be possible. In that case we would first collect values into buckets using other aggregations and then use the pipeline aggregation to create clusters from those buckets.

lessless · 2015-12-20T14:35:27Z

that would be mad!

lessless · 2015-12-30T09:54:18Z

@koobs is there a recording of this session somewhere out there?

koobs · 2015-12-30T12:43:28Z

@lessless I hope not :)

irony · 2016-01-27T17:57:09Z

This would really be awesome!

audriusbugas · 2016-03-10T06:47:05Z

+1

reinier-pv · 2016-03-10T13:02:43Z

👍

trupin · 2016-04-09T10:43:46Z

+1

chenryn · 2016-04-28T07:18:23Z

+1

marfago · 2016-05-04T15:54:12Z

+1

hkulekci · 2016-05-09T15:38:27Z

+1

amazium · 2016-06-23T10:40:05Z

+1

geekpete · 2018-01-09T08:55:42Z

Could we collate potential future features on a special section of the roadmap perhaps?
This ticket could be closed and referenced to the "potential future features" area of the roadmap. This might help to clear a number of other github tickets that don't have major focus if priority is on other work at the moment.

colings86 · 2018-03-13T11:54:59Z

Stalled waiting on #26659

/cc @elastic/es-search-aggs

lessless · 2018-09-10T00:53:40Z

Still desired :)

LaurentChardin · 2018-10-11T10:41:55Z

Indeed !! very desired !

lessless · 2018-10-11T10:52:42Z

@colings86 should "stalled" label be removed now? #26659 was closed in favor of #28993 which is merged now

colings86 · 2018-10-12T09:33:14Z

It is true that because #28993 is merged the "stalled" label can be removed.

Destroy666x · 2018-10-22T15:17:10Z

I also confirm it's very desired and I'd be happy to see it.

ivssh · 2018-12-03T17:48:40Z

+1 for this

ThomasSolti · 2019-01-22T11:10:40Z

+1

barracuda317 · 2019-02-22T15:45:11Z

Is the size parameter in https://www.elastic.co/guide/en/elasticsearch/reference/7.0/search-aggregations-bucket-geotilegrid-aggregation.html something like k-means-clustering for geo-search?

polyfractal · 2019-02-22T16:16:54Z

@barracuda317 Not really, no. GeoTile just overlays a fixed grid over the area and aggregates documents into those grid cells. The grids are constructed irregardless of the data distribution (think of it more like a heatmap).

Clustering like k-means dynamically identifies regions of data that are "similar" and groups them together into a cluster. Clustering can give you individual clusters that are different shapes, sizes, and densities. For a practical example, a clustering algo might group all the values inside a city together, then cluster the rural countryside together as a different group (much larger but also more sparse)

jamesdorfman · 2019-04-17T21:19:53Z

@colings86 is there a reason why this was never completed? If the issue is time commitment, I'm very interested in this feature and would like to try finishing the implementation.

mayya-sharipova · 2019-04-18T08:30:06Z

@jamesdorfman Can you please describe your use case. Are you interested to have k-means clustering on geo data (they can be up to 8 dims)?

lessless · 2019-04-18T11:10:31Z

@mayya-sharipova +1 to exactly that use case

jamesdorfman · 2019-04-19T04:16:49Z

@mayya-sharipova yes, I was specifically interested in implementing the geo data use case. This thread made it seem as though this specific feature is highly desired.

Furthermore, I'm not completely certain about how difficult this will be to implement, so I also think that the restricted use case of clustering only geo data is a good starting point.

LaurentChardin · 2019-04-21T21:05:49Z

Another use case was to group ranges of prices of products within an index, and use k-cluster to propose cluster of prices to use with price selection.

jamesdorfman · 2019-05-01T06:09:38Z

Upon further research and experimentation it seems that a more straightforward approach would be to implement an agglomerative hierarchical clustering algorithm, rather than k-means clustering.

K-means involves creating k buckets, and then reassigning data points at each iteration of the algorithm. On the other hand, in agglomerative hierarchical clustering each point is initially placed in its own cluster. Then, these clusters are merged together on subsequent iterations.
https://en.wikipedia.org/wiki/Hierarchical_clustering

I am currently working on implementing this clustering feature as a histogram multi-bucket aggregation. The k-means approach would involve moving documents between buckets at each iteration; however, the hierarchical method would simply entail creating a bucket for each document and then merging them until the desired number of buckets is reached. This functionality is very similar to the existing Auto Date Histogram Aggregation, where buckets are created and then merged. Since bucket merging functionality was already created for that aggregation, this approach is significantly easier to implement.

Furthermore, it seems that both methods can produce clusters of similar quality. https://www.cs.utah.edu/~piyush/teaching/4-10-print.pdf

Please let me know if this line of reasoning makes sense :)

arshad171 · 2019-09-04T11:33:52Z

@jamesdorfman Isn't Agglomerative Hierarchical Clustering expensive in terms time and space complexities compared to K-means?

Agglomerative clustering:
time complexity: O(n^3)
space complexity: O(n^2)

K-means:
time complexity: O(n * k * m)
space complexity: O((n + k)m)

And since K-means has implementations that support incremental learning, the space complexity can be further reduced to make it constant.

sroui · 2021-01-01T19:16:26Z

+1

geekpete · 2023-09-10T00:37:22Z

Do we think this issue is still relevant? I see it's now 10 years ago that I first opened it.

ddavidebor · 2023-09-12T22:07:58Z

This brings back so many memories...

wchaparro · 2024-01-05T14:57:40Z

Closing this as not planned in Aggregations. If we are going to develop this - it will go into ES|QL, where we are focusing future development.

koobs · 2024-01-15T06:25:09Z

😢

geekpete changed the title ~~Add k-means clustering functionality~~ Add k-means clustering feature Mar 24, 2014

geekpete changed the title ~~Add k-means clustering feature~~ Add K-means clustering feature Mar 24, 2014

clintongormley assigned brwe Jul 10, 2014

clintongormley added >feature discuss :Analytics/Aggregations Aggregations labels Oct 14, 2015

colings86 added the stalled label Mar 13, 2018

colings86 mentioned this issue Mar 13, 2018

Enhancement: Range agg specified as max bucket count rather than explicit ranges #24254

Closed

colings86 removed the stalled label Oct 12, 2018

jamesdorfman mentioned this issue May 9, 2019

Add Variable Width Histogram Aggregation #42035

Merged

$@polyfractal$ polyfractal mentioned this issue Jan 10, 2020

Multi-pass aggregation support #50863

Open

rjernst added the Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo) label May 4, 2020

wchaparro closed this as not planned Won't fix, can't repro, duplicate, stale Jan 5, 2024

Add K-means clustering feature #5512

Add K-means clustering feature #5512

Comments

geekpete commented Mar 24, 2014

savioteles commented May 20, 2014

geekpete commented May 20, 2014

brwe commented May 21, 2014

koobs commented May 22, 2014

mishakogan commented Aug 21, 2014

clintongormley commented Nov 10, 2014

brwe commented Nov 12, 2014

jpountz commented Nov 12, 2014

brwe commented Nov 14, 2014

yehosef commented Jun 29, 2015

dsingley commented Oct 9, 2015

ghost commented Oct 12, 2015

colings86 commented Nov 13, 2015

lessless commented Dec 20, 2015

lessless commented Dec 30, 2015

koobs commented Dec 30, 2015

irony commented Jan 27, 2016

audriusbugas commented Mar 10, 2016

reinier-pv commented Mar 10, 2016

trupin commented Apr 9, 2016

chenryn commented Apr 28, 2016

marfago commented May 4, 2016

hkulekci commented May 9, 2016

amazium commented Jun 23, 2016

geekpete commented Jan 9, 2018

colings86 commented Mar 13, 2018

lessless commented Sep 10, 2018

LaurentChardin commented Oct 11, 2018

lessless commented Oct 11, 2018

colings86 commented Oct 12, 2018

Destroy666x commented Oct 22, 2018 • edited Loading

ivssh commented Dec 3, 2018

ThomasSolti commented Jan 22, 2019

barracuda317 commented Feb 22, 2019

polyfractal commented Feb 22, 2019

jamesdorfman commented Apr 17, 2019

mayya-sharipova commented Apr 18, 2019 • edited Loading

lessless commented Apr 18, 2019

jamesdorfman commented Apr 19, 2019

LaurentChardin commented Apr 21, 2019

jamesdorfman commented May 1, 2019 • edited Loading

arshad171 commented Sep 4, 2019

sroui commented Jan 1, 2021

geekpete commented Sep 10, 2023

ddavidebor commented Sep 12, 2023

wchaparro commented Jan 5, 2024

koobs commented Jan 15, 2024

Destroy666x commented Oct 22, 2018 •

edited

Loading

mayya-sharipova commented Apr 18, 2019 •

edited

Loading

jamesdorfman commented May 1, 2019 •

edited

Loading