add ability to generate UUID on documents #1492

djschny · 2016-04-26T14:03:00Z

Beats come from various sources (log files, packets, etc) and ultimately make their way to Elasticsearch. From a best practice standpoint all source data would some type of unique ID on it so that way duplicate documents are avoided downstream. However we know this is not always the case.

Therefore adding the ability for *beats to place a UUID on the documents (regardless of the output used) greatly simplify pipelines for end users and helps them completely avoid the duplicate document problem when replaying/retrying indexing operations. The goal is the UUID would be used as the ID in Elasticsearch. The benefits are as follows:

by placing UUIDs on documents at the earliest possible stage of a data pipeline, we avoid duplicates at any stage after this
allows for simplified replay logic inside of *beats as well since worst case is a document is updated with the exact same data
Further customer data pipelines leveraging Kafka or Logstash can leverage that UUID for retry, dedup, and other processing.

tsg · 2016-04-27T07:10:42Z

@djschny would passing an UUID for as the _id field to Elasticsearch get rid of duplicates? If it's really that simple, we should have done it long ago :-).

In either case, I agree with all your points, we should add this.

djschny · 2016-04-27T13:56:49Z

@djschny would passing an UUID for as the _id field to Elasticsearch get rid of duplicates?

Yep should, but I believe v5.0.0 might require it to only be in the URL. I'll need to check that.

mkocikowski · 2016-07-26T20:31:46Z

I've been looking at the same thing when using journald and the __CURSOR field for idempotent indexing.

jpvillalobos · 2016-10-18T14:36:19Z

+1

nicolasguyomar · 2016-12-01T16:35:38Z

+1
That enhancement will allow use to get rid of logstash as a shipper only to use the uuid filter as mentionned here : https://www.elastic.co/fr/blog/just-enough-kafka-for-the-elastic-stack-part2

cdahlqvist · 2017-08-29T09:50:51Z

Now that the Elastic Stack ingest components support at-least-once delivery guarantees, having the ability to prevent duplicates by adding a unique identifier to each event at the source would be great.

We should try to ensure that the default (if applicable) is an efficient identifier from Elasticsearch's point of view.

alexandrejuma · 2017-09-15T13:44:08Z

+1

xuyangxy · 2018-12-21T05:40:46Z

Excuse me, from filebeat --> kafka, how to add a random uuid field? How to write configuration changes?

shushantan · 2019-11-12T08:46:32Z

+1
Thanks !
will we be have the ability to generate UUID on documents without relying on logstash ?

urso · 2019-12-11T15:51:18Z

Different strategies to add document IDs have been implemented for the upcoming releases.
See the related meta issue and referenced PRs for details: #14363

andrewkroh added the enhancement label Apr 26, 2016

ruflin added the libbeat label Apr 27, 2016

tsg mentioned this issue Jul 25, 2016

Duplicate events when filebeat is killed with SIGHUP/SIGINT #2044

Closed

tsg mentioned this issue Sep 29, 2017

Make general elasticsearch/logstash output idempotent #5269

Closed

urso self-assigned this Aug 7, 2019

urso assigned ycombinator Nov 12, 2019

urso closed this as completed Dec 11, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add ability to generate UUID on documents #1492

add ability to generate UUID on documents #1492

djschny commented Apr 26, 2016

tsg commented Apr 27, 2016

djschny commented Apr 27, 2016 •

edited

Loading

mkocikowski commented Jul 26, 2016

jpvillalobos commented Oct 18, 2016

nicolasguyomar commented Dec 1, 2016

cdahlqvist commented Aug 29, 2017

alexandrejuma commented Sep 15, 2017

xuyangxy commented Dec 21, 2018

shushantan commented Nov 12, 2019 •

edited

Loading

urso commented Dec 11, 2019

add ability to generate UUID on documents #1492

add ability to generate UUID on documents #1492

Comments

djschny commented Apr 26, 2016

tsg commented Apr 27, 2016

djschny commented Apr 27, 2016 • edited Loading

mkocikowski commented Jul 26, 2016

jpvillalobos commented Oct 18, 2016

nicolasguyomar commented Dec 1, 2016

cdahlqvist commented Aug 29, 2017

alexandrejuma commented Sep 15, 2017

xuyangxy commented Dec 21, 2018

shushantan commented Nov 12, 2019 • edited Loading

urso commented Dec 11, 2019

djschny commented Apr 27, 2016 •

edited

Loading

shushantan commented Nov 12, 2019 •

edited

Loading