Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add ability to generate UUID on documents #1492

Closed
djschny opened this issue Apr 26, 2016 · 10 comments
Closed

add ability to generate UUID on documents #1492

djschny opened this issue Apr 26, 2016 · 10 comments

Comments

@djschny
Copy link

djschny commented Apr 26, 2016

Beats come from various sources (log files, packets, etc) and ultimately make their way to Elasticsearch. From a best practice standpoint all source data would some type of unique ID on it so that way duplicate documents are avoided downstream. However we know this is not always the case.

Therefore adding the ability for *beats to place a UUID on the documents (regardless of the output used) greatly simplify pipelines for end users and helps them completely avoid the duplicate document problem when replaying/retrying indexing operations. The goal is the UUID would be used as the ID in Elasticsearch. The benefits are as follows:

  • by placing UUIDs on documents at the earliest possible stage of a data pipeline, we avoid duplicates at any stage after this
  • allows for simplified replay logic inside of *beats as well since worst case is a document is updated with the exact same data
  • Further customer data pipelines leveraging Kafka or Logstash can leverage that UUID for retry, dedup, and other processing.
@tsg
Copy link
Contributor

tsg commented Apr 27, 2016

@djschny would passing an UUID for as the _id field to Elasticsearch get rid of duplicates? If it's really that simple, we should have done it long ago :-).

In either case, I agree with all your points, we should add this.

@djschny
Copy link
Author

djschny commented Apr 27, 2016

@djschny would passing an UUID for as the _id field to Elasticsearch get rid of duplicates?

Yep should, but I believe v5.0.0 might require it to only be in the URL. I'll need to check that.

@mkocikowski
Copy link

I've been looking at the same thing when using journald and the __CURSOR field for idempotent indexing.

@jpvillalobos
Copy link

+1

@nicolasguyomar
Copy link

+1
That enhancement will allow use to get rid of logstash as a shipper only to use the uuid filter as mentionned here : https://www.elastic.co/fr/blog/just-enough-kafka-for-the-elastic-stack-part2

@cdahlqvist
Copy link

Now that the Elastic Stack ingest components support at-least-once delivery guarantees, having the ability to prevent duplicates by adding a unique identifier to each event at the source would be great.

We should try to ensure that the default (if applicable) is an efficient identifier from Elasticsearch's point of view.

@alexandrejuma
Copy link

+1

@xuyangxy
Copy link

Excuse me, from filebeat --> kafka, how to add a random uuid field? How to write configuration changes?

@urso urso self-assigned this Aug 7, 2019
@shushantan
Copy link

shushantan commented Nov 12, 2019

+1
Thanks !
will we be have the ability to generate UUID on documents without relying on logstash ?

@urso
Copy link

urso commented Dec 11, 2019

Different strategies to add document IDs have been implemented for the upcoming releases.
See the related meta issue and referenced PRs for details: #14363

@urso urso closed this as completed Dec 11, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests