Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Docs] Discuss deduplication strategies in the Integrations Developer Guide #11266

Open
chrisberkhout opened this issue Sep 27, 2024 · 1 comment
Assignees

Comments

@chrisberkhout
Copy link
Contributor

We have several strategies for handling duplicate events:

  • Use clean pagination logic that avoids ingesting duplicates.
  • Use the fingerprint processor to set an _id value.
  • Use a latest transform, as we do for IOC data.
  • Just tolerate duplicates.

The nature of the data set may make a certain strategy preferable. Some relevant questions:

  • Is the data set append-only or are events updated?
  • What is the impact of duplicates? (e.g. do they inflate counts or cause excess alerts?)
  • Do we receive information about deletions (soft deletes)?
  • Do we need to expire old events?
  • Do we want to retain a history of changes or just the latest state?

The transform approach has some IOC-specific support. Other uses are possible but see elastic/kibana#134321 and elastic/kibana#137278.

The First Class Data Streams Elasticsearch Changes document may be relevant.

There may be ways to improve upon our current deduplication strategies, but before that we can describe existing strategies and recommend when each should be used in the Integrations Developer Guide.

@chrisberkhout chrisberkhout self-assigned this Sep 27, 2024
@chrisberkhout
Copy link
Contributor Author

The _id strategy depends on how a data stream's backing indexes are rolled over. How is that done now? Can it be controlled?

Should we be using more plain indexes, for example when the data is more like inventory than like logs?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant