Skip to content

Commit

Permalink
filebeat aws s3 input docs - Add doc _id generation details
Browse files Browse the repository at this point in the history
Document the details about how the input generates Elasticsearch
document _id values.
  • Loading branch information
andrewkroh committed Dec 19, 2024
1 parent 323c69e commit 11e049b
Showing 1 changed file with 72 additions and 0 deletions.
72 changes: 72 additions & 0 deletions x-pack/filebeat/docs/inputs/input-aws-s3.asciidoc
Original file line number Diff line number Diff line change
Expand Up @@ -88,6 +88,78 @@ that require a different endpoint.
expand_event_list_from_field: Records
----

[float]
=== Document ID Generation

This feature prevents the duplication of events in Elasticsearch by generating
a custom document `_id` for each event, rather than relying on Elasticsearch to
automatically generate one. Each document in an Elasticsearch index must have a
unique `_id`, and `{beatname_uc}` uses this property to avoid ingesting
duplicate events.

The custom `_id` is based on several pieces of information from the S3 object:
the Last-Modified timestamp, the bucket ARN, the object key, and the byte
offset of the data in the event.

Duplicate prevention is particularly useful in scenarios where {beatname_uc}
needs to retry an operation. {beatname_uc} guarantees at-least-once delivery,
meaning it will retry any failed or incomplete operations. These retries may be
triggered by issues with the host, `{beatname_uc}`, network connectivity, or
services such as Elasticsearch, SQS, or S3.

[float]
==== Limitations of `_id`-Based Deduplication

There are some limitations to consider when using `_id`-based deduplication in
Elasticsearch:

* Deduplication works only within a single index. The same `_id` can exist in
different indices, which is important if you're using data streams or index
aliases. When the backing index rolls over, a duplicate may be ingested.

* Indexing operations in Elasticsearch may take longer when an `_id` is
specified. Elasticsearch needs to check if the ID already exists before
writing, which can increase the time required for indexing.

[float]
==== Disabling Duplicate Prevention

If you want to disable the `_id`-based deduplication, you can remove the
document `_id` using the `drop_fields` processor in {beatname_uc}.

["source","yaml",subs="attributes"]
----
{beatname_lc}.inputs:
- type: aws-s3
queue_url: https://queue.amazonaws.com/80398EXAMPLE/MyQueue
processors:
- drop_fields:
fields:
- '@metadata._id'
ignore_missing: true
----

Alternatively, you can remove the `_id` field using an Elasticsearch Ingest
Node pipeline.

["source","json",subs="attributes"]
----
{
"processors": [
{
"remove": {
"if": "ctx.input?.type == \"aws-s3\"",
"field": "_id",
"ignore_missing": true
}
}
]
}
----

[float]
=== Configuration

The `aws-s3` input supports the following configuration options plus the
<<{beatname_lc}-input-{type}-common-options>> described later.

Expand Down

0 comments on commit 11e049b

Please sign in to comment.