filebeat aws s3 input docs - Add doc _id generation details

Document the details about how the input generates Elasticsearch document _id values.
elastic · Dec 19, 2024 · 11e049b · 11e049b
1 parent 323c69e
commit 11e049b
Showing 1 changed file with 72 additions and 0 deletions.
diff --git a/x-pack/filebeat/docs/inputs/input-aws-s3.asciidoc b/x-pack/filebeat/docs/inputs/input-aws-s3.asciidoc
@@ -88,6 +88,78 @@ that require a different endpoint.
   expand_event_list_from_field: Records
 ----
 
+[float]
+=== Document ID Generation
+
+This feature prevents the duplication of events in Elasticsearch by generating
+a custom document `_id` for each event, rather than relying on Elasticsearch to
+automatically generate one. Each document in an Elasticsearch index must have a
+unique `_id`, and `{beatname_uc}` uses this property to avoid ingesting
+duplicate events.
+
+The custom `_id` is based on several pieces of information from the S3 object:
+the Last-Modified timestamp, the bucket ARN, the object key, and the byte
+offset of the data in the event.
+
+Duplicate prevention is particularly useful in scenarios where {beatname_uc}
+needs to retry an operation. {beatname_uc} guarantees at-least-once delivery,
+meaning it will retry any failed or incomplete operations. These retries may be
+triggered by issues with the host, `{beatname_uc}`, network connectivity, or
+services such as Elasticsearch, SQS, or S3.
+
+[float]
+==== Limitations of `_id`-Based Deduplication
+
+There are some limitations to consider when using `_id`-based deduplication in
+Elasticsearch:
+
+* Deduplication works only within a single index. The same `_id` can exist in
+  different indices, which is important if you're using data streams or index
+  aliases. When the backing index rolls over, a duplicate may be ingested.
+
+* Indexing operations in Elasticsearch may take longer when an `_id` is
+  specified. Elasticsearch needs to check if the ID already exists before
+  writing, which can increase the time required for indexing.
+
+[float]
+==== Disabling Duplicate Prevention
+
+If you want to disable the `_id`-based deduplication, you can remove the
+document `_id` using the `drop_fields` processor in {beatname_uc}.
+
+["source","yaml",subs="attributes"]
+----
+{beatname_lc}.inputs:
+  - type: aws-s3
+    queue_url: https://queue.amazonaws.com/80398EXAMPLE/MyQueue
+    processors:
+      - drop_fields:
+          fields:
+            - '@metadata._id'
+          ignore_missing: true
+----
+
+Alternatively, you can remove the `_id` field using an Elasticsearch Ingest
+Node pipeline.
+
+["source","json",subs="attributes"]
+----
+{
+  "processors": [
+    {
+      "remove": {
+        "if": "ctx.input?.type == \"aws-s3\"",
+        "field": "_id",
+        "ignore_missing": true
+      }
+    }
+  ]
+}
+----
+
+[float]
+=== Configuration
+
 The `aws-s3` input supports the following configuration options plus the
 <<{beatname_lc}-input-{type}-common-options>> described later.