[Filebeat] Refactor AWS S3 input with workers #27199

andrewkroh · 2021-08-02T17:38:39Z

What does this PR do?

This changes the AWS S3 input to allow it to process more SQS messages in parallel
by having workers that are fully utilized while there are SQS message to process.

The previous design processed SQS messages in batches ranging from 1 to 10 in size.
It waited until all messages were processed before requesting more. This left some
workers idle toward the end of processing the batch (as seen in the monitoring metrics
below). This also limited the maximum number of messages processed in parallel to
10 because that is the largest request size allowed by SQS.

The refactored input uses ephemeral goroutines as workers to process SQS messages. It
receives as many SQS messages as there are free workers. The total number of workers
is controlled by max_number_of_messages (same as before but without an upper limit).

Other changes

Prevent poison pill messages

When an S3 object processing error occurs the SQS message is returned to the after
the visibility timeout expires. This allows it to be reprocessed again or moved to
the SQS dead letter queue (if configured). But if no dead letter queue policy is
configured and the error is permanent (reprocessing won't fix it) then the message
would continuosly be reprocessed. On error the input will now check the
ApproximateReceiveCount attribute of the SQS message and delete it if it exceeds
the configured maximum retries.

Removal of api_timeout from S3 GetObject calls

The api_timeout has been removed from the S3 GetObject call. This limited the
maximum amount of time for processing the object since the response body is processed
as a stream while the request is open. Requests can still timeout in the server due
to inactivity.

Improved debug logs

The log messages have been enriched with more data about the related SQS message and
S3 object. For example the SQS message_id, s3_bucket, and s3_object are
included in some messages.

DEBUG [aws-s3.sqs_s3_event] awss3/s3.go:127 End S3 object processing. {"id": "test_id", "queue_url": "https://sqs.us-east-1.amazonaws.com/144492464627/filebeat-s3-integtest-lxlmx6", "message_id": "a11de9f9-0a68-4c4e-a09d-979b87602958", "s3_bucket": "filebeat-s3-integtest-lxlmx6", "s3_object": "events-array.json", "elapsed_time_ns": 23262327}

Increased test coverage

The refactored input has about 88% test coverage.

The specific AWS API methods used by the input were turned into interfaces to allow
for easier testing. The unit tests mock the AWS interfaces.

The parts of the input were separted into three components listed below. There's a defined
interface for each to allow for mock testing there too. To test the interactions between
these components go-mock is used to generate mocks and then assert the expectations.

The SQS receiver. (sqs.go)
The S3 Notification Event handler. (sqs_s3_event.go)
The S3 Object reader. (s3.go)

Terraform setup for integration test

Setup for executing the integration tests is now handled by Terraform.
See _meta/terraform/README.md for instructions.

Benchmark test

I added a benchmark that tests the input in isolation with mocked SQS and S3 responeses.
It uses a 7KB cloudtrail json.gz file containing about 60 messages for its input.
This removes any variability related to the network, but also means these do not reflect
real-work rates. They can be used to measure the effect of future changes.

+-------------------+--------------------+------------------+--------------------+------+
| MAX MSGS INFLIGHT |   EVENTS PER SEC   | S3 BYTES PER SEC |     TIME (SEC)     | CPUS |
+-------------------+--------------------+------------------+--------------------+------+
|                 1 | 36380.253532518276 | 1.2 MB           |        1.243366816 |   12 |
|                 2 | 61727.549671738896 | 2.1 MB           | 1.0951187980000001 |   12 |
|                 4 |  86218.70431874547 | 3.0 MB           |        1.208577661 |   12 |
|                 8 | 131900.69854257331 | 4.5 MB           |        1.179751144 |   12 |
|                16 |   151824.404438336 | 5.2 MB           |        1.083857372 |   12 |
|                32 | 155548.56015625654 | 5.3 MB           |        1.170502638 |   12 |
|                64 | 166188.27838709904 | 5.7 MB           |         1.17403587 |   12 |
|               128 | 185429.33410590928 | 6.4 MB           |        3.380193339 |   12 |
|               256 | 186181.15705271234 | 6.4 MB           |         1.66313823 |   12 |
|               512 |  197793.9906993712 | 7.3 MB           |        1.230735065 |   12 |
|              1024 | 211492.91373843007 | 7.4 MB           |        1.246637513 |   12 |
+-------------------+--------------------+------------------+--------------------+------+

Relates #25750

Why is it important?

This enabled easier vertical scaling of Filebeat aws-s3 input and helps it better utilize the CPU that's available.

Checklist

My code follows the style guidelines of this project
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
I have made corresponding change to the default configuration files
I have added tests that prove my fix is effective or that my feature works
I have added an entry in CHANGELOG.next.asciidoc or CHANGELOG-developer.next.asciidoc.

Author's Checklist

Update documentation
Make max receive message retries configurable.
Streaming JSON parser code needs cleaned up
Check for any recent changes since May that were not merged into refactoring
Changelog

Related issues

Relates [Filebeat] aws-s3 input visibility timeout extensions lead to deadlocked ACKHandlers #25750

This changes the AWS S3 input to allow it to process more SQS messages in parallel by having workers that are fully utilized while there are SQS message to process. The previous design processed SQS messages in batches ranging from 1 to 10 in size. It waited until all messages were processed before requesting more. This left some workers idle toward the the end of processing the batch. This also limited the maximum number of messages processed in parallel to 10 because that is the largest request size allowed by SQS. The refactored input uses ephemeral goroutines as workers to process SQS messages. It receives as many SQS messages as there are free workers. The total number of workers is controlled by `max_number_of_messages` (same as before but without an upper limit). Other changes Prevent poison pill messages When an S3 object processing error occurs the SQS message is returned to the after the visibility timeout expires. This allows it to be reprocessed again or moved to the SQS dead letter queue (if configured). But if no dead letter queue policy is configured and the error is permanent (reprocessing won't fix it) then the message would continuosly be reprocessed. On error the input will now check the `ApproximateReceiveCount` attribute of the SQS message and delete it if it exceeds the configured maximum retries. Removal of api_timeout from S3 GetObject calls The `api_timeout` has been removed from the S3 `GetObject` call. This limited the maximum amount of time for processing the object since the response body is processed as a stream while the request is open. Requests can still timeout in the server due to inactivity. Improved debug logs The log messages have been enriched with more data about the related SQS message and S3 object. For example the SQS `message_id`, `s3_bucket`, and `s3_object` are included in some messages. `DEBUG [aws-s3.sqs_s3_event] awss3/s3.go:127 End S3 object processing. {"id": "test_id", "queue_url": "https://sqs.us-east-1.amazonaws.com/144492464627/filebeat-s3-integtest-lxlmx6", "message_id": "a11de9f9-0a68-4c4e-a09d-979b87602958", "s3_bucket": "filebeat-s3-integtest-lxlmx6", "s3_object": "events-array.json", "elapsed_time_ns": 23262327}` Increased test coverage The refactored input has about 88% test coverage. The specific AWS API methods used by the input were turned into interfaces to allow for easier testing. The unit tests mock the AWS interfaces. The parts of the input were separted into three components listed below. There's a defined interface for each to allow for mock testing there too. To test the interactions between these components go-mock is used to generate mocks and then assert the expectations. 1. The SQS receiver. (sqs.go) 2. The S3 Notification Event handler. (sqs_s3_event.go) 3. The S3 Object reader. (s3.go) Terraform setup for integration test Setup for executing the integration tests is now handled by Terraform. See _meta/terraform/README.md for instructions. Benchmark test I added a benchmark that tests the input in isolation with mocked SQS and S3 responeses. It uses a 7KB cloudtrail json.gz file containing about 60 messages for its input. This removes any variability related to the network, but also means these do not reflect real-work rates. They can be used to measure the effect of future changes. +-------------------+--------------------+------------------+--------------------+------+ | MAX MSGS INFLIGHT | EVENTS PER SEC | S3 BYTES PER SEC | TIME (SEC) | CPUS | +-------------------+--------------------+------------------+--------------------+------+ | 1 | 23019.782175720782 | 3.0 MB | 1.257266458 | 12 | | 2 | 36237.53174269319 | 4.8 MB | 1.158798571 | 12 | | 4 | 56456.84532752983 | 7.5 MB | 1.138285351 | 12 | | 8 | 90485.15755430676 | 12 MB | 1.117244007 | 12 | | 16 | 103853.8984324643 | 14 MB | 1.165541225 | 12 | | 32 | 110380.28141417276 | 15 MB | 1.110814345 | 12 | | 64 | 116074.13735061679 | 15 MB | 1.408100062 | 12 | | 128 | 114854.80273666105 | 15 MB | 1.5331357140000001 | 12 | | 256 | 118767.73924992209 | 16 MB | 2.041783413 | 12 | | 512 | 122933.1033660647 | 16 MB | 1.255463303 | 12 | | 1024 | 124222.51861746894 | 17 MB | 1.505765638 | 12 | +-------------------+--------------------+------------------+--------------------+------+ Relates elastic#25750 Docs update

elasticmachine · 2021-08-02T19:17:41Z

Pinging @elastic/integrations (Team:Integrations)

elasticmachine · 2021-08-02T19:17:42Z

Pinging @elastic/security-external-integrations (Team:Security-External Integrations)

elasticmachine · 2021-08-02T19:29:35Z

💚 Build Succeeded

the below badges are clickable and redirect to their specific view in the CI or DOCS

Expand to view the summary

Build stats

Start Time: 2021-08-11T20:10:48.769+0000
Duration: 218 min 16 sec
Commit: d246c04

Test stats 🧪

Test	Results
Failed	0
Passed	53000
Skipped	5318
Total	58318

Trends 🧪

💚 Flaky test report

Tests succeeded.

Expand to view the summary

Test stats 🧪

Test	Results
Failed	0
Passed	53000
Skipped	5318
Total	58318

x-pack/filebeat/input/awss3/input.go

aspacca · 2021-08-04T16:48:14Z

LGTM

exekias · 2021-08-05T08:23:46Z

x-pack/filebeat/input/awss3/_meta/terraform/README.md

+    cd x-pack/filebeat/inputs/awss3
+    go test -tags aws,integration -run TestInputRun -v .


@kaiyan-sheng would this work with our existing AWS ci integration? that would be great.

I think we have the aws tests run weekly right now in the CI and this would work. @v1v Do you know if this is easy to add?

The AWS tests run on a weekly basis, as you said, and it runs the commands defined in:

beats/x-pack/metricbeat/Jenkinsfile.yml

Lines 35 to 47 in d898533

cloud:

cloud: "mage build test"

withModule: true ## run the ITs only if the changeset affects a specific module.

dirs: ## run the cloud tests for the given modules.

- "x-pack/metricbeat/module/aws"

when: ## Override the top-level when.

parameters:

- "awsCloudTests"

comments:

- "/test x-pack/metricbeat for aws cloud"

labels:

- "aws"

stage: extended

mage build test runs in the folder x-pack/metricbeat so, if the above-mentioned command is part of the mage then it should work.

the test requires applying a terraform file (and destroying it at the end):
https://github.com/elastic/beats/blob/master/x-pack/filebeat/input/awss3/_meta/terraform/README.md#usage

will this happen?

btw, shouldn't be the file https://github.com/elastic/beats/blob/master/x-pack/filebeat/Jenkinsfile.yml?

aspacca · 2021-08-05T08:43:27Z

x-pack/filebeat/input/awss3/sqs_s3_event.go

+	acker := newEventACKTracker(ctx)
+	defer acker.Wait()


actually one doubt here: do we really want to wait for all the events to be acked for the worker to shutdown?

Yes. S3 event notifications for ObjectCreated sometimes contain information for more than object. This function returns after all the events generated from each of those objects (usually it's just 1 object) has been ACKed. Then the caller of this function can stop updating the SQS visibility timeout for this one SQS message and decide based on whether an error occurred while processing the S3 object if it should delete the SQS message or try to process it again.

Yes, I understand the reason for waiting in the PR, but I was wondering if we should wait for ACK for actually handling the SQS message.
The current implementation doesn't have this: in theory ACK could be blocked for a certain amount of time (network issue reaching elasticsearch, for example), filebeat queue should be able to later catch up. While here we will block starting new workers if I got it right.

The reason for waiting until all ACKs are received before deleting the SQS message (and freeing the worker) is that the queued events are non-persistent by default. If we delete the SQS message before receiving all ACKs then Filebeat is vulnerable to data loss if the process is killed before the queue is emptied (events are ACKed).

By waiting until all ACKs are received we ensure that another instance of Filebeat will process the same message again if this process is killed before all events are ACKed.

I agree we should not delete the SQS message before receiving all ACKs, but by making the worker wait for the ACKs we prevent it from filling the Filebeat queue in cases of backpressure. If the ACK wait / keepalive / delete message handling was done somewhere else (in a goroutine) that would free the worker for keep processing other SQS messages and filling the queue.

We could write it that way.

The current implementation is derived from having a single tuning knob of max_number_of_messages that controls the number of inflight messages. If we change the model then we'd add another option that controls the number of workers independently of the max_number_of_messages.

I'm hesitant to change it now without doing some testing to see if that is actually a problem. I think that if you size the max_number_of_messages parameter appropriately based on the files you're processing and your queue.mem.events size then you should be able to keep the internal memory queue full in this model without much of a problem.

andrewkroh · 2021-08-05T14:09:27Z

x-pack/filebeat/input/awss3/s3_test.go

+
+	ctrl, ctx := gomock.WithContext(ctx, t)
+	defer ctrl.Finish()
+	mockS3Pager := newMockS3Pager(ctrl, 1, fakeObjects)


@aspacca I updated the S3 interface to account for your need to call ListObjects. I included a helper to mock the S3 pagination calls for your tests.

leehinman

LGTM. Just a question about the acker wait function.

x-pack/filebeat/input/awss3/acker.go

aspacca · 2021-08-11T08:11:06Z

/test

x-pack/filebeat/input/awss3/sqs_s3_event.go

aspacca · 2021-08-11T15:11:31Z

LGTM

please, @andrewkroh , could you wait for #27126 to be merged and adapt any needed changes in your PR?

E2E tests failure should be fixed if you merge latest master

andrewkroh · 2021-08-11T15:24:55Z

Sure, I'll wait for that and then fix any merge conflicts.

kaiyan-sheng

Thank you @andrewkroh for this enhancement. Should filebeat.reference.yml also get updated in the aws-s3 input section? Other than that, everything looks good.

andrewkroh · 2021-08-11T18:00:51Z

I'll add the two new settings to filebeat.reference.yml when I do the merge for #27126.

sqs.max_receive_count

sqs.wait_time

…refactor

…ats into feature/fb/aws-s3-refactor

* Refactor AWS S3 input with workers This changes the AWS S3 input to allow it to process more SQS messages in parallel by having workers that are fully utilized while there are SQS message to process. The previous design processed SQS messages in batches ranging from 1 to 10 in size. It waited until all messages were processed before requesting more. This left some workers idle toward the the end of processing the batch. This also limited the maximum number of messages processed in parallel to 10 because that is the largest request size allowed by SQS. The refactored input uses ephemeral goroutines as workers to process SQS messages. It receives as many SQS messages as there are free workers. The total number of workers is controlled by `max_number_of_messages` (same as before but without an upper limit). Other changes Prevent poison pill messages When an S3 object processing error occurs the SQS message is returned to the after the visibility timeout expires. This allows it to be reprocessed again or moved to the SQS dead letter queue (if configured). But if no dead letter queue policy is configured and the error is permanent (reprocessing won't fix it) then the message would continuosly be reprocessed. On error the input will now check the `ApproximateReceiveCount` attribute of the SQS message and delete it if it exceeds the configured maximum retries. Removal of api_timeout from S3 GetObject calls The `api_timeout` has been removed from the S3 `GetObject` call. This limited the maximum amount of time for processing the object since the response body is processed as a stream while the request is open. Requests can still timeout in the server due to inactivity. Improved debug logs The log messages have been enriched with more data about the related SQS message and S3 object. For example the SQS `message_id`, `s3_bucket`, and `s3_object` are included in some messages. `DEBUG [aws-s3.sqs_s3_event] awss3/s3.go:127 End S3 object processing. {"id": "test_id", "queue_url": "https://sqs.us-east-1.amazonaws.com/144492464627/filebeat-s3-integtest-lxlmx6", "message_id": "a11de9f9-0a68-4c4e-a09d-979b87602958", "s3_bucket": "filebeat-s3-integtest-lxlmx6", "s3_object": "events-array.json", "elapsed_time_ns": 23262327}` Increased test coverage The refactored input has about 88% test coverage. The specific AWS API methods used by the input were turned into interfaces to allow for easier testing. The unit tests mock the AWS interfaces. The parts of the input were separted into three components listed below. There's a defined interface for each to allow for mock testing there too. To test the interactions between these components go-mock is used to generate mocks and then assert the expectations. 1. The SQS receiver. (sqs.go) 2. The S3 Notification Event handler. (sqs_s3_event.go) 3. The S3 Object reader. (s3.go) Terraform setup for integration test Setup for executing the integration tests is now handled by Terraform. See _meta/terraform/README.md for instructions. Benchmark test I added a benchmark that tests the input in isolation with mocked SQS and S3 responeses. It uses a 7KB cloudtrail json.gz file containing about 60 messages for its input. This removes any variability related to the network, but also means these do not reflect real-work rates. They can be used to measure the effect of future changes. +-------------------+--------------------+------------------+--------------------+------+ | MAX MSGS INFLIGHT | EVENTS PER SEC | S3 BYTES PER SEC | TIME (SEC) | CPUS | +-------------------+--------------------+------------------+--------------------+------+ | 1 | 23019.782175720782 | 3.0 MB | 1.257266458 | 12 | | 2 | 36237.53174269319 | 4.8 MB | 1.158798571 | 12 | | 4 | 56456.84532752983 | 7.5 MB | 1.138285351 | 12 | | 8 | 90485.15755430676 | 12 MB | 1.117244007 | 12 | | 16 | 103853.8984324643 | 14 MB | 1.165541225 | 12 | | 32 | 110380.28141417276 | 15 MB | 1.110814345 | 12 | | 64 | 116074.13735061679 | 15 MB | 1.408100062 | 12 | | 128 | 114854.80273666105 | 15 MB | 1.5331357140000001 | 12 | | 256 | 118767.73924992209 | 16 MB | 2.041783413 | 12 | | 512 | 122933.1033660647 | 16 MB | 1.255463303 | 12 | | 1024 | 124222.51861746894 | 17 MB | 1.505765638 | 12 | +-------------------+--------------------+------------------+--------------------+------+ Relates #25750 * Use InitializeAWSConfig * Add s3Lister interface for mocking pagination of S3 ListObjects calls * Add new config parameters to reference.yml * Optimize uploading b/c it was slow in aws v2 sdk (cherry picked from commit 7c76158) # Conflicts: # x-pack/filebeat/input/awss3/collector.go # x-pack/filebeat/input/awss3/collector_test.go

* Refactor AWS S3 input with workers This changes the AWS S3 input to allow it to process more SQS messages in parallel by having workers that are fully utilized while there are SQS message to process. The previous design processed SQS messages in batches ranging from 1 to 10 in size. It waited until all messages were processed before requesting more. This left some workers idle toward the the end of processing the batch. This also limited the maximum number of messages processed in parallel to 10 because that is the largest request size allowed by SQS. The refactored input uses ephemeral goroutines as workers to process SQS messages. It receives as many SQS messages as there are free workers. The total number of workers is controlled by `max_number_of_messages` (same as before but without an upper limit). Other changes Prevent poison pill messages When an S3 object processing error occurs the SQS message is returned to the after the visibility timeout expires. This allows it to be reprocessed again or moved to the SQS dead letter queue (if configured). But if no dead letter queue policy is configured and the error is permanent (reprocessing won't fix it) then the message would continuosly be reprocessed. On error the input will now check the `ApproximateReceiveCount` attribute of the SQS message and delete it if it exceeds the configured maximum retries. Removal of api_timeout from S3 GetObject calls The `api_timeout` has been removed from the S3 `GetObject` call. This limited the maximum amount of time for processing the object since the response body is processed as a stream while the request is open. Requests can still timeout in the server due to inactivity. Improved debug logs The log messages have been enriched with more data about the related SQS message and S3 object. For example the SQS `message_id`, `s3_bucket`, and `s3_object` are included in some messages. `DEBUG [aws-s3.sqs_s3_event] awss3/s3.go:127 End S3 object processing. {"id": "test_id", "queue_url": "https://sqs.us-east-1.amazonaws.com/144492464627/filebeat-s3-integtest-lxlmx6", "message_id": "a11de9f9-0a68-4c4e-a09d-979b87602958", "s3_bucket": "filebeat-s3-integtest-lxlmx6", "s3_object": "events-array.json", "elapsed_time_ns": 23262327}` Increased test coverage The refactored input has about 88% test coverage. The specific AWS API methods used by the input were turned into interfaces to allow for easier testing. The unit tests mock the AWS interfaces. The parts of the input were separted into three components listed below. There's a defined interface for each to allow for mock testing there too. To test the interactions between these components go-mock is used to generate mocks and then assert the expectations. 1. The SQS receiver. (sqs.go) 2. The S3 Notification Event handler. (sqs_s3_event.go) 3. The S3 Object reader. (s3.go) Terraform setup for integration test Setup for executing the integration tests is now handled by Terraform. See _meta/terraform/README.md for instructions. Benchmark test I added a benchmark that tests the input in isolation with mocked SQS and S3 responeses. It uses a 7KB cloudtrail json.gz file containing about 60 messages for its input. This removes any variability related to the network, but also means these do not reflect real-work rates. They can be used to measure the effect of future changes. +-------------------+--------------------+------------------+--------------------+------+ | MAX MSGS INFLIGHT | EVENTS PER SEC | S3 BYTES PER SEC | TIME (SEC) | CPUS | +-------------------+--------------------+------------------+--------------------+------+ | 1 | 23019.782175720782 | 3.0 MB | 1.257266458 | 12 | | 2 | 36237.53174269319 | 4.8 MB | 1.158798571 | 12 | | 4 | 56456.84532752983 | 7.5 MB | 1.138285351 | 12 | | 8 | 90485.15755430676 | 12 MB | 1.117244007 | 12 | | 16 | 103853.8984324643 | 14 MB | 1.165541225 | 12 | | 32 | 110380.28141417276 | 15 MB | 1.110814345 | 12 | | 64 | 116074.13735061679 | 15 MB | 1.408100062 | 12 | | 128 | 114854.80273666105 | 15 MB | 1.5331357140000001 | 12 | | 256 | 118767.73924992209 | 16 MB | 2.041783413 | 12 | | 512 | 122933.1033660647 | 16 MB | 1.255463303 | 12 | | 1024 | 124222.51861746894 | 17 MB | 1.505765638 | 12 | +-------------------+--------------------+------------------+--------------------+------+ Relates #25750 * Use InitializeAWSConfig * Add s3Lister interface for mocking pagination of S3 ListObjects calls * Add new config parameters to reference.yml * Optimize uploading b/c it was slow in aws v2 sdk

* Refactor AWS S3 input with workers This changes the AWS S3 input to allow it to process more SQS messages in parallel by having workers that are fully utilized while there are SQS message to process. The previous design processed SQS messages in batches ranging from 1 to 10 in size. It waited until all messages were processed before requesting more. This left some workers idle toward the the end of processing the batch. This also limited the maximum number of messages processed in parallel to 10 because that is the largest request size allowed by SQS. The refactored input uses ephemeral goroutines as workers to process SQS messages. It receives as many SQS messages as there are free workers. The total number of workers is controlled by `max_number_of_messages` (same as before but without an upper limit). Other changes Prevent poison pill messages When an S3 object processing error occurs the SQS message is returned to the after the visibility timeout expires. This allows it to be reprocessed again or moved to the SQS dead letter queue (if configured). But if no dead letter queue policy is configured and the error is permanent (reprocessing won't fix it) then the message would continuosly be reprocessed. On error the input will now check the `ApproximateReceiveCount` attribute of the SQS message and delete it if it exceeds the configured maximum retries. Removal of api_timeout from S3 GetObject calls The `api_timeout` has been removed from the S3 `GetObject` call. This limited the maximum amount of time for processing the object since the response body is processed as a stream while the request is open. Requests can still timeout in the server due to inactivity. Improved debug logs The log messages have been enriched with more data about the related SQS message and S3 object. For example the SQS `message_id`, `s3_bucket`, and `s3_object` are included in some messages. `DEBUG [aws-s3.sqs_s3_event] awss3/s3.go:127 End S3 object processing. {"id": "test_id", "queue_url": "https://sqs.us-east-1.amazonaws.com/144492464627/filebeat-s3-integtest-lxlmx6", "message_id": "a11de9f9-0a68-4c4e-a09d-979b87602958", "s3_bucket": "filebeat-s3-integtest-lxlmx6", "s3_object": "events-array.json", "elapsed_time_ns": 23262327}` Increased test coverage The refactored input has about 88% test coverage. The specific AWS API methods used by the input were turned into interfaces to allow for easier testing. The unit tests mock the AWS interfaces. The parts of the input were separted into three components listed below. There's a defined interface for each to allow for mock testing there too. To test the interactions between these components go-mock is used to generate mocks and then assert the expectations. 1. The SQS receiver. (sqs.go) 2. The S3 Notification Event handler. (sqs_s3_event.go) 3. The S3 Object reader. (s3.go) Terraform setup for integration test Setup for executing the integration tests is now handled by Terraform. See _meta/terraform/README.md for instructions. Benchmark test I added a benchmark that tests the input in isolation with mocked SQS and S3 responeses. It uses a 7KB cloudtrail json.gz file containing about 60 messages for its input. This removes any variability related to the network, but also means these do not reflect real-work rates. They can be used to measure the effect of future changes. +-------------------+--------------------+------------------+--------------------+------+ | MAX MSGS INFLIGHT | EVENTS PER SEC | S3 BYTES PER SEC | TIME (SEC) | CPUS | +-------------------+--------------------+------------------+--------------------+------+ | 1 | 23019.782175720782 | 3.0 MB | 1.257266458 | 12 | | 2 | 36237.53174269319 | 4.8 MB | 1.158798571 | 12 | | 4 | 56456.84532752983 | 7.5 MB | 1.138285351 | 12 | | 8 | 90485.15755430676 | 12 MB | 1.117244007 | 12 | | 16 | 103853.8984324643 | 14 MB | 1.165541225 | 12 | | 32 | 110380.28141417276 | 15 MB | 1.110814345 | 12 | | 64 | 116074.13735061679 | 15 MB | 1.408100062 | 12 | | 128 | 114854.80273666105 | 15 MB | 1.5331357140000001 | 12 | | 256 | 118767.73924992209 | 16 MB | 2.041783413 | 12 | | 512 | 122933.1033660647 | 16 MB | 1.255463303 | 12 | | 1024 | 124222.51861746894 | 17 MB | 1.505765638 | 12 | +-------------------+--------------------+------------------+--------------------+------+ Relates #25750 * Use InitializeAWSConfig * Add s3Lister interface for mocking pagination of S3 ListObjects calls * Add new config parameters to reference.yml * Optimize uploading b/c it was slow in aws v2 sdk Co-authored-by: Andrew Kroh <andrew.kroh@elastic.co>

andrewkroh added enhancement Filebeat Filebeat labels Aug 2, 2021

botelastic bot added the needs_team Indicates that the issue/PR needs a Team:* label label Aug 2, 2021

andrewkroh force-pushed the feature/fb/aws-s3-refactor branch from 2776dfd to d9e2eb3 Compare August 2, 2021 19:11

andrewkroh added Team:Integrations Label for the Integrations team Team:Security-External Integrations labels Aug 2, 2021

botelastic bot removed the needs_team Indicates that the issue/PR needs a Team:* label label Aug 2, 2021

andrewkroh requested review from leehinman and kaiyan-sheng August 2, 2021 19:17

andrewkroh marked this pull request as ready for review August 2, 2021 19:17

kaiyan-sheng requested a review from aspacca August 2, 2021 20:55

legoguy1000 reviewed Aug 2, 2021

View reviewed changes

x-pack/filebeat/input/awss3/input.go Outdated Show resolved Hide resolved

Use InitializeAWSConfig

c789583

exekias reviewed Aug 5, 2021

View reviewed changes

aspacca reviewed Aug 5, 2021

View reviewed changes

andrewkroh added 2 commits August 5, 2021 10:03

Add godocs to the internal interfaces between components.

8981559

Add s3Lister interface for mocking pagination of S3 ListObjects calls

1059a52

andrewkroh commented Aug 5, 2021

View reviewed changes

leehinman approved these changes Aug 5, 2021

View reviewed changes

x-pack/filebeat/input/awss3/acker.go Show resolved Hide resolved

Add comment to acker Wait

d04c0f8

aspacca requested a review from faec August 9, 2021 15:06

aspacca reviewed Aug 10, 2021

View reviewed changes

x-pack/filebeat/input/awss3/acker.go Show resolved Hide resolved

andrewkroh requested a review from a team August 11, 2021 14:43

aspacca reviewed Aug 11, 2021

View reviewed changes

x-pack/filebeat/input/awss3/sqs_s3_event.go Show resolved Hide resolved

Remove superfluous sync.WaitGroup Wait()

652f534

aspacca approved these changes Aug 11, 2021

View reviewed changes

kaiyan-sheng approved these changes Aug 11, 2021

View reviewed changes

andrewkroh added 4 commits August 11, 2021 15:20

Merge remote-tracking branch 'elastic/master' into feature/fb/aws-s3-…

1317fac

…refactor

Add new config parameters to reference.yml

8818991

Optimize uploading b/c it was slow in aws v2 sdk

b508531

Merge branch 'feature/fb/aws-s3-refactor' of github.com:andrewkroh/be…

d246c04

…ats into feature/fb/aws-s3-refactor

elastic deleted a comment from mergify bot Aug 11, 2021

andrewkroh merged commit 7c76158 into elastic:master Aug 12, 2021

aspacca added the backport-v7.15.0 Automated backport with mergify label Aug 12, 2021

mergify bot mentioned this pull request Aug 12, 2021

[7.x](backport #27199) [Filebeat] Refactor AWS S3 input with workers #27338

Merged

This was referenced Aug 18, 2021

[Filebeat] Integration tests in CI for AWS-S3 input #27469

Closed

[Filebeat] Integration tests in CI for AWS-S3 input #27491

Merged

kaiyan-sheng mentioned this pull request Oct 21, 2021

[Filebeat] Add more tests for s3 input #13128

Closed

3 tasks

cmacknz mentioned this pull request Jan 25, 2024

Amazon SQS input stalls on new queue flush timeout defaults #37754

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Filebeat] Refactor AWS S3 input with workers #27199

[Filebeat] Refactor AWS S3 input with workers #27199

andrewkroh commented Aug 2, 2021 •

edited

Loading

elasticmachine commented Aug 2, 2021

elasticmachine commented Aug 2, 2021

elasticmachine commented Aug 2, 2021 •

edited by jenkins-beats-ci bot

Loading

Build stats

Test stats 🧪

Trends 🧪

Test stats 🧪

aspacca commented Aug 4, 2021

exekias Aug 5, 2021

kaiyan-sheng Aug 9, 2021

v1v Aug 12, 2021

aspacca Aug 12, 2021

aspacca Aug 12, 2021

aspacca Aug 5, 2021

andrewkroh Aug 5, 2021

aspacca Aug 9, 2021

andrewkroh Aug 9, 2021

aspacca Aug 10, 2021

andrewkroh Aug 10, 2021

andrewkroh Aug 5, 2021 •

edited

Loading

leehinman left a comment

aspacca commented Aug 11, 2021

aspacca commented Aug 11, 2021

andrewkroh commented Aug 11, 2021

kaiyan-sheng left a comment

andrewkroh commented Aug 11, 2021

		cd x-pack/filebeat/inputs/awss3
		go test -tags aws,integration -run TestInputRun -v .

	cloud:
	cloud: "mage build test"
	withModule: true ## run the ITs only if the changeset affects a specific module.
	dirs: ## run the cloud tests for the given modules.
	- "x-pack/metricbeat/module/aws"
	when: ## Override the top-level when.
	parameters:
	- "awsCloudTests"
	comments:
	- "/test x-pack/metricbeat for aws cloud"
	labels:
	- "aws"
	stage: extended

[Filebeat] Refactor AWS S3 input with workers #27199

[Filebeat] Refactor AWS S3 input with workers #27199

Conversation

andrewkroh commented Aug 2, 2021 • edited Loading

What does this PR do?

Other changes

Prevent poison pill messages

Removal of api_timeout from S3 GetObject calls

Improved debug logs

Increased test coverage

Terraform setup for integration test

Benchmark test

Why is it important?

Checklist

Author's Checklist

Related issues

elasticmachine commented Aug 2, 2021

elasticmachine commented Aug 2, 2021

elasticmachine commented Aug 2, 2021 • edited by jenkins-beats-ci bot Loading

💚 Build Succeeded

Build stats

Test stats 🧪

Trends 🧪

💚 Flaky test report

Test stats 🧪

aspacca commented Aug 4, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

andrewkroh Aug 5, 2021 • edited Loading

Choose a reason for hiding this comment

leehinman left a comment

Choose a reason for hiding this comment

aspacca commented Aug 11, 2021

aspacca commented Aug 11, 2021

andrewkroh commented Aug 11, 2021

kaiyan-sheng left a comment

Choose a reason for hiding this comment

andrewkroh commented Aug 11, 2021

andrewkroh commented Aug 2, 2021 •

edited

Loading

elasticmachine commented Aug 2, 2021 •

edited by jenkins-beats-ci bot

Loading

andrewkroh Aug 5, 2021 •

edited

Loading