Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[filebeat][azure-blob-storage] - Fixed concurrency & flakey tests issue #36124

Merged
merged 14 commits into from
Aug 4, 2023

Conversation

ShourieG
Copy link
Contributor

Type of change

  • Bug

What does this PR do?

This PR fixes the concurrency issues present in the azure blob storage input and the flakey tests issue.

Why is it important?

Concurrent ops were failing at scale and this fix addresses that issue.

Checklist

  • My code follows the style guidelines of this project
  • I have commented my code, particularly in hard-to-understand areas
    - [ ] I have made corresponding changes to the documentation
    - [ ] I have made corresponding change to the default configuration files
  • I have added tests that prove my fix is effective or that my feature works
  • I have added an entry in CHANGELOG.next.asciidoc or CHANGELOG-developer.next.asciidoc.

Author's Checklist

  • [ ]

How to test this PR locally

Related issues

Use cases

Screenshots

Logs

@ShourieG ShourieG requested a review from a team as a code owner July 20, 2023 09:05
@botelastic botelastic bot added the needs_team Indicates that the issue/PR needs a Team:* label label Jul 20, 2023
@ShourieG ShourieG self-assigned this Jul 20, 2023
@elasticmachine
Copy link
Collaborator

Pinging @elastic/security-external-integrations (Team:Security-External Integrations)

@botelastic botelastic bot removed the needs_team Indicates that the issue/PR needs a Team:* label label Jul 20, 2023
@ShourieG ShourieG added needs_team Indicates that the issue/PR needs a Team:* label bugfix labels Jul 20, 2023
@botelastic botelastic bot removed the needs_team Indicates that the issue/PR needs a Team:* label label Jul 20, 2023
@ShourieG ShourieG added needs_team Indicates that the issue/PR needs a Team:* label 8.10-candidate labels Jul 20, 2023
@botelastic botelastic bot removed the needs_team Indicates that the issue/PR needs a Team:* label label Jul 20, 2023
@mergify
Copy link
Contributor

mergify bot commented Jul 20, 2023

This pull request does not have a backport label.
If this is a bug or security fix, could you label this PR @ShourieG? 🙏.
For such, you'll need to label your PR with:

  • The upcoming major version of the Elastic Stack
  • The upcoming minor version of the Elastic Stack (if you're not pushing a breaking change)

To fixup this pull request, you need to add the backport labels for the needed
branches, such as:

  • backport-v8./d.0 is the label to automatically backport to the 8./d branch. /d is the digit

@ShourieG ShourieG changed the title Abs/concurrency fix [filebeat][azure-blob-storage] - Fixed concurrency & flakey tests issue Jul 20, 2023
@elasticmachine
Copy link
Collaborator

elasticmachine commented Jul 20, 2023

💚 Build Succeeded

the below badges are clickable and redirect to their specific view in the CI or DOCS
Pipeline View Test View Changes Artifacts preview preview

Expand to view the summary

Build stats

  • Start Time: 2023-08-04T02:47:19.171+0000

  • Duration: 78 min 12 sec

Test stats 🧪

Test Results
Failed 0
Passed 3116
Skipped 176
Total 3292

💚 Flaky test report

Tests succeeded.

🤖 GitHub comments

Expand to view the GitHub comments

To re-run your PR in the CI, just comment with:

  • /test : Re-trigger the build.

  • /package : Generate the packages and run the E2E tests.

  • /beats-tester : Run the installation tests with beats-tester.

  • run elasticsearch-ci/docs : Re-trigger the docs validation. (use unformatted text in the comment!)

@mergify
Copy link
Contributor

mergify bot commented Jul 21, 2023

This pull request is now in conflicts. Could you fix it? 🙏
To fixup this pull request, you can check out it locally. See documentation: https://help.github.com/articles/checking-out-pull-requests-locally/

git fetch upstream
git checkout -b abs/concurrency_fix upstream/abs/concurrency_fix
git merge upstream/main
git push upstream abs/concurrency_fix

Copy link
Contributor

@efd6 efd6 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have you confirmed that the tests here fail without the fix?

x-pack/filebeat/input/azureblobstorage/input_test.go Outdated Show resolved Hide resolved
x-pack/filebeat/input/azureblobstorage/job.go Outdated Show resolved Hide resolved
x-pack/filebeat/input/azureblobstorage/job.go Outdated Show resolved Hide resolved
@ShourieG
Copy link
Contributor Author

@efd6 The concurrency error only happens when using multiple workers at scale with 1000s of records, I'm not sure how we can simulate that without an Integration test.

@efd6
Copy link
Contributor

efd6 commented Jul 24, 2023

Would it not be possible to make a mock handler that just responds with arbitrary random object data and then run a test with multiple workers pulling from that mock? There does not need to be any comparison of the values, just an absence of a concurrent map-write throw.

@ShourieG
Copy link
Contributor Author

Would it not be possible to make a mock handler that just responds with arbitrary random object data and then run a test with multiple workers pulling from that mock? There does not need to be any comparison of the values, just an absence of a concurrent map-write throw.

Will try this and update.

@ShourieG
Copy link
Contributor Author

@efd6 I've update the PR with suitable concurrency tests having random blob generation.

@narph narph requested a review from efd6 July 31, 2023 10:39
Copy link
Contributor

@efd6 efd6 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you confirm that the new tests fail without the fix?

I have run the tests with the main branch's code in job.go and state.go, and the tests pass unless -race is passed. In the case that -race is passed, only the 5000 workers case fails. This failure is consistent, but appears to be purely a timeout effect caused by the added load the race detector imposes; increasing the timeout on the test to 200s allows it to pass.

x-pack/filebeat/input/azureblobstorage/input_test.go Outdated Show resolved Hide resolved
@ShourieG
Copy link
Contributor Author

ShourieG commented Aug 1, 2023

Can you confirm that the new tests fail without the fix?

I have run the tests with the main branch's code in job.go and state.go, and the tests pass unless -race is passed. In the case that -race is passed, only the 5000 workers case fails. This failure is consistent, but appears to be purely a timeout effect caused by the added load the race detector imposes; increasing the timeout on the test to 200s allows it to pass.

@efd6
Nope the new tests seem to be passing even without the fix. The reason is mainly because we are using a stateless input here. But locally I tried even with a stateful input similar to the CEL input tests by passing custom v2 publisher. It seems though the concurrency error is happening deep within the cursor op, specifically the updateCursorOp in the v2 cursor code. I'm not sure how we can mock this behaviour as mocking a complete v2 cursor means mocking the resource and state store packages and the harvester completely.

@efd6
Copy link
Contributor

efd6 commented Aug 1, 2023

I suggest writing a new beat client implementation (these are small, so it should not be onerous) that mutates the event. This was the cause of the issue, so it should be detectable that way.

@ShourieG
Copy link
Contributor Author

ShourieG commented Aug 2, 2023

@efd6 I've updated the pr with the suggested changes. For me 3000 workers is not timing out and is passing with the -race flag. I've tested this with and without the fix and without the fix the data race occurs and the fix is resolving that issue.

@ShourieG
Copy link
Contributor Author

ShourieG commented Aug 4, 2023

@efd6 I've updated the test case code.

@ShourieG ShourieG merged commit cd9ad24 into elastic:main Aug 4, 2023
@ShourieG ShourieG deleted the abs/concurrency_fix branch August 8, 2023 09:37
Scholar-Li pushed a commit to Scholar-Li/beats that referenced this pull request Feb 5, 2024
…ue (elastic#36124)

## Type of change
- Bug

## What does this PR do?
This PR fixes the concurrency issues present in the azure blob storage
input and the flakey tests issue.

## Why is it important?
Concurrent ops were failing at scale and this fix addresses that issue. 

## Checklist

- [x] My code follows the style guidelines of this project
- [x] I have commented my code, particularly in hard-to-understand areas
~~- [ ] I have made corresponding changes to the documentation~~
~~- [ ] I have made corresponding change to the default configuration
files~~
- [x] I have added tests that prove my fix is effective or that my
feature works
- [x] I have added an entry in `CHANGELOG.next.asciidoc` or
`CHANGELOG-developer.next.asciidoc`.

## Author's Checklist

<!-- Recommended
Add a checklist of things that are required to be reviewed in order to
have the PR approved
-->
- [ ]

## How to test this PR locally

<!-- Recommended
Explain here how this PR will be tested by the reviewer: commands,
dependencies, steps, etc.
-->

## Related issues

- Relates elastic#35983


## Use cases

<!-- Recommended
Explain here the different behaviors that this PR introduces or modifies
in this project, user roles, environment configuration, etc.

If you are familiar with Gherkin test scenarios, we recommend its usage:
https://cucumber.io/docs/gherkin/reference/
-->

## Screenshots

<!-- Optional
Add here screenshots about how the project will be changed after the PR
is applied. They could be related to web pages, terminal, etc, or any
other image you consider important to be shared with the team.
-->

## Logs

<!-- Recommended
Paste here output logs discovered while creating this PR, such as stack
traces or integration logs, or any other output you consider important
to be shared with the team.
-->
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants