Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix hang in bulk helper semaphore when server responses are slower than flushInterval #2027

Merged
merged 5 commits into from
Feb 6, 2024

Conversation

JoshMock
Copy link
Member

@JoshMock JoshMock commented Oct 3, 2023

A possible fix for #1562.

The failing state in the above issue is reached when a server's response times are slower than flushInterval. What happens in this situation, in this order:

  1. The flushBytes block awaits semaphore()
  2. While awaiting, onFlushTimeout is triggered
  3. onFlushTimeout "steals" the bulk body rows that the flushBytes block is waiting to send by emptying bulkBody before awaiting a semaphore
  4. When the flushBytes block's semaphore resolves, it tries to send(bulkBody) but bulkBody is empty

By having flushBytes set its blocks aside before awaiting the semaphore, like onFlushTimeout already does, it appears to solve the situation where a Promise sits unresolved indefinitely.

@delvedor I'm particularly interested in your thoughts on this solution, since you wrote the helper and chose the semaphore strategy.

Before it was waiting until after semaphore resolved, then sending with
a reference to bulkBody. If flushInterval is reached after `await
semaphore()` but before `send(bulkBody)`, onFlushTimeout is "stealing"
bulkBody so that there is nothing left in bulkBody for the flushBytes
block to send, causing an indefinite hang for a promise that does not
resolve.
@JoshMock
Copy link
Member Author

JoshMock commented Oct 3, 2023

I've published this to npm as @elastic/elasticsearch-canary@8.10.1-canary.1 so that it can be tested by a few affected people.

pquentin
pquentin previously approved these changes Oct 4, 2023
Copy link
Member

@pquentin pquentin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your explanation makes sense, but I'm unable to map it back to the code, sorry.

@JoshMock
Copy link
Member Author

JoshMock commented Oct 4, 2023

@pquentin Thanks. 🖤 Mostly tagged dynamic clients team members for visibility because this bug came up in conversation in our Monday team meeting.

It still bends my mind a bit, too, and I can't find a clear explanation for why an empty bulkBody causes the promise to not resolve. But that's a relatively small gap to jump, especially when the fix does solve the problem in every scenario I've been able to recreate. 😆

Copy link
Contributor

github-actions bot commented Jan 3, 2024

This pull request is stale because it has been open 90 days with no activity. Remove the stale label, or leave a comment, or this will be closed in 14 days.

@github-actions github-actions bot added the stale label Jan 3, 2024
@stale stale bot removed the stale label Jan 16, 2024
ezimuel
ezimuel previously approved these changes Feb 5, 2024
Copy link
Contributor

@ezimuel ezimuel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The solution proposed to store the bulkBody before the await semaphore() makes sense. I'm wondering if the issue came because of the slice() usage, since it's create a shallow copy of the array.

@pquentin pquentin marked this pull request as ready for review February 6, 2024 05:39
@pquentin pquentin dismissed stale reviews from ezimuel and themself via bd72631 February 6, 2024 05:39
@pquentin pquentin merged commit 1607a0d into main Feb 6, 2024
11 checks passed
@pquentin pquentin deleted the semaphore-hang branch February 6, 2024 05:58
github-actions bot pushed a commit that referenced this pull request Feb 6, 2024
…an flushInterval (#2027)

* Set version to 8.10.1

* Add tests for bulk helper with various flush and server timeouts

* Copy and empty bulkBody when flushBytes is reached

Before it was waiting until after semaphore resolved, then sending with
a reference to bulkBody. If flushInterval is reached after `await
semaphore()` but before `send(bulkBody)`, onFlushTimeout is "stealing"
bulkBody so that there is nothing left in bulkBody for the flushBytes
block to send, causing an indefinite hang for a promise that does not
resolve.

* comment typo fixes

---------

Co-authored-by: Quentin Pradet <quentin.pradet@elastic.co>
(cherry picked from commit 1607a0d)
pquentin pushed a commit that referenced this pull request Feb 6, 2024
…an flushInterval (#2027) (#2129)

* Set version to 8.10.1

* Add tests for bulk helper with various flush and server timeouts

* Copy and empty bulkBody when flushBytes is reached

Before it was waiting until after semaphore resolved, then sending with
a reference to bulkBody. If flushInterval is reached after `await
semaphore()` but before `send(bulkBody)`, onFlushTimeout is "stealing"
bulkBody so that there is nothing left in bulkBody for the flushBytes
block to send, causing an indefinite hang for a promise that does not
resolve.

* comment typo fixes

---------

Co-authored-by: Quentin Pradet <quentin.pradet@elastic.co>
(cherry picked from commit 1607a0d)

Co-authored-by: Josh Mock <joshua.mock@elastic.co>
JoshMock added a commit to elastic/elasticsearch-serverless-js that referenced this pull request Apr 3, 2024
JoshMock added a commit to elastic/elasticsearch-serverless-js that referenced this pull request Apr 3, 2024
Applies patches from elastic/elasticsearch-js#2199 and elastic/elasticsearch-js#2027,
adding support for an onSuccess callback and fixing a bug that would cause the helper to
hang when the flushInterval was lower than the request timeout.

---------

Co-authored-by: JoshMock <160161+JoshMock@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants