Fix for EH receiver timeout while opening #21324

JamesBirdsall · 2021-05-12T00:48:23Z

The CBSChannel is a shared resource on the MessagingFactory, used by all senders/receivers opened on the same EventHubClient. If a sender/receiver tries to send a token at the same time that the existing session to cbs$ node is closing, then a race can occur which leaves the MessagingFactory in a bad state. New senders and receivers will timeout while trying to open because the auth step stalls:

The existing CBS session is closing.
CBSChannel.sendToken is called, which calls CBSChannel.innerSendToken, which calls FaultTolerantObject.runOnOpenedObject
FaultTolerantObject.runOnOpenedObject detects that the RequestResponseChannel it is wrapping is in state CLOSING, so it decides to create a new one and sets this.creatingNewInnerObject to true, then calls RequestResponseOpener.run, which sees that RequestResponseOpener.isOpened is true and short-circuits, doing nothing at all
Later, RequestResponseOpener gets the callback from RequestResponseChannel and sets RequestResponseOpener.isOpened to false
Still later, user tries to create a new receiver on the same EventHubClient
MessageReceiver tries to create a receive link
a. First step is to send a token via CBSChannel.sendToken, which chains down to FaultTolerantObject.runOnOpenedObject
b. FaultTolerantObject.runOnOpenedObject sees that this.creatingNewInnerObject is true (left over from step 3) and just queues the action, assuming it will be handled when the channel is finally opened, but nobody is opening the channel…

This is similar to a previous race condition which was caused by tracking the same state in two different places and the two getting out of sync, but is not the same. In this case, RequestResponseOpener.isOpened tracks more than just the state of the inner RequestResponseChannel, so we don't want to change to just use the state of the RequestResponseChannel. The proposed fix is for RequestResponseOpener.run to also check the state of the inner RequestResponseChannel; if the state is mixed (isOpened is still true but the RequestResponseChannel is CLOSING or CLOSED) then use a continuation to replay the call to run() when the close callback for the existing channel has finished cleanup and set isOpened back to false.

…eout300

check-enforcer · 2021-06-08T22:14:03Z

This pull request is protected by Check Enforcer.

What is Check Enforcer?

Check Enforcer helps ensure all pull requests are covered by at least one check-run (typically an Azure Pipeline). When all check-runs associated with this pull request pass then Check Enforcer itself will pass.

Why am I getting this message?

You are getting this message because Check Enforcer did not detect any check-runs being associated with this pull request within five minutes. This may indicate that your pull request is not covered by any pipelines and so Check Enforcer is correctly blocking the pull request being merged.

What should I do now?

If the check-enforcer check-run is not passing and all other check-runs associated with this PR are passing (excluding license-cla) then you could try telling Check Enforcer to evaluate your pull request again. You can do this by adding a comment to this pull request as follows:
/check-enforcer evaluate
Typically evaulation only takes a few seconds. If you know that your pull request is not covered by a pipeline and this is expected you can override Check Enforcer using the following command:
/check-enforcer override
Note that using the override command triggers alerts so that follow-up investigations can occur (PRs still need to be approved as normal).

What if I am onboarding a new service?

Often, new services do not have validation pipelines associated with them, in order to bootstrap pipelines for a new service, you can issue the following command as a pull request comment:
/azp run prepare-pipelines
This will run a pipeline that analyzes the source tree and creates the pipelines necessary to build and validate your pull request. Once the pipeline has been created you can trigger the pipeline using the following comment:
/azp run java - [service] - ci

…eout300

…ut300

JamesBirdsall added 2 commits May 11, 2021 17:08

Fix race on reopening CBSChannel

6bd7dab

Avoid more races

f393c42

JamesBirdsall requested review from conniey, mssfang, srnagar and YijunXieMS as code owners May 12, 2021 00:48

ghost added the Event Hubs label May 12, 2021

JamesBirdsall self-assigned this May 12, 2021

JamesBirdsall requested a review from sjkwak May 12, 2021 00:48

JamesBirdsall added 8 commits May 28, 2021 10:54

Merge branch 'master' of github.com:azure/azure-sdk-for-java into tim…

7fc1efa

…eout300

Merge branch 'master' of github.com:azure/azure-sdk-for-java into tim…

982cf9d

…eout300

Fix maximumSilentTime validation issue hitting Spark

145bc97

Roll track1 vers to 3.3.0-beta.1

f23c5fd

Pom typo

72f218a

Revert version changes to avoid polluting fix PR

743ef65

Merge branch 'master' of github.com:azure/azure-sdk-for-java into tim…

9275a17

…eout300

Merge branch 'master' of github.com:azure/azure-sdk-for-java into tim…

5214c2e

…eout300

JamesBirdsall added 9 commits June 8, 2021 16:04

Improve and simplify race condition fix

08c91fd

More logging

ede1025

Merge branch 'master' of github.com:azure/azure-sdk-for-java into tim…

adc0bc4

…eout300

Merge branch 'master' of github.com:azure/azure-sdk-for-java into tim…

ba41489

…eout300

Improve tracing to track down race issue

f392207

Throw away RequestResponseChannel on open error

03a950a

Merge branch 'main' of github.com:azure/azure-sdk-for-java into timeo…

e2293e7

…ut300

Remove extra debugging logs

69d535a

Remove unneeded import

bbe31f5

nyaghma approved these changes Jun 22, 2021

View reviewed changes

sjkwak approved these changes Jun 22, 2021

View reviewed changes

JamesBirdsall merged commit 7335011 into Azure:main Jun 22, 2021

JamesBirdsall deleted the timeout300 branch June 22, 2021 23:12

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix for EH receiver timeout while opening #21324

Fix for EH receiver timeout while opening #21324

JamesBirdsall commented May 12, 2021

check-enforcer bot commented Jun 8, 2021

Fix for EH receiver timeout while opening #21324

Fix for EH receiver timeout while opening #21324

Conversation

JamesBirdsall commented May 12, 2021

check-enforcer bot commented Jun 8, 2021

What is Check Enforcer?

Why am I getting this message?

What should I do now?

What if I am onboarding a new service?