[core-amqp][event-hubs] Fix "too much pending tasks" by making acquire lock calls cancellable #14844

chradek · 2021-04-12T19:32:00Z

Fixes Event Hubs "Too much pending tasks" error: #14606

Summary

This PR fixes the "too much pending tasks" error by updating the acquire('lock') operations to support timeouts and cancellation. Previously, we were essentially wrapping our acquire lock calls in timeouts, but were unable to actually clear the pending task. So while the overall operation would fail with a timeout error, the lock was not released and over time tasks would queue up faster than the lock could be released.

Detailed scenario

EventHubConsumerClient.subscribe() called.
The EventProcessor._runLoopWithLoadBalancing method is invoked, which starts a loop that continues until the subscription or EventHubConsumerClient is closed.
At the start of each loop iteration, we attempt to get the partitionIds.

If there isn't already a $management request/response link open on the connection, we wait to acquire a lock to initialize the request/response link:

azure-sdk-for-js/sdk/eventhub/event-hubs/src/managementClient.ts

Lines 412 to 414 in fadd5b9

    
           return defaultLock.acquire(this.managementLock, () => { 
        
             return this._init(); 
        
           });

Once a lock is acquired, if the request/response link is still not created, attempt to create it.
One of these steps is to negotiate a claim (this is for auth):

azure-sdk-for-js/sdk/eventhub/event-hubs/src/managementClient.ts

Lines 314 to 315 in fadd5b9

await this._context.readyToOpenLink();

await this._negotiateClaim();

At this point, we'll wait to acquire another lock to create the CBS session (for auth):

azure-sdk-for-js/sdk/eventhub/event-hubs/src/linkEntity.ts

Lines 129 to 131 in fadd5b9

    
           await defaultLock.acquire(this._context.cbsSession.cbsLock, () => { 
        
             return this._context.cbsSession.init(); 
        
           });

Now, we have a timeout around acquiring the managementLock (used in step 4), and if we timeout, we will try again until our retries are exhausted. However, our timeout wraps the defaultLock.acquire call, so it doesn't actually release the lock.

Under normal "transient" conditions, we'd expect to see the negotiateClaim call timeout when trying to create a connection to the service, the managementLock gets released, and assuming the network conditions have returned to normal, the next attempt to initialize the management request/response link would succeed. However in the scenario we're testing, we're unable to recreate the connection for an extended period of time. What's interesting is that we're not seeing ECONNREFUSED errors as one might expect. Instead it appears from the client-side that it's attempting to connect to a blackhole that never responds.

So, another wrinkle is that eventually once the getPartitionIds call throws a timeout error, the EventProcessor loop will sleep for some duration and then move on to the next iteration, starting with calling getPartitionIds again. This leads to more attempts to acquire the management lock, which means overtime we eventually hit the limit.

New lock type

@azure/core-amqp currently uses the AsyncLock from async-lock to implement locking. At first glance, it does support setting a timeout for acquiring a lock. However there are 2 issues:

The error thrown is a generic Error so we'd have to inspect the error message and convert it to an OperationTimeoutError so that it could be retryable. Since error messages aren't guaranteed to be stable, this poses some risk.
When a pending task reaches its timeout, it isn't actually removed from the queue. Instead, once it acquires the lock it immediately yields to the next item in the queue. This means we would still see the "too much pending tasks" even using the timeout.

AsyncLock also doesn't support cancellation via something like abortSignal, which is something we need to support in all our async operations.

Thus, I've added CancellableAsyncLock to @azure/core-amqp. It supports both cancellation (which throws an AbortError) and timeouts (which throws an OperationTimeoutError). We should actually be able to get rid of AsyncLock now, but since that's technically a breaking change I currently export both types.

…r if acquireTimeoutInMs is reached

…-pending

chradek · 2021-04-16T17:06:52Z

Looks like core-rest is failing and will fail until #14899 is merged.

richardpark-msft

If you disagree, let's chat, but I think we should be very aggressive about making the abortSignal and timeout values be non-optional. They can be undefined, but I want to make sure it's a compile time issue to not have passed it since (in our stacks) it's probably an error.

sdk/core/core-amqp/review/core-amqp.api.md

richardpark-msft · 2021-04-16T17:49:32Z

sdk/core/core-amqp/review/core-amqp.api.md

@@ -351,6 +362,9 @@ export function createSasTokenProvider(data: {
    sharedAccessSignature: string;
 } | NamedKeyCredential | SASCredential): SasTokenProvider;

+// @public
+export const defaultCancellableLock: CancellableAsyncLock;


I think this is one case where we should just leave the instantiation and usage to the calling package, rather than declaring a default global variable.

I know it's made to work cleanly, etc... but is there a specific reason we'd need to do this?

I believe it's just a matter of having 1 of these in core-amqp, or 1 in core-amqp, service-bus, and event-hubs. I don't have a strong opinion either way since core-amqp is kind of a weird 'public but not really but it is' package.

Arent we replacing the use of defaultLock with this defaultCancellableLock? If so, I would prefer to maintain the same pattern for now. We can always investigate the caller instantiating in a separate task.

sdk/core/core-amqp/src/util/lock.ts

sdk/core/core-amqp/test/lock.spec.ts

sdk/core/core-amqp/test/utils/utils.ts

sdk/eventhub/event-hubs/src/eventHubReceiver.ts

richardpark-msft · 2021-04-16T18:00:05Z

sdk/eventhub/event-hubs/src/eventHubReceiver.ts

+      {
+        connectionId: {
+          enumerable: true,
+          get: () => {


This is very cool and makes me think we should always make these properties dynamic (ie, Service Bus too).

In a separate PR I was thinking it'd be nice to just pass in a logger (and maybe it's time to make it so the logger properly prints out the header).

I don't follow how the change here helps us... Was there something in the PR description I missed?

Ah sorry, I didn't mention this in the PR. Basically, before this change, sometimes in our logs we'd see the wrong connection-id when referring to a receiver because we copied the value of the id. With this change, we're always referencing the id from our context object. Now our logs are accurate.

I noticed this while trying to troubleshoot the too many pending tasks issue.

Wowza, should we follow this format everywhere else we log the connection id? Or only certain scenarios need this? If yes for either question, can you log an issue to investigate all other places where we may have to change?

We should look into following this method anywhere we would copy the connectionId value today. Issue created:
#14923

sdk/core/core-amqp/review/core-amqp.api.md

sdk/core/core-amqp/src/util/lock.ts

sdk/eventhub/event-hubs/src/linkEntity.ts

richardpark-msft · 2021-04-16T21:55:18Z

(approved after discussion with @chradek and he's handling all the contentious discussions)

sdk/eventhub/event-hubs/src/linkEntity.ts

ramya-rao-a · 2021-04-16T22:24:17Z

sdk/eventhub/event-hubs/src/managementClient.ts

@@ -407,16 +407,13 @@ export class ManagementClient extends LinkEntity {
          const initOperationStartTime = Date.now();

          try {
-            await waitForTimeoutOrAbortOrResolve({


Nit: Now that we are not using waitForTimeoutOrAbortOrResolve, you can delete the file timeoutAbortSignalUtils.ts file. Or do that in a separate PR

sdk/core/core-amqp/test/utils/utils.ts

…llableLock back, remove unneeded code from allSettled helper util

…-pending

sdk/core/core-amqp/src/cbs.ts

sdk/eventhub/event-hubs/src/eventHubReceiver.ts

sdk/eventhub/event-hubs/src/linkEntity.ts

…m methods

chradek · 2021-04-19T23:05:11Z

/azp run js - event-hubs - tests

azure-pipelines · 2021-04-19T23:05:22Z

Azure Pipelines successfully started running 1 pipeline(s).

…e lock calls cancellable (Azure#14844) * [core-amqp] adds defaultCancellableLock * [core-amqp] make cbs acquireLock call cancellable * [core-amqp] update CancellableAsyncLock to throw OperationTimeoutError if acquireTimeoutInMs is reached * [core-amqp] fix eslint errors * [event-hubs] add timeouts to acquire calls * pass abortSignal to init/negotiateClaim methods * update pnpm-lock.yaml * [core-amqp] make flaky tests not flaky * [core-amqp] make fields required * [core-amqp] AcquireOptions -> AcquireLockProperties, add defaultCancellableLock back, remove unneeded code from allSettled helper util * [core-amqp] parameter rename cleanup * [event-hubs] add timeout to link initialization calls * update pnpm-lock.yaml * [event-hubs] improve timeout to cbsSession.negotiateClaimLock * [core-amqp] add isOpen() to CbsClient * [event-hubs] remove unneeded AbortError branch from event hub receiver * [core-amqp] fix flaky test in node 15 * [event-hubs] use cbs isOpen() * [core-amqp] add timeout to CbsClient init and negotiateClaim methods * [event-hubs] pass timeout through to CbsClient init and negotiateClaim methods

ghost added the Azure.Core label Apr 12, 2021

chradek added 6 commits April 15, 2021 19:04

[core-amqp] adds defaultCancellableLock

0702bd6

[core-amqp] make cbs acquireLock call cancellable

43ac04d

[core-amqp] update CancellableAsyncLock to throw OperationTimeoutErro…

61d6231

…r if acquireTimeoutInMs is reached

[core-amqp] fix eslint errors

3047567

[event-hubs] add timeouts to acquire calls

a46c318

pass abortSignal to init/negotiateClaim methods

17a2b84

chradek force-pushed the eh-fix-management-pending branch from 8e1055e to 17a2b84 Compare April 16, 2021 02:04

chradek added 3 commits April 15, 2021 19:17

update pnpm-lock.yaml

961880c

[core-amqp] make flaky tests not flaky

f059122

Merge remote-tracking branch 'upstream/master' into eh-fix-management…

3a27f38

…-pending

chradek changed the title ~~[core-amqp] adds defaultCancellableLock~~ [core-amqp][event-hubs] Fix "too much pending tasks" by making acquire lock calls cancellable Apr 16, 2021

chradek marked this pull request as ready for review April 16, 2021 17:05

chradek requested review from bterlson, HarshaNalluru, ramya-rao-a, richardpark-msft and xirzec as code owners April 16, 2021 17:05

richardpark-msft requested changes Apr 16, 2021

View reviewed changes

richardpark-msft approved these changes Apr 16, 2021

View reviewed changes

[core-amqp] make fields required

f656209

ramya-rao-a reviewed Apr 16, 2021

View reviewed changes

sdk/eventhub/event-hubs/src/linkEntity.ts Outdated Show resolved Hide resolved

ramya-rao-a reviewed Apr 16, 2021

View reviewed changes

sdk/core/core-amqp/test/utils/utils.ts Show resolved Hide resolved

chradek mentioned this pull request Apr 19, 2021

[event-hubs][service-bus][core-amqp] fix incorrect logging #14923

Closed

chradek added 3 commits April 19, 2021 10:00

[core-amqp] AcquireOptions -> AcquireLockProperties, add defaultCance…

e9a387b

…llableLock back, remove unneeded code from allSettled helper util

[core-amqp] parameter rename cleanup

f6c4b1a

[event-hubs] add timeout to link initialization calls

260143c

chradek added 2 commits April 19, 2021 13:08

Merge remote-tracking branch 'upstream/master' into eh-fix-management…

ad272ed

…-pending

update pnpm-lock.yaml

e70f463

ramya-rao-a reviewed Apr 19, 2021

View reviewed changes

chradek added 7 commits April 19, 2021 14:09

[event-hubs] improve timeout to cbsSession.negotiateClaimLock

6af37eb

[core-amqp] add isOpen() to CbsClient

3952638

[event-hubs] remove unneeded AbortError branch from event hub receiver

329e2d2

[core-amqp] fix flaky test in node 15

9c25659

[event-hubs] use cbs isOpen()

1fdaee7

[core-amqp] add timeout to CbsClient init and negotiateClaim methods

4a4511e

[event-hubs] pass timeout through to CbsClient init and negotiateClai…

83b72f0

…m methods

ramya-rao-a approved these changes Apr 19, 2021

View reviewed changes

chradek merged commit 4027106 into Azure:master Apr 19, 2021

ramya-rao-a mentioned this pull request Apr 26, 2021

[event-hubs] fixes sendBatch race condition causing TypeError to be thrown #15021

Merged

This was referenced Apr 29, 2021

[event-hubs] add tests for cancellation #15094

Merged

Receiving "Too much pending tasks" error when we use the event-hubs library to process messages #14606

Closed

On Consumer - Error: Too much pending tasks #14704

Closed

chradek mentioned this pull request May 17, 2021

[service-bus] add cancellation to init and use defaultCancellableLock #15311

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[core-amqp][event-hubs] Fix "too much pending tasks" by making acquire lock calls cancellable #14844

[core-amqp][event-hubs] Fix "too much pending tasks" by making acquire lock calls cancellable #14844

chradek commented Apr 12, 2021 •

edited

Loading

chradek commented Apr 16, 2021

richardpark-msft left a comment

richardpark-msft Apr 16, 2021

chradek Apr 16, 2021

ramya-rao-a Apr 16, 2021

richardpark-msft Apr 16, 2021

ramya-rao-a Apr 16, 2021

chradek Apr 16, 2021

chradek Apr 16, 2021

ramya-rao-a Apr 16, 2021

chradek Apr 19, 2021

richardpark-msft commented Apr 16, 2021

ramya-rao-a Apr 16, 2021

chradek commented Apr 19, 2021

azure-pipelines bot commented Apr 19, 2021

	return defaultLock.acquire(this.managementLock, () => {
	return this._init();
	});

	await this._context.readyToOpenLink();
	await this._negotiateClaim();

	await defaultLock.acquire(this._context.cbsSession.cbsLock, () => {
	return this._context.cbsSession.init();
	});

[core-amqp][event-hubs] Fix "too much pending tasks" by making acquire lock calls cancellable #14844

[core-amqp][event-hubs] Fix "too much pending tasks" by making acquire lock calls cancellable #14844

Conversation

chradek commented Apr 12, 2021 • edited Loading

Summary

Detailed scenario

New lock type

chradek commented Apr 16, 2021

richardpark-msft left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

richardpark-msft commented Apr 16, 2021

Choose a reason for hiding this comment

chradek commented Apr 19, 2021

azure-pipelines bot commented Apr 19, 2021

chradek commented Apr 12, 2021 •

edited

Loading