Receiving "Too much pending tasks" error when we use the event-hubs library to process messages #14606

hilfor · 2021-03-30T09:50:52Z

Package Name: @azure/event-hubs
Package Version: 5.4.0
Operating system: Linux container
nodejs
- version: 14.16.0
browser
- name/version:
typescript
- version: 3.9.9
Is the bug related to documentation in
- README.md
- source code documentation
- SDK API docs on https://docs.microsoft.com

Describe the bug
After a few days of processing events, the following errors start showing Too much pending tasks.
These errors are printed in the processError consumer callback.

Once this error starts showing the consumer can no longer receive any new events.

Before posting this issue, I have checked out other related issues like #5944 and #7674 , but these issues talk about sending requests while in our case the container can only receive and process events.

The issue resolves itself if we restart the container.

Stack trace

Error: Too much pending tasks
    at AsyncLock.acquire (/app/node_modules/async-lock/lib/index.js:160:15)
    at ManagementClient.<anonymous> (/app/node_modules/@azure/event-hubs/dist/index.js:687:56)
    at Generator.next (<anonymous>)
    at /app/node_modules/tslib/tslib.js:114:75
    at new Promise (<anonymous>)
    at Object.__awaiter (/app/node_modules/tslib/tslib.js:110:16)
    at /app/node_modules/@azure/event-hubs/dist/index.js:656:91
    at new Promise (<anonymous>)
    at Object.sendOperationPromise [as operation] (/app/node_modules/@azure/event-hubs/dist/index.js:656:52)
    at Object.<anonymous> (/app/node_modules/@azure/core-amqp/dist/index.js:1227:39)
    at Generator.next (<anonymous>)
    at /app/node_modules/tslib/tslib.js:114:75
    at new Promise (<anonymous>)
    at Object.__awaiter (/app/node_modules/tslib/tslib.js:110:16)
    at Object.retry (/app/node_modules/@azure/core-amqp/dist/index.js:1202:18)
    at ManagementClient.<anonymous> (/app/node_modules/@azure/event-hubs/dist/index.js:732:40)
    at Generator.next (<anonymous>)
    at /app/node_modules/tslib/tslib.js:114:75
    at new Promise (<anonymous>)
    at Object.__awaiter (/app/node_modules/tslib/tslib.js:110:16)
    at ManagementClient._makeManagementRequest (/app/node_modules/@azure/event-hubs/dist/index.js:652:22)
    at ManagementClient.<anonymous> (/app/node_modules/@azure/event-hubs/dist/index.js:497:41)

To Reproduce
This container is running in Azure, so maybe a connection was lost between the container and the EventHub instance, but I can't be sure.

The text was updated successfully, but these errors were encountered:

chradek · 2021-03-30T17:03:50Z

@hilfor
Thanks for logging this issue.

This is an interesting error you're seeing. We use a 'lock' when creating connections/links to prevent edge cases that would result in more connections to the service being created than needed. We configure the lock so there can be up to 10,000 pending tasks. The error you're seeing indicates that there are at least 10,000 pending tasks waiting on a lock.

This leads me to believe that either locks aren't being released properly in all cases, or there are a lot of operations happening in parallel.

Can you provide a sample of how you're receiving messages? I can use that to see if anything immediately sticks out. Also, how many partitions are you trying to read from? (I wouldn't expect a problem to be due to the number of partitions unless maybe you had close to 10,000 partitions!)

Do you see your processError handlers firing during your run at all? I'd be interested in what sort of errors you see there. I'm really curious if we're seeing the client disconnect over time.

If possible, would you be able to turn on logging by setting the DEBUG environment variable to azure:*? This may result in a lot of logs if it takes days for this issue to manifest. It'd be helpful if we could get the tail end of the logs near where the issue happens if possible.

I can run some simple test cases to verify that we don't have code paths that would prevent locks from being released. Any of the above info you can provide would be greatly appreciated!

hilfor · 2021-04-04T05:44:22Z

Hi @chradek , sorry for the late response.

We are creating an EventHub listener in the following order:

const containerClient = new ContainerClient(storageConnectionString, storageContainerName);
const checkpointStore = new BlobCheckpointStore(containerClient);
const consumerClient = new EventHubConsumerClient(eventHubConsumerGroup, eventHubConnectionString, eventHubName, checkpointStore);

consumerClient.subscribe(
  {
    processEvents: async (events, context) => {},
    processError: async (err, context) => {} // --> the errors are printed in here
  },
  {
    maxBatchSize: 20,
    maxWaitTimeInSeconds: 20
  }
)

The number of EventHub partitions is 8 :) so it is not even close to 10000

I will check if we can turn on logging with DEBUG but it might take some time to manifest :|

hilfor · 2021-04-04T12:57:54Z

@chradek , I've also noticed that just before these errors start showing up, we can see the following error message:

Error : MessagingError: The request with message_id "5550f132-8297-4827-b6e3-5472457aa17c" timed out. Please try again later.

chradek · 2021-04-05T17:58:00Z

@hilfor
Can you tell approximately how long you see MessagingError before you start seeing Too much pending tasks?

I believe I'm starting to see the "pending tasks" on management operations (e.g. getPartitionIds) increase very slowly over time when I prevent the service from being able to respond. This increase in pending tasks is surprising, but doesn't appear to be happening if the connection to the service is healthy.

Any logs you have around when you start seeing MessagingError indicating time outs might help us determine what's going on. If you can keep track of the timestamps where you were able to receive messages and when you stopped receiving messages, that can be helpful to take to the service team for investigation as well.

chradek · 2021-04-06T00:03:54Z

Just wanting to dump my findings thus far here.

So as I mentioned earlier, I am able to see the "Too much pending tasks" occur when I prevent the service from responding to the client. Note that the client still attempts to connect to the service, but creating the connection times out.

Here's what I'm seeing happen:

EventHubConsumerClient.subscribe() called.
The EventProcessor._runLoopWithLoadBalancing method is invoked, which starts a loop that continues until the subscription or EventHubConsumerClient is closed.
At the start of each loop iteration, we attempt to get the partitionIds.

If there isn't already a $management request/response link open on the connection, we wait to acquire a lock to initialize the request/response link:

azure-sdk-for-js/sdk/eventhub/event-hubs/src/managementClient.ts

Lines 412 to 414 in fadd5b9

    
           return defaultLock.acquire(this.managementLock, () => { 
        
             return this._init(); 
        
           });

Once a lock is acquired, if the request/response link is still not created, attempt to create it.
One of these steps is to negotiate a claim (this is for auth):

azure-sdk-for-js/sdk/eventhub/event-hubs/src/managementClient.ts

Lines 314 to 315 in fadd5b9

await this._context.readyToOpenLink();

await this._negotiateClaim();

At this point, we'll wait to acquire another lock to create the CBS session (for auth):

azure-sdk-for-js/sdk/eventhub/event-hubs/src/linkEntity.ts

Lines 129 to 131 in fadd5b9

    
           await defaultLock.acquire(this._context.cbsSession.cbsLock, () => { 
        
             return this._context.cbsSession.init(); 
        
           });

Now, we have a timeout around acquiring the managementLock (used in step 4), and if we timeout, we will try again until our retries are exhausted. However, our timeout wraps the defaultLock.acquire call, so it doesn't actually release the lock.

Under normal "transient" conditions, we'd expect to see the negotiateClaim call timeout when trying to create a connection to the service, the managementLock gets released, and assuming the network conditions have returned to normal, the next attempt to initialize the management request/response link would succeed. However in the scenario we're testing, we're unable to recreate the connection for an extended period of time. What's interesting is that we're not seeing ECONNREFUSED errors as one might expect. Instead it appears from the client-side that it's attempting to connect to a blackhole that never responds.

So, another wrinkle is that eventually once the getPartitionIds call throws a timeout error, the EventProcessor loop will sleep for some duration and then move on to the next iteration, starting with calling getPartitionIds again. This leads to more attempts to acquire the management lock, which means overtime we eventually hit the limit.

Possible solutions?

AsyncLock accepts a timeout when attempting to acquire a lock, so we could set that and handle any timeout errors as needed. This would prevent us from reaching the conditions that cause the "Too much pending tasks" error.
If we do this, we should also pass the abortSignal through the _init() calls so that the original method invocation that acquired a lock doesn't take too much time as well.

I'm not sure that this would do anything to help with the MessagingError OperationTimeouts also being seen. On the one hand, being able to cancel in-flight requests to create a connection so the next _init() call doesn't think we're already connecting could help. However, I can't say that definitively without better understanding how the client is getting into this state. Logs with timestamps would help so we can ask the service what they see on their side.

hilfor · 2021-04-06T08:50:20Z

hi @chradek

I've checked the logs and we're seeing these timeout errors for about a week before Too much pending tasks errors start showing up.
This timeout message appears every 5 minutes.
This message is printed inside the processError callback.
Unfortunately, we're not seeing any other errors around or near the timeouts.

We can also see that once Too much pending tasks starts showing the timeout errors are gone.
Please note the timestamps on the graphs in the screenshots below

chradek · 2021-04-06T16:34:59Z

@hilfor
Thanks for the info! While you're seeing these MessagingError errors, are you seeing any incoming events? Not just that the processEvents handler is called, but that you're seeing actual events? In my testing I was seeing an 'all or nothing' type of scenario where if I was getting the MessagingError, I also was not receiving events so I'm curious if you are seeing the same.

hilfor · 2021-04-06T18:46:31Z

@chradek
I've checked the logs and we have the exact same scenario.
I just want to point out that these screenshots were of a service that has 5 other consumers (connected to different EventHubs) working without any issues.

chradek · 2021-04-06T19:08:50Z

@hilfor
Thanks for confirming. I'm working on some changes that will prevent the too much pending tasks error from occurring, but I'm still unsure how much they will affect the MessagingErrors you're seeing. I'll keep this issue posted on those changes.

Would you be able to create a support case so we can work with the service team as well to determine why your connection to event hubs from your container seemed to have entered an unrecoverable state? If you can narrow down the time range where around where you were able to receive events and when you stopped receiving events, that would be useful info for the case.

chradek · 2021-04-30T15:43:32Z

#14844 implements a fix for this issue and has been released as part of @azure/event-hubs version 5.5.1. Please feel free to reach out if you encounter any other issue!

amit12cool · 2021-05-27T07:27:34Z

Hi @chradek I'm facing the same issue with MessagingError and Too much pending tasks. I have gathered the logs at my end related the time they arrived. I see a difference of one week between timeout and too much pending task error messages. Check below for logs.

First timeout issue came on 13th May

2021-05-13 05:03:10    
{"level":"error", "time":"2021-05-12T23:33:10.991Z","msg":"Error in listening on eventHub: iot-prd01-evh and partitionId : MessagingError: The request with message_id \"5db3cdec-6b8b-46f0-b333-f7cf3f00c7aa\" timed out. Please try again later.}"}

First too much pending task issue came on 20th May

2021-05-20 04:20:53   
{"level":"error", "time":"2021-05-25T22:50:53.372Z","msg":"Error in listening on eventHub: iot-prd01-evh and partitionId : Error: Too much pending tasks}"}

@chradek I just found this error log on the same day when we got timeout error. It seems its trying to connect to hub and error is coming.

{"level":"error", "time":"2021-05-20T16:09:48.407Z","errno":-3001,"code":"EAI_AGAIN","syscall":"getaddrinfo","hostname":".azure-devices.net","stack":"Error: getaddrinfo EAI_AGAIN .azure-devices.net\n at GetAddrInfoReqWrap.onlookup [as oncomplete] (dns.js:67:26)","type":"Error","msg":"getaddrinfo EAI_AGAIN sil-xms-prd01-iothub.azure-devices.net"}

Anither point is we lost some of the events as soon as we got the timeout error and then too much pending task error. We have checkpointing in place, which should not have made events lost but I doubt if timeout is something to do with events getting missed intermittently.

ramya-rao-a added Client This issue points to a problem in the data-plane of the library. Event Hubs labels Mar 30, 2021

ghost removed the needs-triage Workflow: This is a new issue that needs to be triaged to the appropriate team. label Mar 30, 2021

ramya-rao-a assigned chradek Mar 30, 2021

ghost added the needs-team-attention Workflow: This issue needs attention from Azure service team or SDK team label Mar 30, 2021

chradek mentioned this issue Apr 5, 2021

On Consumer - Error: Too much pending tasks #14704

Closed

1 task

chradek mentioned this issue Apr 16, 2021

[core-amqp][event-hubs] Fix "too much pending tasks" by making acquire lock calls cancellable #14844

Merged

chradek closed this as completed Apr 30, 2021

amit12cool mentioned this issue May 31, 2021

Events not received on consumer during Too much pending tasks error in Azure Event hub sdk #15478

Closed

6 tasks

chradek mentioned this issue Jul 19, 2021

Some of messages are not consumed from the event hub #16444

Closed

github-actions bot locked and limited conversation to collaborators Apr 12, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Receiving "Too much pending tasks" error when we use the event-hubs library to process messages #14606

Receiving "Too much pending tasks" error when we use the event-hubs library to process messages #14606

hilfor commented Mar 30, 2021

chradek commented Mar 30, 2021

hilfor commented Apr 4, 2021 •

edited

Loading

hilfor commented Apr 4, 2021

chradek commented Apr 5, 2021

chradek commented Apr 6, 2021

hilfor commented Apr 6, 2021

chradek commented Apr 6, 2021

hilfor commented Apr 6, 2021 •

edited

Loading

chradek commented Apr 6, 2021

chradek commented Apr 30, 2021

amit12cool commented May 27, 2021 •

edited

Loading

Receiving "Too much pending tasks" error when we use the event-hubs library to process messages #14606

Receiving "Too much pending tasks" error when we use the event-hubs library to process messages #14606

Comments

hilfor commented Mar 30, 2021

chradek commented Mar 30, 2021

hilfor commented Apr 4, 2021 • edited Loading

hilfor commented Apr 4, 2021

chradek commented Apr 5, 2021

chradek commented Apr 6, 2021

Possible solutions?

hilfor commented Apr 6, 2021

chradek commented Apr 6, 2021

hilfor commented Apr 6, 2021 • edited Loading

chradek commented Apr 6, 2021

chradek commented Apr 30, 2021

amit12cool commented May 27, 2021 • edited Loading

hilfor commented Apr 4, 2021 •

edited

Loading

hilfor commented Apr 6, 2021 •

edited

Loading

amit12cool commented May 27, 2021 •

edited

Loading