-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Receiving "Too much pending tasks" error when we use the event-hubs library to process messages #14606
Comments
@hilfor This is an interesting error you're seeing. We use a 'lock' when creating connections/links to prevent edge cases that would result in more connections to the service being created than needed. We configure the lock so there can be up to This leads me to believe that either locks aren't being released properly in all cases, or there are a lot of operations happening in parallel. Can you provide a sample of how you're receiving messages? I can use that to see if anything immediately sticks out. Also, how many partitions are you trying to read from? (I wouldn't expect a problem to be due to the number of partitions unless maybe you had close to 10,000 partitions!) Do you see your If possible, would you be able to turn on logging by setting the I can run some simple test cases to verify that we don't have code paths that would prevent locks from being released. Any of the above info you can provide would be greatly appreciated! |
Hi @chradek , sorry for the late response. We are creating an EventHub listener in the following order: const containerClient = new ContainerClient(storageConnectionString, storageContainerName);
const checkpointStore = new BlobCheckpointStore(containerClient);
const consumerClient = new EventHubConsumerClient(eventHubConsumerGroup, eventHubConnectionString, eventHubName, checkpointStore);
consumerClient.subscribe(
{
processEvents: async (events, context) => {},
processError: async (err, context) => {} // --> the errors are printed in here
},
{
maxBatchSize: 20,
maxWaitTimeInSeconds: 20
}
) The number of EventHub partitions is 8 :) so it is not even close to 10000 I will check if we can turn on logging with |
@chradek , I've also noticed that just before these errors start showing up, we can see the following error message:
|
@hilfor I believe I'm starting to see the "pending tasks" on management operations (e.g. getPartitionIds) increase very slowly over time when I prevent the service from being able to respond. This increase in pending tasks is surprising, but doesn't appear to be happening if the connection to the service is healthy. Any logs you have around when you start seeing |
Just wanting to dump my findings thus far here. So as I mentioned earlier, I am able to see the "Too much pending tasks" occur when I prevent the service from responding to the client. Note that the client still attempts to connect to the service, but creating the connection times out. Here's what I'm seeing happen:
Now, we have a timeout around acquiring the managementLock (used in step 4), and if we timeout, we will try again until our retries are exhausted. However, our timeout wraps the Under normal "transient" conditions, we'd expect to see the negotiateClaim call timeout when trying to create a connection to the service, the managementLock gets released, and assuming the network conditions have returned to normal, the next attempt to initialize the management request/response link would succeed. However in the scenario we're testing, we're unable to recreate the connection for an extended period of time. What's interesting is that we're not seeing So, another wrinkle is that eventually once the getPartitionIds call throws a timeout error, the EventProcessor loop will sleep for some duration and then move on to the next iteration, starting with calling getPartitionIds again. This leads to more attempts to acquire the management lock, which means overtime we eventually hit the limit. Possible solutions?AsyncLock accepts a timeout when attempting to acquire a lock, so we could set that and handle any timeout errors as needed. This would prevent us from reaching the conditions that cause the "Too much pending tasks" error. I'm not sure that this would do anything to help with the MessagingError OperationTimeouts also being seen. On the one hand, being able to cancel in-flight requests to create a connection so the next |
hi @chradek I've checked the logs and we're seeing these timeout errors for about a week before We can also see that once |
@hilfor |
@chradek |
@hilfor Would you be able to create a support case so we can work with the service team as well to determine why your connection to event hubs from your container seemed to have entered an unrecoverable state? If you can narrow down the time range where around where you were able to receive events and when you stopped receiving events, that would be useful info for the case. |
#14844 implements a fix for this issue and has been released as part of |
Hi @chradek I'm facing the same issue with First timeout issue came on 13th May
First too much pending task issue came on 20th May
@chradek I just found this error log on the same day when we got timeout error. It seems its trying to connect to hub and error is coming. {"level":"error", "time":"2021-05-20T16:09:48.407Z","errno":-3001,"code":"EAI_AGAIN","syscall":"getaddrinfo","hostname":".azure-devices.net","stack":"Error: getaddrinfo EAI_AGAIN .azure-devices.net\n at GetAddrInfoReqWrap.onlookup [as oncomplete] (dns.js:67:26)","type":"Error","msg":"getaddrinfo EAI_AGAIN sil-xms-prd01-iothub.azure-devices.net"} Anither point is we lost some of the events as soon as we got the timeout error and then too much pending task error. We have checkpointing in place, which should not have made events lost but I doubt if timeout is something to do with events getting missed intermittently. |
Describe the bug
After a few days of processing events, the following errors start showing Too much pending tasks.
These errors are printed in the
processError
consumer callback.Once this error starts showing the consumer can no longer receive any new events.
Before posting this issue, I have checked out other related issues like #5944 and #7674 , but these issues talk about sending requests while in our case the container can only receive and process events.
The issue resolves itself if we restart the container.
Stack trace
To Reproduce
This container is running in Azure, so maybe a connection was lost between the container and the EventHub instance, but I can't be sure.
The text was updated successfully, but these errors were encountered: