Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Receiving "Too much pending tasks" error when we use the event-hubs library to process messages #14606

Closed
2 of 6 tasks
hilfor opened this issue Mar 30, 2021 · 11 comments
Closed
2 of 6 tasks
Assignees
Labels
Client This issue points to a problem in the data-plane of the library. customer-reported Issues that are reported by GitHub users external to the Azure organization. Event Hubs needs-team-attention Workflow: This issue needs attention from Azure service team or SDK team question The issue doesn't require a change to the product in order to be resolved. Most issues start as that

Comments

@hilfor
Copy link

hilfor commented Mar 30, 2021

  • Package Name: @azure/event-hubs
  • Package Version: 5.4.0
  • Operating system: Linux container
  • nodejs
    • version: 14.16.0
  • browser
    • name/version:
  • typescript
    • version: 3.9.9
  • Is the bug related to documentation in

Describe the bug
After a few days of processing events, the following errors start showing Too much pending tasks.
These errors are printed in the processError consumer callback.

Once this error starts showing the consumer can no longer receive any new events.

Before posting this issue, I have checked out other related issues like #5944 and #7674 , but these issues talk about sending requests while in our case the container can only receive and process events.

The issue resolves itself if we restart the container.

Stack trace

Error: Too much pending tasks
    at AsyncLock.acquire (/app/node_modules/async-lock/lib/index.js:160:15)
    at ManagementClient.<anonymous> (/app/node_modules/@azure/event-hubs/dist/index.js:687:56)
    at Generator.next (<anonymous>)
    at /app/node_modules/tslib/tslib.js:114:75
    at new Promise (<anonymous>)
    at Object.__awaiter (/app/node_modules/tslib/tslib.js:110:16)
    at /app/node_modules/@azure/event-hubs/dist/index.js:656:91
    at new Promise (<anonymous>)
    at Object.sendOperationPromise [as operation] (/app/node_modules/@azure/event-hubs/dist/index.js:656:52)
    at Object.<anonymous> (/app/node_modules/@azure/core-amqp/dist/index.js:1227:39)
    at Generator.next (<anonymous>)
    at /app/node_modules/tslib/tslib.js:114:75
    at new Promise (<anonymous>)
    at Object.__awaiter (/app/node_modules/tslib/tslib.js:110:16)
    at Object.retry (/app/node_modules/@azure/core-amqp/dist/index.js:1202:18)
    at ManagementClient.<anonymous> (/app/node_modules/@azure/event-hubs/dist/index.js:732:40)
    at Generator.next (<anonymous>)
    at /app/node_modules/tslib/tslib.js:114:75
    at new Promise (<anonymous>)
    at Object.__awaiter (/app/node_modules/tslib/tslib.js:110:16)
    at ManagementClient._makeManagementRequest (/app/node_modules/@azure/event-hubs/dist/index.js:652:22)
    at ManagementClient.<anonymous> (/app/node_modules/@azure/event-hubs/dist/index.js:497:41)

To Reproduce
This container is running in Azure, so maybe a connection was lost between the container and the EventHub instance, but I can't be sure.

@ghost ghost added needs-triage Workflow: This is a new issue that needs to be triaged to the appropriate team. customer-reported Issues that are reported by GitHub users external to the Azure organization. question The issue doesn't require a change to the product in order to be resolved. Most issues start as that labels Mar 30, 2021
@ramya-rao-a ramya-rao-a added Client This issue points to a problem in the data-plane of the library. Event Hubs labels Mar 30, 2021
@ghost ghost removed the needs-triage Workflow: This is a new issue that needs to be triaged to the appropriate team. label Mar 30, 2021
@ghost ghost added the needs-team-attention Workflow: This issue needs attention from Azure service team or SDK team label Mar 30, 2021
@chradek
Copy link
Contributor

chradek commented Mar 30, 2021

@hilfor
Thanks for logging this issue.

This is an interesting error you're seeing. We use a 'lock' when creating connections/links to prevent edge cases that would result in more connections to the service being created than needed. We configure the lock so there can be up to 10,000 pending tasks. The error you're seeing indicates that there are at least 10,000 pending tasks waiting on a lock.

This leads me to believe that either locks aren't being released properly in all cases, or there are a lot of operations happening in parallel.

Can you provide a sample of how you're receiving messages? I can use that to see if anything immediately sticks out. Also, how many partitions are you trying to read from? (I wouldn't expect a problem to be due to the number of partitions unless maybe you had close to 10,000 partitions!)

Do you see your processError handlers firing during your run at all? I'd be interested in what sort of errors you see there. I'm really curious if we're seeing the client disconnect over time.

If possible, would you be able to turn on logging by setting the DEBUG environment variable to azure:*? This may result in a lot of logs if it takes days for this issue to manifest. It'd be helpful if we could get the tail end of the logs near where the issue happens if possible.

I can run some simple test cases to verify that we don't have code paths that would prevent locks from being released. Any of the above info you can provide would be greatly appreciated!

@hilfor
Copy link
Author

hilfor commented Apr 4, 2021

Hi @chradek , sorry for the late response.

We are creating an EventHub listener in the following order:

const containerClient = new ContainerClient(storageConnectionString, storageContainerName);
const checkpointStore = new BlobCheckpointStore(containerClient);
const consumerClient = new EventHubConsumerClient(eventHubConsumerGroup, eventHubConnectionString, eventHubName, checkpointStore);

consumerClient.subscribe(
  {
    processEvents: async (events, context) => {},
    processError: async (err, context) => {} // --> the errors are printed in here
  },
  {
    maxBatchSize: 20,
    maxWaitTimeInSeconds: 20
  }
)

The number of EventHub partitions is 8 :) so it is not even close to 10000

I will check if we can turn on logging with DEBUG but it might take some time to manifest :|

@hilfor
Copy link
Author

hilfor commented Apr 4, 2021

@chradek , I've also noticed that just before these errors start showing up, we can see the following error message:

Error : MessagingError: The request with message_id "5550f132-8297-4827-b6e3-5472457aa17c" timed out. Please try again later.

@chradek
Copy link
Contributor

chradek commented Apr 5, 2021

@hilfor
Can you tell approximately how long you see MessagingError before you start seeing Too much pending tasks?

I believe I'm starting to see the "pending tasks" on management operations (e.g. getPartitionIds) increase very slowly over time when I prevent the service from being able to respond. This increase in pending tasks is surprising, but doesn't appear to be happening if the connection to the service is healthy.

Any logs you have around when you start seeing MessagingError indicating time outs might help us determine what's going on. If you can keep track of the timestamps where you were able to receive messages and when you stopped receiving messages, that can be helpful to take to the service team for investigation as well.

@chradek
Copy link
Contributor

chradek commented Apr 6, 2021

Just wanting to dump my findings thus far here.

So as I mentioned earlier, I am able to see the "Too much pending tasks" occur when I prevent the service from responding to the client. Note that the client still attempts to connect to the service, but creating the connection times out.

Here's what I'm seeing happen:

  1. EventHubConsumerClient.subscribe() called.
  2. The EventProcessor._runLoopWithLoadBalancing method is invoked, which starts a loop that continues until the subscription or EventHubConsumerClient is closed.
  3. At the start of each loop iteration, we attempt to get the partitionIds.
  4. If there isn't already a $management request/response link open on the connection, we wait to acquire a lock to initialize the request/response link:
    return defaultLock.acquire(this.managementLock, () => {
    return this._init();
    });
  5. Once a lock is acquired, if the request/response link is still not created, attempt to create it.
  6. One of these steps is to negotiate a claim (this is for auth):
    await this._context.readyToOpenLink();
    await this._negotiateClaim();
  7. At this point, we'll wait to acquire another lock to create the CBS session (for auth):
    await defaultLock.acquire(this._context.cbsSession.cbsLock, () => {
    return this._context.cbsSession.init();
    });

Now, we have a timeout around acquiring the managementLock (used in step 4), and if we timeout, we will try again until our retries are exhausted. However, our timeout wraps the defaultLock.acquire call, so it doesn't actually release the lock.

Under normal "transient" conditions, we'd expect to see the negotiateClaim call timeout when trying to create a connection to the service, the managementLock gets released, and assuming the network conditions have returned to normal, the next attempt to initialize the management request/response link would succeed. However in the scenario we're testing, we're unable to recreate the connection for an extended period of time. What's interesting is that we're not seeing ECONNREFUSED errors as one might expect. Instead it appears from the client-side that it's attempting to connect to a blackhole that never responds.

So, another wrinkle is that eventually once the getPartitionIds call throws a timeout error, the EventProcessor loop will sleep for some duration and then move on to the next iteration, starting with calling getPartitionIds again. This leads to more attempts to acquire the management lock, which means overtime we eventually hit the limit.

Possible solutions?

AsyncLock accepts a timeout when attempting to acquire a lock, so we could set that and handle any timeout errors as needed. This would prevent us from reaching the conditions that cause the "Too much pending tasks" error.
If we do this, we should also pass the abortSignal through the _init() calls so that the original method invocation that acquired a lock doesn't take too much time as well.

I'm not sure that this would do anything to help with the MessagingError OperationTimeouts also being seen. On the one hand, being able to cancel in-flight requests to create a connection so the next _init() call doesn't think we're already connecting could help. However, I can't say that definitively without better understanding how the client is getting into this state. Logs with timestamps would help so we can ask the service what they see on their side.

@hilfor
Copy link
Author

hilfor commented Apr 6, 2021

hi @chradek

I've checked the logs and we're seeing these timeout errors for about a week before Too much pending tasks errors start showing up.
This timeout message appears every 5 minutes.
This message is printed inside the processError callback.
Unfortunately, we're not seeing any other errors around or near the timeouts.

We can also see that once Too much pending tasks starts showing the timeout errors are gone.
Please note the timestamps on the graphs in the screenshots below
image
image

@chradek
Copy link
Contributor

chradek commented Apr 6, 2021

@hilfor
Thanks for the info! While you're seeing these MessagingError errors, are you seeing any incoming events? Not just that the processEvents handler is called, but that you're seeing actual events? In my testing I was seeing an 'all or nothing' type of scenario where if I was getting the MessagingError, I also was not receiving events so I'm curious if you are seeing the same.

@hilfor
Copy link
Author

hilfor commented Apr 6, 2021

@chradek
I've checked the logs and we have the exact same scenario.
I just want to point out that these screenshots were of a service that has 5 other consumers (connected to different EventHubs) working without any issues.

@chradek
Copy link
Contributor

chradek commented Apr 6, 2021

@hilfor
Thanks for confirming. I'm working on some changes that will prevent the too much pending tasks error from occurring, but I'm still unsure how much they will affect the MessagingErrors you're seeing. I'll keep this issue posted on those changes.

Would you be able to create a support case so we can work with the service team as well to determine why your connection to event hubs from your container seemed to have entered an unrecoverable state? If you can narrow down the time range where around where you were able to receive events and when you stopped receiving events, that would be useful info for the case.

@chradek
Copy link
Contributor

chradek commented Apr 30, 2021

#14844 implements a fix for this issue and has been released as part of @azure/event-hubs version 5.5.1. Please feel free to reach out if you encounter any other issue!

@chradek chradek closed this as completed Apr 30, 2021
@amit12cool
Copy link

amit12cool commented May 27, 2021

Hi @chradek I'm facing the same issue with MessagingError and Too much pending tasks. I have gathered the logs at my end related the time they arrived. I see a difference of one week between timeout and too much pending task error messages. Check below for logs.

First timeout issue came on 13th May

2021-05-13 05:03:10    
{"level":"error", "time":"2021-05-12T23:33:10.991Z","msg":"Error in listening on eventHub: iot-prd01-evh and partitionId : MessagingError: The request with message_id \"5db3cdec-6b8b-46f0-b333-f7cf3f00c7aa\" timed out. Please try again later.}"}

First too much pending task issue came on 20th May

2021-05-20 04:20:53   
{"level":"error", "time":"2021-05-25T22:50:53.372Z","msg":"Error in listening on eventHub: iot-prd01-evh and partitionId : Error: Too much pending tasks}"}

@chradek I just found this error log on the same day when we got timeout error. It seems its trying to connect to hub and error is coming.

{"level":"error", "time":"2021-05-20T16:09:48.407Z","errno":-3001,"code":"EAI_AGAIN","syscall":"getaddrinfo","hostname":".azure-devices.net","stack":"Error: getaddrinfo EAI_AGAIN .azure-devices.net\n at GetAddrInfoReqWrap.onlookup [as oncomplete] (dns.js:67:26)","type":"Error","msg":"getaddrinfo EAI_AGAIN sil-xms-prd01-iothub.azure-devices.net"}

Anither point is we lost some of the events as soon as we got the timeout error and then too much pending task error. We have checkpointing in place, which should not have made events lost but I doubt if timeout is something to do with events getting missed intermittently.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Client This issue points to a problem in the data-plane of the library. customer-reported Issues that are reported by GitHub users external to the Azure organization. Event Hubs needs-team-attention Workflow: This issue needs attention from Azure service team or SDK team question The issue doesn't require a change to the product in order to be resolved. Most issues start as that
Projects
None yet
Development

No branches or pull requests

4 participants