Eventhub stops getting data when under load #273

MartinKosicky · 2022-09-15T10:06:14Z

When we run eventhub go listener after some time we get stuck on https://github.com/Azure/azure-event-hubs-go/blob/master/receiver.go#L291 . It seems that the session get's broken, however I know that the connection is OK because I watched it in wireshark and when I call GetPartitionInfo (it uses the same connection so the socket is not dead) I see that I am not at the end of the partition.

I would like to ask if there shouldn't be some kind of timeout if the session get's broken somehow as I saw such code on the C# variant of this library. However I dont see anything in this code like that. Maybe a call to Recover if there is no data after some time?

MartinKosicky · 2022-09-15T10:27:50Z

I just change that line to:

	newContext, _ := context.WithTimeout(ctx, 30 * time.Second)
	msg, err := r.listenForMessage(newContext)    (this would trigger a Recover on the session if no data arrives in 30 seconds)

and it works now, should I make a PR or can we possibly discuss it if you have some other idea?

richardpark-msft · 2022-09-15T17:47:59Z

The issue with this is that it'll force recovery every 'n' seconds (in your case 30) if there's no activity. So really we need to fix the core bug here, which appears to be that the Receiver is no longer "live" and so it's not responding to messages. There's a few reasons this could happen.

Can you give me a better idea of how you reproduce this? How long is "after some time"? Are we talking 3 days, 4 days, that kind of thing? Also, do you see this after longer idle periods or is this even when activity is active?

MartinKosicky · 2022-09-15T18:49:11Z

I totally agree to fix the core issue here, although I'm afraid about the reselience here. I can reproduce this by running a read from the start of an eventhub with prefetch 2000. After a while (few minutes max) the reading stops, and we have logic that if i get no data for 30 secs it checks if I am at the end of partition (over same connection). And when i'm not we let the microservice crash and get restarted, from a checkpoint. This happens only when there is a lot of activity. Also this can be a server issue

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Eventhub stops getting data when under load #273

Eventhub stops getting data when under load #273

MartinKosicky commented Sep 15, 2022

MartinKosicky commented Sep 15, 2022 •

edited

Loading

richardpark-msft commented Sep 15, 2022

MartinKosicky commented Sep 15, 2022 •

edited

Loading

Eventhub stops getting data when under load #273

Eventhub stops getting data when under load #273

Comments

MartinKosicky commented Sep 15, 2022

MartinKosicky commented Sep 15, 2022 • edited Loading

richardpark-msft commented Sep 15, 2022

MartinKosicky commented Sep 15, 2022 • edited Loading

MartinKosicky commented Sep 15, 2022 •

edited

Loading

MartinKosicky commented Sep 15, 2022 •

edited

Loading