Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Eventhub stops getting data when under load #273

Open
MartinKosicky opened this issue Sep 15, 2022 · 3 comments
Open

Eventhub stops getting data when under load #273

MartinKosicky opened this issue Sep 15, 2022 · 3 comments

Comments

@MartinKosicky
Copy link

When we run eventhub go listener after some time we get stuck on https://github.com/Azure/azure-event-hubs-go/blob/master/receiver.go#L291 . It seems that the session get's broken, however I know that the connection is OK because I watched it in wireshark and when I call GetPartitionInfo (it uses the same connection so the socket is not dead) I see that I am not at the end of the partition.

I would like to ask if there shouldn't be some kind of timeout if the session get's broken somehow as I saw such code on the C# variant of this library. However I dont see anything in this code like that. Maybe a call to Recover if there is no data after some time?

@MartinKosicky
Copy link
Author

MartinKosicky commented Sep 15, 2022

I just change that line to:

	newContext, _ := context.WithTimeout(ctx, 30 * time.Second)
	msg, err := r.listenForMessage(newContext)    (this would trigger a Recover on the session if no data arrives in 30 seconds)

and it works now, should I make a PR or can we possibly discuss it if you have some other idea?

@richardpark-msft
Copy link
Member

The issue with this is that it'll force recovery every 'n' seconds (in your case 30) if there's no activity. So really we need to fix the core bug here, which appears to be that the Receiver is no longer "live" and so it's not responding to messages. There's a few reasons this could happen.

Can you give me a better idea of how you reproduce this? How long is "after some time"? Are we talking 3 days, 4 days, that kind of thing? Also, do you see this after longer idle periods or is this even when activity is active?

@MartinKosicky
Copy link
Author

MartinKosicky commented Sep 15, 2022

I totally agree to fix the core issue here, although I'm afraid about the reselience here. I can reproduce this by running a read from the start of an eventhub with prefetch 2000. After a while (few minutes max) the reading stops, and we have logic that if i get no data for 30 secs it checks if I am at the end of partition (over same connection). And when i'm not we let the microservice crash and get restarted, from a checkpoint. This happens only when there is a lot of activity. Also this can be a server issue

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants