Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[azeventhubs] Look deeper into "stolen partition" robustness #19062

Closed
richardpark-msft opened this issue Sep 7, 2022 · 1 comment
Closed
Assignees
Labels
Client This issue points to a problem in the data-plane of the library. Event Hubs

Comments

@richardpark-msft
Copy link
Member

richardpark-msft commented Sep 7, 2022

In the Processor there are two sources of ownership:

  • The checkpoint store, which has a blob per partition and has a distinct owner (a Processor instance)
  • At the AMQP link level, by virtue of owning a live link. When you use an epoch only one link can be alive at a time (last one in) and the other links that exist are automatically detached.

When partitions are being assigned it's possible for one processor to own more partitions than it should, due to timing and when the processor came up vs other processors. When this happens newer processors will have to steal partitions - take ownership by updating the ownership blob and open a new link. This will cause the older processors link to detach and also prevent them from updating ownership at that point since their etag will no longer match what's stored.

There are some cases where I think we can tighten up the handling of this to reduce the window of time where two partition processors believe they should process a partition. As it is now, it's possible for two processors to process the same events, which isn't breaking, as "at least once" delivery is a property of Event Hubs and needs to be handled by customers.

We can improve this by doing a few things:

  1. We can, in the ownership polling loop, close any partition processors that are on partitions we no longer own. This is something that would happen when the customer calls ReceiveEvents(), but we can do it earlier.
  2. Change UpdateCheckpoint() to handle the case where we've lost link-level ownership. This'll probably involve either a check of the link itself (to see if it's detached - a function that doesn't exist today) or at least knowing when we've been locally closed and relying on the code we write in item 1.

For live testing we'll probably want better inspection over the internals, so these will be live tests that don't rely only on the public interface.

@ghost ghost added the needs-triage Workflow: This is a new issue that needs to be triaged to the appropriate team. label Sep 7, 2022
@richardpark-msft richardpark-msft added Event Hubs Client This issue points to a problem in the data-plane of the library. labels Sep 7, 2022
@richardpark-msft richardpark-msft self-assigned this Sep 7, 2022
@ghost ghost removed the needs-triage Workflow: This is a new issue that needs to be triaged to the appropriate team. label Sep 7, 2022
@richardpark-msft richardpark-msft added this to the 2022-10 milestone Sep 7, 2022
@RickWinter RickWinter modified the milestones: 2022-10, 2023-04 Jan 10, 2023
@RickWinter RickWinter removed this from the 2023-04 milestone Aug 29, 2023
@richardpark-msft
Copy link
Member Author

This was done a long time ago as part of our general validation and the design of the client. We also have several stress tests for this.

@github-actions github-actions bot locked and limited conversation to collaborators Aug 30, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Client This issue points to a problem in the data-plane of the library. Event Hubs
Projects
Status: Done
Development

No branches or pull requests

2 participants