[azeventhubs] Look deeper into "stolen partition" robustness #19062

richardpark-msft · 2022-09-07T19:25:38Z

In the Processor there are two sources of ownership:

The checkpoint store, which has a blob per partition and has a distinct owner (a Processor instance)
At the AMQP link level, by virtue of owning a live link. When you use an epoch only one link can be alive at a time (last one in) and the other links that exist are automatically detached.

When partitions are being assigned it's possible for one processor to own more partitions than it should, due to timing and when the processor came up vs other processors. When this happens newer processors will have to steal partitions - take ownership by updating the ownership blob and open a new link. This will cause the older processors link to detach and also prevent them from updating ownership at that point since their etag will no longer match what's stored.

There are some cases where I think we can tighten up the handling of this to reduce the window of time where two partition processors believe they should process a partition. As it is now, it's possible for two processors to process the same events, which isn't breaking, as "at least once" delivery is a property of Event Hubs and needs to be handled by customers.

We can improve this by doing a few things:

We can, in the ownership polling loop, close any partition processors that are on partitions we no longer own. This is something that would happen when the customer calls ReceiveEvents(), but we can do it earlier.
Change UpdateCheckpoint() to handle the case where we've lost link-level ownership. This'll probably involve either a check of the link itself (to see if it's detached - a function that doesn't exist today) or at least knowing when we've been locally closed and relying on the code we write in item 1.

For live testing we'll probably want better inspection over the internals, so these will be live tests that don't rely only on the public interface.

The text was updated successfully, but these errors were encountered:

richardpark-msft · 2024-06-01T03:01:09Z

This was done a long time ago as part of our general validation and the design of the client. We also have several stress tests for this.

ghost added the needs-triage Workflow: This is a new issue that needs to be triaged to the appropriate team. label Sep 7, 2022

richardpark-msft added Event Hubs Client This issue points to a problem in the data-plane of the library. labels Sep 7, 2022

richardpark-msft self-assigned this Sep 7, 2022

ghost removed the needs-triage Workflow: This is a new issue that needs to be triaged to the appropriate team. label Sep 7, 2022

richardpark-msft added this to the 2022-10 milestone Sep 7, 2022

RickWinter modified the milestones: 2022-10, 2023-04 Jan 10, 2023

RickWinter removed this from the 2023-04 milestone Aug 29, 2023

richardpark-msft closed this as completed Jun 1, 2024

github-actions bot locked and limited conversation to collaborators Aug 30, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[azeventhubs] Look deeper into "stolen partition" robustness #19062

[azeventhubs] Look deeper into "stolen partition" robustness #19062

richardpark-msft commented Sep 7, 2022 •

edited

Loading

richardpark-msft commented Jun 1, 2024

[azeventhubs] Look deeper into "stolen partition" robustness #19062

[azeventhubs] Look deeper into "stolen partition" robustness #19062

Comments

richardpark-msft commented Sep 7, 2022 • edited Loading

richardpark-msft commented Jun 1, 2024

richardpark-msft commented Sep 7, 2022 •

edited

Loading