You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
In the Processor there are two sources of ownership:
The checkpoint store, which has a blob per partition and has a distinct owner (a Processor instance)
At the AMQP link level, by virtue of owning a live link. When you use an epoch only one link can be alive at a time (last one in) and the other links that exist are automatically detached.
When partitions are being assigned it's possible for one processor to own more partitions than it should, due to timing and when the processor came up vs other processors. When this happens newer processors will have to steal partitions - take ownership by updating the ownership blob and open a new link. This will cause the older processors link to detach and also prevent them from updating ownership at that point since their etag will no longer match what's stored.
There are some cases where I think we can tighten up the handling of this to reduce the window of time where two partition processors believe they should process a partition. As it is now, it's possible for two processors to process the same events, which isn't breaking, as "at least once" delivery is a property of Event Hubs and needs to be handled by customers.
We can improve this by doing a few things:
We can, in the ownership polling loop, close any partition processors that are on partitions we no longer own. This is something that would happen when the customer calls ReceiveEvents(), but we can do it earlier.
Change UpdateCheckpoint() to handle the case where we've lost link-level ownership. This'll probably involve either a check of the link itself (to see if it's detached - a function that doesn't exist today) or at least knowing when we've been locally closed and relying on the code we write in item 1.
For live testing we'll probably want better inspection over the internals, so these will be live tests that don't rely only on the public interface.
The text was updated successfully, but these errors were encountered:
ghost
added
the
needs-triage
Workflow: This is a new issue that needs to be triaged to the appropriate team.
label
Sep 7, 2022
In the Processor there are two sources of ownership:
Processor
instance)When partitions are being assigned it's possible for one processor to own more partitions than it should, due to timing and when the processor came up vs other processors. When this happens newer processors will have to steal partitions - take ownership by updating the ownership blob and open a new link. This will cause the older processors link to detach and also prevent them from updating ownership at that point since their etag will no longer match what's stored.
There are some cases where I think we can tighten up the handling of this to reduce the window of time where two partition processors believe they should process a partition. As it is now, it's possible for two processors to process the same events, which isn't breaking, as "at least once" delivery is a property of Event Hubs and needs to be handled by customers.
We can improve this by doing a few things:
For live testing we'll probably want better inspection over the internals, so these will be live tests that don't rely only on the public interface.
The text was updated successfully, but these errors were encountered: