Not all partitions are assigned to consumer group if partitions % instances != 0 #13546

josago · 2020-09-03T14:52:08Z

Package Name: azure-eventhub
Package Version:
azure-eventhub==5.1.0
azure-eventhub-checkpointstoreblob==1.1.0
Operating System: Linux
Python Version: 3.6

Describe the bug
Having 6 instances of a process consuming records from an EventHub topic with 32 partitions (using the same consumer group) results in every instance being assigned to 5 partitions, with 32 - 6 * 5 = 2 partitions staying unassigned to any consumer, even hours after the processes have been launched. No partitions remain unassigned if either 4 or 8 instances are used instead (we hypothesize the reason is that 32 % 8 == 0, 32 % 4 == 0, but 32 % 6 != 0).

To Reproduce
Steps to reproduce the behavior:

Fill in the EVENT_HUB_MONITOR_* constants within the given Python code with proper values.
Execute 6 instances of the given Python code in parallel.
Put the output of every instance together to figure out how many partitions remain unassigned.

The source code of the program follows:

EventHubMonitor.txt

Expected behavior
The expected behavior would be that 4 instances are assigned to 5 partitions, and 2 instances are assigned to 6 partitions, for a total of 4 * 5 + 2 * 6 = 32 assigned partitions.

Screenshots
I include screenshots that show the output of each one of the 6 instances after 5 minutes have passed since they were launched.

Additional context
While we are using the Azure Blob check-pointing mechanism to balance the consumers, we do not use it to commit reading offsets. In practice, to run several instances of this process in parallel we use a ReplicationController within a Kubernetes cluster.

KieranBrantnerMagee · 2020-09-05T06:29:23Z

Thanks for reaching out josago; and for the level of detail you provided as well.

While we investigate this, I'm tempted to suggest trying the 5.2.0b1 release (--pre --upgrade when pip installing) as it contained some changes to the load balancing algorithm that may intersect with this beneficially.

saadansarithefirst · 2020-10-13T12:04:09Z

Hello - is there any updates on this bug expected? it seems to be a major issue.

KieranBrantnerMagee · 2020-10-13T23:33:05Z

Hey @saadansarithefirst ,
Per the earlier comment, 3.2.0 (then in preview, since released) underwent a significant overhaul of the load balancing algorithms that may have incidentally addressed this issue. If you've seen the same symptoms on this latest release, then yes, this is a major issue that we'll prioritize accordingly. No pressure if it won't be easy for you to try, as I do intend to repro this in the next few days as time allows; the steps seem clear enough.

Thanks for your patience/any info that you have.

…te issue Azure#13546 having been fixed. (it seems to have been.)

KieranBrantnerMagee · 2020-12-14T17:43:11Z

Following up on this (Thanks all for your patience), the good news is it appears our assumptions were on-point and the addition of the novel load balancing might have addressed this. See the tests here

I would mention that the tests assume use of the Greedy checkpoint acquisition strategy to most effectively deal with the problem you're describing (quickly ensuring all partitions are claimed.) Am closing this for now under the hope and assumption that this approach addresses the aforementioned repro as tested, but if I've misunderstood or this issue still seems to recur, do not hesitate to loop back on this and give us a shout.

…te issue #13546 having been fixed. (it seems to have been.) (#15786) Co-authored-by: Yunhao Ling <adam_ling@outlook.com>

…te issue Azure#13546 having been fixed. (it seems to have been.) (Azure#15786) Co-authored-by: Yunhao Ling <adam_ling@outlook.com>

xiangyan99 assigned KieranBrantnerMagee Sep 3, 2020

kaerm added Client This issue points to a problem in the data-plane of the library. Event Hubs labels Sep 3, 2020

ghost removed the needs-triage Workflow: This is a new issue that needs to be triaged to the appropriate team. label Sep 3, 2020

yunhaoling added the needs-author-feedback Workflow: More information is needed from author to address the issue. label Dec 9, 2020

KieranBrantnerMagee added a commit to KieranBrantnerMagee/azure-sdk-for-python that referenced this issue Dec 14, 2020

Add unit test for 32 partition greedy ownership to attempt and valida…

5f3eb90

…te issue Azure#13546 having been fixed. (it seems to have been.)

KieranBrantnerMagee mentioned this issue Dec 14, 2020

[EventHubs] Add unit test for 32 partition greedy ownership to attempt and validate issue #13546 having been fixed #15786

Merged

yunhaoling assigned yunhaoling and unassigned KieranBrantnerMagee Dec 17, 2020

yunhaoling closed this as completed in #15786 Dec 21, 2020

yunhaoling added a commit that referenced this issue Dec 21, 2020

Add unit test for 32 partition greedy ownership to attempt and valida…

5b1777e

…te issue #13546 having been fixed. (it seems to have been.) (#15786) Co-authored-by: Yunhao Ling <adam_ling@outlook.com>

github-actions bot locked and limited conversation to collaborators Apr 12, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Not all partitions are assigned to consumer group if partitions % instances != 0 #13546

Not all partitions are assigned to consumer group if partitions % instances != 0 #13546

josago commented Sep 3, 2020

KieranBrantnerMagee commented Sep 5, 2020

saadansarithefirst commented Oct 13, 2020

KieranBrantnerMagee commented Oct 13, 2020

KieranBrantnerMagee commented Dec 14, 2020

Not all partitions are assigned to consumer group if partitions % instances != 0 #13546

Not all partitions are assigned to consumer group if partitions % instances != 0 #13546

Comments

josago commented Sep 3, 2020

KieranBrantnerMagee commented Sep 5, 2020

saadansarithefirst commented Oct 13, 2020

KieranBrantnerMagee commented Oct 13, 2020

KieranBrantnerMagee commented Dec 14, 2020