Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Not all partitions are assigned to consumer group if partitions % instances != 0 #13546

Closed
josago opened this issue Sep 3, 2020 · 4 comments · Fixed by #15786
Closed

Not all partitions are assigned to consumer group if partitions % instances != 0 #13546

josago opened this issue Sep 3, 2020 · 4 comments · Fixed by #15786
Assignees
Labels
Client This issue points to a problem in the data-plane of the library. customer-reported Issues that are reported by GitHub users external to the Azure organization. Event Hubs needs-author-feedback Workflow: More information is needed from author to address the issue. question The issue doesn't require a change to the product in order to be resolved. Most issues start as that

Comments

@josago
Copy link

josago commented Sep 3, 2020

  • Package Name: azure-eventhub
  • Package Version:
    azure-eventhub==5.1.0
    azure-eventhub-checkpointstoreblob==1.1.0
  • Operating System: Linux
  • Python Version: 3.6

Describe the bug
Having 6 instances of a process consuming records from an EventHub topic with 32 partitions (using the same consumer group) results in every instance being assigned to 5 partitions, with 32 - 6 * 5 = 2 partitions staying unassigned to any consumer, even hours after the processes have been launched. No partitions remain unassigned if either 4 or 8 instances are used instead (we hypothesize the reason is that 32 % 8 == 0, 32 % 4 == 0, but 32 % 6 != 0).

To Reproduce
Steps to reproduce the behavior:

  1. Fill in the EVENT_HUB_MONITOR_* constants within the given Python code with proper values.
  2. Execute 6 instances of the given Python code in parallel.
  3. Put the output of every instance together to figure out how many partitions remain unassigned.

The source code of the program follows:

EventHubMonitor.txt

Expected behavior
The expected behavior would be that 4 instances are assigned to 5 partitions, and 2 instances are assigned to 6 partitions, for a total of 4 * 5 + 2 * 6 = 32 assigned partitions.

Screenshots
I include screenshots that show the output of each one of the 6 instances after 5 minutes have passed since they were launched.

Screenshot_20200903_164612
Screenshot_20200903_164635
Screenshot_20200903_164647
Screenshot_20200903_164655
Screenshot_20200903_164704
Screenshot_20200903_164711

Additional context
While we are using the Azure Blob check-pointing mechanism to balance the consumers, we do not use it to commit reading offsets. In practice, to run several instances of this process in parallel we use a ReplicationController within a Kubernetes cluster.

@ghost ghost added needs-triage Workflow: This is a new issue that needs to be triaged to the appropriate team. customer-reported Issues that are reported by GitHub users external to the Azure organization. question The issue doesn't require a change to the product in order to be resolved. Most issues start as that labels Sep 3, 2020
@kaerm kaerm added Client This issue points to a problem in the data-plane of the library. Event Hubs labels Sep 3, 2020
@ghost ghost removed the needs-triage Workflow: This is a new issue that needs to be triaged to the appropriate team. label Sep 3, 2020
@KieranBrantnerMagee
Copy link
Member

Thanks for reaching out josago; and for the level of detail you provided as well.

While we investigate this, I'm tempted to suggest trying the 5.2.0b1 release (--pre --upgrade when pip installing) as it contained some changes to the load balancing algorithm that may intersect with this beneficially.

@saadansarithefirst
Copy link

Hello - is there any updates on this bug expected? it seems to be a major issue.

@KieranBrantnerMagee
Copy link
Member

Hey @saadansarithefirst ,
Per the earlier comment, 3.2.0 (then in preview, since released) underwent a significant overhaul of the load balancing algorithms that may have incidentally addressed this issue. If you've seen the same symptoms on this latest release, then yes, this is a major issue that we'll prioritize accordingly. No pressure if it won't be easy for you to try, as I do intend to repro this in the next few days as time allows; the steps seem clear enough.

Thanks for your patience/any info that you have.

@yunhaoling yunhaoling added the needs-author-feedback Workflow: More information is needed from author to address the issue. label Dec 9, 2020
KieranBrantnerMagee added a commit to KieranBrantnerMagee/azure-sdk-for-python that referenced this issue Dec 14, 2020
…te issue Azure#13546 having been fixed. (it seems to have been.)
@KieranBrantnerMagee
Copy link
Member

Following up on this (Thanks all for your patience), the good news is it appears our assumptions were on-point and the addition of the novel load balancing might have addressed this. See the tests here

I would mention that the tests assume use of the Greedy checkpoint acquisition strategy to most effectively deal with the problem you're describing (quickly ensuring all partitions are claimed.) Am closing this for now under the hope and assumption that this approach addresses the aforementioned repro as tested, but if I've misunderstood or this issue still seems to recur, do not hesitate to loop back on this and give us a shout.

yunhaoling added a commit that referenced this issue Dec 21, 2020
…te issue #13546 having been fixed. (it seems to have been.) (#15786)

Co-authored-by: Yunhao Ling <adam_ling@outlook.com>
rakshith91 pushed a commit to rakshith91/azure-sdk-for-python that referenced this issue Jan 8, 2021
…te issue Azure#13546 having been fixed. (it seems to have been.) (Azure#15786)

Co-authored-by: Yunhao Ling <adam_ling@outlook.com>
@github-actions github-actions bot locked and limited conversation to collaborators Apr 12, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Client This issue points to a problem in the data-plane of the library. customer-reported Issues that are reported by GitHub users external to the Azure organization. Event Hubs needs-author-feedback Workflow: More information is needed from author to address the issue. question The issue doesn't require a change to the product in order to be resolved. Most issues start as that
Projects
None yet
5 participants