Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow Apache Kafka scaler to scale using sum of lag for all topics within a consumer group #2409

Conversation

PaulLiang1
Copy link
Contributor

@PaulLiang1 PaulLiang1 commented Dec 18, 2021

Allow kafka scaler to use sum of lag for all topic partition when no topic is supplied.
This is useful when the consumer is subscribed to multiple topics;

Checklist

@JorTurFer
Copy link
Member

JorTurFer commented Dec 20, 2021

/run-e2e kafka.test*
Update: You can check the progres here

Copy link
Member

@JorTurFer JorTurFer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not an expert in Kafka but LGTM.
Could you add a unit test checking the new metric name please?
You can add the test case here

CHANGELOG.md Outdated Show resolved Hide resolved
@JorTurFer
Copy link
Member

hi @PaulLiang1
e2e test is failing, could you review it? You could find the logs here

@PaulLiang1 PaulLiang1 force-pushed the allow-kafka-scaler-to-scale-without-a-specific-topic branch from 68650ca to d745707 Compare December 20, 2021 10:41
@JorTurFer
Copy link
Member

JorTurFer commented Dec 20, 2021

/run-e2e kafka.test*
Update: You can check the progres here

@PaulLiang1
Copy link
Contributor Author

/run-e2e kafka.test* Update: You can check the progres here

Hi @JorTurFer , thanks for your help.
the test failed again. i will debug it further & update the ticket once i had more info.

@JorTurFer
Copy link
Member

the test failed again. i will debug it further & update the ticket once i had more info.

Thanks @PaulLiang1 ,
No rush at all, when you have some time

@PaulLiang1
Copy link
Contributor Author

the test failed again. i will debug it further & update the ticket once i had more info.

Thanks @PaulLiang1 , No rush at all, when you have some time

Hi @JorTurFer , turns out i previously had some mis-understanding for lagThreshold..
I've updated e2e test. would you mind trigger other run for me? thanks

@JorTurFer
Copy link
Member

JorTurFer commented Dec 20, 2021

/run-e2e kafka.test*
Update: You can check the progres here

@PaulLiang1
Copy link
Contributor Author

/run-e2e kafka.test* Update: You can check the progres here

Hi @JorTurFer . the tests passed. would you mind take another look at the PR? thanks

@PaulLiang1 PaulLiang1 requested a review from JorTurFer December 20, 2021 22:55
@JorTurFer JorTurFer requested a review from a team December 22, 2021 07:44
Copy link
Member

@JorTurFer JorTurFer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!
Only a little suggestion

pkg/scalers/kafka_scaler.go Outdated Show resolved Hide resolved
@JorTurFer JorTurFer requested a review from a team December 22, 2021 07:56
@PaulLiang1 PaulLiang1 requested review from JorTurFer and removed request for a team December 22, 2021 12:14
@JorTurFer
Copy link
Member

JorTurFer commented Dec 22, 2021

/run-e2e kafka.test*
Update: You can check the progres here

Copy link
Member

@JorTurFer JorTurFer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!
Thanks for this contribution ❤️
(in any case, let's wait till other pair of eyes takes a look at this)

@JorTurFer JorTurFer requested a review from a team December 22, 2021 12:55
@bpinske
Copy link
Contributor

bpinske commented Dec 23, 2021

Any particular reason you chose to implement this, rather than setting up different kafka triggers for each topic individually, all within the same scaledObject?

I have no idea what your applications do, but I'd imagine the different kafka topics to be producing different messages that get consumed in very different ways. Is it really valuable to be having a mixed measure that doesn't discriminate based on what the messages really are?

I do have consumers that consume from multiple topics, and I've had good success with tracking each topic separately with different thresholds set to account for different volumes/computational intensity to process the different message types.

@PaulLiang1
Copy link
Contributor Author

Any particular reason you chose to implement this, rather than setting up different kafka triggers for each topic individually, all within the same scaledObject?

I have no idea what your applications do, but I'd imagine the different kafka topics to be producing different messages that get consumed in very different ways. Is it really valuable to be having a mixed measure that doesn't discriminate based on what the messages really are?

I do have consumers that consume from multiple topics, and I've had good success with tracking each topic separately with different thresholds set to account for different volumes/computational intensity to process the different message types.


Thanks for the discussion.

Context:

  • We had a homogeneous topic setup -> 1:1 relationship between topic to event type;
  • For the very specific app we talking about here, it need to consume ~60 types of event, where from business's perspective, they are of similar "importance";
  • These event are also ingested from multiple clusters from different geographic locations and being processed in central aggregation cluster via MirrorMaker;
  • MirrorMaker by default prefix the mirrored/destination topic with the source cluster, so if we have ~60 topics of different event type with 8 regions, the total mount of topic could become 480;
  • On actual deployment we do bulkhead the application into different consumer groups / deployments etc for DR reasons, but from business's perspective they are equally important;

...setting up different kafka triggers for each topic individually...

  • Due to the amount of topics it need to consume, it became an operational burden to list all the topics into the trigger; Especially the consumer could be using wildcards to "auto discover" the topics.

...Is it really valuable to be having a mixed measure that doesn't discriminate based on what the messages really are...

  • It would be great that decision can made by the user, instead of being enforced by;
  • Though it's could be good indicator that the doc can be updated to highlight the differences between these approaches.

...I've had good success with tracking each topic separately with different thresholds set to account for different volumes...

  • This change does not forbid that, it's providing an alternative for ppl to adopt to their scenario.

@bpinske
Copy link
Contributor

bpinske commented Dec 24, 2021

Context:

  • We had a homogeneous topic setup -> 1:1 relationship between topic to event type;
  • For the very specific app we talking about here, it need to consume ~60 types of event, where from business's perspective, they are of similar "importance";
  • These event are also ingested from multiple clusters from different geographic locations and being processed in central aggregation cluster via MirrorMaker;
  • MirrorMaker by default prefix the mirrored/destination topic with the source cluster, so if we have ~60 topics of different event type with 8 regions, the total mount of topic could become 480;
  • On actual deployment we do bulkhead the application into different consumer groups / deployments etc for DR reasons, but from business's perspective they are equally important;

...setting up different kafka triggers for each topic individually...

  • Due to the amount of topics it need to consume, it became an operational burden to list all the topics into the trigger; Especially the consumer could be using wildcards to "auto discover" the topics.

...Is it really valuable to be having a mixed measure that doesn't discriminate based on what the messages really are...

  • It would be great that decision can made by the user, instead of being enforced by;
  • Though it's could be good indicator that the doc can be updated to highlight the differences between these approaches.

...I've had good success with tracking each topic separately with different thresholds set to account for different volumes...

  • This change does not forbid that, it's providing an alternative for ppl to adopt to their scenario.

I'll admit I never considered the possibility of somebody consuming from 480 topics at once :)

It would definitely be good to emphasize the autodiscovery behaviour that leaving the topic string empty would be then in the docs. I've had problems before in almost the inverse situation where I was accidentally scaling based off of the total lag of all consumer groups on a topic.

One limitation I can think of is that the autodiscovery is limited to a single Kafka cluster. All topics you discovery from must be present within the same Cluster. You could, of course, just supply multiple kafka scaler triggers each autodiscovering a different cluster if necessary. Just something else to maybe highlight in the docs.

I suppose the other thing to be aware of is that with 480 topics, any significant number of partitions per topic will quickly explode to a very large total number of partitions that must be queried. There is currently another active PR to ensure querying brokers is concurrent. By the time this PR gets included in the next public release, concurrency should alleviate any performance concerns there.

The PR seems sane to me.

@PaulLiang1
Copy link
Contributor Author

It would definitely be good to emphasize the autodiscovery behaviour that leaving the topic string empty would be then in the docs. I've had problems before in almost the inverse situation where I was accidentally scaling based off of the total lag of all consumer groups on a topic.

PR for doc: kedacore/keda-docs#613
But it did not highlight the difference between auto discovery vs multi-tigger;

... limited to a single Kafka cluster...

In the example provided above, topics from other geo location are mirrored into a single aggregation cluster, where the consumer is only consuming from the single cluster.

... just supply multiple kafka scaler triggers each autodiscovering a different cluster if necessary...

Correct. This change does not forbid that.

@zroubalik
Copy link
Member

But it did not highlight the difference between auto discovery vs multi-tigger;

@PaulLiang1 do you think you can a few words about this to the docs? So users are aware of the consequences?

Copy link
Member

@zroubalik zroubalik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@PaulLiang1 Could you please rebase your PR, there are conflicts because of #2409

Thanks!

@PaulLiang1
Copy link
Contributor Author

Sure, sorry just back from holidays. will work on it in the next few days

Signed-off-by: Jinli Liang <paul.liang@rokt.com>
Signed-off-by: Jinli Liang <paul.liang@rokt.com>
Signed-off-by: Jinli Liang <paul.liang@rokt.com>
@PaulLiang1 PaulLiang1 force-pushed the allow-kafka-scaler-to-scale-without-a-specific-topic branch from 36a3ae4 to f8b6735 Compare January 11, 2022 00:27
@PaulLiang1
Copy link
Contributor Author

But it did not highlight the difference between auto discovery vs multi-tigger;

@PaulLiang1 do you think you can a few words about this to the docs? So users are aware of the consequences?

Hi @zroubalik

  • Changes have rebased, would you mind kick off another run of integration test for me? thanks;
  • Failure for Github action does not seem to relate to my change
    where it complain about The unauthenticated git protocol on port 9418 is no longer supported.
    what would be the best course of action to address this?
  • Doc had been updated with extra info, would you mind take a look?

@zroubalik
Copy link
Member

zroubalik commented Jan 11, 2022

/run-e2e kafka.test*
Update: You can check the progres here

Copy link
Member

@zroubalik zroubalik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Once the Doc PR is fixed, we can merge this one. Great job @PaulLiang1

@zroubalik zroubalik merged commit 92c75bc into kedacore:main Jan 12, 2022
@shaswa
Copy link

shaswa commented Apr 27, 2022

@PaulLiang1 Thank you for this! What would be the maximum replicas? Is it the sum of partitions of all the topics in a consumer group?

For example. Consumer group has Topic 1 with 10 partitions, and Topic 2 with 6 partitions. Will it scale up to 16 pods or 10 pods?

Edit: Oops, just found the answer to my question as soon as I posted this!

@PaulLiang1
Copy link
Contributor Author

@shaswa

  • when allowIdleConsumers=true it scales to the number of totalCGLag/desiredLag
  • when allowIdleConsumers=false it scales to the number of max(totalCGLag/desiredLag, totalNbOfPartitionsInCG), using the example outlined, it would be 16

ref:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants