Don't always send update metadata requests to the same broker #395
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
We have been trying to use KafkEx with an app that generates high produce rates.
At first we tried to use one KafkaEx worker (the default one) but found that a few moments after starting the app, produce requests start to timeout. One worker wasn't able to keep up with the rate of produce requests coming in, so its mailbox started to fill up.
So next we tried to use a pool of workers - one per topic and partition like Brod. The app was stable now but we noticed something odd with the brokers. One of the brokers always had significantly higher system load and network traffic (bytes out) than the other brokers. After investigating it was found that the extra load was coming from the periodic metadata update requests made by all the workers.
For requests like fetching metadata and api_versions, KafkaEx will iterate through every broker that it knows about until it gets a successful response. It will normally try the brokers in the same sequence every time but the first one usually succeeds, so this first broker in the list gets an uneven amount of load.
In this PR we randomize the broker list before sending any requests in order to spread the load of update metadata requests evenly across all brokers.
Testing:
All tests passed locally.
I manually tested the behaviour by logging the broker list in
first_broker_response()