Support timeout when fetching metadata #1359
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Versions
Sarama Version:
v1.22.0-6-g2a49b70
(2a49b70)Kafka Version:
1.0.2
Go Version:
1.12
Configuration
What configuration values are you using for Sarama and Kafka?
Default configuration, see unit tests.
Problem Description
Fetching metadata from an unreachable cluster using the default configuration (
30s
DialTimeout
,3
retries,250ms
backoff) of 2 brokers can take more than 4 minutes before failing withErrOutOfBrokers
.The following logs can be seen when running the provided unit test against
master
with unresponsive seed brokers (with TCP read timeout on127.0.0.1:63168
and127.0.0.1:63169
to simulate unreachable cluster with TCP dial timeout):This time grows exponentially with the number of brokers because of the following formula:
Solution
The proposed solution is to add a new configuration option
Metadata.Timeout
to fail faster.Changes
Metadata.Timeout
configuration option to fail faster (disabled by default for backward compatibility)0
Testing done
See provided unit tests to see
ErrOutOfBrokers
returned much faster.We have been using that approach in production successfully (based on a different Sarama release) to failover a producer from one Kafka cluster to another.
When one cluster become unreachable over WAN we can failover to another region in less than 30 seconds using the following configuration: