-
Notifications
You must be signed in to change notification settings - Fork 3.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Retry with backoff on cluster connection failures #2358
Conversation
No behavior changes, just a refactoring. Changes: * Replaces recursion with a for loop * Extract redirection handling into its own method * Extract connection-failed handling into its own method Note that `tryWithRandomNode` is gone, it was never `true` so it and its code didn't survive the refactoring.
Inspired by redis#1334 where this went real easy :). Would have made redis#2355 shorter. Free public updates for JDK 7 ended in 2015: <https://en.wikipedia.org/wiki/Java_version_history> For JDK 8, free public support is available from non-Orace vendors until at least 2026 according to the same table. And JDK 8 is what Jedis is being tested on anyway: <https://github.com/redis/jedis/blob/ac0969315655180c09b8139c16bded09c068d498/.circleci/config.yml#L67-L74>
✅ 👀 Ready for review! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This PR breaks backward compatibility. Breaking backward compatibility means it won't be released until next major release. As of this moment, next major release for Jedis is 4.0.0 which, you can imagine, is a long away.
Try to find a backward compatible solution. Don't make the code too ugly for that purpose though :)
Thank you for your short turnaround time in reviewing, I really appreciate that @sazzad16! |
Another constructor is needed either way (I think). But if #2364 would get merged before this PR, that constructor could be made |
/** | ||
* Default timeout in milliseconds. | ||
*/ | ||
public static final int DEFAULT_TIMEOUT = 2000; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
public
makes these reachable from JedisClusterCommand.java
for its default timeout.
* consider connection exceptions and disregard random nodes * reset redirection
I disagree. Firstly, it doesn't suit there. Secondly, when we'd try to improve this (targeting Jedis 4.0.0), this would mess the config interface and/or could be bottlenecked by it.
Exactly. This is one of our ultimately goal. But #2377 and mp911de's comment, makes me think that a good enough solution is likely to be a breaking change and thus be targeting 4.0.0. This PR is at least bringing (somewhat not-customizable) sleep time in 3.x. Considering we don't have any sort of sleep, something should be better than nothing.
It's just that those commands were implemented & merged after this PR is crafted and simple git merge doesn't add those. We'll always have time to add those. |
@@ -85,7 +100,10 @@ public T runWithAnyNode() { | |||
} | |||
|
|||
private T runWithRetries(final int slot) { | |||
Instant deadline = Instant.now().plus(maxTotalRetriesDuration); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Taking the the time on each successful call seems like a waste and might impact the performance.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@gkorland According to https://www.alibabacloud.com/blog/performance-issues-related-to-localdatetime-and-instant-during-serialization-operations_595605
Throughput of Instant.now+atZone+format+DateTimeFormatter.ofPattern is 6816922.578 ops/sec.
Without any formatting, throughput of Instant.now+plus should be much higher. Shouldn't it be enough?
@@ -69,6 +81,7 @@ public BinaryJedisCluster(Set<HostAndPort> jedisClusterNode, int connectionTimeo | |||
this.connectionHandler = new JedisSlotBasedConnectionHandler(jedisClusterNode, poolConfig, | |||
connectionTimeout, soTimeout, user, password, clientName); | |||
this.maxAttempts = maxAttempts; | |||
this.maxTotalRetriesDuration = Duration.ofMillis(soTimeout); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why are we connecting soTimeout with maxTotalRetriesDuration?
I think we should have a separate argument for the maxTotalRetriesDuration and perhaps for enabling backoff at all to avoid backward issues (at least in 3.6)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why are we connecting soTimeout with maxTotalRetriesDuration?
Because max duration for one single try is soTimeout.
It should be multiplied by maxAttempts, though.
@sazzad16 Okay, If we have an improvement plan, then I agree to continue, but I still think the default value of
This is the responsibility of this PR, and maxTotalRetriesDuration should be added to the new command before merged. |
agreed
We can do this after the PR is approved. |
@gkorland @yangbodong22011 Please check #2490. Hopefully that PR addresses your concerns. |
Conflicts: src/main/java/redis/clients/jedis/BinaryJedisCluster.java src/main/java/redis/clients/jedis/JedisCluster.java
🥳 |
Before this change, if there were connection failures to the cluster, we did all our retries without any backoff.
With this change in place:
maxAttempts
(see theshouldBackOff()
method)getBackoffSleepMillis()
methodAdditionally, this change adds unit tests for the retries / backoff logic.
This change is based on the changes in #2355 (approved, not yet merged, currently waiting for more reviewers).