Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Retry with backoff on cluster connection failures #2358

Merged
merged 46 commits into from
Mar 31, 2021

Conversation

walles
Copy link
Contributor

@walles walles commented Jan 29, 2021

Before this change, if there were connection failures to the cluster, we did all our retries without any backoff.

With this change in place:

  • We first do the previous no-backoff tactic for one third of our maxAttempts (see the shouldBackOff() method)
  • Then we start backing off as determined by the getBackoffSleepMillis() method

Additionally, this change adds unit tests for the retries / backoff logic.

This change is based on the changes in #2355 (approved, not yet merged, currently waiting for more reviewers).

Johan Walles and others added 5 commits January 25, 2021 09:11
No behavior changes, just a refactoring.

Changes:
* Replaces recursion with a for loop
* Extract redirection handling into its own method
* Extract connection-failed handling into its own method

Note that `tryWithRandomNode` is gone, it was never `true` so it and its
code didn't survive the refactoring.
Inspired by redis#1334 where this went real easy :).

Would have made redis#2355 shorter.

Free public updates for JDK 7 ended in 2015:
<https://en.wikipedia.org/wiki/Java_version_history>

For JDK 8, free public support is available from non-Orace vendors until
at least 2026 according to the same table.

And JDK 8 is what Jedis is being tested on anyway:
<https://github.com/redis/jedis/blob/ac0969315655180c09b8139c16bded09c068d498/.circleci/config.yml#L67-L74>
@walles walles marked this pull request as draft January 29, 2021 07:51
@walles walles marked this pull request as ready for review February 1, 2021 12:58
@walles
Copy link
Contributor Author

walles commented Feb 1, 2021

✅ 👀 Ready for review!

Copy link
Collaborator

@sazzad16 sazzad16 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR breaks backward compatibility. Breaking backward compatibility means it won't be released until next major release. As of this moment, next major release for Jedis is 4.0.0 which, you can imagine, is a long away.

Try to find a backward compatible solution. Don't make the code too ugly for that purpose though :)

src/main/java/redis/clients/jedis/JedisClusterCommand.java Outdated Show resolved Hide resolved
src/main/java/redis/clients/jedis/BinaryJedisCluster.java Outdated Show resolved Hide resolved
@walles
Copy link
Contributor Author

walles commented Feb 2, 2021

Thank you for your short turnaround time in reviewing, I really appreciate that @sazzad16!

@walles
Copy link
Contributor Author

walles commented Feb 2, 2021

Try to find a backward compatible solution. Don't make the code too ugly for that purpose though :)

Another constructor is needed either way (I think).

But if #2364 would get merged before this PR, that constructor could be made private and wouldn't have to clutter the public API.

/**
* Default timeout in milliseconds.
*/
public static final int DEFAULT_TIMEOUT = 2000;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

public makes these reachable from JedisClusterCommand.java for its default timeout.

* consider connection exceptions and disregard random nodes

* reset redirection
@sazzad16
Copy link
Collaborator

@yangbodong22011

maxTotalRetriesDuration should be an independent configuration in JedisClientConfig

I disagree. Firstly, it doesn't suit there. Secondly, when we'd try to improve this (targeting Jedis 4.0.0), this would mess the config interface and/or could be bottlenecked by it.

users may set maxTotalRetriesDuration> timeout

Exactly. This is one of our ultimately goal. But #2377 and mp911de's comment, makes me think that a good enough solution is likely to be a breaking change and thus be targeting 4.0.0. This PR is at least bringing (somewhat not-customizable) sleep time in 3.x. Considering we don't have any sort of sleep, something should be better than nothing.

Some JedisCluster commands, such as copy, getDel, getEx, do not have the configuration of maxTotalRetriesDuration

It's just that those commands were implemented & merged after this PR is crafted and simple git merge doesn't add those. We'll always have time to add those.

@@ -85,7 +100,10 @@ public T runWithAnyNode() {
}

private T runWithRetries(final int slot) {
Instant deadline = Instant.now().plus(maxTotalRetriesDuration);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Taking the the time on each successful call seems like a waste and might impact the performance.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gkorland According to https://www.alibabacloud.com/blog/performance-issues-related-to-localdatetime-and-instant-during-serialization-operations_595605

Throughput of Instant.now+atZone+format+DateTimeFormatter.ofPattern is 6816922.578 ops/sec.
Without any formatting, throughput of Instant.now+plus should be much higher. Shouldn't it be enough?

@@ -69,6 +81,7 @@ public BinaryJedisCluster(Set<HostAndPort> jedisClusterNode, int connectionTimeo
this.connectionHandler = new JedisSlotBasedConnectionHandler(jedisClusterNode, poolConfig,
connectionTimeout, soTimeout, user, password, clientName);
this.maxAttempts = maxAttempts;
this.maxTotalRetriesDuration = Duration.ofMillis(soTimeout);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are we connecting soTimeout with maxTotalRetriesDuration?
I think we should have a separate argument for the maxTotalRetriesDuration and perhaps for enabling backoff at all to avoid backward issues (at least in 3.6)

Copy link
Collaborator

@sazzad16 sazzad16 Mar 28, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gkorland

Why are we connecting soTimeout with maxTotalRetriesDuration?

Because max duration for one single try is soTimeout.

It should be multiplied by maxAttempts, though.

@yangbodong22011
Copy link
Collaborator

I disagree. Firstly, it doesn't suit there. Secondly, when we'd try to improve this (targeting Jedis 4.0.0), this would mess the config interface and/or could be bottlenecked by it.

@sazzad16 Okay, If we have an improvement plan, then I agree to continue, but I still think the default value of maxTotalRetriesDuration should be: maxAttempts * soTimeout, not equal to soTimeout.

It's just that those commands were implemented & merged after this PR is crafted and simple git merge doesn't add those. We'll always have time to add those.

This is the responsibility of this PR, and maxTotalRetriesDuration should be added to the new command before merged.

@sazzad16
Copy link
Collaborator

@yangbodong22011

the default value of maxTotalRetriesDuration should be: maxAttempts * soTimeout

agreed

maxTotalRetriesDuration should be added to the new command before merged

We can do this after the PR is approved.

@sazzad16
Copy link
Collaborator

@gkorland @yangbodong22011 Please check #2490. Hopefully that PR addresses your concerns.

 Conflicts:
	src/main/java/redis/clients/jedis/BinaryJedisCluster.java
	src/main/java/redis/clients/jedis/JedisCluster.java
@sazzad16 sazzad16 merged commit 270bb71 into redis:master Mar 31, 2021
@walles walles deleted the j/backoff branch March 31, 2021 07:47
@walles
Copy link
Contributor Author

walles commented Mar 31, 2021

🥳

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants