Retry with backoff on cluster connection failures #2358

walles · 2021-01-29T06:00:13Z

Before this change, if there were connection failures to the cluster, we did all our retries without any backoff.

With this change in place:

We first do the previous no-backoff tactic for one third of our maxAttempts (see the shouldBackOff() method)
Then we start backing off as determined by the getBackoffSleepMillis() method

Additionally, this change adds unit tests for the retries / backoff logic.

This change is based on the changes in #2355 (approved, not yet merged, currently waiting for more reviewers).

No behavior changes, just a refactoring. Changes: * Replaces recursion with a for loop * Extract redirection handling into its own method * Extract connection-failed handling into its own method Note that `tryWithRandomNode` is gone, it was never `true` so it and its code didn't survive the refactoring.

Inspired by redis#1334 where this went real easy :). Would have made redis#2355 shorter. Free public updates for JDK 7 ended in 2015: <https://en.wikipedia.org/wiki/Java_version_history> For JDK 8, free public support is available from non-Orace vendors until at least 2026 according to the same table. And JDK 8 is what Jedis is being tested on anyway: <https://github.com/redis/jedis/blob/ac0969315655180c09b8139c16bded09c068d498/.circleci/config.yml#L67-L74>

walles · 2021-02-01T12:59:45Z

✅ 👀 Ready for review!

sazzad16

This PR breaks backward compatibility. Breaking backward compatibility means it won't be released until next major release. As of this moment, next major release for Jedis is 4.0.0 which, you can imagine, is a long away.

Try to find a backward compatible solution. Don't make the code too ugly for that purpose though :)

src/main/java/redis/clients/jedis/JedisClusterCommand.java

src/main/java/redis/clients/jedis/BinaryJedisCluster.java

walles · 2021-02-02T13:15:50Z

Thank you for your short turnaround time in reviewing, I really appreciate that @sazzad16!

walles · 2021-02-02T13:36:51Z

Try to find a backward compatible solution. Don't make the code too ugly for that purpose though :)

Another constructor is needed either way (I think).

But if #2364 would get merged before this PR, that constructor could be made private and wouldn't have to clutter the public API.

walles · 2021-02-02T13:42:35Z

src/main/java/redis/clients/jedis/BinaryJedisCluster.java

+  /**
+   * Default timeout in milliseconds.
+   */
+  public static final int DEFAULT_TIMEOUT = 2000;


public makes these reachable from JedisClusterCommand.java for its default timeout.

src/main/java/redis/clients/jedis/JedisClusterCommand.java

* consider connection exceptions and disregard random nodes * reset redirection

sazzad16 · 2021-03-28T19:38:25Z

@yangbodong22011

maxTotalRetriesDuration should be an independent configuration in JedisClientConfig

I disagree. Firstly, it doesn't suit there. Secondly, when we'd try to improve this (targeting Jedis 4.0.0), this would mess the config interface and/or could be bottlenecked by it.

users may set maxTotalRetriesDuration> timeout

Exactly. This is one of our ultimately goal. But #2377 and mp911de's comment, makes me think that a good enough solution is likely to be a breaking change and thus be targeting 4.0.0. This PR is at least bringing (somewhat not-customizable) sleep time in 3.x. Considering we don't have any sort of sleep, something should be better than nothing.

Some JedisCluster commands, such as copy, getDel, getEx, do not have the configuration of maxTotalRetriesDuration

It's just that those commands were implemented & merged after this PR is crafted and simple git merge doesn't add those. We'll always have time to add those.

gkorland · 2021-03-28T19:44:22Z

src/main/java/redis/clients/jedis/JedisClusterCommand.java

@@ -85,7 +100,10 @@ public T runWithAnyNode() {
  }

  private T runWithRetries(final int slot) {
+    Instant deadline = Instant.now().plus(maxTotalRetriesDuration);


Taking the the time on each successful call seems like a waste and might impact the performance.

@gkorland According to https://www.alibabacloud.com/blog/performance-issues-related-to-localdatetime-and-instant-during-serialization-operations_595605

Throughput of Instant.now+atZone+format+DateTimeFormatter.ofPattern is 6816922.578 ops/sec.
Without any formatting, throughput of Instant.now+plus should be much higher. Shouldn't it be enough?

gkorland · 2021-03-28T19:49:10Z

src/main/java/redis/clients/jedis/BinaryJedisCluster.java

@@ -69,6 +81,7 @@ public BinaryJedisCluster(Set<HostAndPort> jedisClusterNode, int connectionTimeo
    this.connectionHandler = new JedisSlotBasedConnectionHandler(jedisClusterNode, poolConfig,
        connectionTimeout, soTimeout, user, password, clientName);
    this.maxAttempts = maxAttempts;
+    this.maxTotalRetriesDuration = Duration.ofMillis(soTimeout);


Why are we connecting soTimeout with maxTotalRetriesDuration?
I think we should have a separate argument for the maxTotalRetriesDuration and perhaps for enabling backoff at all to avoid backward issues (at least in 3.6)

@gkorland

Why are we connecting soTimeout with maxTotalRetriesDuration?

Because max duration for one single try is soTimeout.

It should be multiplied by maxAttempts, though.

yangbodong22011 · 2021-03-29T03:17:49Z

I disagree. Firstly, it doesn't suit there. Secondly, when we'd try to improve this (targeting Jedis 4.0.0), this would mess the config interface and/or could be bottlenecked by it.

@sazzad16 Okay, If we have an improvement plan, then I agree to continue, but I still think the default value of maxTotalRetriesDuration should be: maxAttempts * soTimeout, not equal to soTimeout.

It's just that those commands were implemented & merged after this PR is crafted and simple git merge doesn't add those. We'll always have time to add those.

This is the responsibility of this PR, and maxTotalRetriesDuration should be added to the new command before merged.

sazzad16 · 2021-03-29T04:08:26Z

@yangbodong22011

the default value of maxTotalRetriesDuration should be: maxAttempts * soTimeout

agreed

maxTotalRetriesDuration should be added to the new command before merged

We can do this after the PR is approved.

src/main/java/redis/clients/jedis/JedisClusterCommand.java

sazzad16 · 2021-03-29T16:52:13Z

@gkorland @yangbodong22011 Please check #2490. Hopefully that PR addresses your concerns.

Conflicts: src/main/java/redis/clients/jedis/BinaryJedisCluster.java src/main/java/redis/clients/jedis/JedisCluster.java

walles · 2021-03-31T07:58:34Z

🥳

Johan Walles and others added 5 commits January 25, 2021 09:11

Drop redundant null check

5a4fdbd

Replace ConnectionGetters with lambdas

cdf56b2

Retrigger CI

d99ef7b

walles marked this pull request as draft January 29, 2021 07:51

Johan Walles added 4 commits February 1, 2021 13:45

Add backoff to Redis connections

8978ca5

Add unit tests for backoff logic

85fa21c

Add retries logging

f8d09c2

Always use the user requested timeout

9c7ef1d

walles force-pushed the j/backoff branch from 8362e90 to 9c7ef1d Compare February 1, 2021 12:52

walles marked this pull request as ready for review February 1, 2021 12:58

sazzad16 requested changes Feb 1, 2021

View reviewed changes

src/main/java/redis/clients/jedis/JedisClusterCommand.java Outdated Show resolved Hide resolved

src/main/java/redis/clients/jedis/BinaryJedisCluster.java Outdated Show resolved Hide resolved

walles mentioned this pull request Feb 2, 2021

jedis-3.2.0 JedisClusterMaxAttemptsException connect to redis-5.0.7 cluster #2130

Closed

Remedy review feedback

9bce8eb

walles force-pushed the j/backoff branch from fc2349f to 9bce8eb Compare February 2, 2021 13:39

walles commented Feb 2, 2021

View reviewed changes

walles requested a review from sazzad16 February 2, 2021 13:45

walles mentioned this pull request Feb 2, 2021

Add logging for cluster retries logic #2350

Closed

sazzad16 requested changes Feb 3, 2021

View reviewed changes

src/main/java/redis/clients/jedis/JedisClusterCommand.java Outdated Show resolved Hide resolved

This was referenced Feb 3, 2021

Kindly document the Jedis fork buildfarm/buildfarm#664

Closed

redis.clients.jedis.exceptions.JedisConnectionException: java.net.ConnectException: Connection refused #2345

Closed

Consider connection exceptions and disregard random nodes

67a062a

* consider connection exceptions and disregard random nodes * reset redirection

gkorland reviewed Mar 28, 2021

View reviewed changes

Merge branch 'master' into j/backoff

7aa0b74

sazzad16 dismissed their stale review via 7aa0b74 March 29, 2021 03:48

sazzad16 reviewed Mar 29, 2021

View reviewed changes

src/main/java/redis/clients/jedis/JedisClusterCommand.java Outdated Show resolved Hide resolved

sazzad16 reviewed Mar 29, 2021

View reviewed changes

src/main/java/redis/clients/jedis/JedisClusterCommand.java Outdated Show resolved Hide resolved

Use maxAttempts

0ef36d3

sazzad16 reviewed Mar 29, 2021

View reviewed changes

src/main/java/redis/clients/jedis/JedisClusterCommand.java Outdated Show resolved Hide resolved

sazzad16 added 6 commits March 29, 2021 11:52

format import

25303b7

Re-add missing codes due to merge

9e3fbcc

avoid NPE while zero max attempts

882dd49

Remove zero attempts test

7430b9b

More cluster constructors and customizability

9eb8d58

Use maxTotalRetriesDuration everywhere

27bce50

sazzad16 force-pushed the j/backoff branch from 7430b9b to ddd4038 Compare March 29, 2021 16:46

sazzad16 mentioned this pull request Mar 29, 2021

Retry with backoff on cluster connection failures (II) #2490

Closed

sazzad16 added 2 commits March 31, 2021 07:53

Merge remote-tracking branch 'redis/master' into j/backoff

b900a87

Conflicts: src/main/java/redis/clients/jedis/BinaryJedisCluster.java src/main/java/redis/clients/jedis/JedisCluster.java

more missing maxTotalRetriesDuration after merge

4501b0d

sazzad16 added ready to merge and removed wait for more reviews labels Mar 31, 2021

sazzad16 merged commit 270bb71 into redis:master Mar 31, 2021

sazzad16 removed the ready to merge label Mar 31, 2021

walles deleted the j/backoff branch March 31, 2021 07:47

This was referenced Apr 6, 2021

[Ehancement]Try run command again after renew the redis cluster slot cache #2195

Closed

Call renew slots before final retry #1443

Closed

joshua5201 mentioned this pull request Aug 25, 2022

Add support for Jedis maxTotalRetriesDuration in config spring-projects/spring-data-redis#2389

Closed

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Retry with backoff on cluster connection failures #2358

Retry with backoff on cluster connection failures #2358

walles commented Jan 29, 2021

walles commented Feb 1, 2021

sazzad16 left a comment

walles commented Feb 2, 2021

walles commented Feb 2, 2021

walles Feb 2, 2021

sazzad16 commented Mar 28, 2021

gkorland Mar 28, 2021

sazzad16 Mar 29, 2021

gkorland Mar 28, 2021

sazzad16 Mar 28, 2021 •

edited

Loading

yangbodong22011 commented Mar 29, 2021

sazzad16 commented Mar 29, 2021

sazzad16 commented Mar 29, 2021

walles commented Mar 31, 2021

Retry with backoff on cluster connection failures #2358

Retry with backoff on cluster connection failures #2358

Conversation

walles commented Jan 29, 2021

walles commented Feb 1, 2021

sazzad16 left a comment

Choose a reason for hiding this comment

walles commented Feb 2, 2021

walles commented Feb 2, 2021

walles Feb 2, 2021

Choose a reason for hiding this comment

sazzad16 commented Mar 28, 2021

gkorland Mar 28, 2021

Choose a reason for hiding this comment

sazzad16 Mar 29, 2021

Choose a reason for hiding this comment

gkorland Mar 28, 2021

Choose a reason for hiding this comment

sazzad16 Mar 28, 2021 • edited Loading

Choose a reason for hiding this comment

yangbodong22011 commented Mar 29, 2021

sazzad16 commented Mar 29, 2021

sazzad16 commented Mar 29, 2021

walles commented Mar 31, 2021

sazzad16 Mar 28, 2021 •

edited

Loading