Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Retry with backoff on cluster connection failures #2358

Merged
merged 46 commits into from
Mar 31, 2021
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
46 commits
Select commit Hold shift + click to select a range
6e2f312
Split JedisClusterCommand into multiple methods
Jan 25, 2021
5a4fdbd
Drop redundant null check
Jan 25, 2021
fc3728a
Bump JDK version to 1.8
Jan 25, 2021
cdf56b2
Replace ConnectionGetters with lambdas
Jan 25, 2021
d99ef7b
Retrigger CI
Jan 25, 2021
8978ca5
Add backoff to Redis connections
Jan 28, 2021
85fa21c
Add unit tests for backoff logic
Jan 22, 2021
f8d09c2
Add retries logging
Jan 29, 2021
9c7ef1d
Always use the user requested timeout
Jan 28, 2021
9bce8eb
Remedy review feedback
Feb 2, 2021
67a062a
Consider connection exceptions and disregard random nodes
sazzad16 Feb 11, 2021
7e4abac
Revert "Consider connection exceptions and disregard random nodes"
Feb 11, 2021
eabf10b
Add another backoff test case
Feb 11, 2021
3223404
consider connection exceptions and disregard random nodes
sazzad16 Feb 10, 2021
fd17343
reset redirection
sazzad16 Feb 11, 2021
bf56639
Fix test failure
walles Feb 12, 2021
c7ae6b5
Merge pull request #3 from sazzad16/backoff-walles
Feb 12, 2021
569afe7
Merge branch 'master' into j/backoff
walles Feb 12, 2021
1638603
Apply suggestions from code review
Feb 12, 2021
68e8fdc
update documentation
sazzad16 Feb 12, 2021
c665dc1
Improve a comment
Feb 14, 2021
d9f2596
Update src/main/java/redis/clients/jedis/DefaultJedisSocketFactory.java
Feb 17, 2021
a23d602
Add change from another branch
Feb 25, 2021
6dd86db
Merge branch 'master' into j/backoff
walles Feb 25, 2021
fed69b0
Merge remote-tracking branch 'origin/master' into j/backoff
Feb 26, 2021
791180c
Move JedisClusterCommandTest out of commands package
sazzad16 Feb 26, 2021
4bf345b
Use JedisClusterOperationException
sazzad16 Feb 26, 2021
62a619e
Reduce sleep time, especially when few attempts left
sazzad16 Feb 26, 2021
f1c307d
Update src/main/java/redis/clients/jedis/JedisClusterCommand.java
sazzad16 Feb 26, 2021
8a9e0a8
Merge remote-tracking branch 'origin/master' into j/backoff
Mar 3, 2021
25c63a4
Merge branch 'master' into j/backoff
sazzad16 Mar 10, 2021
9b6242f
merge fix
sazzad16 Mar 10, 2021
73d74d3
Merge branch 'master' into j/backoff
gkorland Mar 18, 2021
d14174d
merge fix
sazzad16 Mar 20, 2021
af5d1f7
Merge remote-tracking branch 'redis/master' into j/backoff
sazzad16 Mar 20, 2021
ddd4038
Merge branch 'master' into j/backoff
sazzad16 Mar 25, 2021
7aa0b74
Merge branch 'master' into j/backoff
sazzad16 Mar 29, 2021
0ef36d3
Use maxAttempts
sazzad16 Mar 29, 2021
25303b7
format import
sazzad16 Mar 29, 2021
9e3fbcc
Re-add missing codes due to merge
sazzad16 Mar 29, 2021
882dd49
avoid NPE while zero max attempts
sazzad16 Mar 29, 2021
7430b9b
Remove zero attempts test
sazzad16 Mar 29, 2021
9eb8d58
More cluster constructors and customizability
sazzad16 Mar 29, 2021
27bce50
Use maxTotalRetriesDuration everywhere
sazzad16 Mar 29, 2021
b900a87
Merge remote-tracking branch 'redis/master' into j/backoff
sazzad16 Mar 31, 2021
4501b0d
more missing maxTotalRetriesDuration after merge
sazzad16 Mar 31, 2021
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 6 additions & 0 deletions pom.xml
Original file line number Diff line number Diff line change
Expand Up @@ -92,6 +92,12 @@
<version>2.3.2</version>
<scope>test</scope>
</dependency>
<dependency>
<groupId>org.mockito</groupId>
<artifactId>mockito-core</artifactId>
<version>3.7.7</version>
<scope>test</scope>
</dependency>
</dependencies>

<distributionManagement>
Expand Down
5 changes: 5 additions & 0 deletions src/main/java/redis/clients/jedis/BinaryJedis.java
Original file line number Diff line number Diff line change
Expand Up @@ -292,6 +292,11 @@ public BinaryJedis(final JedisSocketFactory jedisSocketFactory, final JedisClien
initializeFromClientConfig(clientConfig);
}

@Override
public String toString() {
return "BinaryJedis{" + client + '}';
}

public boolean isConnected() {
return client.isConnected();
}
Expand Down
557 changes: 298 additions & 259 deletions src/main/java/redis/clients/jedis/BinaryJedisCluster.java

Large diffs are not rendered by default.

5 changes: 5 additions & 0 deletions src/main/java/redis/clients/jedis/Connection.java
Original file line number Diff line number Diff line change
Expand Up @@ -81,6 +81,11 @@ public Connection(final JedisSocketFactory jedisSocketFactory) {
this.soTimeout = jedisSocketFactory.getSoTimeout();
}

@Override
public String toString() {
return "Connection{" + socketFactory + "}";
}

public Socket getSocket() {
return socket;
}
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -220,4 +220,9 @@ public HostAndPortMapper getHostAndPortMapper() {
public void setHostAndPortMapper(HostAndPortMapper hostAndPortMapper) {
this.hostAndPortMapper = hostAndPortMapper;
}

@Override
public String toString() {
return "DefaultJedisSocketFactory{" + hostAndPort.toString() + "}";
}
}
539 changes: 282 additions & 257 deletions src/main/java/redis/clients/jedis/JedisCluster.java

Large diffs are not rendered by default.

91 changes: 82 additions & 9 deletions src/main/java/redis/clients/jedis/JedisClusterCommand.java
Original file line number Diff line number Diff line change
@@ -1,5 +1,8 @@
package redis.clients.jedis;

import java.time.Duration;
import java.time.Instant;
import java.util.concurrent.TimeUnit;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;

Expand All @@ -18,10 +21,22 @@ public abstract class JedisClusterCommand<T> {

private final JedisClusterConnectionHandler connectionHandler;
private final int maxAttempts;
private final Duration maxTotalRetriesDuration;

public JedisClusterCommand(JedisClusterConnectionHandler connectionHandler, int maxAttempts) {
this(connectionHandler, maxAttempts, Duration.ofMillis((long) BinaryJedisCluster.DEFAULT_TIMEOUT * maxAttempts));
}

/**
* @param connectionHandler
* @param maxAttempts
* @param maxTotalRetriesDuration No more attempts after we have been trying for this long.
*/
public JedisClusterCommand(JedisClusterConnectionHandler connectionHandler, int maxAttempts,
Duration maxTotalRetriesDuration) {
this.connectionHandler = connectionHandler;
this.maxAttempts = maxAttempts;
this.maxTotalRetriesDuration = maxTotalRetriesDuration;
}

public abstract T execute(Jedis connection);
Expand Down Expand Up @@ -85,7 +100,10 @@ public T runWithAnyNode() {
}

private T runWithRetries(final int slot) {
Instant deadline = Instant.now().plus(maxTotalRetriesDuration);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Taking the the time on each successful call seems like a waste and might impact the performance.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gkorland According to https://www.alibabacloud.com/blog/performance-issues-related-to-localdatetime-and-instant-during-serialization-operations_595605

Throughput of Instant.now+atZone+format+DateTimeFormatter.ofPattern is 6816922.578 ops/sec.
Without any formatting, throughput of Instant.now+plus should be much higher. Shouldn't it be enough?


JedisRedirectionException redirect = null;
int consecutiveConnectionFailures = 0;
Exception lastException = null;
for (int attemptsLeft = this.maxAttempts; attemptsLeft > 0; attemptsLeft--) {
Jedis connection = null;
Expand All @@ -106,15 +124,21 @@ private T runWithRetries(final int slot) {
throw jnrcne;
} catch (JedisConnectionException jce) {
lastException = jce;
++consecutiveConnectionFailures;
LOG.debug("Failed connecting to Redis: {}", connection, jce);
// "- 1" because we just did one, but the attemptsLeft counter hasn't been decremented yet
handleConnectionProblem(attemptsLeft - 1);
boolean reset = handleConnectionProblem(attemptsLeft - 1, consecutiveConnectionFailures, deadline);
if (reset) {
consecutiveConnectionFailures = 0;
redirect = null;
}
} catch (JedisRedirectionException jre) {
// avoid updating lastException if it is a connection exception
if (lastException == null || lastException instanceof JedisRedirectionException) {
lastException = jre;
}
LOG.debug("Redirected by server to {}", jre.getTargetNode());
consecutiveConnectionFailures = 0;
redirect = jre;
// if MOVED redirection occurred,
if (jre instanceof JedisMovedDataException) {
Expand All @@ -124,6 +148,9 @@ private T runWithRetries(final int slot) {
} finally {
releaseConnection(connection);
}
if (Instant.now().isAfter(deadline)) {
throw new JedisClusterOperationException("Cluster retry deadline exceeded.");
}
}

JedisClusterMaxAttemptsException maxAttemptsException
Expand All @@ -132,14 +159,60 @@ private T runWithRetries(final int slot) {
throw maxAttemptsException;
}

private void handleConnectionProblem(int attemptsLeft) {
if (attemptsLeft <= 1) {
// We need this because if node is not reachable anymore - we need to finally initiate slots
// renewing, or we can stuck with cluster state without one node in opposite case.
// But now if maxAttempts = [1 or 2] we will do it too often.
// TODO make tracking of successful/unsuccessful operations for node - do renewing only
// if there were no successful responses from this node last few seconds
this.connectionHandler.renewSlotCache();
/**
* Related values should be reset if <code>TRUE</code> is returned.
*
* @param attemptsLeft
* @param consecutiveConnectionFailures
* @param doneDeadline
* @return true - if some actions are taken
* <br /> false - if no actions are taken
*/
private boolean handleConnectionProblem(int attemptsLeft, int consecutiveConnectionFailures, Instant doneDeadline) {
if (this.maxAttempts < 3) {
// Since we only renew the slots cache after two consecutive connection
// failures (see consecutiveConnectionFailures above), we need to special
// case the situation where we max out after two or fewer attempts.
// Otherwise, on two or fewer max attempts, the slots cache would never be
// renewed.
if (attemptsLeft == 0) {
this.connectionHandler.renewSlotCache();
return true;
walles marked this conversation as resolved.
Show resolved Hide resolved
}
return false;
}

if (consecutiveConnectionFailures < 2) {
return false;
}

sleep(getBackoffSleepMillis(attemptsLeft, doneDeadline));
//We need this because if node is not reachable anymore - we need to finally initiate slots
//renewing, or we can stuck with cluster state without one node in opposite case.
//TODO make tracking of successful/unsuccessful operations for node - do renewing only
//if there were no successful responses from this node last few seconds
this.connectionHandler.renewSlotCache();
return true;
}

private static long getBackoffSleepMillis(int attemptsLeft, Instant deadline) {
if (attemptsLeft <= 0) {
return 0;
}

long millisLeft = Duration.between(Instant.now(), deadline).toMillis();
if (millisLeft < 0) {
throw new JedisClusterOperationException("Cluster retry deadline exceeded.");
}

return millisLeft / (attemptsLeft * (attemptsLeft + 1));
}

protected void sleep(long sleepMillis) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be private?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, protected enables overriding in unit tests.

try {
TimeUnit.MILLISECONDS.sleep(sleepMillis);
} catch (InterruptedException e) {
throw new JedisClusterOperationException(e);
}
}

Expand Down
Loading