JedisCluster uses `maxRedirects` retries when cluster master is down #1238

roblg · 2016-03-11T06:44:05Z

This is sort of related to #1236 that I filed earlier today, and might be related to #1120.

We had a machine that is running one of our redis cluster masters go down hard today, and while it was down we saw some interesting behavior from the jedis pool. We got a lot of errors like this:

redis.clients.jedis.exceptions.JedisClusterMaxRedirectionsException: Too many Cluster redirections?
    at redis.clients.jedis.JedisClusterCommand.runWithRetries(JedisClusterCommand.java:97)
    at redis.clients.jedis.JedisClusterCommand.runWithRetries(JedisClusterCommand.java:131)
    at redis.clients.jedis.JedisClusterCommand.runWithRetries(JedisClusterCommand.java:152)
    at redis.clients.jedis.JedisClusterCommand.runWithRetries(JedisClusterCommand.java:131)
    at redis.clients.jedis.JedisClusterCommand.runWithRetries(JedisClusterCommand.java:152)
    at redis.clients.jedis.JedisClusterCommand.runWithRetries(JedisClusterCommand.java:131)
    at redis.clients.jedis.JedisClusterCommand.runBinary(JedisClusterCommand.java:59)
    at redis.clients.jedis.BinaryJedisCluster.get(BinaryJedisCluster.java:98)

It looks like redis.clients.jedis.Connection.connect() is wrapping the java.net.ConnectException from the failed socket up in a JedisConnectionException, which gets wrapped again by redis.clients.util.Pool.getResource(), and because it's a JedisConnectionException, it triggers a retry.

I think this issue is maybe a little less obviously a bug than #1236, because I can see some uses where someone might want to retry immediately if a socket threw a ConnectException, but I think one could also make an argument that if a ConnectException is being thrown something is seriously wrong, and failing immediately seems reasonable. (If users were experiencing the ConnectException for nodes that weren't down, they could alway increase the connect timeout.) As it is, when a node goes down, users could potentially end up waiting a total of (maxRedirects * connectTimeout) ms, and still get an error, which is kind of unfortunate, especially when a regular cache hit is over in just a few ms. :)

Thoughts?

Redis / Jedis Configuration

Jedis version:

2.8.0

Redis version:

3.0.6

Java version:

JDK8

The text was updated successfully, but these errors were encountered:

marcosnils · 2016-03-11T17:36:04Z

@roblg. If you want to fail immediately you can set the JedisCluster retries to 0 and in that case JedisCluster won't try to contact other nodes upon failure. I'm in mobile now I'll explain the rationale for this functionality later

roblg · 2016-03-11T18:31:32Z

@marcosnils Thanks for the response. Yeah, I originally considered lowering retries to 0, but I think I actually do want retries for "MOVED" responses, so that we can add/remove nodes from the cluster and pick up those changes. I think maybe ideally there would be two kinds of retries -- error retries and redirect retries, but they seem to be the same right now.

In all honesty, it would probably work well enough to just set retries to 1, so that if we add a new node and rebalance the slots, Jedis would be able to correct for that, but in the case of a socket connect timeout, we don't pay a huge penalty.

roblg · 2016-03-11T22:39:02Z

Hmm... it also looks like if there's a ConnectException when trying to connect to a particular node, the request will be retried, but is retried with tryRandomNode=true, so the follow-up request is unlikely to land on a node that has the data. It'll get a MOVED response, and be forced to try again. Am I reading that right?

HeartSaVioR · 2016-03-24T14:04:29Z

@roblg
Yes, right. That implementation was port of https://github.com/antirez/redis-rb-cluster.
When we receive JedisConnectionException, we can't determine that Redis instance is down, or just in case of timeout. (We may could distinguish but let's abstract for now.)
If issue is former, reconnecting same instance is meaningless since Redis Cluster would try to failover and that information is not pushed to Jedis. We just have to request to random instance and let that forward me to valid instance.

HeartSaVioR · 2016-03-24T14:20:47Z

Hmm... I think the logic is up to how fast Redis Cluster completes failover.

If Redis Cluster cannot complete failover fast enough, there's likely to just half of max retry count recurrences of connecting random node -> failed node and finally throw JedisConnectionException anyway.

marcosnils · 2016-03-25T01:41:40Z

@HeartSaVioR I believe this shouldn't be much trouble because even though Jedis might return MaxRedirects very soon depending on how much time RedisCluster completes its failover, whenever the Jedis user tries to issue any command using the same JedisCluster instance, Jedis will try to reconnect again and try to dispatch the command to the appropriate node.

In this scenario, Jedis users can catch the MaxRedirectException and implement some sort of exponential backoff or something before sending the next command to the cluster.

HeartSaVioR · 2016-03-25T08:49:49Z

@marcosnils
Yes, agreed. It's just who is responsible to do retry.
If we want to reduce max retry count to fail fast, retry count should be more than 2 (random node -> moved).

walles · 2021-02-01T13:04:28Z

PR for backing off while retrying on connection failures: #2358

github-actions · 2024-02-15T00:09:59Z

This issue is marked stale. It will be closed in 30 days if it is not updated.

zouyx mentioned this issue May 25, 2016

JedisCluster throws Too many Cluster redirections when a master node crash #1120

Closed

github-actions bot added the stale label Feb 15, 2024

github-actions bot closed this as completed Mar 17, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

JedisCluster uses `maxRedirects` retries when cluster master is down #1238

JedisCluster uses `maxRedirects` retries when cluster master is down #1238

roblg commented Mar 11, 2016

marcosnils commented Mar 11, 2016

roblg commented Mar 11, 2016

roblg commented Mar 11, 2016

HeartSaVioR commented Mar 24, 2016

HeartSaVioR commented Mar 24, 2016

marcosnils commented Mar 25, 2016

HeartSaVioR commented Mar 25, 2016

walles commented Feb 1, 2021

github-actions bot commented Feb 15, 2024

JedisCluster uses maxRedirects retries when cluster master is down #1238

JedisCluster uses maxRedirects retries when cluster master is down #1238

Comments

roblg commented Mar 11, 2016

Redis / Jedis Configuration

Jedis version:

Redis version:

Java version:

marcosnils commented Mar 11, 2016

roblg commented Mar 11, 2016

roblg commented Mar 11, 2016

HeartSaVioR commented Mar 24, 2016

HeartSaVioR commented Mar 24, 2016

marcosnils commented Mar 25, 2016

HeartSaVioR commented Mar 25, 2016

walles commented Feb 1, 2021

github-actions bot commented Feb 15, 2024

JedisCluster uses `maxRedirects` retries when cluster master is down #1238

JedisCluster uses `maxRedirects` retries when cluster master is down #1238