Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

JedisCluster uses maxRedirects retries when cluster master is down #1238

Closed
roblg opened this issue Mar 11, 2016 · 9 comments
Closed

JedisCluster uses maxRedirects retries when cluster master is down #1238

roblg opened this issue Mar 11, 2016 · 9 comments
Labels

Comments

@roblg
Copy link
Contributor

roblg commented Mar 11, 2016

This is sort of related to #1236 that I filed earlier today, and might be related to #1120.

We had a machine that is running one of our redis cluster masters go down hard today, and while it was down we saw some interesting behavior from the jedis pool. We got a lot of errors like this:

redis.clients.jedis.exceptions.JedisClusterMaxRedirectionsException: Too many Cluster redirections?
    at redis.clients.jedis.JedisClusterCommand.runWithRetries(JedisClusterCommand.java:97)
    at redis.clients.jedis.JedisClusterCommand.runWithRetries(JedisClusterCommand.java:131)
    at redis.clients.jedis.JedisClusterCommand.runWithRetries(JedisClusterCommand.java:152)
    at redis.clients.jedis.JedisClusterCommand.runWithRetries(JedisClusterCommand.java:131)
    at redis.clients.jedis.JedisClusterCommand.runWithRetries(JedisClusterCommand.java:152)
    at redis.clients.jedis.JedisClusterCommand.runWithRetries(JedisClusterCommand.java:131)
    at redis.clients.jedis.JedisClusterCommand.runBinary(JedisClusterCommand.java:59)
    at redis.clients.jedis.BinaryJedisCluster.get(BinaryJedisCluster.java:98)

It looks like redis.clients.jedis.Connection.connect() is wrapping the java.net.ConnectException from the failed socket up in a JedisConnectionException, which gets wrapped again by redis.clients.util.Pool.getResource(), and because it's a JedisConnectionException, it triggers a retry.

I think this issue is maybe a little less obviously a bug than #1236, because I can see some uses where someone might want to retry immediately if a socket threw a ConnectException, but I think one could also make an argument that if a ConnectException is being thrown something is seriously wrong, and failing immediately seems reasonable. (If users were experiencing the ConnectException for nodes that weren't down, they could alway increase the connect timeout.) As it is, when a node goes down, users could potentially end up waiting a total of (maxRedirects * connectTimeout) ms, and still get an error, which is kind of unfortunate, especially when a regular cache hit is over in just a few ms. :)

Thoughts?

Redis / Jedis Configuration

Jedis version:

2.8.0

Redis version:

3.0.6

Java version:

JDK8

@marcosnils
Copy link
Contributor

@roblg. If you want to fail immediately you can set the JedisCluster retries to 0 and in that case JedisCluster won't try to contact other nodes upon failure. I'm in mobile now I'll explain the rationale for this functionality later

@roblg
Copy link
Contributor Author

roblg commented Mar 11, 2016

@marcosnils Thanks for the response. Yeah, I originally considered lowering retries to 0, but I think I actually do want retries for "MOVED" responses, so that we can add/remove nodes from the cluster and pick up those changes. I think maybe ideally there would be two kinds of retries -- error retries and redirect retries, but they seem to be the same right now.

In all honesty, it would probably work well enough to just set retries to 1, so that if we add a new node and rebalance the slots, Jedis would be able to correct for that, but in the case of a socket connect timeout, we don't pay a huge penalty.

@roblg
Copy link
Contributor Author

roblg commented Mar 11, 2016

Hmm... it also looks like if there's a ConnectException when trying to connect to a particular node, the request will be retried, but is retried with tryRandomNode=true, so the follow-up request is unlikely to land on a node that has the data. It'll get a MOVED response, and be forced to try again. Am I reading that right?

@HeartSaVioR
Copy link
Contributor

@roblg
Yes, right. That implementation was port of https://github.com/antirez/redis-rb-cluster.
When we receive JedisConnectionException, we can't determine that Redis instance is down, or just in case of timeout. (We may could distinguish but let's abstract for now.)
If issue is former, reconnecting same instance is meaningless since Redis Cluster would try to failover and that information is not pushed to Jedis. We just have to request to random instance and let that forward me to valid instance.

@HeartSaVioR
Copy link
Contributor

Hmm... I think the logic is up to how fast Redis Cluster completes failover.

If Redis Cluster cannot complete failover fast enough, there's likely to just half of max retry count recurrences of connecting random node -> failed node and finally throw JedisConnectionException anyway.

@marcosnils
Copy link
Contributor

@HeartSaVioR I believe this shouldn't be much trouble because even though Jedis might return MaxRedirects very soon depending on how much time RedisCluster completes its failover, whenever the Jedis user tries to issue any command using the same JedisCluster instance, Jedis will try to reconnect again and try to dispatch the command to the appropriate node.

In this scenario, Jedis users can catch the MaxRedirectException and implement some sort of exponential backoff or something before sending the next command to the cluster.

@HeartSaVioR
Copy link
Contributor

@marcosnils
Yes, agreed. It's just who is responsible to do retry.
If we want to reduce max retry count to fail fast, retry count should be more than 2 (random node -> moved).

@walles
Copy link
Contributor

walles commented Feb 1, 2021

PR for backing off while retrying on connection failures: #2358

Copy link

This issue is marked stale. It will be closed in 30 days if it is not updated.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants