Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

JedisCluster didn't reconnect with new IP when domain changes #2907

Closed
mrpre opened this issue Feb 11, 2022 · 32 comments
Closed

JedisCluster didn't reconnect with new IP when domain changes #2907

mrpre opened this issue Feb 11, 2022 · 32 comments
Milestone

Comments

@mrpre
Copy link

mrpre commented Feb 11, 2022

Expected behavior

I use the domain not the IP of Redis Server as parameter of JedisCluster

Set<HostAndPort> testjedisClusterNode = new HashSet<HostAndPort>();
testjedisClusterNode.add(new HostAndPort("mydomain.com", 6379));
new JedisCluster(testjedisClusterNode,...)

But when the Redis is rebuilt and it's ip changes, It seems that the JedisCluster isn't reconfigured by the new ip.

Actual behavior

User must restart the client manually even the server has recovered.
Typically, more and more Servers run on Kubernetes as POD and POD's IP can be changed easily

Steps to reproduce:

Redis / Jedis Configuration

Jedis version:

Latest

Redis version:

6.0

Java version:

8

@mrpre
Copy link
Author

mrpre commented Feb 11, 2022

Well...after searching the PR for a while, I find @yangbodong22011 has already create the PR #2643. I'm not clear enough about the discussion under pr but is there any smooth way for client users to recover their clients except for restarting ? @sazzad16

@sazzad16
Copy link
Collaborator

@mrpre Is it possible to recreate JedisCluster when you're rebuilding Redis?

#2643 does address your issue but it would break many/most existing applications. We're trying to find a solution without breaking but this is taking some time.

We'd welcome any PR regarding this.

PS: Exact version helps better than "Latest".

@yangbodong22011
Copy link
Collaborator

@sazzad16 Hi, i think JedisClsuter should stick to this principle: as long as user-configured startNodes (physical ip, vip, domain) are available, then JedisCluster is normal.

Therefore, in renewClusterSlots, we should prefer user-configured "rock-solid" startNodes over getShuffledNodesPool. This is a bug fix, not a breaking change.

#2643 does address your issue but it would break many/most existing applications.

I think it's a background behavior that won't break the user's existing program, can you be specific about your concerns?

@sazzad16
Copy link
Collaborator

@yangbodong22011 Let's consider a cluster with master nodes M1, M2, M3 and their respective replicas R1, R2, R3. We provide [M1] as startNodes in JedisCluster. Now, if M1 goes down, the entire JedisCluster would be unusable because it would try to connect only through startNodes which contains only one node M1 which is unavailable.

getShuffledNodesPool() approach would not fail in this scenario. I hope this clarifies.

@yangbodong22011
Copy link
Collaborator

yangbodong22011 commented Feb 15, 2022

@sazzad16 Thanks for your explanation.
I realize the problems with this change. In our production environment, the vast majority of services will use a domain name or vip instead of a physical ip. What do you think of the following steps?

  1. First use startNodes to call Cluster Slots
  2. If it fails, fallback to getShuffledNodesPool()

These two orders cannot be reversed, otherwise the previous problem will still exist.

@mrpre
Copy link
Author

mrpre commented Feb 15, 2022

Users don't know when they should recreate JedisCluster(when "connect refused" happen?) or it's too costly for user to code. It is preferred such accident can be closed in JedisCluster.
Also I agree changes should not break any existing application.

@sazzad16
Copy link
Collaborator

@yangbodong22011 I was thinking about having modes for the users to choose from. Something like:

  1. CLUSTER_NODES (default)
  2. START_NODES (your 1st choice)
  3. CLUSTER_NODES_THEN_START_NODES
  4. START_NODES_THEN_CLUSTER_NODES (your 2nd choice)

@yangbodong22011
Copy link
Collaborator

@yangbodong22011 I was thinking about having modes for the users to choose from. Something like:

  1. CLUSTER_NODES (default)
  2. START_NODES (your 1st choice)
  3. CLUSTER_NODES_THEN_START_NODES
  4. START_NODES_THEN_CLUSTER_NODES (your 2nd choice)

@sazzad16 I think this background behavior can be done correctly without burdening the user with choices. Let's keep simple.
p.s. you mean CLUSTER SLOTS right ? not CLUSTER NODES

@sazzad16
Copy link
Collaborator

@yangbodong22011 I meant CLUSTER_NODES (with an underscore) as a mode.

@yangbodong22011
Copy link
Collaborator

yangbodong22011 commented Feb 16, 2022

@yangbodong22011 I meant CLUSTER_NODES (with an underscore) as a mode.

Do you mean to add CLUSTER NODES to the command to get the topology? (Currently CLUSTER SLOTS)

EDIT:I think CLUSTER_NODES just means the previous getShuffledNodesPool, right?

@sazzad16
Copy link
Collaborator

I think CLUSTER_NODES just means the previous getShuffledNodesPool, right?

@yangbodong22011 Yes.

@yangbodong22011
Copy link
Collaborator

@sazzad16 OK , back to the topic
I hope to agree with you on START_NODES_THEN_CLUSTER_NODES.
I don't want us to reveal too many concepts for users to choose from and understand.

@sazzad16
Copy link
Collaborator

@yangbodong22011 Or, simply, we can start with first two. We'll see later about others.

@yangbodong22011
Copy link
Collaborator

@yangbodong22011 Or, simply, we can start with first two. We'll see later about others.

I insist that this strategy should not be exposed to users, we just need to do it silently. This strategy doesn't make sense to users and only confuses them.

Also, no Redis client exposes such a policy. Lettuce provides Refresh the cluster topology view for the user to choose, call cluster nodes at intervals or call cluster nodes when it encounters moved, the implementation of Jedis is the second.

@sazzad16
Copy link
Collaborator

I insist that this strategy should not be exposed to users, we just need to do it silently. This strategy doesn't make sense to users and only confuses them.

@yangbodong22011 Sorry, I'm missing what you're meaning by "this".

Also, if I'm not mistaken, Lettuce provides many options.

@yangbodong22011
Copy link
Collaborator

Sorry, I'm missing what you're meaning by "this".

"this" means multiple modes. I think we should only support one correct mode, Currently I think it is: START_NODES_THEN_CLUSTER_NODES

yes, lettuce provides many options, but it doesn't give the user the option to choose START_NODES OR CLUSTER_NODES, the user controls is when they care to perform a cluster topology refresh.

@sazzad16
Copy link
Collaborator

@dengliming thoughts?

@dengliming
Copy link
Collaborator

Hey. Sorry for the late reply. IMO I agree with @yangbodong22011. Providing more mode will increase the cost of user understanding. We can just do one model for now and see later if any users need multi mode for more scenarios. BTW Would it be better if START_NODES => ORIGINAL_NODES ?

@sazzad16
Copy link
Collaborator

Would it be better if START_NODES => ORIGINAL_NODES ?

@dengliming Which order do you prefer?

I think doing ORIGINAL_NODES first doesn't make that much sense for the case of @yangbodong22011 and @mrpre

@dengliming
Copy link
Collaborator

ORIGINAL_NODES

Just a naming opinion. never mind : )

Agree with START_NODES_THEN_CLUSTER_NODES

@sazzad16
Copy link
Collaborator

There won't be any naming as there won't be multiple modes 😃

@sazzad16
Copy link
Collaborator

@yangbodong22011 Would you like to work on this?

@yangbodong22011
Copy link
Collaborator

@yangbodong22011 Would you like to work on this?

Of course, I will continue to modify #2643 .

@sazzad16
Copy link
Collaborator

Resolved by #2643

@sazzad16 sazzad16 added this to the 4.2.0 milestone Feb 17, 2022
@eldarj
Copy link

eldarj commented Apr 5, 2023

@sazzad16 @yangbodong22011 seeing that this PR didn't require a lot of changes, but it fixes a huge problem - are there any chances for backporting this, for a previous jedis version, which doesn't require Spring 3.x?

@sazzad16
Copy link
Collaborator

sazzad16 commented Apr 5, 2023

seeing that this PR didn't require a lot of changes, but it fixes a huge problem - are there any chances for backporting this, for a previous jedis version, which doesn't require Spring 3.x?

@eldarj We'll consider if we see user demand.

@eldarj
Copy link

eldarj commented Apr 6, 2023

@sazzad16 I'm sure there will be, going forward for most teams this will be an issue, especially considering k8s usage and considering that upgrading to Spring Boot 3.x isn't straight forward.

Can we be proactive here and do it straight away? I'd volunteer if needed.

@eldarj
Copy link

eldarj commented Apr 6, 2023

@sazzad16 @yangbodong22011 Actualy, I just tried this out with Spring Boot 3.0.1 and jedis 4.3.2, and it seems that the issue is still there.

Maybe I missunderstood this issue and the PR in the first place, but what I'm trying to accomplish is to have jedis re-discover redis nodes, after those nodes' IPs change.

What I'm doing and seeing is the following:

  1. Run redis-sentinel in k8s
  2. Run a sample java service with jedis, configuring it with the k8s redis service host name i.e. rfs-redis-sentinel-cluster and master mymaster
  3. On startup jedis connects to redis, resolves all nodes i.e. IPs
  4. I delete all pods in redis-sentinel in order to acquire new IPs
  5. jedis throws
redis.clients.jedis.JedisSentinelPool    : Lost connection to Sentinel at rfs-redis-sentinel-cluster:26379. Sleeping 5000ms and retrying.
redis.clients.jedis.exceptions.JedisConnectionException: Unexpected end of stream.
...
[ns-pool-evictor] redis.clients.jedis.JedisFactory         : Error while validating pooled Jedis object.
...
  1. After a while I see jedis logs the following which seems weird
[-cluster:26379]] redis.clients.jedis.JedisSentinelPool    : Created JedisSentinelPool to master at 127.0.0.1:6379
  1. Any following operation with e.g. redisTemplate fails with
org.springframework.data.redis.RedisConnectionFailureException: Cannot get Jedis connection] with root cause redis.clients.jedis.exceptions.JedisConnectionException: Failed to connect to any host resolved for DNS name.

Did I miss something? Was this supposed to be solved, or it was some other problem?

@sazzad16
Copy link
Collaborator

sazzad16 commented Apr 6, 2023

Did I miss something? Was this supposed to be solved, or it was some other problem?

@eldarj This issue is related to Cluster but yours is related to Sentinel.

@eldarj
Copy link

eldarj commented Apr 7, 2023

Okay, I'll have to check why would Cluster be any different from Sentinel

If you can point me in the right direction, that would be great. Why are these two different? @sazzad16

Also, if the same needs to be implemented for Sentinel, do you guys accept PR contributions, or should I create a separate issue then?

@sazzad16
Copy link
Collaborator

sazzad16 commented Apr 9, 2023

@eldarj I couldn't find a direct comparison in redis site. Here are two separate links from there:

Coming to the comparison, there are resources available elsewhere online. For example:

Hope these help.


if the same needs to be implemented for Sentinel, do you guys accept PR contributions

Yes, of course.

should I create a separate issue then?

This helps to attract more eyes and may lead to resolve it faster and/or more efficiently.

@yangbodong22011
Copy link
Collaborator

@eldarj This is a closed state, you can open a new issue to let the problem continue, in the new issue, please note:

  1. Code, what is your initialization code like?
  2. Architecture, How many sentinel nodes are configured behind rfs-redis-sentinel-cluster? 1 or 3 or more?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants