-
Notifications
You must be signed in to change notification settings - Fork 3.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Do cluster slot renewal in a background thread #1347
Comments
Hmm I'm wondering if this makes sense. The only scenario where I believe this could be useful is if you have little traffic to the cache or a specific node and the background renewal happens before the user gets a MOVED response. I still believe this scenario is pretty rare and I don't really see the benefit of doing this. IMHO slot migration (because of failures or new node additions) shouldn't be something that happens often, so I don't think it's bad to lock if two queries get a MOVED at the same time and wait until the cluster heals. In addition to this, some questions came to my mind and I'm wondering if you considered them:
A hybrid idea that might work is to perform slot renewal in a separated thread only when receiving a @HeartSaVioR @antirez WDYT? |
I think you got it wrong. It doesn't need to happen in background before a On Mon, Jul 11, 2016, 08:28 Marcos Nils notifications@github.com wrote:
|
Ok, I got it wrong then. This is something similar to what I thought about 😄 |
Please correct me if I'm wrong, but then slot will be shared variable between user threads and background thread. If we don't set a lock, while background thread reconstructs slots operations can be probably failing. We should make operations waiting anyway while slot is reconstructing. |
@HeartSaVioR how so?. Threads that don't see the change will still get the This way, you won't make any transactions to wait until slot renewal finishes. |
We are "moving" the slot information, which means we "delete" the slot, and "add" the slot. It can't be atomic, and it means there's race condition between main thread and background thread. But I like the idea @xetorthio said. Main thread doesn't need to wait for cache to be updated. If we handle the case what I'm saying properly, I'm OK to change. |
Yeah, I think I did a mistake again. Slots are stored to a concurrent hashmap. I'm OK with this. Great idea. |
I was about to say this :D |
Also current implementation cleans all slots before filling them. We could avoid doing that. Just update the slots without doing the cleanup. This way we'll be making a better use of the |
We could use a |
AFAIK, we can just create a new task which contains slot information from MOVED response to that ExecutorService whenever we found MOVED, and background thread runs every new task. |
Ok. That means we never ask for slots again, right? On Wed, Jul 13, 2016, 01:11 Jungtaek Lim notifications@github.com wrote:
|
@HeartSaVioR but won't tasks queue if we do this?. The idea is that there can one be one task in the queue, it wouldn't make sense to queue slot renewal tasks. |
I'm not sure I understood you comment @marcosnils |
@HeartSaVioR suggest adding a new tasks to the ExecutorService each time a MOVED response is returned. The SingleThreadExecutor will execute one task at a time but every time you add a task the task will be queued for later execution. What I'm saying is that there shouldn't be any queuing in this process, only one task can be processed at a time and that's it. |
But how your suggestion works? Imagine you get a MOVED at the same time from 2 different jedis instances. Both need to do something. What do they do? With what @HeartSaVioR suggest, both queue them in the executor. And eventually the background thread will refresh the cache. What will your suggestion do? |
@xetorthio @marcosnils
What we're currently doing is in fact slowest but clearest to resolve this scenario. No race condition because it's guarded with RW lock. One sketched idea to resolve is comparing fetched timestamp (I mean the moment when the thread queries to the cache) and slot updated timestamp (in cache), and do update slots when only fetched timestamp is greater than slot updated timestamp. (which means that it queries to Redis Cluster with latest update of slot information but Redis responds MOVED.) If we don't want to maintain RW lock but want to update whole slots, sketched idea could be extended to have timestamp for each slot (might be crazy to maintain timestamp of each slot). It's just a sketched idea so we could have better alternatives or this idea could be improved. tl.dr; Summary of my sketched idea is: if borrowed thread got MOVED response, check that it refers latest information of that slot. If it does, we need to update slot(s). If not, we don't need to update slot(s). |
I think that updating only one slot has it's benefits, but it could be also Now this doesn't need to be too strict. Because if cache is not up to date, So probably the best solution is the one that it' simplest to understand, A background thread that does the update sounds good. What triggers the On Wed, Jul 13, 2016, 11:23 Jungtaek Lim notifications@github.com wrote:
|
IMHO updating slot-wise on each
The adaptive topology refresh listens to persistent disconnects, Another point to consider is a rate-limiting. Busy applications can run into 1000's of events per second, you don't want to trigger that many updates. Integrating Threads into a framework always pulls in more stuff than expected. You don't want to launch a |
Thanks for your input @mp911de |
IMHO when client receives a MOVED command, we should not deem it as an exception ( JedisMovedDataException), it's quite normal during migration. We can just continue to serve MOVE command and update the slot cache in a background thread. Since redis is not a CP model in CAP, we can sacrifice consistency for liveness. |
This thread is very old, but I don't think the problem has been solved, or it has not been solved consistently. I'm not overly familiar with all of the different cluster, connection pool, configuration options, but I do have a requirement to hold a consistent connection for a particular hash slot for a series of operations. Here is the basic code I worked out, after ruling out try (Connection conn = connectionProvider.getConnectionFromSlot(JedisClusterCRC16.getSlot(aStringKey))) {
final Jedis client = new Jedis(conn);
final String result = client.get(aStringKey); // ?
} If the client sees a One solution is something like this - similar to Lettuce's adaptive refresh: try (Connection conn = connectionProvider.getConnectionFromSlot(JedisClusterCRC16.getSlot(aStringKey))) {
try {
final Jedis client = new Jedis(conn);
final String result = client.get(aStringKey); // ?
} catch (JedisConnectionException ex) { // Only catches connection errors, will likely miss `MOVED` or others
connectionProvider.renewSlotCache();
// enter a retry loop here
}
} Another solution, or a companion solution in case you need both is to do the periodic refresh that was suggested above. Lettuce has this option and it has become absolutely necessary for stability in production. // Somewhere in your application setup...
final ScheduledExecutorService scheduledExecutorService = Executors.newSingleThreadScheduledExecutor();
final ClusterConnectionProvider connectionProvider = new ClusterConnectionProvider(hosts, jedisClientConfig);
scheduledExecutorService.scheduleAtFixedRate(connectionProvider::renewSlotCache, 0, 15, TimeUnit.SECONDS); |
I was primarily testing cluster destruction and failovers where a primary is lost. In that case you see socket exceptions, wrapped under a |
I did test this with I was migrating the test from Lettuce where the equivalent of |
This issue is marked stale. It will be closed in 30 days if it is not updated. |
Instead of doing it as soon as a
MOVED
was received from redis.The text was updated successfully, but these errors were encountered: