-
-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Client connections load balancing among nats cluster nodes #1556
Comments
We have been aware of this issue and have been exploring ideas here for some time, but nothing has come up that we really love yet as a solution. It is on our list though for sure. |
Ok, thanks. Btw, thank you for adding a feature to nats server that will immediately finsih request when no replier is listening. Waiting for 2.2 release) |
@derekcollison has any progress or soloution been found towards this? We are struggling with the same situation and I dont want to do hacky things ontop of the nats client to solve this 😂. Any suggestions? |
Post 2.2 release we will be focusing on some client upgrades, this being one of them. But need to get 2.2 out the door first. |
Bump. Is there any movement on this? We're facing the exact same problem. So far we only thought of building our NATS clients in a way that they reconnect after some time to give them a chance to pick another NATS server and improve the distribution of client connections. |
Apologies for delays here, been a very interesting few months for us in a positive way. So prior to any intelligent client work, I took a look at this today during a break. I wanted to see if there might be a way to balance a cluster after and upgrade that is possible today. I believe there is and I will test this myself at some point in next few days but here is the thought. For this experiment let's assume 10 connections on 3 servers A, B, and C. Also assume accurate randomness etc which we know is not the case but will help illustrate what I think is possible. START POST UPGRADE A POST UPGRADE B POST UPGRADE C Servers all the setting on max connections So again assuming accurate and true randomness of next server picked. If we set A to POST LIMIT:10 to A POST LIMIT:10 to B (Keep A's liit in place, but none on C) Release limits on A and B. Again this was me doodling during a lunch break but I did verify code wise the So I think this would work today. |
@derekcollison interesting thought. In our case we don't know upfront about the number of client connections. So we cannot use a fixed number as we would drop any client that's more than I think it could work by having a nanny job that continuously monitors the cluster distribution and adjusts the WDYT? |
After a full sweep upgrade count all the current connections, divide by number of servers and that will give you your target per server balance number which can guild the temp Right now the client (Go client at least) will see the error from the server and terminate and not to normal reconnect. @tbeets and @wallyqs will look into this and get it fixed. Its non-fatal and if the client knows it has other server options to perform a reconnect to, that should happen. |
PR for client to auto reconnect in this scenario is in the hopper: |
Hi, any updates on this thread. We are having the same issue. Client connections will not distribute across restarted failed servers in the cluster. |
The work here is two fold, server side which will be manual and can be done today with tooling and scripts etc. We may automate some of this in our service offering NGS. The other option would be hint based sent from the servers that client v2 could respond to and do the right thing.. That work looks to happen possibly in 2023. /cc @ColinSullivan1 |
Hello! Currently the mechanism of load balancing of client connections to nats cluster is based on random node selection. The problem (already discussed in some of the issues, like #1359) is when we restart nats cluster. To be able to serve clients we restart cluster one node after another and this will lead us to approximately the following distribution (step-by-step example for 3 node cluster):
As you can see original uniform distribution 1/3 | 1/3 | 1/3 tends to 2/3 | 1/3 | 0 . I suggest you to add some modifications that will allow developers to get around this problem (for example, we use nats cluster in 24/7 fashion and never stop it completely but sometimes need to upgrade server version or our hardware). There can be several solutions:
The text was updated successfully, but these errors were encountered: