-
-
Notifications
You must be signed in to change notification settings - Fork 10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Azure Kubernetes publick8s
suffers from SNAT port exhaustion: network slowness
#3908
Comments
publick8s
suffers from SNAT port ehxautsion: network slownesspublick8s
suffers from SNAT port exhaustion: network slowness
|
|
publick8s
suffers from SNAT port exhaustion: network slownesspublick8s
suffers from SNAT port exhaustion: network slowness
…nnections (#579) Ref. jenkins-infra/helpdesk#3908 This PR tunes the network outbound method used for both `publick8s` and `privatek8s` (using loadbalancers) with: - TCP idle timeout decrease from 30 min (default) to 4 min to recycle sockets way more often - Force static allocation of `3200` (and `1600`) port on the public outbound IPs as per the Azure Metrics (these values are the upper of each amount of SNAT connection diagrams). - Note it disable the dynamic allocation: this should be a problem if we have more than 50 nodes per cluster. Not the case for these 2. Signed-off-by: Damien Duportal <damien.duportal@gmail.com>
Update:
Next step:
|
Update:
|
Tracking the ci.jenkins.io migration in #3913 |
While working on #3837 (comment), we saw the problem to re-appear due to additional nodes. It sounds like we should add more public IP to increase the threshold. If it does not suffice, we'll have to plan a cluster re-creation during the Kubernetes 1.27 upgrade with a new subnet (and associated NAT gateway). |
…NAT exhaustion (#587) Related to jenkins-infra/helpdesk#3908 This PR increases the amount of public IPs used for outbound connection in `publick8s` as a tentative to increase the SNAT exhaustion threshold. Signed-off-by: Damien Duportal <damien.duportal@gmail.com>
Update:
|
Let's check how is the week-end going |
Alas we still see SNAT port problem. Lets go the "add a NAT gateway but not explicitly" as described in https://www.danielstechblog.io/preventing-snat-port-exhaustion-on-azure-kubernetes-service-with-virtual-network-nat/ (and other internet posts) |
Update: opened jenkins-infra/azure-net#198, let's prepare this PR and check the SNAT metrics before and after deploying to confirm SNAT exhaustion disappear. If it does, then we'll decrease the LB outbound setup IPs to pay less. |
Update:
|
Ref. jenkins-infra/helpdesk#3908 This PR adds the NAT gatewat public IP in the allow list for both `publick8s` and `privatek8s` to ensure all requests originated from inside the clusters (autoscaler, nodes healthchecks, API commands for `kubectl logs/exec`, etc.) are allowed to reach the control plane. Signed-off-by: Damien Duportal <damien.duportal@gmail.com>
Update: we'll delay the switch to a NAT gateway for after the 2.426.3 LTS release |
Let's go! Ref. jenkins-infra/azure-net#201 |
Service(s)
Azure
Summary
The AKS cluster
publick8s
suffers from SNAT port exhaustion since around 1 month (example below for the last 24 hours):It causes the following problems:
Reproduction steps
No response
The text was updated successfully, but these errors were encountered: