Azure Kubernetes `publick8s` suffers from SNAT port exhaustion: network slowness #3908

dduportal · 2024-01-15T17:18:03Z

Service(s)

Azure

Summary

The AKS cluster publick8ssuffers from SNAT port exhaustion since around 1 month (example below for the last 24 hours):

It causes the following problems:

Pagerduty is spamming us with false positive alerts reporting unusual response time from repo.jenkins-ci.org and reports.jenkins.io services while other probes (on other clusters) are reporting valid response times.
The rest of the public network (including ci.jenkins.io and its agents, even on the peered network on the sponsored subscription,) is slowed down:
- ci.jenkins.io jobs on Windows agents are much slower than 21 days ago #3904 (even if the network slowness is not the only reason)
- Mirrorbits got pods restart on publick8s #3799 (mirrorbit pods are failing when trying to scan mirrors)

Reproduction steps

No response

The text was updated successfully, but these errors were encountered:

dduportal · 2024-01-15T17:25:03Z

This problem rings a bell in my memory: https://github.com/jenkins-infra/documentation/blob/5971469e62d93184ad4ffa34b969f53a1c7af45c/archives/aks-1.19.md?plain=1#L101 => we did had and solved it on the former prodpublick8s (replaced by publick8s since then).
- We could check the method we used (allocating a public IP pool prefix to the loadBlancer used for outbound connectivity)
- But this time let's track it and even better: as publick8s is terraformed, let's see if we can use config from https://www.danielstechblog.io/detecting-snat-port-exhaustion-on-azure-kubernetes-service/
- AKS clusters now support NAT gateway outbound method but we cannot migrate an existing AKS cluster to it: https://learn.microsoft.com/en-us/azure/load-balancer/load-balancer-outbound-connections
- WiP: reading https://www.danielstechblog.io/preventing-snat-port-exhaustion-on-azure-kubernetes-service-with-virtual-network-nat/ to understand better the concept of Virtual Network NAT which is distinct from a "User NAT Gateway"

dduportal · 2024-01-15T17:26:41Z

The ci.jenkins.io networks could be moved out of the publick8s network to separate outbounds routes:
- Since agents are using the new subscription, we could set it up) to use a NAT gateway for outbound like we already did with trusted.ci and cert.ci
- ci.jenkins.io is currently using public Ip outbound: moving its controller VM/disk to the secondary subscription would ensure:
  - Complete removal of the public vnet peering (and separate concerns) as ci.j would be on another network
  - More usage on the subscription billing (~450$ monthly)

…nnections (#579) Ref. jenkins-infra/helpdesk#3908 This PR tunes the network outbound method used for both `publick8s` and `privatek8s` (using loadbalancers) with: - TCP idle timeout decrease from 30 min (default) to 4 min to recycle sockets way more often - Force static allocation of `3200` (and `1600`) port on the public outbound IPs as per the Azure Metrics (these values are the upper of each amount of SNAT connection diagrams). - Note it disable the dynamic allocation: this should be a problem if we have more than 50 nodes per cluster. Not the case for these 2. Signed-off-by: Damien Duportal <damien.duportal@gmail.com>

dduportal · 2024-01-17T15:04:55Z

Update:

Loadbalancer tuned for both publick8s and privatek8s: feat(publick8s,privatek8s) set up outbound LB to support more SNAT connections azure#579
- It required hotfixes hotfix(publick8s) correct outbound LB SNAT port azure#580 hotfix(privatek8s) correct outbound LB SNAT port azure#581 and hotfix(publick8s,privatek8s) set up LB ports amount to valid values azure#582) as the specified max outbound port number must be:
  - Per VM
  - A multiple of 8 (or said differently, a division of the 64000 max ports available per public IP)

Next step:

NAT gateway for ci.jenkins.io sponsorship network
migration of ci.jenkins.io VM to the sponsorship network

dduportal · 2024-01-17T18:20:00Z

Update:

ci.jenkins.io is now using a NAT gateway for its outbound traffic: if there are any remaning slowness it won't be related to SNAT port exhaustion

dduportal · 2024-01-18T12:02:47Z

Update:

* Loadbalancer tuned for both `publick8s` and `privatek8s`: [feat(publick8s,privatek8s) set up outbound LB to support more SNAT connections azure#579](https://github.com/jenkins-infra/azure/pull/579)
  
  * It required hotfixes [hotfix(publick8s) correct outbound LB SNAT port azure#580](https://github.com/jenkins-infra/azure/pull/580) [ hotfix(privatek8s) correct outbound LB SNAT port azure#581](https://github.com/jenkins-infra/azure/pull/581) and [hotfix(publick8s,privatek8s) set up LB ports amount to valid values azure#582](https://github.com/jenkins-infra/azure/pull/582)) as the specified max outbound port number must be:
    
    * Per VM
    * A multiple of 8 (or said differently, a division of the 64000 max ports available per public IP)

No SNAT port exhaustion detected in the past 12 hours as per Azure metrics:

dduportal · 2024-01-18T13:24:51Z

The ci.jenkins.io networks could be moved out of the publick8s network to separate outbounds routes:

* Since agents are using the new [subscription](https://github.com/jenkins-infra/helpdesk/issues/3818), we could set it up) to use a NAT gateway for outbound like we [already did with trusted.ci and cert.ci](https://github.com/jenkins-infra/azure/pull/567)
* ci.jenkins.io is currently using public Ip outbound: moving its controller VM/disk to the secondary subscription would ensure:
  
  * Complete removal of the public vnet peering (and separate concerns) as ci.j would be on another network
  * More usage on the subscription billing (~450$ monthly)

Tracking the ci.jenkins.io migration in #3913

dduportal · 2024-01-19T12:53:00Z

While working on #3837 (comment), we saw the problem to re-appear due to additional nodes.

It sounds like we should add more public IP to increase the threshold. If it does not suffice, we'll have to plan a cluster re-creation during the Kubernetes 1.27 upgrade with a new subnet (and associated NAT gateway).

…NAT exhaustion (#587) Related to jenkins-infra/helpdesk#3908 This PR increases the amount of public IPs used for outbound connection in `publick8s` as a tentative to increase the SNAT exhaustion threshold. Signed-off-by: Damien Duportal <damien.duportal@gmail.com>

dduportal · 2024-01-19T13:14:52Z

Update:

Opened a PR to increase the number of Public IPs: fix(publick8s) add more public IPs for outbound connection to avoid SNAT exhaustion azure#587
- Required an hotfix as increasing the amount of IPv4 also required to specify IPv6 (set to 2): jenkins-infra/azure@f3ba3d8
Once deployed, we'll have to retrieve the value of these public IPs and add it in https://github.com/jenkins-infra/kubernetes-management/blob/main/config/ldap.yaml

Ref. jenkins-infra/helpdesk#3908 (comment)

dduportal · 2024-01-19T13:29:51Z

Update:

* Opened a PR to increase the number of Public IPs: [fix(publick8s) add more public IPs for outbound connection to avoid SNAT exhaustion azure#587](https://github.com/jenkins-infra/azure/pull/587)
  
  * Required an hotfix as increasing the amount of IPv4 also required to specify IPv6 (set to 2): [jenkins-infra/azure@f3ba3d8](https://github.com/jenkins-infra/azure/commit/f3ba3d81e758a3a1d9d033d51ab99893d14a6d7a)

* Once deployed, we'll have to retrieve the value of these public IPs and add it in https://github.com/jenkins-infra/kubernetes-management/blob/main/config/ldap.yaml

Current outbound IPs:

outbound publick8s:

- 2603:1030:403:3::106
- 2603:1030:403:3::217
- 20.85.71.108
- 20.22.30.9
- 20.22.30.74

outbound privatek8s:

- 20.22.6.81

Updated LDAP: jenkins-infra/kubernetes-management@8a7f2bc

Let's check how is the week-end going

dduportal · 2024-01-21T14:00:45Z

Alas we still see SNAT port problem. Lets go the "add a NAT gateway but not explicitly" as described in https://www.danielstechblog.io/preventing-snat-port-exhaustion-on-azure-kubernetes-service-with-virtual-network-nat/ (and other internet posts)

dduportal · 2024-01-23T11:37:05Z

Update: opened jenkins-infra/azure-net#198, let's prepare this PR and check the SNAT metrics before and after deploying to confirm SNAT exhaustion disappear.

If it does, then we'll decrease the LB outbound setup IPs to pay less.

dduportal · 2024-01-23T13:44:12Z

Update:

Dashboard for SNAT tracking created in Azure (Go to "Dashboard Hub" Section -> select the "ASK Outbound SNAT Connection" shared dashboard")

Opened feat(gateways) add a NAT gateway to privatek8s for outbound requests azure-net#199 to try the migration on the privatek8s cluster before the public one

… to reach LDAP Ref. jenkins-infra/helpdesk#3908

Ref. jenkins-infra/helpdesk#3908 This PR adds the NAT gatewat public IP in the allow list for both `publick8s` and `privatek8s` to ensure all requests originated from inside the clusters (autoscaler, nodes healthchecks, API commands for `kubectl logs/exec`, etc.) are allowed to reach the control plane. Signed-off-by: Damien Duportal <damien.duportal@gmail.com>

dduportal · 2024-01-23T17:16:27Z

Update: we'll delay the switch to a NAT gateway for after the 2.426.3 LTS release

dduportal · 2024-01-24T15:34:51Z

Update: we'll delay the switch to a NAT gateway for after the 2.426.3 LTS release

Let's go! Ref. jenkins-infra/azure-net#201

dduportal · 2024-01-24T15:47:33Z

Looks good for now:

dduportal · 2024-01-26T07:08:05Z

The fix was efficient: we see no more SNAT exhaustion 🥳

dduportal · 2024-01-29T07:47:38Z

Confirmed after the weekend: we can close:

dduportal added the triage Incoming issues that need review label Jan 15, 2024

dduportal changed the title ~~Azure Kubernetes publick8s suffers from SNAT port ehxautsion: network slowness~~ Azure Kubernetes publick8s suffers from SNAT port exhaustion: network slowness Jan 15, 2024

jenkins-infra-helpdesk-app bot added the azure label Jan 15, 2024

lemeurherve changed the title ~~Azure Kubernetes publick8s suffers from SNAT port exhaustion: network slowness~~ Azure Kubernetes publick8s suffers from SNAT port exhaustion: network slowness Jan 16, 2024

smerle33 added this to the infra-team-sync-2024-01-23 milestone Jan 16, 2024

dduportal self-assigned this Jan 17, 2024

dduportal removed the triage Incoming issues that need review label Jan 17, 2024

dduportal mentioned this issue Jan 17, 2024

feat(publick8s,privatek8s) set up outbound LB to support more SNAT connections jenkins-infra/azure#579

Merged

This was referenced Jan 18, 2024

feat(vnets) add a subnet for ci.jenkins.io controller in the sponsored subscription jenkins-infra/azure-net#195

Merged

feat(ci.jenkins.io) add a secondary (empty) VM to prepare moving controller to secondary sponsored subscription jenkins-infra/azure#583

Merged

This was referenced Jan 18, 2024

ci.jenkins.io jobs on Windows agents are much slower than 21 days ago #3904

Closed

Migrate ci.jenkins.io to the sponsored subscription #3913

Closed

dduportal closed this as completed Jan 18, 2024

dduportal reopened this Jan 19, 2024

dduportal mentioned this issue Jan 19, 2024

Migration left over from publicK8s to arm64 #3837

Open

7 tasks

dduportal mentioned this issue Jan 19, 2024

fix(publick8s) add more public IPs for outbound connection to avoid SNAT exhaustion jenkins-infra/azure#587

Merged

dduportal added a commit to jenkins-infra/kubernetes-management that referenced this issue Jan 19, 2024

hotfix(ldap) allow all IPv4 of publick8s

8a7f2bc

Ref. jenkins-infra/helpdesk#3908 (comment)

dduportal mentioned this issue Jan 23, 2024

feat(gateways) add NAT gateways for the publick8s and privatek8s clusters jenkins-infra/azure-net#198

Merged

dduportal mentioned this issue Jan 23, 2024

feat(gateways) add a NAT gateway to privatek8s for outbound requests jenkins-infra/azure-net#199

Merged

dduportal added a commit to jenkins-infra/kubernetes-management that referenced this issue Jan 23, 2024

hotfix(publick8s) allow requests from public and private NAT gateways…

4e401da

… to reach LDAP Ref. jenkins-infra/helpdesk#3908

This was referenced Jan 23, 2024

feat(gateways) add all privatek8s operational subnets jenkins-infra/azure-net#200

Merged

feat(publick8s, privatek8s) allow NAT gateways to AKS API jenkins-infra/azure#596

Merged

dduportal modified the milestones: infra-team-sync-2024-01-23, infra-team-sync-2024-01-30 Jan 24, 2024

dduportal mentioned this issue Jan 24, 2024

feat(gateways) use NAT gateway for outbound connections on publick8s jenkins-infra/azure-net#201

Merged

dduportal closed this as completed Jan 29, 2024

dduportal mentioned this issue Aug 6, 2024

[publick8s] AzureAD / AKS error Authorization Failures have been detected that may affect cluster availability over outbound IPv6 addresses #4206

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Azure Kubernetes `publick8s` suffers from SNAT port exhaustion: network slowness #3908

Azure Kubernetes `publick8s` suffers from SNAT port exhaustion: network slowness #3908

dduportal commented Jan 15, 2024

dduportal commented Jan 15, 2024 •

edited by smerle33

Loading

dduportal commented Jan 15, 2024 •

edited

Loading

dduportal commented Jan 17, 2024

dduportal commented Jan 17, 2024

dduportal commented Jan 18, 2024

dduportal commented Jan 18, 2024

dduportal commented Jan 19, 2024

dduportal commented Jan 19, 2024 •

edited

Loading

dduportal commented Jan 19, 2024

dduportal commented Jan 21, 2024

dduportal commented Jan 23, 2024

dduportal commented Jan 23, 2024

dduportal commented Jan 23, 2024

dduportal commented Jan 24, 2024

dduportal commented Jan 24, 2024

dduportal commented Jan 26, 2024 •

edited

Loading

dduportal commented Jan 29, 2024

Azure Kubernetes publick8s suffers from SNAT port exhaustion: network slowness #3908

Azure Kubernetes publick8s suffers from SNAT port exhaustion: network slowness #3908

Comments

dduportal commented Jan 15, 2024

Service(s)

Summary

Reproduction steps

dduportal commented Jan 15, 2024 • edited by smerle33 Loading

dduportal commented Jan 15, 2024 • edited Loading

dduportal commented Jan 17, 2024

dduportal commented Jan 17, 2024

dduportal commented Jan 18, 2024

dduportal commented Jan 18, 2024

dduportal commented Jan 19, 2024

dduportal commented Jan 19, 2024 • edited Loading

dduportal commented Jan 19, 2024

dduportal commented Jan 21, 2024

dduportal commented Jan 23, 2024

dduportal commented Jan 23, 2024

dduportal commented Jan 23, 2024

dduportal commented Jan 24, 2024

dduportal commented Jan 24, 2024

dduportal commented Jan 26, 2024 • edited Loading

dduportal commented Jan 29, 2024

Azure Kubernetes `publick8s` suffers from SNAT port exhaustion: network slowness #3908

Azure Kubernetes `publick8s` suffers from SNAT port exhaustion: network slowness #3908

dduportal commented Jan 15, 2024 •

edited by smerle33

Loading

dduportal commented Jan 15, 2024 •

edited

Loading

dduportal commented Jan 19, 2024 •

edited

Loading

dduportal commented Jan 26, 2024 •

edited

Loading