[BUG] need to handle the failure during allocate multiple IPs for a single pod, or it will exhausted the whole IP Pool. #4210

jcshare · 2024-06-22T07:04:50Z

Kube-OVN Version

v.1.12.17 and master

Kubernetes Version

v1.27

Operation-system/Kernel Version

"Ubuntu 20.04.6 LTS" / 5.4.0-186-generic

Description

it looks like no handling for the failure of IP allocation during create the VPC GW pod, and the whole External IP Pool get exhausted by this problem.

after done some research, I found out the root cause looks like below:

pod.go
// do the same thing as add pod
func (c *Controller) reconcileAllocateSubnets(cachedPod, pod *v1.Pod, needAllocatePodNets []*kubeovnNet) (*v1.Pod, error) {
	namespace := pod.Namespace
	name := pod.Name
	klog.Infof("sync pod %s/%s allocated", namespace, name)

	isVMPod, vmName := isVMPod(pod)
	podType := getPodType(pod)
	podName := c.getNameByPod(pod)
	// todo: isVmPod, getPodType, getNameByPod has duplicated logic

	// Avoid create lsp for already running pod in ovn-nb when controller restart
	for _, podNet := range needAllocatePodNets {
		// the subnet may changed when alloc static ip from the latter subnet after ns supports multi subnets
		v4IP, v6IP, mac, subnet, err := c.acquireAddress(pod, podNet)
		if err != nil {
			c.recorder.Eventf(pod, v1.EventTypeWarning, "AcquireAddressFailed", err.Error())
			klog.Error(err)
			return nil, err  <<<<<<<< here, need to release those allocated IP address in previous loop
		} 
                ...
       }

an vpc-gw pod info:

root@master:~# kubectl describe pod vpc-nat-gw-tanant1-pro1-vpc1-net1-gw-0 -n kube-system
Name:             vpc-nat-gw-tanant1-pro1-vpc1-net1-gw-0
Namespace:        kube-system
Priority:         0
Service Account:  default
Node:             worker2/192.168.1.118
Start Time:       Fri, 21 Jun 2024 11:13:32 +0000
Labels:           app=vpc-nat-gw-tanant1-pro1-vpc1-net1-gw
                  controller-revision-hash=vpc-nat-gw-tanant1-pro1-vpc1-net1-gw-658dfcff4
                  ovn.kubernetes.io/vpc-nat-gw=true
                  statefulset.kubernetes.io/pod-name=vpc-nat-gw-tanant1-pro1-vpc1-net1-gw-0
Annotations:      k8s.v1.cni.cncf.io/network-status:
                    [{
                        "name": "kube-ovn",
                        "interface": "eth0",
                        "ips": [
                            "172.22.1.254"
                        ],
                        "mac": "00:00:00:16:07:58",
                        "default": true,
                        "dns": {},
                        "gateway": [
                            "172.22.1.1"
                        ]
                    },{
                        "name": "kube-system/ovn-vpc-external-network",
                        "interface": "net1",
                        "ips": [
                            "192.168.1.19"
                        ],
                        "mac": "02:26:74:5d:03:a3",
                        "dns": {}
                    }]

Steps To Reproduce

create and delete vpc nat gate multiple times

Current Behavior

the external IP CIDR get exhausted by this problem

Expected Behavior

nice handling for the IP allocating/releasing to avoid such a problem.

The text was updated successfully, but these errors were encountered:

jcshare · 2024-06-22T07:05:47Z

can some expert help fix this issue as I have no authority to do it,
many thanks

bobz965 · 2024-06-22T07:44:04Z

please attach the err log in the kube-ovn-controller pod about the nat gw pod allocate ip

jcshare · 2024-06-22T22:49:46Z

701 I0619 12:07:04.058569 702 I0619 12:07:04.071551 703 E0619 12:07:04.072121 704 I0619 12:07:04.072830 705 E0619 12:07:04.073320 706 E0619 12:07:04.073525 707 E0619 12:07:04.073788 708 E0619 12:07:04.074250 709 I0619 12:07:04.074177 710 I0619 12:07:04.080417 711 I0619 12:07:04.083914 712 I0619 12:07:04.086506 713 I0619 12:07:04.087707 714 I0619 12:07:04.097194 715 E0619 12:07:04.097443 716 I0619 12:07:04.097556 717 E0619 12:07:04.097627 718 E0619 12:07:04.097700 719 E0619 12:07:04.097851 720 E0619 12:07:04.097949 721 I0619 12:07:04.098040 722 I0619 12:07:04.098105 723 I0619 12:07:04.101424 724 I0619 12:07:04.101533 725 I0619 12:07:04.107169 726 I0619 12:07:04.107456 727 E0619 12:07:04.107559 728 I0619 12:07:04.107574 729 E0619 12:07:04.107581 730 E0619 12:07:04.107585 731 E0619 12:07:04.107645 732 E0619 12:07:04.108065 733 I0619 12:07:04.108083 734 I0619 12:07:04.107846 735 I0619 12:07:04.110791 736 I0619 12:07:04.110947 737 I0619 12:07:04.116781 738 E0619 12:07:04.116801 739 I0619 12:07:04.116809 740 E0619 12:07:04.116901 741 E0619 12:07:04.116916 742 E0619 12:07:04.117040 6 ipam.go:60] allocate v4 192.168.1.10, v6 , mac for kube-system/vpc-nat-gw-gw1-vpc-1-0 from subnet ovn-vpc-external-network
6 ipam.go:72] allocating static ip 10.0.1.254 from subnet net1-vpc-1
6 pod.go:1762] failed to get static ip 10.0.1.254, mac , subnet net1-vpc-1, err NoAvailableAddress
6 ipam.go:72] allocating static ip 10.0.1.254 from subnet ovn-default
6 ipam.go:89] failed to allocate static ip 10.0.1.254 for kube-system/vpc-nat-gw-gw1-vpc-1-0
6 pod.go:1762] failed to get static ip 10.0.1.254, mac , subnet ovn-default, err AddressOutOfRange
6 pod.go:620] AddressOutOfRange
6 pod.go:405] error syncing 'kube-system/vpc-nat-gw-gw1-vpc-1-0': AddressOutOfRange, requeuing
6 event.go:298] Event(v1.ObjectReference{Kind:"Pod", Namespace:"kube-system", Name:"vpc-nat-gw-gw1-vpc-1-0", UID:"d61e58b5-c8f8-4f79-86cc-4e6d8724f475", AP IVersion:"v1", ResourceVersion:"2632", FieldPath:""}): type: 'Warning' reason: 'AcquireAddressFailed' AddressOutOfRange
6 pod.go:550] handle add/update pod kube-system/vpc-nat-gw-gw1-vpc-1-0
6 pod.go:346] enqueue update pod kube-system/vpc-nat-gw-gw1-vpc-1-0
6 pod.go:607] sync pod kube-system/vpc-nat-gw-gw1-vpc-1-0 allocated
6 ipam.go:60] allocate v4 192.168.1.11, v6 , mac for kube-system/vpc-nat-gw-gw1-vpc-1-0 from subnet ovn-vpc-external-network
6 ipam.go:72] allocating static ip 10.0.1.254 from subnet net1-vpc-1
6 pod.go:1762] failed to get static ip 10.0.1.254, mac , subnet net1-vpc-1, err NoAvailableAddress
6 ipam.go:72] allocating static ip 10.0.1.254 from subnet ovn-default
6 ipam.go:89] failed to allocate static ip 10.0.1.254 for kube-system/vpc-nat-gw-gw1-vpc-1-0
6 pod.go:1762] failed to get static ip 10.0.1.254, mac , subnet ovn-default, err AddressOutOfRange
6 pod.go:620] AddressOutOfRange
6 pod.go:405] error syncing 'kube-system/vpc-nat-gw-gw1-vpc-1-0': AddressOutOfRange, requeuing
6 pod.go:550] handle add/update pod kube-system/vpc-nat-gw-gw1-vpc-1-0
6 event.go:298] Event(v1.ObjectReference{Kind:"Pod", Namespace:"kube-system", Name:"vpc-nat-gw-gw1-vpc-1-0", UID:"d61e58b5-c8f8-4f79-86cc-4e6d8724f475", AP IVersion:"v1", ResourceVersion:"2633", FieldPath:""}): type: 'Warning' reason: 'AcquireAddressFailed' AddressOutOfRange
6 pod.go:607] sync pod kube-system/vpc-nat-gw-gw1-vpc-1-0 allocated
6 ipam.go:60] allocate v4 192.168.1.12, v6 , mac for kube-system/vpc-nat-gw-gw1-vpc-1-0 from subnet ovn-vpc-external-network
6 pod.go:346] enqueue update pod kube-system/vpc-nat-gw-gw1-vpc-1-0
6 ipam.go:72] allocating static ip 10.0.1.254 from subnet net1-vpc-1
6 pod.go:1762] failed to get static ip 10.0.1.254, mac , subnet net1-vpc-1, err NoAvailableAddress
6 ipam.go:72] allocating static ip 10.0.1.254 from subnet ovn-default
6 ipam.go:89] failed to allocate static ip 10.0.1.254 for kube-system/vpc-nat-gw-gw1-vpc-1-0
6 pod.go:1762] failed to get static ip 10.0.1.254, mac , subnet ovn-default, err AddressOutOfRange
6 pod.go:620] AddressOutOfRange
6 pod.go:405] error syncing 'kube-system/vpc-nat-gw-gw1-vpc-1-0': AddressOutOfRange, requeuing
6 pod.go:550] handle add/update pod kube-system/vpc-nat-gw-gw1-vpc-1-0
6 event.go:298] Event(v1.ObjectReference{Kind:"Pod", Namespace:"kube-system", Name:"vpc-nat-gw-gw1-vpc-1-0", UID:"d61e58b5-c8f8-4f79-86cc-4e6d8724f475", AP IVersion:"v1", ResourceVersion:"2642", FieldPath:""}): type: 'Warning' reason: 'AcquireAddressFailed' AddressOutOfRange
6 pod.go:607] sync pod kube-system/vpc-nat-gw-gw1-vpc-1-0 allocated
6 ipam.go:60] allocate v4 192.168.1.13, v6 , mac for kube-system/vpc-nat-gw-gw1-vpc-1-0 from subnet ovn-vpc-external-network
6 ipam.go:72] allocating static ip 10.0.1.254 from subnet net1-vpc-1
6 pod.go:1762] failed to get static ip 10.0.1.254, mac , subnet net1-vpc-1, err NoAvailableAddress
6 ipam.go:72] allocating static ip 10.0.1.254 from subnet ovn-default
6 ipam.go:89] failed to allocate static ip 10.0.1.254 for kube-system/vpc-nat-gw-gw1-vpc-1-0
6 pod.go:1762] failed to get static ip 10.0.1.254, mac , subnet ovn-default, err AddressOutOfRange
6 pod.go:620] AddressOutOfRange

bobz965 · 2024-06-23T03:51:10Z

err： failed to get static ip 10.0.1.254, mac , subnet ovn-default, err AddressOutOfRange 719 E0619 12:07:04.097851 6 pod.go:620] AddressOutOfRange

please attatch kubectl get subnet details:

jcshare · 2024-06-24T01:20:05Z

the root cause should be identified as above, we need to handle the exception gracefully.
I have rebuild my setup, so paste the subnet and VPC's definition as below:

kind: Vpc
apiVersion: kubeovn.io/v1
metadata:
  name: vpc-1
spec:
  staticRoutes:
    - cidr: 0.0.0.0/0
      nextHopIP: 10.0.1.254
      policy: policyDst
  namespaces:
    - ns1
---
kind: Subnet
apiVersion: kubeovn.io/v1
metadata:
  name: net1-vpc-1
spec:
  vpc: vpc-1
  cidrBlock: 10.0.1.0/24
  protocol: IPv4
  excludeIps:
    - 10.0.1.254
  namespaces:
    - ns1
kind: VpcNatGateway
apiVersion: kubeovn.io/v1
metadata:
  name: gw-vpc-1
spec:
  vpc: vpc-1
  subnet: net1-vpc-1
  lanIp: 10.0.1.254
  selector:
    - "kubernetes.io/hostname: worker2"
    - "kubernetes.io/os: linux"
  externalSubnets:
    - ovn-vpc-external-network

ubuntu@master:~/project/debug/1.12.7/test$ kubectl get subnet
NAME                       PROVIDER                               VPC                 PROTOCOL   CIDR             PRIVATE   NAT     DEFAULT   GATEWAYTYPE   V4USED   V4AVAILABLE   V6USED   V6AVAILABLE   EXCLUDEIPS                                                   U2OINTERCONNECTIONIP
join                       ovn                                    ovn-cluster         IPv4       100.64.0.0/16    false     false   false     distributed   3        65530         0        0             ["100.64.0.1"]
ovn-default                ovn                                    ovn-cluster         IPv4       10.16.0.0/16     false     true    true      distributed   5        65528         0        0             ["10.16.0.1"]
ovn-vpc-external-network   ovn-vpc-external-network.kube-system                       IPv4       192.168.1.0/24   false     false   false     distributed   3        7             0        0             ["192.168.1.1..192.168.1.9","192.168.1.20..192.168.1.255"]
ubuntu@master:~/project/debug/1.12.7/test$

jcshare · 2024-06-24T01:22:55Z

per the log above,it looks exist another problem(as you mentioned ), the controller shouldn't allocate the "10.0.1.254" for ovn-default subnet

bobz965 · 2024-06-24T03:29:47Z

where is your 10.0.1.0/24 subnet ???

jcshare · 2024-06-24T06:00:38Z

where is your 10.0.1.0/24 subnet ???

could you help take a deep look at the problem? it should be easy to reproduce with my configuration above.
my testbed got broken by the problem and I have rebuild it, so you cannot see the subnet in my current setup.

many thanks

jcshare · 2024-06-24T06:12:11Z

anyway, I will reproduce it and upload all the log file later, thanks

jcshare · 2024-06-24T06:27:50Z

I have reproduced it with a new VPC named "vpc-3" with related log files
could you help take a look?
many thanks
1.12.7-IP-Allocation-bug.zip

bobz965 · 2024-06-28T09:51:39Z

where is your 10.0.1.0/24 subnet ???

sorry, I think, when you get subnet: it shows all the subnets, but, i do not find the 10.0.1.0/24 subnet

jcshare · 2024-06-29T02:42:26Z

where is your 10.0.1.0/24 subnet ???

sorry, I think, when you get subnet: it shows all the subnets, but, i do not find the 10.0.1.0/24 subnet

could you refer to my reply above : #4210 (comment)

jcshare · 2024-06-29T02:44:30Z

it looks like the problem is obviously, can we help fix it if possible?
many thanks

bobz965 · 2024-06-29T03:02:35Z

you do not have the vpc subnet 10.0.1.0/24, if you use the 10.0.1.0.254, you should create it.
if you use vpc3 subnet, I think you should use 10.0.3.254.

jcshare · 2024-06-29T03:24:23Z

"if you use vpc3 subnet, I think you should use 10.0.3.254."

yes, I'm using 10.0.3.254 for vpc3, please refer to vpc3(rather than vpc1) related configuration/debug info in the tar ball
the information you mentioned was the stale configuration of vpc1(should be another issue that need to be handled)

thanks

bobz965 · 2024-06-29T03:34:20Z

hi @zhangzujian, it seems IPAM has a problem ?

jcshare added the bug Something isn't working label Jun 22, 2024

zhangzujian mentioned this issue Jun 29, 2024

check both sts name and UID when handling pod deletion #4238

Merged

1 task

zhangzujian closed this as completed in #4238 Jul 2, 2024

zhangzujian mentioned this issue Jul 8, 2024

ipam: fix ip not released for non-ovn subnets #4265

Merged

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] need to handle the failure during allocate multiple IPs for a single pod, or it will exhausted the whole IP Pool. #4210

[BUG] need to handle the failure during allocate multiple IPs for a single pod, or it will exhausted the whole IP Pool. #4210

jcshare commented Jun 22, 2024 •

edited

Loading

jcshare commented Jun 22, 2024

bobz965 commented Jun 22, 2024

jcshare commented Jun 22, 2024

bobz965 commented Jun 23, 2024

jcshare commented Jun 24, 2024 •

edited

Loading

jcshare commented Jun 24, 2024

bobz965 commented Jun 24, 2024

jcshare commented Jun 24, 2024

jcshare commented Jun 24, 2024

jcshare commented Jun 24, 2024

bobz965 commented Jun 28, 2024 •

edited

Loading

jcshare commented Jun 29, 2024

jcshare commented Jun 29, 2024

bobz965 commented Jun 29, 2024

jcshare commented Jun 29, 2024 •

edited

Loading

bobz965 commented Jun 29, 2024

[BUG] need to handle the failure during allocate multiple IPs for a single pod, or it will exhausted the whole IP Pool. #4210

[BUG] need to handle the failure during allocate multiple IPs for a single pod, or it will exhausted the whole IP Pool. #4210

Comments

jcshare commented Jun 22, 2024 • edited Loading

Kube-OVN Version

Kubernetes Version

Operation-system/Kernel Version

Description

Steps To Reproduce

Current Behavior

Expected Behavior

jcshare commented Jun 22, 2024

bobz965 commented Jun 22, 2024

jcshare commented Jun 22, 2024

bobz965 commented Jun 23, 2024

jcshare commented Jun 24, 2024 • edited Loading

jcshare commented Jun 24, 2024

bobz965 commented Jun 24, 2024

jcshare commented Jun 24, 2024

jcshare commented Jun 24, 2024

jcshare commented Jun 24, 2024

bobz965 commented Jun 28, 2024 • edited Loading

jcshare commented Jun 29, 2024

jcshare commented Jun 29, 2024

bobz965 commented Jun 29, 2024

jcshare commented Jun 29, 2024 • edited Loading

bobz965 commented Jun 29, 2024

jcshare commented Jun 22, 2024 •

edited

Loading

jcshare commented Jun 24, 2024 •

edited

Loading

bobz965 commented Jun 28, 2024 •

edited

Loading

jcshare commented Jun 29, 2024 •

edited

Loading