Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] need to handle the failure during allocate multiple IPs for a single pod, or it will exhausted the whole IP Pool. #4210

Closed
jcshare opened this issue Jun 22, 2024 · 16 comments · Fixed by #4238
Labels
bug Something isn't working

Comments

@jcshare
Copy link

jcshare commented Jun 22, 2024

Kube-OVN Version

v.1.12.17 and master

Kubernetes Version

v1.27

Operation-system/Kernel Version

"Ubuntu 20.04.6 LTS" / 5.4.0-186-generic

Description

it looks like no handling for the failure of IP allocation during create the VPC GW pod, and the whole External IP Pool get exhausted by this problem.

after done some research, I found out the root cause looks like below:

pod.go
// do the same thing as add pod
func (c *Controller) reconcileAllocateSubnets(cachedPod, pod *v1.Pod, needAllocatePodNets []*kubeovnNet) (*v1.Pod, error) {
	namespace := pod.Namespace
	name := pod.Name
	klog.Infof("sync pod %s/%s allocated", namespace, name)

	isVMPod, vmName := isVMPod(pod)
	podType := getPodType(pod)
	podName := c.getNameByPod(pod)
	// todo: isVmPod, getPodType, getNameByPod has duplicated logic

	// Avoid create lsp for already running pod in ovn-nb when controller restart
	for _, podNet := range needAllocatePodNets {
		// the subnet may changed when alloc static ip from the latter subnet after ns supports multi subnets
		v4IP, v6IP, mac, subnet, err := c.acquireAddress(pod, podNet)
		if err != nil {
			c.recorder.Eventf(pod, v1.EventTypeWarning, "AcquireAddressFailed", err.Error())
			klog.Error(err)
			return nil, err  <<<<<<<< here, need to release those allocated IP address in previous loop
		} 
                ...
       }

an vpc-gw pod info:

root@master:~# kubectl describe pod vpc-nat-gw-tanant1-pro1-vpc1-net1-gw-0 -n kube-system
Name:             vpc-nat-gw-tanant1-pro1-vpc1-net1-gw-0
Namespace:        kube-system
Priority:         0
Service Account:  default
Node:             worker2/192.168.1.118
Start Time:       Fri, 21 Jun 2024 11:13:32 +0000
Labels:           app=vpc-nat-gw-tanant1-pro1-vpc1-net1-gw
                  controller-revision-hash=vpc-nat-gw-tanant1-pro1-vpc1-net1-gw-658dfcff4
                  ovn.kubernetes.io/vpc-nat-gw=true
                  statefulset.kubernetes.io/pod-name=vpc-nat-gw-tanant1-pro1-vpc1-net1-gw-0
Annotations:      k8s.v1.cni.cncf.io/network-status:
                    [{
                        "name": "kube-ovn",
                        "interface": "eth0",
                        "ips": [
                            "172.22.1.254"
                        ],
                        "mac": "00:00:00:16:07:58",
                        "default": true,
                        "dns": {},
                        "gateway": [
                            "172.22.1.1"
                        ]
                    },{
                        "name": "kube-system/ovn-vpc-external-network",
                        "interface": "net1",
                        "ips": [
                            "192.168.1.19"
                        ],
                        "mac": "02:26:74:5d:03:a3",
                        "dns": {}
                    }]

Steps To Reproduce

create and delete vpc nat gate multiple times

Current Behavior

the external IP CIDR get exhausted by this problem

Expected Behavior

nice handling for the IP allocating/releasing to avoid such a problem.

@jcshare jcshare added the bug Something isn't working label Jun 22, 2024
@jcshare
Copy link
Author

jcshare commented Jun 22, 2024

can some expert help fix this issue as I have no authority to do it,
many thanks

@bobz965
Copy link
Collaborator

bobz965 commented Jun 22, 2024

please attach the err log in the kube-ovn-controller pod about the nat gw pod allocate ip

@jcshare
Copy link
Author

jcshare commented Jun 22, 2024

701 I0619 12:07:04.058569 6 ipam.go:60] allocate v4 192.168.1.10, v6 , mac for kube-system/vpc-nat-gw-gw1-vpc-1-0 from subnet ovn-vpc-external-network
702 I0619 12:07:04.071551 6 ipam.go:72] allocating static ip 10.0.1.254 from subnet net1-vpc-1
703 E0619 12:07:04.072121 6 pod.go:1762] failed to get static ip 10.0.1.254, mac , subnet net1-vpc-1, err NoAvailableAddress
704 I0619 12:07:04.072830 6 ipam.go:72] allocating static ip 10.0.1.254 from subnet ovn-default
705 E0619 12:07:04.073320 6 ipam.go:89] failed to allocate static ip 10.0.1.254 for kube-system/vpc-nat-gw-gw1-vpc-1-0
706 E0619 12:07:04.073525 6 pod.go:1762] failed to get static ip 10.0.1.254, mac , subnet ovn-default, err AddressOutOfRange
707 E0619 12:07:04.073788 6 pod.go:620] AddressOutOfRange
708 E0619 12:07:04.074250 6 pod.go:405] error syncing 'kube-system/vpc-nat-gw-gw1-vpc-1-0': AddressOutOfRange, requeuing
709 I0619 12:07:04.074177 6 event.go:298] Event(v1.ObjectReference{Kind:"Pod", Namespace:"kube-system", Name:"vpc-nat-gw-gw1-vpc-1-0", UID:"d61e58b5-c8f8-4f79-86cc-4e6d8724f475", AP IVersion:"v1", ResourceVersion:"2632", FieldPath:""}): type: 'Warning' reason: 'AcquireAddressFailed' AddressOutOfRange
710 I0619 12:07:04.080417 6 pod.go:550] handle add/update pod kube-system/vpc-nat-gw-gw1-vpc-1-0
711 I0619 12:07:04.083914 6 pod.go:346] enqueue update pod kube-system/vpc-nat-gw-gw1-vpc-1-0
712 I0619 12:07:04.086506 6 pod.go:607] sync pod kube-system/vpc-nat-gw-gw1-vpc-1-0 allocated
713 I0619 12:07:04.087707 6 ipam.go:60] allocate v4 192.168.1.11, v6 , mac for kube-system/vpc-nat-gw-gw1-vpc-1-0 from subnet ovn-vpc-external-network
714 I0619 12:07:04.097194 6 ipam.go:72] allocating static ip 10.0.1.254 from subnet net1-vpc-1
715 E0619 12:07:04.097443 6 pod.go:1762] failed to get static ip 10.0.1.254, mac , subnet net1-vpc-1, err NoAvailableAddress
716 I0619 12:07:04.097556 6 ipam.go:72] allocating static ip 10.0.1.254 from subnet ovn-default
717 E0619 12:07:04.097627 6 ipam.go:89] failed to allocate static ip 10.0.1.254 for kube-system/vpc-nat-gw-gw1-vpc-1-0
718 E0619 12:07:04.097700 6 pod.go:1762] failed to get static ip 10.0.1.254, mac , subnet ovn-default, err AddressOutOfRange
719 E0619 12:07:04.097851 6 pod.go:620] AddressOutOfRange
720 E0619 12:07:04.097949 6 pod.go:405] error syncing 'kube-system/vpc-nat-gw-gw1-vpc-1-0': AddressOutOfRange, requeuing
721 I0619 12:07:04.098040 6 pod.go:550] handle add/update pod kube-system/vpc-nat-gw-gw1-vpc-1-0
722 I0619 12:07:04.098105 6 event.go:298] Event(v1.ObjectReference{Kind:"Pod", Namespace:"kube-system", Name:"vpc-nat-gw-gw1-vpc-1-0", UID:"d61e58b5-c8f8-4f79-86cc-4e6d8724f475", AP IVersion:"v1", ResourceVersion:"2633", FieldPath:""}): type: 'Warning' reason: 'AcquireAddressFailed' AddressOutOfRange
723 I0619 12:07:04.101424 6 pod.go:607] sync pod kube-system/vpc-nat-gw-gw1-vpc-1-0 allocated
724 I0619 12:07:04.101533 6 ipam.go:60] allocate v4 192.168.1.12, v6 , mac for kube-system/vpc-nat-gw-gw1-vpc-1-0 from subnet ovn-vpc-external-network
725 I0619 12:07:04.107169 6 pod.go:346] enqueue update pod kube-system/vpc-nat-gw-gw1-vpc-1-0
726 I0619 12:07:04.107456 6 ipam.go:72] allocating static ip 10.0.1.254 from subnet net1-vpc-1
727 E0619 12:07:04.107559 6 pod.go:1762] failed to get static ip 10.0.1.254, mac , subnet net1-vpc-1, err NoAvailableAddress
728 I0619 12:07:04.107574 6 ipam.go:72] allocating static ip 10.0.1.254 from subnet ovn-default
729 E0619 12:07:04.107581 6 ipam.go:89] failed to allocate static ip 10.0.1.254 for kube-system/vpc-nat-gw-gw1-vpc-1-0
730 E0619 12:07:04.107585 6 pod.go:1762] failed to get static ip 10.0.1.254, mac , subnet ovn-default, err AddressOutOfRange
731 E0619 12:07:04.107645 6 pod.go:620] AddressOutOfRange
732 E0619 12:07:04.108065 6 pod.go:405] error syncing 'kube-system/vpc-nat-gw-gw1-vpc-1-0': AddressOutOfRange, requeuing
733 I0619 12:07:04.108083 6 pod.go:550] handle add/update pod kube-system/vpc-nat-gw-gw1-vpc-1-0
734 I0619 12:07:04.107846 6 event.go:298] Event(v1.ObjectReference{Kind:"Pod", Namespace:"kube-system", Name:"vpc-nat-gw-gw1-vpc-1-0", UID:"d61e58b5-c8f8-4f79-86cc-4e6d8724f475", AP IVersion:"v1", ResourceVersion:"2642", FieldPath:""}): type: 'Warning' reason: 'AcquireAddressFailed' AddressOutOfRange
735 I0619 12:07:04.110791 6 pod.go:607] sync pod kube-system/vpc-nat-gw-gw1-vpc-1-0 allocated
736 I0619 12:07:04.110947 6 ipam.go:60] allocate v4 192.168.1.13, v6 , mac for kube-system/vpc-nat-gw-gw1-vpc-1-0 from subnet ovn-vpc-external-network
737 I0619 12:07:04.116781 6 ipam.go:72] allocating static ip 10.0.1.254 from subnet net1-vpc-1
738 E0619 12:07:04.116801 6 pod.go:1762] failed to get static ip 10.0.1.254, mac , subnet net1-vpc-1, err NoAvailableAddress
739 I0619 12:07:04.116809 6 ipam.go:72] allocating static ip 10.0.1.254 from subnet ovn-default
740 E0619 12:07:04.116901 6 ipam.go:89] failed to allocate static ip 10.0.1.254 for kube-system/vpc-nat-gw-gw1-vpc-1-0
741 E0619 12:07:04.116916 6 pod.go:1762] failed to get static ip 10.0.1.254, mac , subnet ovn-default, err AddressOutOfRange
742 E0619 12:07:04.117040 6 pod.go:620] AddressOutOfRange

@bobz965
Copy link
Collaborator

bobz965 commented Jun 23, 2024

err: failed to get static ip 10.0.1.254, mac , subnet ovn-default, err AddressOutOfRange 719 E0619 12:07:04.097851 6 pod.go:620] AddressOutOfRange

please attatch kubectl get subnet details:

@jcshare
Copy link
Author

jcshare commented Jun 24, 2024

the root cause should be identified as above, we need to handle the exception gracefully.
I have rebuild my setup, so paste the subnet and VPC's definition as below:

kind: Vpc
apiVersion: kubeovn.io/v1
metadata:
  name: vpc-1
spec:
  staticRoutes:
    - cidr: 0.0.0.0/0
      nextHopIP: 10.0.1.254
      policy: policyDst
  namespaces:
    - ns1
---
kind: Subnet
apiVersion: kubeovn.io/v1
metadata:
  name: net1-vpc-1
spec:
  vpc: vpc-1
  cidrBlock: 10.0.1.0/24
  protocol: IPv4
  excludeIps:
    - 10.0.1.254
  namespaces:
    - ns1
kind: VpcNatGateway
apiVersion: kubeovn.io/v1
metadata:
  name: gw-vpc-1
spec:
  vpc: vpc-1
  subnet: net1-vpc-1
  lanIp: 10.0.1.254
  selector:
    - "kubernetes.io/hostname: worker2"
    - "kubernetes.io/os: linux"
  externalSubnets:
    - ovn-vpc-external-network
ubuntu@master:~/project/debug/1.12.7/test$ kubectl get subnet
NAME                       PROVIDER                               VPC                 PROTOCOL   CIDR             PRIVATE   NAT     DEFAULT   GATEWAYTYPE   V4USED   V4AVAILABLE   V6USED   V6AVAILABLE   EXCLUDEIPS                                                   U2OINTERCONNECTIONIP
join                       ovn                                    ovn-cluster         IPv4       100.64.0.0/16    false     false   false     distributed   3        65530         0        0             ["100.64.0.1"]
ovn-default                ovn                                    ovn-cluster         IPv4       10.16.0.0/16     false     true    true      distributed   5        65528         0        0             ["10.16.0.1"]
ovn-vpc-external-network   ovn-vpc-external-network.kube-system                       IPv4       192.168.1.0/24   false     false   false     distributed   3        7             0        0             ["192.168.1.1..192.168.1.9","192.168.1.20..192.168.1.255"]
ubuntu@master:~/project/debug/1.12.7/test$ 

@jcshare
Copy link
Author

jcshare commented Jun 24, 2024

per the log above,it looks exist another problem(as you mentioned ), the controller shouldn't allocate the "10.0.1.254" for ovn-default subnet

@bobz965
Copy link
Collaborator

bobz965 commented Jun 24, 2024

where is your 10.0.1.0/24 subnet ???

@jcshare
Copy link
Author

jcshare commented Jun 24, 2024

where is your 10.0.1.0/24 subnet ???

could you help take a deep look at the problem? it should be easy to reproduce with my configuration above.
my testbed got broken by the problem and I have rebuild it, so you cannot see the subnet in my current setup.

many thanks

@jcshare
Copy link
Author

jcshare commented Jun 24, 2024

anyway, I will reproduce it and upload all the log file later, thanks

@jcshare
Copy link
Author

jcshare commented Jun 24, 2024

I have reproduced it with a new VPC named "vpc-3" with related log files
could you help take a look?
many thanks
1.12.7-IP-Allocation-bug.zip

@bobz965
Copy link
Collaborator

bobz965 commented Jun 28, 2024

where is your 10.0.1.0/24 subnet ???

sorry, I think, when you get subnet: it shows all the subnets, but, i do not find the 10.0.1.0/24 subnet

image

@jcshare
Copy link
Author

jcshare commented Jun 29, 2024

where is your 10.0.1.0/24 subnet ???

sorry, I think, when you get subnet: it shows all the subnets, but, i do not find the 10.0.1.0/24 subnet

image

could you refer to my reply above : #4210 (comment)

@jcshare
Copy link
Author

jcshare commented Jun 29, 2024

it looks like the problem is obviously, can we help fix it if possible?
many thanks

@bobz965
Copy link
Collaborator

bobz965 commented Jun 29, 2024

you do not have the vpc subnet 10.0.1.0/24, if you use the 10.0.1.0.254, you should create it.
if you use vpc3 subnet, I think you should use 10.0.3.254.

image

@jcshare
Copy link
Author

jcshare commented Jun 29, 2024

"if you use vpc3 subnet, I think you should use 10.0.3.254."

yes, I'm using 10.0.3.254 for vpc3, please refer to vpc3(rather than vpc1) related configuration/debug info in the tar ball
the information you mentioned was the stale configuration of vpc1(should be another issue that need to be handled)

thanks

@bobz965
Copy link
Collaborator

bobz965 commented Jun 29, 2024

image

image

hi @zhangzujian, it seems IPAM has a problem ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants