Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

IPPool is empty error #5599

Closed
yuraxdrumz opened this issue Apr 27, 2022 · 12 comments · Fixed by networkservicemesh/cmd-exclude-prefixes-k8s#146
Closed

IPPool is empty error #5599

yuraxdrumz opened this issue Apr 27, 2022 · 12 comments · Fixed by networkservicemesh/cmd-exclude-prefixes-k8s#146
Milestone

Comments

@yuraxdrumz
Copy link

yuraxdrumz commented Apr 27, 2022

Hello.
During my tests with NSM, things seem to work fine until, sometimes, I reach a point where I get IPPool is empty error and I can't even run 1 nsc.
I am doing lots of tests and I delete and add nsc's and nse's all the time.
My guess is something is not getting released with the IP Pool.
I tried running a different CIDR, tried recreating the NetworkService, tried creating new NetworkService, tried restarting all NSM control plane and data plane components, all of which do not help.

I am running on EKS with 4 nodes of m5.large with cilium CNI and vpp as forwarder.

nsc error log - cmd-nsc-init-err.txt
Passthrough example - passthrough.zip

I would love to help on this if needed, thanks!

@denis-tingaikin
Copy link
Member

denis-tingaikin commented Apr 27, 2022

Hello @yuraxdrumz !

As I can see your apps are based on Feb 26 changes (see at commit 8ddd0be6e5d1 in SDK)

The problems could be resolved in the main version of nsm

Could you check this with latest version
https://github.com/networkservicemesh/deployments-k8s/tree/main/apps?

@yuraxdrumz
Copy link
Author

yuraxdrumz commented Apr 28, 2022

Hello @denis-tingaikin
I updated all the components to the latest versions and still nothing.

alpine-composition.txt
forwarder-vpp-t5bcw.txt
nse-passthrough-3-56f8f8c6cd-l8v5z.txt
nse-passthrough-2-75878b74dd-pq2dh.txt
nse-passthrough-1-667ffd8b65-7n2t6.txt
nsmgr-vd9tq.txt
nsmgr-9sd52.txt
nsmgr-gvqlt.txt

Strangly enough, I dont see any logs in passthroughs or nse and upon running the nsc for the first time, I get to the request timeout with the IP Pool is empty error.
alpine-composition.txt

Because once in a while it does work, I ruled out the security-group issue.

Another thing I tried was putting all components on a single node and I saw that if I restart the manager, I see the nsc trying to reach an old instance of the nse, which should have been deleted from the registry already.

nsmgr-7qqtq.txt

The one line that stood out was

Apr 28 09:14:14.495 [TRAC] [type:registry] (14.1)                recv-response={"network_service_endpoint":{"name":"nse-kernel-69dccd667b-5prpc","network_service_names":["composition"],"network_service_labels":{"composition":{"labels":{"app":"destination-app","nodeName":"ip-192-168-189-67.eu-west-1.compute.internal"}}},"url":"tcp://192.168.189.67:5001","expiration_time":{"seconds":1651072641,"nanos":98046844},"initial_registration_time":{"seconds":1651072144,"nanos":997370982}}} span=40199ab3a2a865e8

After checking the expiration of the above Wed Apr 27 2022 15:17:21 GMT+0000, I noticed its around the time I started having issues

EDIT: my guess is the old nse gets selected and it does not exist, so I get to timeout, but for some reason afterwards, I see the IP Pool error
EDIT2: I saw networkservicemesh/sdk#1007 and networkservicemesh/sdk#1020, which might be related to my issue.
However I still don't fully understand how to tackle it

@denis-tingaikin
Copy link
Member

denis-tingaikin commented Apr 28, 2022

The problems networkservicemesh/sdk#1020 and networkservicemesh/sdk#1007 are not actual. The errors what you're seeing is related to the first incorrect request from the client (nsc).

As I can see by some reason the forwarder is selecting nse-kernel-544dd8fd4b-5qfwv on the first call.

@yuraxdrumz Could you please attach your networkservices, networkserviceendpoints describes (just call kubectl describe networkserviceendpoints -A ) and also your nsm annotatin that you're using in the alpine-composition?

And also it would be nice to get logs from the nse-kernel and from cmd-nsc-init.

@denis-tingaikin denis-tingaikin added this to the v1.4.0 milestone Apr 28, 2022
@yuraxdrumz
Copy link
Author

I tried deleting and switching the networkservices several times.
The latest looks like the following

apiVersion: networkservicemesh.io/v1
kind: NetworkService
metadata:
  annotations:
    kubectl.kubernetes.io/last-applied-configuration: >
      {"apiVersion":"networkservicemesh.io/v1","kind":"NetworkService","metadata":{"annotations":{},"name":"composition","namespace":"nsm-system"},"spec":{"matches":[{"routes":[{"destination_selector":{"app":"destination-app"}}],"source_selector":{"app":"passthrough-1"}},{"routes":[{"destination_selector":{"app":"passthrough-1"}}]}],"payload":"ETHERNET"}}
  creationTimestamp: '2022-04-28T09:06:10Z'
  generation: 1
  managedFields:
    - apiVersion: networkservicemesh.io/v1
      fieldsType: FieldsV1
      fieldsV1:
        f:metadata:
          f:annotations:
            .: {}
            f:kubectl.kubernetes.io/last-applied-configuration: {}
        f:spec: {}
      manager: kubectl-client-side-apply
      operation: Update
      time: '2022-04-28T09:06:10Z'
  name: composition
  namespace: nsm-system
  resourceVersion: '5735744'
  uid: e4935cd7-4339-42e0-976b-5685dba72d1b
  selfLink: >-
    /apis/networkservicemesh.io/v1/namespaces/nsm-system/networkservices/composition
spec:
  matches:
    - routes:
        - destination_selector:
            app: destination-app
      source_selector:
        app: passthrough-1
    - routes:
        - destination_selector:
            app: passthrough-1
  payload: ETHERNET

Regarding the networkserviceendpoints, I have versions even of those I already deleted.

NAMESPACE    NAME                                 AGE
nsm-system   forwarder-vpp-5sfjz                  30h
nsm-system   forwarder-vpp-5t68m                  30h
nsm-system   forwarder-vpp-dpjhs                  12h
nsm-system   forwarder-vpp-g97wd                  32h
nsm-system   forwarder-vpp-hchd8                  30h
nsm-system   forwarder-vpp-lpzrq                  30h
nsm-system   forwarder-vpp-mp7vf                  32h
nsm-system   forwarder-vpp-rxhwk                  12h
nsm-system   forwarder-vpp-sqp2t                  12h
nsm-system   forwarder-vpp-t5bcw                  12h
nsm-system   nse-kernel-69dccd667b-5prpc          29h
nsm-system   nse-kernel-c7b75c75-dqpsv            12h
nsm-system   nse-passthrough-1-655fcb6f89-p2zg6   29h
nsm-system   nse-passthrough-1-6fbbdbbfdc-5mhls   12h
nsm-system   nse-passthrough-2-6d5bdd5854-nr8gs   29h
nsm-system   nse-passthrough-2-7cdb55c65f-lknwj   12h
nsm-system   nse-passthrough-3-6c659766d6-q992h   29h
nsm-system   nse-passthrough-3-6f6d598df-jfl6v    12h

all networkserviceendpoints - nse.txt

The nsm annotation used is

  annotations:
    networkservicemesh.io: kernel://composition/nsm-1

full nsc-config

---
apiVersion: v1
kind: Pod
metadata:
  name: alpine-composition
  labels:
    app: alpine-composition
    "spiffe.io/spiffe-id": "true"
  annotations:
    networkservicemesh.io: kernel://composition/nsm-1
spec:
  nodeName: ip-192-168-189-67.eu-west-1.compute.internal
  containers:
    - name: alpine
      image: alpine:3.15.0
      imagePullPolicy: IfNotPresent
      stdin: true
      tty: true

logs from cmd-nsc-init alpine-composition.txt

@denis-tingaikin
Copy link
Member

denis-tingaikin commented Apr 28, 2022

@yuraxdrumz Thanks! Looking

Question: Did you try to run this example https://github.com/networkservicemesh/deployments-k8s/tree/main/examples/features/nse-composition without changes?

UPD: I'd recommend to delete all non forwarders nses networkserviceendpoints via kubectl delete ... before re-test. (Actually you could delete all nses, forwarders should be re-registered immediately)

@denis-tingaikin
Copy link
Member

denis-tingaikin commented Apr 29, 2022

@yuraxdrumz Im not sure but probabbly you got something like this:

  1. Your tried to run composition (have been created nses entries in the etcd)
  2. Something was wrong and you have deleted all components (the nses are deleting by timeout if the registry is alive so we got a leak => its defently a problem of the current registry that cant check all exsist nses in the etcd on the startup => I'll create an issue for that)
  3. Now all you new tries are going wrong because of dead nses are noising

Could you try this:

Steps:

  1. Remove all nses via kubectl delete networkserviceendpoints -A
  2. Remove all services via kubectl delete networkservices -A
  3. Start https://github.com/networkservicemesh/deployments-k8s/tree/main/examples/features/nse-composition

Actual: TODO
Expected: NSM is working

@denis-tingaikin
Copy link
Member

denis-tingaikin commented Apr 30, 2022

@yuraxdrumz We have AWS integration testing, but we're using 2 nodes setup. I'll try to reproduce your problem at Monday with your AWS setup. Let me know if you get any updates with ☝️

@yuraxdrumz
Copy link
Author

Hey @denis-tingaikin.
Thanks for replying fast!
Steps I did:

  • Remove all network service endpoints
  • Forwarders got reregistered
  • Remove all network services
  • Run nse-composition example you provided

Same error as before, adding the nsc-cmd-init log - alpine.txt
Registry response - registry-k8s-78f5768d97-8trrz.txt

Expected: NSM is working
Actual: IP Pool is empty error

@yuraxdrumz
Copy link
Author

Anything new?

@glazychev-art
Copy link
Contributor

Related to: networkservicemesh/sdk#1297

@denis-tingaikin
Copy link
Member

@glazychev-art This is really good catch.

Im a bit poked with a code, and I found the real root cause - networkservicemesh/cmd-exclude-prefixes-k8s#144

@denis-tingaikin
Copy link
Member

@yuraxdrumz Should be fixed for the latest master. Could you re-test ?
Be free to reopen issue if the problem is still actual.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants