Sometimes vl3 network interface is recreated #8351

d-uzlov · 2022-12-21T12:33:18Z

Expected Behavior

NSM network interface is created once and stays working until it is explicitly closed, usually when deleting the pod.

Current Behavior

I somehow got my cluster to a state when one of the nsm interfaces was regularly deleted and recreated, making it impossible to use.

Failure Information (for bugs)

I notices the issue when running tcpdump:

$ k1 exec -n istio-system deployments/istiod -c cmd-nsc -- tcpdump -i nsm-1 -U -w - >4-istio-tcpdump-2-nsm.pcap                 [14:12:15]
tcpdump: listening on nsm-1, link-type RAW (Raw IP), snapshot length 262144 bytes
tcpdump: pcap_loop: The interface disappeared
8 packets captured
8 packets received by filter
0 packets dropped by kernel
command terminated with exit code 1

I tried to capture tcpdump on istiod pod.
The interface was stable at first.
Then I deployed a second pod, with istio-proxy, that tried to connect to istiod via the nsm interface.
Right after this second pod was created nsm interface got recreated (tcp dump said "The interface disappeared", but ip a still showed the nsm-1 interface in the list).
When I tried to run tcpdump again, I got the same error after a few seconds.
The interface became stable again after I deleted the pod with istio-proxy.

I tried this a few more times, and each time tcpdump complained about interface disappearing while istio-proxy pod existed.
Few times I even got the error that interface doesn't exist. Apparently, there is some time between interface deletion and recreation events.

Steps to Reproduce

Unfortunately I was unable to recreate this issue.
I was experimenting with istio+nsm integration, and one time I randomly got this behavior.
But I took the cluster-info dump logs.

Here are rough instructions of the setup I was testing:
https://github.com/d-uzlov/deployments-k8s/blob/deployment-issue-do-not-delete/examples/interdomain/nsm_istio_vl3/clean/
There are also uncompressed cluster-info dump logs in this branch.

I'm not sure if these instructions will be helpful because usually they don't produce the error described in this issue.

Context

Kubernetes Version: v1.25
NSM Version: v1.7.0-rc.2

Failure Logs

Logs file from kubectl cluster-info dump:
dump-1.zip

The text was updated successfully, but these errors were encountered:

d-uzlov · 2022-12-22T01:16:34Z

I got this issue again.

What I did:

I actually just refreshed my cluster, so there is not much.
I took a capture of my console command history:

 1401* ../delete-cluster.sh; ../create-cluster.sh; ../init-multicluster-2.sh
 1402  kubectl --kubeconfig=$KUBECONFIG1 apply -k ./vl3-dns\nsleep 0.5\nkubectl --kubeconfig=$KUBECONFIG1 -n ns-dns-vl3 wait --for=condition=ready --timeout=1m pod -l app=nse-vl3-vpp\nkubectl --kubeconfig=$KUBECONFIG1 -n ns-dns-vl3 wait --for=condition=ready --timeout=1m pod -l app=vl3-ipam\n
 1403  kubectl --kubeconfig=$KUBECONFIG1 apply -f istio-namespace.yaml\nistioctl install --set profile=minimal -y --kubeconfig=$KUBECONFIG1 --set meshConfig.accessLogFile=/dev/stdout\nk1 exec -n istio-system deployments/istiod -c cmd-nsc -- apk add tcpdump\n
 1404  WORK_DIR="$(git rev-parse --show-toplevel)/examples/interdomain/nsm_istio_vl3/clean/istio-vm-configs"\nVM_APP="vm-app"\nVM_NAMESPACE="vm-ns"\nSERVICE_ACCOUNT="serviceaccountvm"\nCLUSTER_NETWORK=""\nVM_NETWORK=""\nCLUSTER="Kubernetes"\n
 1405  kubectl --kubeconfig=$KUBECONFIG1 create namespace "${VM_NAMESPACE}"\nkubectl --kubeconfig=$KUBECONFIG1 create serviceaccount "${SERVICE_ACCOUNT}" -n "${VM_NAMESPACE}"\n
 1406  k1 exec -n istio-system deployments/istiod -c cmd-nsc -- ip a
 1407  time k1 exec -n istio-system deployments/istiod -c cmd-nsc -- tcpdump -i nsm-1 -U -w - >1-istio-standard.pcap &\nsleep 1\nk1 apply -k ubuntu-standard\nsleep 0.5\nk1 -n vl3-test wait --for=condition=ready --timeout=1m pod -l app=ubuntu\nsleep 1\nkill -2 $!\nsleep 1\nk1 delete -k ubuntu-standard\ntshark -r 1-istio-standard.pcap\n
 1408  istioctl x workload entry configure -f workloadgroup.yaml -o "${WORK_DIR}" --clusterID "${CLUSTER}" --kubeconfig=$KUBECONFIG1 --ingressIP=172.16.0.2\n# sed -i '' 's/15012/15010/' "${WORK_DIR}/mesh.yaml"\nrm -rf ubuntu-standard/istio-vm-configs\nrm -rf ubuntu-hosts/istio-vm-configs\nrm -rf ubuntu-hosts-2/istio-vm-configs\nrm -rf ubuntu-hosts-2-vmlike/istio-vm-configs\ncp -r "${WORK_DIR}" ubuntu-standard/istio-vm-configs\ncp -r "${WORK_DIR}" ubuntu-hosts/istio-vm-configs\ncp -r "${WORK_DIR}" ubuntu-hosts-2/istio-vm-configs\ncp -r "${WORK_DIR}" ubuntu-hosts-2-vmlike/istio-vm-configs\n
 1409  time k1 exec -n istio-system deployments/istiod -c cmd-nsc -- tcpdump -i nsm-1 -U -w - >4-istio-tcpdump-1-nsm.pcap &\nsleep 1\nk1 apply -k ubuntu-hosts\nsleep 0.5\nk1 -n vl3-test wait --for=condition=ready --timeout=1m pod -l app=ubuntu\nk1 -n vl3-test get pod\n# k1 -n vl3-test exec deployments/ubuntu-deployment -c istio-proxy -- curl 172.16.0.2:8080/ready -vs\nsleep 1\nkill -2 $!\ntime k1 exec -n istio-system deployments/istiod -c cmd-nsc -- tcpdump -i nsm-1 -U -w - >4-istio-tcpdump-2-nsm.pcap\nsleep 1\nk1 delete -k ubuntu-hosts\ntshark -r 4-istio-tcpdump-1-nsm.pcap
 1410  k1 delete -k ubuntu-hosts
 1411  time k1 exec -n istio-system deployments/istiod -c cmd-nsc -- tcpdump -i nsm-1 -U -w - >4-istio-tcpdump-1-nsm.pcap &\nsleep 1\nk1 apply -k ubuntu-hosts\nsleep 0.5\nk1 -n vl3-test wait --for=condition=ready --timeout=1m pod -l app=ubuntu\nk1 -n vl3-test get pod\nk1 -n vl3-test exec deployments/ubuntu-deployment -c istio-proxy -- curl 172.16.0.2:8080/ready -vs\nsleep 1\nkill -2 $!\n# time k1 exec -n istio-system deployments/istiod -c cmd-nsc -- tcpdump -i nsm-1 -U -w - >4-istio-tcpdump-2-nsm.pcap\nsleep 1\nk1 delete -k ubuntu-hosts\ntshark -r 4-istio-tcpdump-1-nsm.pcap\n# tshark -r 4-istio-tcpdump-2-nsm.pcap\n
 1412  jobs
 1413  time k1 exec -n istio-system deployments/istiod -c cmd-nsc -- tcpdump -i nsm-1 -U -w - >4-istio-tcpdump-1-nsm.pcap &\nsleep 1\nk1 apply -k ubuntu-hosts\nsleep 0.5\nk1 -n vl3-test wait --for=condition=ready --timeout=1m pod -l app=ubuntu\nk1 -n vl3-test get pod\nk1 -n vl3-test exec deployments/ubuntu-deployment -c istio-proxy -- curl 172.16.0.2:8080/ready -vs\nsleep 1\nkill -2 $!\n# time k1 exec -n istio-system deployments/istiod -c cmd-nsc -- tcpdump -i nsm-1 -U -w - >4-istio-tcpdump-2-nsm.pcap\nsleep 1\nk1 delete -k ubuntu-hosts
 1414  k1 cluster-info dump --output yaml --output-directory dump-2 --all-namespaces
 1415  cp ~/.bash_history .
 1416  cp ~/.zsh_history .
 1417  history 0

So, basically:
1401 — delete previous cluster, create a new one, initialize it with interdomain setup (2 clusters with metallb+dns+spire+nsm)
-- I waited some time after running this command, maybe about 0.5h
1402-1406 — deploy vl3 nse, deploy istiod client
1407 — test deployment that doesn't use nsm interface
1408 — more istio initialization
1409 — start tcpdump, deploy vl3 client with istio, wait for readiness, cleanup (kill tcpdump, delete deployment)
-- I canceled the command with ctrl+C (cleanup part didn't run)
1410 — delete deployment. I didn't stop tcpdump. Not sure if this impacted the issue.
1411 — repeat previous command
-- after command 1411 I saw some tcpdump errors from background, but I think it were different errors
-- tcp dump file was corrupted due to simultaneous writes from 2 different instances
1412 — run the same command

After 1412 I immediately saw the error from tcpdump that the interface has disappeared, and run cluster-info dump before doing anything else.

I'm not sure if I or someone else would be able to replicate this by running the commands as I described above. I myself haven't tried replicating this yet.
I suspect that this issue has stochastic nature, so it will be hard to reproduce reliably.
Still maybe this list of commands would be helpful.

In NSM logs I see repeated close, request, close, request, close, request, .... Both istiod and test client (istio-proxy) have this pattern in cmd-nsc sidecar.

Logs

The logs are aken right after I was the issue first time with this cluster, so they must be much cleaner than previous ones.
dump-2.zip

All related information can be found here:

https://github.com/d-uzlov/deployments-k8s/tree/deployment-issue-do-not-delete-2/examples/interdomain/nsm_istio_vl3/clean

d-uzlov · 2023-01-10T07:10:02Z

I checked what happens if I disable NSC liveness check (I modified webhook to disable it on all clients).
It turns out the after we re-deploy a client in the second cluster this new client gets ip 172.16.0.2.
This ip was previously assigned to istiod.

I checked pings between vl3 clients.
Client in the second cluster .0.2 can't ping anyone (except itself).
NSE .0.1 can ping sample server .0.3.
Sample server .0.3 can ping NSE .0.1 and .0.2 (which is probably istiod)
Istiod .0.2 can ping sample server .0.3 but ping to NSE .0.1 fails.

Here are logs from both clusters in this state:
dump-3.zip

d-uzlov · 2023-01-10T07:15:58Z

I noticed that vl3 NSE has restarted in the setup above.
I captured logs from its previous run.

00-1-nse-prev.log

It looks like it had crashed due to SIGSEGV in memif.
And apparently after it has restarted something went wrong.

d-uzlov · 2023-01-10T12:14:38Z

And apparently after it has restarted something went wrong.

@glazychev-art hinted me that in the current version of NSM this case isn't handled properly yet, and here's the discussion:

Client address in vl3 network after healing sdk#1393

wazsone · 2023-01-17T10:39:59Z

Stack trace from restarted nse container from @d-uzlov

Jan 10 06:44:46.612�[36m [INFO] [cmd:vpp] �[0mvpp[14]: memif_plugin: memif2868466484/0: memif_msg_receive_disconnect: disconnect received
Jan 10 06:44:46.612�[36m [INFO] [cmd:vpp] �[0mvpp[14]: received signal SIGSEGV, PC 0x7fc2e6959299, faulting address 0x4
Jan 10 06:44:46.612�[36m [INFO] [cmd:vpp] �[0mvpp[14]: #0  0x00007fc2e672edf0 0x7fc2e672edf0
Jan 10 06:44:46.612�[36m [INFO] [cmd:vpp] �[0mvpp[14]: #1  0x00007fc2e66563c0 0x7fc2e66563c0
Jan 10 06:44:46.612�[36m [INFO] [cmd:vpp] �[0mvpp[14]: #2  0x00007fc2e6959299 vnet_hw_if_generate_rxq_int_poll_vector + 0x69
Jan 10 06:44:46.612�[36m [INFO] [cmd:vpp] �[0mvpp[14]: #3  0x00007fc2a366b2ea memif_input_node_fn_hsw + 0x3869a
Jan 10 06:44:46.612�[36m [INFO] [cmd:vpp] �[0mvpp[14]: #4  0x00007fc2e66cc355 0x7fc2e66cc355
Jan 10 06:44:46.612�[36m [INFO] [cmd:vpp] �[0mvpp[14]: #5  0x00007fc2e66cb285 0x7fc2e66cb285
Jan 10 06:44:46.612�[36m [INFO] [cmd:vpp] �[0mvpp[14]: #6  0x00007fc2e672e796 0x7fc2e672e796
Jan 10 06:44:46.612�[36m [INFO] [cmd:vpp] �[0mvpp[14]: #7  0x00007fc2e65e8494 0x7fc2e65e8494

Stack traces from restarted nse and forwarder-vpp containers from my attempt to reproduce the issue

nse
Jan 17 04:03:32.570�[36m [INFO] [cmd:vpp] �[0mvpp[17]: memif_plugin: memif2557108433/0: memif_msg_receive_disconnect: disconnect received
Jan 17 04:03:32.570�[36m [INFO] [cmd:vpp] �[0mvpp[17]: received signal SIGSEGV, PC 0x7f81b70dc299, faulting address 0x4
Jan 17 04:03:32.570�[36m [INFO] [cmd:vpp] �[0mvpp[17]: #0  0x00007f81b6eb1df0 0x7f81b6eb1df0
Jan 17 04:03:32.570�[36m [INFO] [cmd:vpp] �[0mvpp[17]: #1  0x00007f81b6dd93c0 0x7f81b6dd93c0
Jan 17 04:03:32.570�[36m [INFO] [cmd:vpp] �[0mvpp[17]: #2  0x00007f81b70dc299 vnet_hw_if_generate_rxq_int_poll_vector + 0x69
Jan 17 04:03:32.570�[36m [INFO] [cmd:vpp] �[0mvpp[17]: #3  0x00007f8173eaa28c 0x7f8173eaa28c
Jan 17 04:03:32.570�[36m [INFO] [cmd:vpp] �[0mvpp[17]: #4  0x00007f81b6e4f355 0x7f81b6e4f355
Jan 17 04:03:32.570�[36m [INFO] [cmd:vpp] �[0mvpp[17]: #5  0x00007f81b6e4e285 0x7f81b6e4e285
Jan 17 04:03:32.570�[36m [INFO] [cmd:vpp] �[0mvpp[17]: #6  0x00007f81b6eb1796 0x7f81b6eb1796
Jan 17 04:03:32.570�[36m [INFO] [cmd:vpp] �[0mvpp[17]: #7  0x00007f81b6d6b494 0x7f81b6d6b494

forwarder-vpp
Jan 17 03:57:17.459�[36m [INFO] [cmd:vpp] �[0mvpp[3376]: capo_sw_interface_add_del:103: unconfiguring policies for if 13 deleted
Jan 17 03:57:17.459�[36m [INFO] [cmd:vpp] �[0mvpp[3376]: capo_sw_interface_add_del:110: error deleting caiop (output): -2
Jan 17 03:57:17.459�[36m [INFO] [cmd:vpp] �[0mvpp[3376]: capo_sw_interface_add_del:116: error deleting caiop (input): -2
Jan 17 03:57:17.459�[36m [INFO] [cmd:vpp] �[0mvpp[3376]: received signal SIGSEGV, PC 0x7fcff461c2a9, faulting address 0x4
time="2023-01-17T04:20:31Z" level=info msg="No subscription found for the notification message." msg_id=710 msg_size=19
Jan 17 04:20:31.613�[37m [DEBU] [id:aef89aec-b5e4-4e54-abfe-7708080f368d] [duration:6.025641ms] [swIfIndex:13] [type:networkService] [vppapi:MemifDelete] �[0m(58.1)                                                            completed
Jan 17 03:57:17.459�[36m [INFO] [cmd:vpp] �[0mvpp[3376]: #0  0x00007fcff43f1df0 0x7fcff43f1df0
Jan 17 03:57:17.459�[36m [INFO] [cmd:vpp] �[0mvpp[3376]: #1  0x00007fcff4319420 0x7fcff4319420
Jan 17 03:57:17.459�[36m [INFO] [cmd:vpp] �[0mvpp[3376]: #2  0x00007fcff461c2a9 vnet_hw_if_generate_rxq_int_poll_vector + 0x69
Jan 17 03:57:17.459�[36m [INFO] [cmd:vpp] �[0mvpp[3376]: #3  0x00007fcff4bf6fd7 virtio_input_node_fn + 0x22a97
Jan 17 03:57:17.459�[36m [INFO] [cmd:vpp] �[0mvpp[3376]: #4  0x00007fcff438f355 0x7fcff438f355
Jan 17 03:57:17.459�[36m [INFO] [cmd:vpp] �[0mvpp[3376]: #5  0x00007fcff438e285 0x7fcff438e285
Jan 17 03:57:17.459�[36m [INFO] [cmd:vpp] �[0mvpp[3376]: #6  0x00007fcff43f1796 0x7fcff43f1796
Jan 17 03:57:17.459�[36m [INFO] [cmd:vpp] �[0mvpp[3376]: #7  0x00007fcff42ab494 0x7fcff42ab494

denis-tingaikin · 2023-01-23T05:18:58Z

@edwarnicke Looks like the issue is on the vpp/govpp side. Could you have a look at the stack trace? ☝️

wazsone · 2023-01-23T05:33:18Z

Right now we could see that this line leads to crush: https://github.com/networkservicemesh/sdk-vpp/blob/main/pkg/networkservice/mechanisms/memif/common.go#L173

wazsone · 2023-02-06T10:10:08Z

The issue is not reproduced with updated VPP. Seems like updating VPP has fixed the issue.

denis-tingaikin · 2023-02-21T14:51:03Z

Was fixed in networkservicemesh/cmd-forwarder-vpp#802

denis-tingaikin added this to Release v1.8.0 Dec 21, 2022

denis-tingaikin added the bug Something isn't working label Dec 21, 2022

denis-tingaikin assigned wazsone Jan 9, 2023

denis-tingaikin moved this to In Progress in Release v1.8.0 Jan 9, 2023

denis-tingaikin added the help wanted Extra attention is needed label Jan 23, 2023

denis-tingaikin closed this as completed Feb 6, 2023

github-project-automation bot moved this from In Progress to Done in Release v1.8.0 Feb 6, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sometimes vl3 network interface is recreated #8351

Sometimes vl3 network interface is recreated #8351

d-uzlov commented Dec 21, 2022 •

edited

Loading

d-uzlov commented Dec 22, 2022

d-uzlov commented Jan 10, 2023

d-uzlov commented Jan 10, 2023

d-uzlov commented Jan 10, 2023

wazsone commented Jan 17, 2023

denis-tingaikin commented Jan 23, 2023

wazsone commented Jan 23, 2023

wazsone commented Feb 6, 2023

denis-tingaikin commented Feb 21, 2023

Sometimes vl3 network interface is recreated #8351

Sometimes vl3 network interface is recreated #8351

Comments

d-uzlov commented Dec 21, 2022 • edited Loading

Expected Behavior

Current Behavior

Failure Information (for bugs)

Steps to Reproduce

Context

Failure Logs

d-uzlov commented Dec 22, 2022

What I did:

Logs

All related information can be found here:

d-uzlov commented Jan 10, 2023

d-uzlov commented Jan 10, 2023

d-uzlov commented Jan 10, 2023

wazsone commented Jan 17, 2023

denis-tingaikin commented Jan 23, 2023

wazsone commented Jan 23, 2023

wazsone commented Feb 6, 2023

denis-tingaikin commented Feb 21, 2023

d-uzlov commented Dec 21, 2022 •

edited

Loading