-
Notifications
You must be signed in to change notification settings - Fork 35
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Sometimes vl3 network interface is recreated #8351
Comments
I got this issue again. What I did:I actually just refreshed my cluster, so there is not much. 1401* ../delete-cluster.sh; ../create-cluster.sh; ../init-multicluster-2.sh
1402 kubectl --kubeconfig=$KUBECONFIG1 apply -k ./vl3-dns\nsleep 0.5\nkubectl --kubeconfig=$KUBECONFIG1 -n ns-dns-vl3 wait --for=condition=ready --timeout=1m pod -l app=nse-vl3-vpp\nkubectl --kubeconfig=$KUBECONFIG1 -n ns-dns-vl3 wait --for=condition=ready --timeout=1m pod -l app=vl3-ipam\n
1403 kubectl --kubeconfig=$KUBECONFIG1 apply -f istio-namespace.yaml\nistioctl install --set profile=minimal -y --kubeconfig=$KUBECONFIG1 --set meshConfig.accessLogFile=/dev/stdout\nk1 exec -n istio-system deployments/istiod -c cmd-nsc -- apk add tcpdump\n
1404 WORK_DIR="$(git rev-parse --show-toplevel)/examples/interdomain/nsm_istio_vl3/clean/istio-vm-configs"\nVM_APP="vm-app"\nVM_NAMESPACE="vm-ns"\nSERVICE_ACCOUNT="serviceaccountvm"\nCLUSTER_NETWORK=""\nVM_NETWORK=""\nCLUSTER="Kubernetes"\n
1405 kubectl --kubeconfig=$KUBECONFIG1 create namespace "${VM_NAMESPACE}"\nkubectl --kubeconfig=$KUBECONFIG1 create serviceaccount "${SERVICE_ACCOUNT}" -n "${VM_NAMESPACE}"\n
1406 k1 exec -n istio-system deployments/istiod -c cmd-nsc -- ip a
1407 time k1 exec -n istio-system deployments/istiod -c cmd-nsc -- tcpdump -i nsm-1 -U -w - >1-istio-standard.pcap &\nsleep 1\nk1 apply -k ubuntu-standard\nsleep 0.5\nk1 -n vl3-test wait --for=condition=ready --timeout=1m pod -l app=ubuntu\nsleep 1\nkill -2 $!\nsleep 1\nk1 delete -k ubuntu-standard\ntshark -r 1-istio-standard.pcap\n
1408 istioctl x workload entry configure -f workloadgroup.yaml -o "${WORK_DIR}" --clusterID "${CLUSTER}" --kubeconfig=$KUBECONFIG1 --ingressIP=172.16.0.2\n# sed -i '' 's/15012/15010/' "${WORK_DIR}/mesh.yaml"\nrm -rf ubuntu-standard/istio-vm-configs\nrm -rf ubuntu-hosts/istio-vm-configs\nrm -rf ubuntu-hosts-2/istio-vm-configs\nrm -rf ubuntu-hosts-2-vmlike/istio-vm-configs\ncp -r "${WORK_DIR}" ubuntu-standard/istio-vm-configs\ncp -r "${WORK_DIR}" ubuntu-hosts/istio-vm-configs\ncp -r "${WORK_DIR}" ubuntu-hosts-2/istio-vm-configs\ncp -r "${WORK_DIR}" ubuntu-hosts-2-vmlike/istio-vm-configs\n
1409 time k1 exec -n istio-system deployments/istiod -c cmd-nsc -- tcpdump -i nsm-1 -U -w - >4-istio-tcpdump-1-nsm.pcap &\nsleep 1\nk1 apply -k ubuntu-hosts\nsleep 0.5\nk1 -n vl3-test wait --for=condition=ready --timeout=1m pod -l app=ubuntu\nk1 -n vl3-test get pod\n# k1 -n vl3-test exec deployments/ubuntu-deployment -c istio-proxy -- curl 172.16.0.2:8080/ready -vs\nsleep 1\nkill -2 $!\ntime k1 exec -n istio-system deployments/istiod -c cmd-nsc -- tcpdump -i nsm-1 -U -w - >4-istio-tcpdump-2-nsm.pcap\nsleep 1\nk1 delete -k ubuntu-hosts\ntshark -r 4-istio-tcpdump-1-nsm.pcap
1410 k1 delete -k ubuntu-hosts
1411 time k1 exec -n istio-system deployments/istiod -c cmd-nsc -- tcpdump -i nsm-1 -U -w - >4-istio-tcpdump-1-nsm.pcap &\nsleep 1\nk1 apply -k ubuntu-hosts\nsleep 0.5\nk1 -n vl3-test wait --for=condition=ready --timeout=1m pod -l app=ubuntu\nk1 -n vl3-test get pod\nk1 -n vl3-test exec deployments/ubuntu-deployment -c istio-proxy -- curl 172.16.0.2:8080/ready -vs\nsleep 1\nkill -2 $!\n# time k1 exec -n istio-system deployments/istiod -c cmd-nsc -- tcpdump -i nsm-1 -U -w - >4-istio-tcpdump-2-nsm.pcap\nsleep 1\nk1 delete -k ubuntu-hosts\ntshark -r 4-istio-tcpdump-1-nsm.pcap\n# tshark -r 4-istio-tcpdump-2-nsm.pcap\n
1412 jobs
1413 time k1 exec -n istio-system deployments/istiod -c cmd-nsc -- tcpdump -i nsm-1 -U -w - >4-istio-tcpdump-1-nsm.pcap &\nsleep 1\nk1 apply -k ubuntu-hosts\nsleep 0.5\nk1 -n vl3-test wait --for=condition=ready --timeout=1m pod -l app=ubuntu\nk1 -n vl3-test get pod\nk1 -n vl3-test exec deployments/ubuntu-deployment -c istio-proxy -- curl 172.16.0.2:8080/ready -vs\nsleep 1\nkill -2 $!\n# time k1 exec -n istio-system deployments/istiod -c cmd-nsc -- tcpdump -i nsm-1 -U -w - >4-istio-tcpdump-2-nsm.pcap\nsleep 1\nk1 delete -k ubuntu-hosts
1414 k1 cluster-info dump --output yaml --output-directory dump-2 --all-namespaces
1415 cp ~/.bash_history .
1416 cp ~/.zsh_history .
1417 history 0 So, basically: After 1412 I immediately saw the error from tcpdump that the interface has disappeared, and run I'm not sure if I or someone else would be able to replicate this by running the commands as I described above. I myself haven't tried replicating this yet. In NSM logs I see repeated LogsThe logs are aken right after I was the issue first time with this cluster, so they must be much cleaner than previous ones. All related information can be found here: |
I checked what happens if I disable NSC liveness check (I modified webhook to disable it on all clients). I checked pings between vl3 clients. Here are logs from both clusters in this state: |
I noticed that vl3 NSE has restarted in the setup above. It looks like it had crashed due to SIGSEGV in memif. |
@glazychev-art hinted me that in the current version of NSM this case isn't handled properly yet, and here's the discussion: |
Stack trace from restarted nse container from @d-uzlov
Stack traces from restarted nse and forwarder-vpp containers from my attempt to reproduce the issue
|
@edwarnicke Looks like the issue is on the vpp/govpp side. Could you have a look at the stack trace? ☝️ |
Right now we could see that this line leads to crush: https://github.com/networkservicemesh/sdk-vpp/blob/main/pkg/networkservice/mechanisms/memif/common.go#L173 |
The issue is not reproduced with updated VPP. Seems like updating VPP has fixed the issue. |
Was fixed in networkservicemesh/cmd-forwarder-vpp#802 |
Expected Behavior
NSM network interface is created once and stays working until it is explicitly closed, usually when deleting the pod.
Current Behavior
I somehow got my cluster to a state when one of the nsm interfaces was regularly deleted and recreated, making it impossible to use.
Failure Information (for bugs)
I notices the issue when running tcpdump:
I tried to capture tcpdump on istiod pod.
The interface was stable at first.
Then I deployed a second pod, with istio-proxy, that tried to connect to istiod via the nsm interface.
Right after this second pod was created nsm interface got recreated (tcp dump said "The interface disappeared", but
ip a
still showed thensm-1
interface in the list).When I tried to run
tcpdump
again, I got the same error after a few seconds.The interface became stable again after I deleted the pod with istio-proxy.
I tried this a few more times, and each time tcpdump complained about interface disappearing while istio-proxy pod existed.
Few times I even got the error that interface doesn't exist. Apparently, there is some time between interface deletion and recreation events.
Steps to Reproduce
Unfortunately I was unable to recreate this issue.
I was experimenting with istio+nsm integration, and one time I randomly got this behavior.
But I took the
cluster-info dump
logs.Here are rough instructions of the setup I was testing:
https://github.com/d-uzlov/deployments-k8s/blob/deployment-issue-do-not-delete/examples/interdomain/nsm_istio_vl3/clean/
There are also uncompressed
cluster-info dump
logs in this branch.I'm not sure if these instructions will be helpful because usually they don't produce the error described in this issue.
Context
v1.7.0-rc.2
Failure Logs
Logs file from
kubectl cluster-info dump
:dump-1.zip
The text was updated successfully, but these errors were encountered: