-
Notifications
You must be signed in to change notification settings - Fork 35
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Issue running nse-composition example #2381
Comments
@Mixaster995 Could you have a look? |
@damiankopyto thanks for your report.
|
Retested this case on kind 1.21. |
Hi @Mixaster995
Thanks for looking into this |
After some tests it seems like it's related to problem with spire restarts: #2393 |
@Mixaster995 Sure will do, do you know the rough timeline for a fix?
|
Hi @Mixaster995 I tested NSM using the nse-composition example and the latest code changes. Although not sure if all the fixes you wrote to @damiankopyto about are already merged. However, I'm still seeing this problem with OOMKilled occur for most of the pods same as described by @damiankopyto. My hardware specification, OS and environment configuration is same as @damiankopyto 's except that the Kubernetes version is 1.21.
With increased memory limits of the pods (up to 1Gi) the problems within the NSM environment changed slightly.
Once the sockets are restored I can ping between NSE and NSC for about 10 minutes again.
Then again one of the apps restarts. When I check the liveness of nsmgr manually with the command in its yaml file, it looks fine. The pod does not restart any more.
To me it looks like the problem now is with the passthrough apps. At the beginning they restart with termination reason "Completed". But left them overnight and OOMKilled started appearing again despite memory limits set
I never noticed any memory spikes myself with "kubectl top" command. My questions are:
I'm attaching the nse-passthrough apps logs. Please, let me know if you need more logs or any other info. Your help would be much appreciated. passthrough1_0.log Here I'm adding latest logs from nse-passthrough-2 app that was terminated due to OOMKilled:
|
Hi @edwarnicke @Mixaster995 have you made any progress on this issue? Are you going to provide a fix to eliminate the restarts errors from nsm-system pods and the example pods? If yes when I can expect it roughly? |
Hi @jkossak. Thank you for very detailed description, it is really helpful. I've just returned from PTO and continued investigating this issue and will provide the answers as soon as possible. |
Hi @Mixaster995 , thank you for getting back to me. Great to hear you are working on it. Just to add some more info about this issue, here is the output of dmesg: Today I tried the example again after pulling the latest changes. All worked for about 10 minutes (despite few restarts of nsc-kernel and nsmgr at the begininng). Then nse-passthrough-1 and nse-passthrough-2 got into OOMKiller error and can't get back to life staying most of the time in CrashLoopBackOff,. They start only for a few seconds to be killed again. One question I have: I don't see the pods with vpp using hugepages. Aren't hugepages required to run VPP operations? |
Hi @jkossak, hugepages not required for VPP to work - it's working fine without it. AFAIK, it is might be needed for some specific features, but we not using those. |
@jkossak VPP can use hugepages. If VPP can't get hugepages, it will log a very nasty message and keep on working to the best of its ability. |
Thank you @edwarnicke and @Mixaster995 for answering my question, so this message I am seeing is actually expected and normal:
|
Hi @edwarnicke, @Mixaster995 , I increased the value of the NSM_REQUEST_TIMEOUT environment variable to 160s , but it was already set to 75s by default in examples/features/nse-composition/patch-nsc.yaml I still observe the instability of the cross-connection that breaks after 2 - 3 minutes, and resumes from time to time for another couple of minutes. The pods after few minutes:
|
Hi, @jkossak. We found possible source of the problem - with some logic turned off wrong behaviour doesn't occur anymore. Now trying to find the solution for this parts. |
Thank you @Mixaster995 for letting me know. |
Hi, @jkossak. It would be great, if you let me know if everything is ok or if there is still instability/errors. Thank you. |
Thanks @Mixaster995 , I will try the example with the new images today and let you know how it goes then. |
Hi @Mixaster995, I steel see the pods crashing, except nse-passthrough pods.
The cross connection breaks even faster: $ kubectl exec ${NSC} -n ${NAMESPACE} -- ping 172.16.1.100
PING 172.16.1.100 (172.16.1.100): 56 data bytes
64 bytes from 172.16.1.100: seq=0 ttl=64 time=93.116 ms
64 bytes from 172.16.1.100: seq=1 ttl=64 time=93.013 ms
64 bytes from 172.16.1.100: seq=2 ttl=64 time=88.954 ms
64 bytes from 172.16.1.100: seq=3 ttl=64 time=120.871 ms
64 bytes from 172.16.1.100: seq=4 ttl=64 time=112.742 ms
64 bytes from 172.16.1.100: seq=5 ttl=64 time=120.646 ms
64 bytes from 172.16.1.100: seq=6 ttl=64 time=120.537 ms
64 bytes from 172.16.1.100: seq=7 ttl=64 time=88.467 ms
64 bytes from 172.16.1.100: seq=8 ttl=64 time=104.360 ms
64 bytes from 172.16.1.100: seq=9 ttl=64 time=100.245 ms
64 bytes from 172.16.1.100: seq=10 ttl=64 time=112.144 ms
64 bytes from 172.16.1.100: seq=11 ttl=64 time=120.108 ms
64 bytes from 172.16.1.100: seq=12 ttl=64 time=115.971 ms
64 bytes from 172.16.1.100: seq=13 ttl=64 time=95.897 ms
64 bytes from 172.16.1.100: seq=14 ttl=64 time=103.794 ms
64 bytes from 172.16.1.100: seq=15 ttl=64 time=235.731 ms
64 bytes from 172.16.1.100: seq=16 ttl=64 time=91.633 ms
64 bytes from 172.16.1.100: seq=17 ttl=64 time=107.529 ms
64 bytes from 172.16.1.100: seq=18 ttl=64 time=131.353 ms
64 bytes from 172.16.1.100: seq=19 ttl=64 time=147.232 ms
64 bytes from 172.16.1.100: seq=20 ttl=64 time=111.130 ms
64 bytes from 172.16.1.100: seq=21 ttl=64 time=143.064 ms
64 bytes from 172.16.1.100: seq=22 ttl=64 time=114.847 ms
command terminated with exit code 137
I see nsmgr gets terminated by OOMKiller and nsc-kernel logs shows this error:
|
@Mixaster995 Could you provide images that simply disable directmemif for the moment for @jkossak to try? |
@jkossak That's interesting, I did not expect nswgr restarts. Could you share nsmgr logs? |
@Mixaster995, @denis-tingaikin, I increased nsmgr memory limit to 100Mi (from 60Mi) and all worked for 20 minutes. But started crashing again:
I noticed nsmgr used memory is increasing starting from ~40 to:
|
Cannot find a reason for nsmgr restart. |
@denis-tingaikin, I took the logs again with -p I wonder why the pod's memory increases always up to its limit - then kernel kills the process as it is shown by dmesg. I guess that is the reason of its restarts. |
@jkossak We are almost sure what should be our memory limits, but we probably need to check it one more time. Can you please checkout again https://github.com/Bolodya1997/deployments-k8s/tree/test-composition (it is a little bit updated) and leave it running for a day or two (maybe for a weekend if it is OK to you) and then share |
@Bolodya1997 today I noticed few restarts of nsmgr happened in the same deployment (without memory limit). Nevertheless the cross connection seems to be working fine at the moment.
Here is the output from: kubectl describe pod nsmgr-njs28 -n nsm-system The output does not mention OOMKilled in termination reason. I will provide the logs soon. |
@jkossak Can you please retry with https://github.com/Bolodya1997/deployments-k8s/tree/test-composition? I have already disabled liveness check there. |
@Bolodya1997 sure, I will retry with https://github.com/Bolodya1997/deployments-k8s/tree/test-composition, leave it running for the weekend and let you know on Monday the results. |
@Bolodya1997, here is the output from the "kubectl top" command after two days of running the deployment:
There is no more restarts of nsmgr:
But the cross connections seems to be down:
After these failed ping requests I noticed a forwarder-vpp restart:
|
@jkossak
|
@jkossak |
Hi @jkossak,
|
Hi @Bolodya1997, apologies for the delay. I have been experiencing some issues with the deployment, which I haven't seen before .. I keep getting errors trying to bring up the example pods. The nsm-system pods look OK, but none of the ns-* namespace pods get created.
Though I have the replicasets:
However, each replicaset from ns ns-4xbnk shows the same error in its description:
Here is the whole description of nse-firewall-vpp-787cdd5468 replicaset: I wonder if you can tell what the root cause is and how it can be fixed.. |
Hi @jkossak,
Missing [4] sometimes leads to the error case, when there is an old Please correct me if you have some other scenario and performing steps [3-5] on your current NSM setup doesn't fix the issue. |
@Bolodya1997 , mutatingwebhookconfiguration is being deleted during basic cleanup. After cleaning I have no mutatingwebhookconfiguration left..
|
@jkossak, can you please share logs and describe for the admission webhook pod?
|
@Bolodya1997 , sure , here are the logs: |
@jkossak, have you tried to rerun the test again? I am not able to reproduce this issue locally, mostly it looks like some invalid behavior of the web server used in the admission webhook. If it is a first and only test run on the new branch, can you please retry the test? If not, please tell me and we will continue trying to figure out the cause. |
@Bolodya1997 , yes I tried to rerun the test many times now, from different branches, also https://github.com/Bolodya1997/deployments-k8s/tree/test-composition and test-composition-new - always the same result. |
The initial problem with pings after long NSM running is solved. We also found a few issues that we'll consider separately.
@jkossak Many thanks for testing this. It was a super useful contribution for us. Be free to open new issues if you face something unexpected :) |
Hello, I am trying the NSM with the nse-composition example but the NSM environment seems to be unstable. The client (nsc-kernel) is stuck in CrashLoopBackOff (with error "cannot support any of the requested mechanism") most of the time, and therefore is not able to ping the endpoint. Other components are also restarting (sometimes with OOM Kill status).
Environment:
Ubuntu 20.04 and RHEL 8.4 (tried with both)
K8s v1.21.0
NSM - main branch -557750d1d6e7469bf1deb10c9ec46be68b725cd7
This is a single node deployment (not sure if it could make a difference)
Steps to reproduce:
Pod status:
The NSE and NSC pods do not receive the nsm interface:
As can be seen from output above and the logs attached, the Client pod keeps failing to connect to the manager. Do you have any thoughts on what could be causing this issue?
forwarder-vpp.LOG
nsc-kernel.LOG
nsmgr.LOG
The text was updated successfully, but these errors were encountered: