Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issue running nse-composition example #2381

Closed
damiankopyto opened this issue Jul 30, 2021 · 55 comments
Closed

Issue running nse-composition example #2381

damiankopyto opened this issue Jul 30, 2021 · 55 comments
Assignees
Labels
performance The problem related to system effectivity

Comments

@damiankopyto
Copy link

Hello, I am trying the NSM with the nse-composition example but the NSM environment seems to be unstable. The client (nsc-kernel) is stuck in CrashLoopBackOff (with error "cannot support any of the requested mechanism") most of the time, and therefore is not able to ping the endpoint. Other components are also restarting (sometimes with OOM Kill status).

Environment:
Ubuntu 20.04 and RHEL 8.4 (tried with both)
K8s v1.21.0
NSM - main branch -557750d1d6e7469bf1deb10c9ec46be68b725cd7
This is a single node deployment (not sure if it could make a difference)

Steps to reproduce:

  1. Set up spire as per: examples/spire
  2. Set up example/basic : examples/basic
  3. Set up the nse-composition as per: features/nse-composition
  4. Try to ping endpoint from client and check the status/logs of the pods.

Pod status:

# kubectl get pods -n ns-m4gf5
NAME                                 READY   STATUS             RESTARTS   AGE  
nsc-kernel-7ff64d979b-lj8ss          0/1     CrashLoopBackOff   10         37m  
nse-firewall-vpp-75c664447-f5bmt     1/1     Running            0          37m  
nse-kernel-67b499bc4d-257ts          2/2     Running            4          37m
nse-passthrough-1-79dcc5ff4-p44lc    1/1     Running            3          37m
nse-passthrough-2-5db868d789-vczv7   1/1     Running            6          37m
nse-passthrough-3-6794587c84-5jpdc   1/1     Running            5          37m
#  kubectl get pods -n nsm-system
NAME                                     READY   STATUS    RESTARTS   AGE
admission-webhook-k8s-698974c59b-xdlrz   1/1     Running   0          40m
forwarder-vpp-srlbd                      1/1     Running   8          40m
nsmgr-mf8th                              1/1     Running   12         40m
registry-k8s-758587b875-md2w6            1/1     Running   1          40m
# kubectl get networkservices -n nsm-system
NAME              AGE
nse-composition   38m

The NSE and NSC pods do not receive the nsm interface:

# kubectl exec ${NSE} -n ${NAMESPACE} -- ip addr
Defaulted container "nse" out of: nse, nginx
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
3: eth0@if37: <BROADCAST,MULTICAST,UP,LOWER_UP,M-DOWN> mtu 1450 qdisc noqueue state UP
    link/ether da:18:f7:11:0a:f9 brd ff:ff:ff:ff:ff:ff
    inet 10.244.0.28/24 brd 10.244.0.255 scope global eth0
       valid_lft forever preferred_lft forever

kubectl exec ${NSC} -n ${NAMESPACE} -- ip addr
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
3: eth0@if38: <BROADCAST,MULTICAST,UP,LOWER_UP,M-DOWN> mtu 1450 qdisc noqueue state UP
    link/ether 5e:d4:29:cb:ef:3a brd ff:ff:ff:ff:ff:ff
    inet 10.244.0.29/24 brd 10.244.0.255 scope global eth0
       valid_lft forever preferred_lft forever

As can be seen from output above and the logs attached, the Client pod keeps failing to connect to the manager. Do you have any thoughts on what could be causing this issue?
forwarder-vpp.LOG
nsc-kernel.LOG
nsmgr.LOG

@edwarnicke
Copy link
Member

@Mixaster995 Could you have a look?

@Mixaster995
Copy link
Contributor

@damiankopyto thanks for your report.
Could you, please, provide additional information:

  • are you testing this locally on kind or somwhere else?
  • what are your system configuration (memory, cpu, etc.)?
  • could you, please, provide additional logs for all components of this example(nsc, nse, passthrough1,2,3, firewall, forwarder, nsmgr) and also previous logs (result of
    kubectl logs --previous)

@Mixaster995 Mixaster995 self-assigned this Aug 10, 2021
@Mixaster995
Copy link
Contributor

Retested this case on kind 1.21.
Tested with different resources: 4 cpus and 2,4,6,8gb memory
Provided errors with CrashLoopBackOff never occurred. With 2gb or less memory cluster not working(it is expected behaviour).
@damiankopyto, please add some additional info to recreate this issue.

@damiankopyto
Copy link
Author

damiankopyto commented Aug 10, 2021

Hi @Mixaster995
Sorry for late response I needed to recreate my environment

  • I am testing locally, on a baremetal server.
  • Ubuntu 20.04, 500GB+ RAM, 128 CPUs (HT-on)
  • Attaching the logs for current/previous logs for all components (some of them did not have previous logs) in the zip file.
  • I have accidentally deployed on K8s v1.20.0 this time but I see the same issue, let me know if I should do with v1.21.0 again.
  • Also I have noticed now that the 2M hugepages have been enabled by default by the OS could this cause issue?
$ lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description:    Ubuntu 20.04.2 LTS
Release:        20.04
Codename:       focal
--------------------------------------------------------------
$ cat /proc/cmdline
BOOT_IMAGE=/vmlinuz-5.4.0-65-generic root=/dev/mapper/ubuntu--vg-ubuntu--lv ro
--------------------------------------------------------------
$ cat /proc/meminfo
MemTotal:       **527675592 kB**
MemFree:        493493840 kB
MemAvailable:   516295508 kB
Buffers:         1225912 kB
Cached:         23325044 kB
SwapCached:            0 kB
Active:          9788248 kB
Inactive:       18300960 kB
Active(anon):    2786740 kB
Inactive(anon):  1365248 kB
Active(file):    7001508 kB
Inactive(file): 16935712 kB
Unevictable:       36840 kB
Mlocked:           36840 kB
SwapTotal:             0 kB
SwapFree:              0 kB
Dirty:             17992 kB
Writeback:             0 kB
AnonPages:       3573956 kB
Mapped:          2968988 kB
Shmem:           1387808 kB
KReclaimable:    2284920 kB
Slab:            4741744 kB
SReclaimable:    2284920 kB
SUnreclaim:      2456824 kB
KernelStack:       91008 kB
PageTables:        59208 kB
NFS_Unstable:          0 kB
Bounce:                0 kB
WritebackTmp:          0 kB
CommitLimit:    263706724 kB
Committed_AS:   40672192 kB
VmallocTotal:   34359738367 kB
VmallocUsed:      374996 kB
VmallocChunk:          0 kB
Percpu:           350720 kB
HardwareCorrupted:     0 kB
AnonHugePages:     90112 kB
ShmemHugePages:        0 kB
ShmemPmdMapped:        0 kB
FileHugePages:         0 kB
FilePmdMapped:         0 kB
CmaTotal:              0 kB
CmaFree:               0 kB
HugePages_Total:     **128**
HugePages_Free:      128
HugePages_Rsvd:        0
HugePages_Surp:        0
Hugepagesize:       2048 kB
Hugetlb:          262144 kB
DirectMap4k:     1818560 kB
DirectMap2M:    28948480 kB
DirectMap1G:    507510784 kB
-------------------------------------------------------------------
# lscpu
Architecture:                    x86_64
CPU op-mode(s):                  32-bit, 64-bit
Byte Order:                      Little Endian
Address sizes:                   46 bits physical, 57 bits virtual
CPU(s):                          128
On-line CPU(s) list:             0-127
Thread(s) per core:              2
Core(s) per socket:              32
-------------------------------------------------------------------

nsm-logs.zip

Thanks for looking into this

@Mixaster995
Copy link
Contributor

After some tests it seems like it's related to problem with spire restarts: #2393
@damiankopyto could you, please, retest this after this problem will be fixed.

@damiankopyto
Copy link
Author

@Mixaster995 Sure will do, do you know the rough timeline for a fix?
Just on the topic of the spire restarts, I do not see my spire pods restarting, is the spire restart something internal within the spire pod that does not affect the life of the pod?

$ kubectl get pod -n spire
NAME                    READY   STATUS    RESTARTS   AGE
pod/spire-agent-sqh64   1/1     Running   0          23h
pod/spire-server-0      1/1     Running   0          23h

@jkossak
Copy link

jkossak commented Aug 26, 2021

Hi @Mixaster995 I tested NSM using the nse-composition example and the latest code changes. Although not sure if all the fixes you wrote to @damiankopyto about are already merged. However, I'm still seeing this problem with OOMKilled occur for most of the pods same as described by @damiankopyto.

My hardware specification, OS and environment configuration is same as @damiankopyto 's except that the Kubernetes version is 1.21.

$ cat /proc/meminfo
MemTotal:       527675632 kB
MemFree:        481398216 kB
MemAvailable:   498714252 kB
Buffers:         1261368 kB
Cached:         33259716 kB
SwapCached:            0 kB
Active:          8836344 kB
Inactive:       29708548 kB
Active(anon):    3673744 kB
Inactive(anon): 16289472 kB
Active(file):    5162600 kB
Inactive(file): 13419076 kB
Unevictable:       19300 kB
Mlocked:           19300 kB
SwapTotal:             0 kB
SwapFree:              0 kB
Dirty:             39576 kB
Writeback:             0 kB
AnonPages:       4000260 kB
Mapped:          2319164 kB
Shmem:          16590664 kB
KReclaimable:    2154692 kB
Slab:            5705996 kB
SReclaimable:    2154692 kB
SUnreclaim:      3551304 kB
KernelStack:       86688 kB
PageTables:        53768 kB
NFS_Unstable:          0 kB
Bounce:                0 kB
WritebackTmp:          0 kB
CommitLimit:    263706744 kB
Committed_AS:   46804100 kB
VmallocTotal:   34359738367 kB
VmallocUsed:      363836 kB
VmallocChunk:          0 kB
Percpu:          1056256 kB
HardwareCorrupted:     0 kB
AnonHugePages:     57344 kB
ShmemHugePages:        0 kB
ShmemPmdMapped:        0 kB
FileHugePages:         0 kB
FilePmdMapped:         0 kB
CmaTotal:              0 kB
CmaFree:               0 kB
HugePages_Total:     128
HugePages_Free:      128
HugePages_Rsvd:        0
HugePages_Surp:        0
Hugepagesize:       2048 kB
Hugetlb:          262144 kB
DirectMap4k:     4028356 kB
DirectMap2M:    41418752 kB
DirectMap1G:    492830720 kB
----------------------------------
$ lscpu
Architecture:                    x86_64
CPU op-mode(s):                  32-bit, 64-bit
Byte Order:                      Little Endian
Address sizes:                   46 bits physical, 57 bits virtual
CPU(s):                          128
On-line CPU(s) list:             0-127
Thread(s) per core:              2
Core(s) per socket:              32
Socket(s):                       2
NUMA node(s):                    2
Vendor ID:                       GenuineIntel
CPU family:                      6
Model:                           106
Model name:                      Intel(R) Xeon(R) Gold 6338N CPU @ 2.20GHz
Stepping:                        6
CPU MHz:                         800.202
BogoMIPS:                        4400.00
Virtualization:                  VT-x

With increased memory limits of the pods (up to 1Gi) the problems within the NSM environment changed slightly.
It helped to stabilize the pods and eliminate the OOMKilled error ( for pods: nsc-kernel, nse-firewall, nse-kernel forwarder-vpp, registry-k8s). With that I'm getting the example working for about 3 minutes - NSC to NSE ping works until nsmgr and nse-passthrough-3 restart with termination reason "Completed". Then "kubectl describe" for nsmgr pod shows "Liveness and readiness probe failed". I disabled the liveness and readiness checks and got the example working for about 10 minutes when one of the passthrough applications start restarting with termination reason "Completed". From my observation, after it restarts usually the passthrough app don't get/restore its memif sockets for a long time and ping between NSC and NSE does not work.

$ kubectl exec -t nse-passthrough-2-57556d99b6-ds5fs -n ns-52p6p -- ls /tmp
nse-passthrough-2-57556d99b6-ds5fs677134252

Once the sockets are restored I can ping between NSE and NSC for about 10 minutes again.

$ kubectl exec -t nse-passthrough-2-57556d99b6-ds5fs -n ns-52p6p -- ls /tmp
memif
memifproxy
nse-passthrough-2-57556d99b6-ds5fs677134252

Then again one of the apps restarts.

When I check the liveness of nsmgr manually with the command in its yaml file, it looks fine. The pod does not restart any more.

$ kubectl exec -t nsmgr-ctb8v -n nsm-system -- /bin/grpc-health-probe -spiffe -addr=:5001
status: SERVING

To me it looks like the problem now is with the passthrough apps. At the beginning they restart with termination reason "Completed". But left them overnight and OOMKilled started appearing again despite memory limits set to 3Gi.

$ kubectl get pods -n ns-52p6p
NAME                                                                      READY     STATUS        RESTARTS       AGE
ns-52p6p      nsc-kernel-6967b5f8b6-jv6fq                                  1/1     Running            0          40h
ns-52p6p      nse-firewall-vpp-8645f84674-8c9vr                            1/1     Running            0          40h
ns-52p6p      nse-kernel-c85b94d69-qwtz9                                   2/2     Running            0          40h
ns-52p6p      nse-passthrough-1-59648c5855-jwqj7                           0/1     CrashLoopBackOff   26         16h
ns-52p6p      nse-passthrough-2-57556d99b6-ds5fs                           0/1     CrashLoopBackOff   135        16h
ns-52p6p      nse-passthrough-3-5cb5b498df-btkff                           1/1     Running            15         16h

$ kubectl get pods -n nsm-system
NAME                                                                      READY     STATUS        RESTARTS       AGE
nsm-system    admission-webhook-k8s-669c9cf54d-fqwws                       1/1     Running            0          44h
nsm-system    forwarder-vpp-l9fjd                                          1/1     Running            0          44h
nsm-system    nsmgr-ctb8v                                                  1/1     Running            0          44h
nsm-system    registry-k8s-5958d8bbb9-77m59                                1/1     Running            0          44h

I never noticed any memory spikes myself with "kubectl top" command.

My questions are:

  • are the nse-passthrough apps expected to terminate after some time? why are they restarting ?
  • should they restore quickly?

I'm attaching the nse-passthrough apps logs. Please, let me know if you need more logs or any other info. Your help would be much appreciated.

passthrough1_0.log
passthrough1_1.log
passthrough2_2.log
passthrough2_3.log
passthrough3_4.log
passthrough3_5.log

Here I'm adding latest logs from nse-passthrough-2 app that was terminated due to OOMKilled:

passthrough2_202.log

$ kubectl get pods -n ns-52p6p
NAME                                 READY   STATUS             RESTARTS   AGE
nsc-kernel-6967b5f8b6-jv6fq          1/1     Running            0          46h
nse-firewall-vpp-8645f84674-8c9vr    1/1     Running            0          46h
nse-kernel-c85b94d69-qwtz9           2/2     Running            0          46h
nse-passthrough-1-59648c5855-jwqj7   0/1     CrashLoopBackOff   92         22h
nse-passthrough-2-57556d99b6-ds5fs   0/1     OOMKilled          202        22h
nse-passthrough-3-5cb5b498df-btkff   1/1     Running            15         22h

@jkossak
Copy link

jkossak commented Sep 1, 2021

Hi @edwarnicke @Mixaster995 have you made any progress on this issue? Are you going to provide a fix to eliminate the restarts errors from nsm-system pods and the example pods? If yes when I can expect it roughly?

@Mixaster995
Copy link
Contributor

Hi @jkossak. Thank you for very detailed description, it is really helpful. I've just returned from PTO and continued investigating this issue and will provide the answers as soon as possible.

@jkossak
Copy link

jkossak commented Sep 6, 2021

Hi @Mixaster995 , thank you for getting back to me. Great to hear you are working on it.

Just to add some more info about this issue, here is the output of dmesg:

dmesg.txt

Today I tried the example again after pulling the latest changes. All worked for about 10 minutes (despite few restarts of nsc-kernel and nsmgr at the begininng). Then nse-passthrough-1 and nse-passthrough-2 got into OOMKiller error and can't get back to life staying most of the time in CrashLoopBackOff,. They start only for a few seconds to be killed again.

One question I have: I don't see the pods with vpp using hugepages. Aren't hugepages required to run VPP operations?

@denis-tingaikin denis-tingaikin added the performance The problem related to system effectivity label Sep 7, 2021
@Mixaster995
Copy link
Contributor

Hi @jkossak, hugepages not required for VPP to work - it's working fine without it. AFAIK, it is might be needed for some specific features, but we not using those.

@edwarnicke
Copy link
Member

@jkossak VPP can use hugepages. If VPP can't get hugepages, it will log a very nasty message and keep on working to the best of its ability.

@jkossak
Copy link

jkossak commented Sep 9, 2021

Thank you @edwarnicke and @Mixaster995 for answering my question, so this message I am seeing is actually expected and normal:

vpp[21]: buffer: numa[0] falling back to non-hugepage backed buffer pool (vlib_physmem_shared_map_create: pmalloc_map_pages: Unable to lock pages: Cannot allocate memory)

@jkossak
Copy link

jkossak commented Sep 13, 2021

Hi @edwarnicke, @Mixaster995 , I increased the value of the NSM_REQUEST_TIMEOUT environment variable to 160s , but it was already set to 75s by default in examples/features/nse-composition/patch-nsc.yaml

I still observe the instability of the cross-connection that breaks after 2 - 3 minutes, and resumes from time to time for another couple of minutes.

The pods after few minutes:

ns-49p5k      nsc-kernel-6d8db46cc9-h4j74                                 1/1     Running     1          19m
ns-49p5k      nse-firewall-vpp-66646d5c97-2tsqt                           1/1     Running     0          19m
ns-49p5k      nse-kernel-7db68697d5-w6r6r                                 2/2     Running     1          19m
ns-49p5k      nse-passthrough-1-679b46b87f-bqxpx                          1/1     Running     0          19m
ns-49p5k      nse-passthrough-2-6f59968558-tqtcr                          1/1     Running     1          19m
ns-49p5k      nse-passthrough-3-6c89477d58-zt5f4                          1/1     Running     0          19m
nsm-system    admission-webhook-k8s-55799cd7f9-swnbr                      1/1     Running     0          21m
nsm-system    forwarder-vpp-hq7wb                                         1/1     Running     0          21m
nsm-system    nsmgr-4rg8h                                                 1/1     Running     0          21m
nsm-system    registry-k8s-6f7f46c897-v7f5p                               1/1     Running     0          21m

@Mixaster995
Copy link
Contributor

Hi, @jkossak. We found possible source of the problem - with some logic turned off wrong behaviour doesn't occur anymore. Now trying to find the solution for this parts.

@jkossak
Copy link

jkossak commented Sep 14, 2021

Thank you @Mixaster995 for letting me know.

@Mixaster995
Copy link
Contributor

Hi, @jkossak.
Could you, please, try nse-composition example with images i've prepared:
https://hub.docker.com/r/mixaster95/forwarder for forwarder-vpp
https://hub.docker.com/r/mixaster95/firewall for nse-firewall-vpp, passthrough-1, passthrough-2, passthrough-3

It would be great, if you let me know if everything is ok or if there is still instability/errors. Thank you.

@jkossak
Copy link

jkossak commented Sep 16, 2021

Thanks @Mixaster995 , I will try the example with the new images today and let you know how it goes then.

@jkossak
Copy link

jkossak commented Sep 16, 2021

Hi @Mixaster995, I steel see the pods crashing, except nse-passthrough pods.

ns-xnz7h      nsc-kernel-596bdf89b4-rdt8j                                 0/1     Error       3          4m40s
ns-xnz7h      nse-firewall-vpp-6fc568b55b-n76tx                           1/1     Running     0          4m40s
ns-xnz7h      nse-kernel-647564b8c6-d4txr                                 2/2     Running     2          4m40s
ns-xnz7h      nse-passthrough-1-6dbbdbfd57-qqbsq                          1/1     Running     0          4m40s
ns-xnz7h      nse-passthrough-2-85cd6f4cdb-9xf98                          1/1     Running     0          4m40s
ns-xnz7h      nse-passthrough-3-65cdd5cf47-tgjqz                          1/1     Running     0          4m40s
nsm-system    admission-webhook-k8s-5c4f9c9d6d-sm4gs                      1/1     Running     0          6m20s
nsm-system    forwarder-vpp-7bm64                                         1/1     Running     0          6m20s
nsm-system    nsmgr-dx9dg                                                 0/1     Running     4          6m20s
nsm-system    registry-k8s-5446d9b464-6mlxw                               1/1     Running     0          6m20s

The cross connection breaks even faster:

$ kubectl exec ${NSC} -n ${NAMESPACE} -- ping  172.16.1.100
PING 172.16.1.100 (172.16.1.100): 56 data bytes
64 bytes from 172.16.1.100: seq=0 ttl=64 time=93.116 ms
64 bytes from 172.16.1.100: seq=1 ttl=64 time=93.013 ms
64 bytes from 172.16.1.100: seq=2 ttl=64 time=88.954 ms
64 bytes from 172.16.1.100: seq=3 ttl=64 time=120.871 ms
64 bytes from 172.16.1.100: seq=4 ttl=64 time=112.742 ms
64 bytes from 172.16.1.100: seq=5 ttl=64 time=120.646 ms
64 bytes from 172.16.1.100: seq=6 ttl=64 time=120.537 ms
64 bytes from 172.16.1.100: seq=7 ttl=64 time=88.467 ms
64 bytes from 172.16.1.100: seq=8 ttl=64 time=104.360 ms
64 bytes from 172.16.1.100: seq=9 ttl=64 time=100.245 ms
64 bytes from 172.16.1.100: seq=10 ttl=64 time=112.144 ms
64 bytes from 172.16.1.100: seq=11 ttl=64 time=120.108 ms
64 bytes from 172.16.1.100: seq=12 ttl=64 time=115.971 ms
64 bytes from 172.16.1.100: seq=13 ttl=64 time=95.897 ms
64 bytes from 172.16.1.100: seq=14 ttl=64 time=103.794 ms
64 bytes from 172.16.1.100: seq=15 ttl=64 time=235.731 ms
64 bytes from 172.16.1.100: seq=16 ttl=64 time=91.633 ms
64 bytes from 172.16.1.100: seq=17 ttl=64 time=107.529 ms
64 bytes from 172.16.1.100: seq=18 ttl=64 time=131.353 ms
64 bytes from 172.16.1.100: seq=19 ttl=64 time=147.232 ms
64 bytes from 172.16.1.100: seq=20 ttl=64 time=111.130 ms
64 bytes from 172.16.1.100: seq=21 ttl=64 time=143.064 ms
64 bytes from 172.16.1.100: seq=22 ttl=64 time=114.847 ms
command terminated with exit code 137

I see nsmgr gets terminated by OOMKiller and nsc-kernel logs shows this error:

Sep 16 14:29:13.587 [INFO] [cmd:[/bin/app]] sVID: "spiffe://example.org/ns/ns-xnz7h/sa/default"
Sep 16 14:29:13.880 [INFO] [cmd:[/bin/app]] NSC: Connecting to Network Service Manager unix:///var/lib/networkservicemesh/nsm.io.sock
2021/09/16 14:30:28 Reporting span 445f1d7835133ce0:445f1d7835133ce0:0000000000000000:1
Sep 16 14:30:28.882 [FATA] [cmd:[/bin/app]] rpc error: code = DeadlineExceeded desc = latest balancer error: connection error: desc = "transport: Error while dialing dial unix /var/lib/networkservicemesh/nsm.io.sock: connect: connection refused"

@edwarnicke
Copy link
Member

@Mixaster995 Could you provide images that simply disable directmemif for the moment for @jkossak to try?

@denis-tingaikin
Copy link
Member

denis-tingaikin commented Sep 16, 2021

@jkossak That's interesting, I did not expect nswgr restarts. Could you share nsmgr logs?

@jkossak
Copy link

jkossak commented Sep 16, 2021

@Mixaster995, @denis-tingaikin, I increased nsmgr memory limit to 100Mi (from 60Mi) and all worked for 20 minutes. But started crashing again:

ns-776vj      nsc-kernel-596bdf89b4-dw6qp                                 1/1     Running     0          21m
ns-776vj      nse-firewall-vpp-6fc568b55b-84snl                           1/1     Running     0          21m
ns-776vj      nse-kernel-647564b8c6-nt9v2                                 2/2     Running     1          21m
ns-776vj      nse-passthrough-1-6dbbdbfd57-467gl                          1/1     Running     0          21m
ns-776vj      nse-passthrough-2-85cd6f4cdb-v4g8r                          1/1     Running     0          21m
ns-776vj      nse-passthrough-3-65cdd5cf47-xspdc                          1/1     Running     0          21m
nsm-system    admission-webhook-k8s-5c4f9c9d6d-xjmtk                      1/1     Running     0          22m
nsm-system    forwarder-vpp-frhnc                                         1/1     Running     0          22m
nsm-system    nsmgr-7jlhh                                                 1/1     Running     0          22m
nsm-system    registry-k8s-5446d9b464-gk8hr                               1/1     Running     0          22m
ns-776vj      nsc-kernel-596bdf89b4-dw6qp                                 1/1     Running     5          33m
ns-776vj      nse-firewall-vpp-6fc568b55b-84snl                           1/1     Running     1          33m
ns-776vj      nse-kernel-647564b8c6-nt9v2                                 2/2     Running     3          33m
ns-776vj      nse-passthrough-1-6dbbdbfd57-467gl                          1/1     Running     0          33m
ns-776vj      nse-passthrough-2-85cd6f4cdb-v4g8r                          1/1     Running     0          33m
ns-776vj      nse-passthrough-3-65cdd5cf47-xspdc                          1/1     Running     0          33m
nsm-system    admission-webhook-k8s-5c4f9c9d6d-xjmtk                      1/1     Running     0          35m
nsm-system    forwarder-vpp-frhnc                                         1/1     Running     0          35m
nsm-system    nsmgr-7jlhh                                                 1/1     Running     3          35m

I noticed nsmgr used memory is increasing starting from ~40 to:

NAME                                     CPU(cores)   MEMORY(bytes)
admission-webhook-k8s-5c4f9c9d6d-xjmtk   2m           17Mi
forwarder-vpp-frhnc                      207m         431Mi
nsmgr-7jlhh                              54m          93Mi
registry-k8s-5446d9b464-gk8hr            8m           24Mi

nsc-kernel-596bdf89b4-dw6qp          10m          11Mi
nse-firewall-vpp-6fc568b55b-84snl    215m         395Mi
nse-kernel-647564b8c6-nt9v2          1m           3Mi
nse-passthrough-1-6dbbdbfd57-467gl   224m         384Mi
nse-passthrough-2-85cd6f4cdb-v4g8r   213m         386Mi
nse-passthrough-3-65cdd5cf47-xspdc   233m         385Mi

Here are the nsmgr logs:
2.log
3.log

@denis-tingaikin
Copy link
Member

Cannot find a reason for nsmgr restart.
Did you use -p option on getting logs?

@jkossak
Copy link

jkossak commented Sep 16, 2021

@denis-tingaikin, I took the logs again with -p kubectl logs -f -p nsmgr-7jlhh -n nsm-system

I wonder why the pod's memory increases always up to its limit - then kernel kills the process as it is shown by dmesg. I guess that is the reason of its restarts.

dmesg.txt
nsmgrlogs.log

@Bolodya1997
Copy link

@jkossak
Are there any restarts without memory limits?

We are almost sure what should be our memory limits, but we probably need to check it one more time. Can you please checkout again https://github.com/Bolodya1997/deployments-k8s/tree/test-composition (it is a little bit updated) and leave it running for a day or two (maybe for a weekend if it is OK to you) and then share kubectl top pod results again?

@jkossak
Copy link

jkossak commented Oct 8, 2021

@Bolodya1997 today I noticed few restarts of nsmgr happened in the same deployment (without memory limit). Nevertheless the cross connection seems to be working fine at the moment.

ns-htq7l      nsc-kernel-c68b874d6-cwznk                                  1/1     Running     0          23h
ns-htq7l      nse-firewall-vpp-787cdd5468-lzdkz                           1/1     Running     0          23h
ns-htq7l      nse-kernel-86f6754c7f-ldt5x                                 2/2     Running     0          23h
ns-htq7l      nse-passthrough-1-c7989fbdc-drrr4                           1/1     Running     0          23h
ns-htq7l      nse-passthrough-2-59f8cdc96b-l2td6                          1/1     Running     0          23h
ns-htq7l      nse-passthrough-3-6844c9998f-9hxcv                          1/1     Running     0          23h
nsm-system    admission-webhook-k8s-799d79797b-mzg82                      1/1     Running     0          23h
nsm-system    forwarder-vpp-cbdlf                                         1/1     Running     0          23h
nsm-system    nsmgr-njs28                                                 1/1     Running     4          23h
nsm-system    registry-k8s-749fd78b8b-mk8n6                               1/1     Running     0          23h

Here is the output from: kubectl describe pod nsmgr-njs28 -n nsm-system
describe_nsmgr_pod.txt

The output does not mention OOMKilled in termination reason.

I will provide the logs soon.

@Bolodya1997
Copy link

Bolodya1997 commented Oct 8, 2021

@jkossak
Looks like the reason of the restarts is related to liveness probe failure, we already have an issue for this in backlog (#2766). So currently we want to measure memory limits on your setup to become sure that we will cover it in deployments :)

Can you please retry with https://github.com/Bolodya1997/deployments-k8s/tree/test-composition? I have already disabled liveness check there.

@jkossak
Copy link

jkossak commented Oct 8, 2021

@Bolodya1997 sure, I will retry with https://github.com/Bolodya1997/deployments-k8s/tree/test-composition, leave it running for the weekend and let you know on Monday the results.

@jkossak
Copy link

jkossak commented Oct 11, 2021

@Bolodya1997, here is the output from the "kubectl top" command after two days of running the deployment:

# kubectl top pods --all-namespaces | grep -v kube-system
NAMESPACE     NAME                                                        CPU(cores)   MEMORY(bytes)
ns-pzdnm      nsc-kernel-5d7bf66655-zfg6h                                 4m           83Mi
ns-pzdnm      nse-firewall-vpp-787cdd5468-d5xqd                           216m         453Mi
ns-pzdnm      nse-kernel-7cb6586d98-n7th6                                 6m           56Mi
ns-pzdnm      nse-passthrough-1-c7989fbdc-r5l5w                           214m         418Mi
ns-pzdnm      nse-passthrough-2-59f8cdc96b-fvbwh                          215m         420Mi
ns-pzdnm      nse-passthrough-3-6844c9998f-m2skj                          216m         419Mi
nsm-system    admission-webhook-k8s-799d79797b-cmgtg                      2m           24Mi
nsm-system    forwarder-vpp-9cp8s                                         302m         436Mi
nsm-system    nsmgr-wfvt4                                                 78m          104Mi
nsm-system    registry-k8s-749fd78b8b-8dqg2                               11m          35Mi
spire         spire-agent-5rsdq                                           2m           119Mi
spire         spire-server-0                                              11m          115Mi

There is no more restarts of nsmgr:

# kubectl get pods --all-namespaces | grep -v kube-system
NAMESPACE     NAME                                                        READY   STATUS      RESTARTS   AGE

ns-pzdnm      nsc-kernel-5d7bf66655-zfg6h                                 1/1     Running     0          2d19h
ns-pzdnm      nse-firewall-vpp-787cdd5468-d5xqd                           1/1     Running     0          2d19h
ns-pzdnm      nse-kernel-7cb6586d98-n7th6                                 2/2     Running     0          2d19h
ns-pzdnm      nse-passthrough-1-c7989fbdc-r5l5w                           1/1     Running     0          2d19h
ns-pzdnm      nse-passthrough-2-59f8cdc96b-fvbwh                          1/1     Running     0          2d19h
ns-pzdnm      nse-passthrough-3-6844c9998f-m2skj                          1/1     Running     0          2d19h
nsm-system    admission-webhook-k8s-799d79797b-cmgtg                      1/1     Running     0          2d20h
nsm-system    forwarder-vpp-9cp8s                                         1/1     Running     0          2d20h
nsm-system    nsmgr-wfvt4                                                 1/1     Running     0          2d20h
nsm-system    registry-k8s-749fd78b8b-8dqg2                               1/1     Running     0          2d20h
openness      nfd-release-node-feature-discovery-master-cc9d7b8f4-hvn9n   1/1     Running     0          34d
openness      nfd-release-node-feature-discovery-worker-s65vp             1/1     Running     0          34d
spire         spire-agent-5rsdq                                           1/1     Running     0          2d20h
spire         spire-server-0                                              1/1     Running     0          2d20h

But the cross connections seems to be down:

# kubectl exec ${NSC} -n ${NAMESPACE} -- ping -c 4 172.16.1.100
PING 172.16.1.100 (172.16.1.100): 56 data bytes

--- 172.16.1.100 ping statistics ---
4 packets transmitted, 0 packets received, 100% packet loss
command terminated with exit code 1
# kubectl exec ${NSC} -n ${NAMESPACE} -- wget -O /dev/null --timeout 5 "172.16.1.100:8080"
Connecting to 172.16.1.100:8080 (172.16.1.100:8080)
wget: can't connect to remote host (172.16.1.100): Host is unreachable
command terminated with exit code 1
# kubectl exec ${NSE} -n ${NAMESPACE} -- ping -c 4 172.16.1.101
Defaulted container "nse" out of: nse, nginx
PING 172.16.1.101 (172.16.1.101): 56 data bytes

--- 172.16.1.101 ping statistics ---
4 packets transmitted, 0 packets received, 100% packet loss
command terminated with exit code 1

After these failed ping requests I noticed a forwarder-vpp restart:

# kubectl get pods --all-namespaces | grep -v kube-system
NAMESPACE     NAME                                                        READY   STATUS      RESTARTS   AGE
ns-pzdnm      nsc-kernel-5d7bf66655-zfg6h                                 1/1     Running     0          2d21h
ns-pzdnm      nse-firewall-vpp-787cdd5468-d5xqd                           1/1     Running     0          2d21h
ns-pzdnm      nse-kernel-7cb6586d98-n7th6                                 2/2     Running     0          2d21h
ns-pzdnm      nse-passthrough-1-c7989fbdc-r5l5w                           1/1     Running     0          2d21h
ns-pzdnm      nse-passthrough-2-59f8cdc96b-fvbwh                          1/1     Running     0          2d21h
ns-pzdnm      nse-passthrough-3-6844c9998f-m2skj                          1/1     Running     0          2d21h
nsm-system    admission-webhook-k8s-799d79797b-cmgtg                      1/1     Running     0          2d21h
nsm-system    forwarder-vpp-9cp8s                                         1/1     Running     1          2d21h
nsm-system    nsmgr-wfvt4                                                 1/1     Running     0          2d21h
nsm-system    registry-k8s-749fd78b8b-8dqg2                               1/1     Running     0          2d21h
spire         spire-agent-5rsdq                                           1/1     Running     0          2d21h
spire         spire-server-0                                              1/1     Running     0          2d21h

@Bolodya1997
Copy link

@jkossak
Can you please share the following:

kubectl -n nsm-system logs nsmgr-wfvt4 > nsmgr.log
kubectl -n nsm-system logs -p forwarder-vpp-9cp8s > forwarder_old.log
kubectl -n nsm-system logs forwarder-vpp-9cp8s > forwarder_new.log
kubectl -n ns-pzdnm logs nsc-kernel-5d7bf66655-zfg6h > nsc.log
kubectl -n ns-pzdnm logs nse-kernel-7cb6586d98-n7th6 > nse.log
kubectl -n nsm-system describe pod forwarder-vpp-9cp8s > forwarder_describe.txt

@jkossak
Copy link

jkossak commented Oct 11, 2021

@Bolodya1997, here are all the logs - the size of some of the log files is quite big at that stage :)

kubectl -n nsm-system logs forwarder-vpp-9cp8s > forwarder_new.log

forwarder_new.zip

kubectl -n nsm-system describe pod forwarder-vpp-9cp8s > forwarder_describe.txt

forwarder_describe.txt

kubectl -n ns-pzdnm logs nsc-kernel-5d7bf66655-zfg6h > nsc.log

nsc_1.zip
nsc_2.zip

kubectl -n ns-pzdnm logs nse-kernel-7cb6586d98-n7th6 > nse.log

nse_1.zip
nse_2.zip

kubectl -n nsm-system logs -p forwarder-vpp-9cp8s > forwarder_old.log

forwarder_old_8.zip
forwarder_old_7.zip
forwarder_old_6.zip
forwarder_old_5.zip
forwarder_old_4.zip
forwarder_old_3.zip
forwarder_old_2.zip
forwarder_old_1.zip

kubectl -n nsm-system logs nsmgr-wfvt4 > nsmgr.log

nsmgr_32.zip
nsmgr_31.zip
nsmgr_30.zip
nsmgr_29.zip
nsmgr_28.zip
nsmgr_27.zip
nsmgr_26.zip
nsmgr_25.zip
nsmgr_24.zip
nsmgr_23.zip
nsmgr_22.zip
nsmgr_21.zip
nsmgr_20.zip
nsmgr_19.zip
nsmgr_18.zip
nsmgr_17.zip
nsmgr_16.zip
nsmgr_15.zip
nsmgr_14.zip
nsmgr_13.zip
nsmgr_12.zip
nsmgr_11.zip
nsmgr_10.zip
nsmgr_9.zip
nsmgr_8.zip
nsmgr_7.zip
nsmgr_6.zip
nsmgr_5.zip
nsmgr_4.zip
nsmgr_3.zip
nsmgr_2.zip
nsmgr_1.zip

@Bolodya1997
Copy link

@jkossak
Your logs are very helpful, thank you!
High memory usage in client is unexpected and unfortunately I haven't add pprof in client deployment I have provided. But it looks like I can reproduce this on my local single node setup, so this is currently in progress.

@Bolodya1997
Copy link

Hi @jkossak,
There were some stability issues in NSM last few days, but currently our tests are working. Can you please run NSE Composition example one more time on your setup?

  1. New branch for testing is https://github.com/Bolodya1997/deployments-k8s/tree/test-composition-new.
  2. Please run it for a day or two and share kubectl top pods --all-namespaces | grep -v kube-system.

@jkossak
Copy link

jkossak commented Oct 27, 2021

Hi @Bolodya1997, apologies for the delay. I have been experiencing some issues with the deployment, which I haven't seen before .. I keep getting errors trying to bring up the example pods. The nsm-system pods look OK, but none of the ns-* namespace pods get created.

# kubectl get pods -n nsm-system
NAME                                     READY   STATUS    RESTARTS   AGE
admission-webhook-k8s-799d79797b-825tv   1/1     Running   0          51m
forwarder-vpp-dm89d                      1/1     Running   0          51m
nsmgr-4tf85                              1/1     Running   0          51m
registry-k8s-749fd78b8b-42cpb            1/1     Running   0          51m
# kubectl get pods -n ns-4xbnk
No resources found in ns-4xbnk namespace.

Though I have the replicasets:

# kubectl get replicasets -n ns-4xbnk
NAME                           DESIRED   CURRENT   READY   AGE
nsc-kernel-88f5d8bc5           1         0         0       47m
nse-firewall-vpp-787cdd5468    1         0         0       47m
nse-kernel-c9797c574           1         0         0       47m
nse-passthrough-1-c7989fbdc    1         0         0       47m
nse-passthrough-2-59f8cdc96b   1         0         0       47m
nse-passthrough-3-6844c9998f   1         0         0       47m

However, each replicaset from ns ns-4xbnk shows the same error in its description:

Conditions:
  Type             Status  Reason
  ----             ------  ------
  ReplicaFailure   True    FailedCreate
Events:
  Type     Reason        Age                 From                   Message
  ----     ------        ----                ----                   -------
  Warning  FailedCreate  11m (x20 over 49m)  replicaset-controller  Error creating: Internal error occurred: failed calling webhook "admission-webhook-k8s-799d79797b-825tv.networkservicemesh.io": Post "https://admission-webhook-svc.nsm-system.svc:443/mutate?timeout=10s": Service Unavailable

Here is the whole description of nse-firewall-vpp-787cdd5468 replicaset:
replicaset_desc.txt

I wonder if you can tell what the root cause is and how it can be fixed..
I can see the admission-webhook service is running and with netcat and pod's IP I am able to connect to port 443 on which admission-webhook-k8s is listening . Maybe increasing timeout would help..?

@Bolodya1997
Copy link

Hi @jkossak,
It is a very common error when kubectl delete mutatingwebhookconfiguration --all step is missed during the basic cleanup. The right steps for the NSM testing with basic restart should be the following:

  1. Use setup from basic : kubectl create ns nsm-system && kubectl apply -k .
  2. Do something with NSM.
  3. Remove nsm-system : kubectl delete ns nsm-system.
  4. Remove mutatingwebhookconfiguration : kubectl delete mutatingwebhookconfiguration --all.
    • If you have a complex k8s system, you can get needed mutatingwebhookconfiguration and manually remove it with:
      1. kubectl get mutatingwebhookconfiguration --all.
      2. kubectl delete mutatingwebhookconfiguration <needed name from the list>.
  5. Now you can start again with [1].

Missing [4] sometimes leads to the error case, when there is an old mutatingwebhookconfiguration on the cluster and it starts failing replicasets to deploy pods.

Please correct me if you have some other scenario and performing steps [3-5] on your current NSM setup doesn't fix the issue.

@jkossak
Copy link

jkossak commented Oct 28, 2021

@Bolodya1997 , mutatingwebhookconfiguration is being deleted during basic cleanup. After cleaning I have no mutatingwebhookconfiguration left..

# kubectl get mutatingwebhookconfiguration --all-namespaces
No resources found

@Bolodya1997
Copy link

Bolodya1997 commented Oct 29, 2021

@jkossak, can you please share logs and describe for the admission webhook pod?

kubectl -n nsm-system logs admission-webhook-k8s-799d79797b-825tv > admission-webhook.log
kubectl -n nsm-system describe pod admission-webhook-k8s-799d79797b-825tv > admission-webhook-describe.txt

@jkossak
Copy link

jkossak commented Oct 29, 2021

@Bolodya1997 , sure , here are the logs:
admission-webhook.log
Thanks for looking into it!

@Bolodya1997
Copy link

Bolodya1997 commented Nov 1, 2021

@jkossak, have you tried to rerun the test again?

I am not able to reproduce this issue locally, mostly it looks like some invalid behavior of the web server used in the admission webhook. If it is a first and only test run on the new branch, can you please retry the test? If not, please tell me and we will continue trying to figure out the cause.

@jkossak
Copy link

jkossak commented Nov 2, 2021

@Bolodya1997 , yes I tried to rerun the test many times now, from different branches, also https://github.com/Bolodya1997/deployments-k8s/tree/test-composition and test-composition-new - always the same result.

@denis-tingaikin denis-tingaikin moved this to In Progress in Release 1.1.0 Nov 17, 2021
@denis-tingaikin denis-tingaikin moved this from In Progress to Todo in Release 1.1.0 Nov 18, 2021
@denis-tingaikin
Copy link
Member

denis-tingaikin commented Nov 22, 2021

The initial problem with pings after long NSM running is solved.

We also found a few issues that we'll consider separately.

  1. ticket for cmd-exclude-prefixes-k8s Fix found issues for cmd-exclude-prefixes-k8s after testing on CI cmd-exclude-prefixes-k8s#64
  2. ticket for probes: Spire failure causes NSMgr/Forwarder liveness probe failures #2766
  3. ticket for nsc, nse restarts cmd-nsc-kernel and cmd-nse-kernel can be restarted after a long NSM running #3771

@jkossak Many thanks for testing this. It was a super useful contribution for us.

Be free to open new issues if you face something unexpected :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance The problem related to system effectivity
Projects
Status: Done
Development

No branches or pull requests

6 participants