Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NSM datapath monitoring on proxy and TAPA NSC #522

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

LionelJouin
Copy link
Member

@LionelJouin LionelJouin commented Apr 25, 2024

Description

Introduce optional NSM datapath monitoring/healing (or liveness check) checking if the connection is alive by pinging endpoint from the client.
This option is for now available only between the proxy and LBs and has to be enabled via new env variables introduced in the proxy.

Introduce optional NSM datapath monitoring/healing (or liveness check) between the tapa and proxy.
Has to be enabled via new env variables introduced in the proxy.

https://networkservicemesh.io/docs/concepts/specs/datapath_healing/

Issue link

Checklist

  • Purpose
    • Bug fix
    • New functionality
    • Documentation
    • Refactoring
    • CI
  • Test
    • Unit test
    • E2E Test
    • Tested manually
  • Introduce a breaking change
    • Yes (description required)
    • No

@LionelJouin LionelJouin force-pushed the nsm-monitoring branch 4 times, most recently from 9e745ba to 36acf2c Compare April 30, 2024 14:05
@LionelJouin LionelJouin changed the title [WIP] NSM datapath monitoring NSM datapath monitoring on proxy NSC Apr 30, 2024
@LionelJouin LionelJouin marked this pull request as ready for review April 30, 2024 14:05
@LionelJouin LionelJouin force-pushed the nsm-monitoring branch 2 times, most recently from 8798185 to 556a281 Compare April 30, 2024 14:28
@LionelJouin LionelJouin changed the title NSM datapath monitoring on proxy NSC NSM datapath monitoring on proxy and TAPA NSC May 2, 2024
@LionelJouin LionelJouin requested a review from zolug May 2, 2024 07:46
Introduce optional NSM datapath monitoring/healing (or liveness check) checking if the
connection is alive by pinging endpoint from the client.
This option is for now available only between the proxy and LBs and has
to be enabled via new env variables introduced in the proxy.
Introduce optional NSM datapath monitoring/healing (or liveness check) between the tapa and proxy.
Has to be enabled via new env variables introduced in the proxy.
@zolug
Copy link
Collaborator

zolug commented May 8, 2024

The src and dst IPs of the IPContext can get updated by the proxy. This is especially valid for TAPAs.
Yet, DATAPATH_* keys are not updated, which can make the datapath monitoring fail all the time after the IP update.

@zolug
Copy link
Collaborator

zolug commented Sep 2, 2024

Posting my ovbervations with datapath healing:

NSMgr restart issues with datapath healing enabled
==================================================

https://github.com/Nordix/Meridio/pull/522
livenessCheckTimeout=1s
livenessCheckInterval=2s
number of pings per check=4

Recent Meridio runnning with NSM v1.11.2 sdks and above PR.
NSM: 1.13.0
Kind cluster with 4 nodes (k8s 1.26.6)

During the tests I either deleted 1 or 2 nsmgr PODs at the same time while running ctraffic for 200s.

Generic note:
When NSM connection between a proxy and a LB breaks so that the associated NSM interface is removed on the LB side, then Targets registered to be reachable via the now unaccesible proxy will be withdrawn from the nfqlb distributor. This changes the hash table and renders affected Targets unreachable until the connection gets fixed. Any incoming packet belonging to a previously established session might end up on a different Targe than before (remember its a stateless LB). For example in case of established TCP connections, incoming packets whose connection endpoint is the Target just removed from the distribution pool should expect a RST when they end up on a different Target. This is one reason why enabling datapath healing sounds tempting.

On the other hand, we might consider changing the Target verification mechanism to not withdraw a Target (in LB) without interface/route right away, instead allow some time for nsm to recover the links first (e.g. 1s). Although delaying how fast LB acts upon link failure can result in longer traffic disturbance, but it could also prevent "misrouting" in case of short, temporary link downs events invoking NSM heal.


NSE timeout:
------------
Where: Proxy and LB (but only LB NSE timeout causes issues in Proxy)

  +----------------------------------+
  |             NODE                 |
  +----------------------------------+
  | +-----+   Register    +-------+  |
  | | NSE |  -----X---->  | NSMgr |  |
  | +-----+               +-------+  |
  +----------------------------------+

Note: Independent from datapath healing.

Background:
NSE registration lifetime is set to 1 minute by a custom nVIP chain.
In NSM it got changed some time ago to follow the MaxTokenLifetime value (defaults to 10 minutes).
Yet, in nVIP we reverted to the old 1 minute lifetime value due to k8s-registry problems (seens at SC/UDM last year) where LB NSE disappearances e.g. due to POD delete were often not reported via the NSM based Find feature to proxies (only when the related custom resource entry timed out in etcd). Thus, I thought keeping the lower lifetime value of LB NSEs could somehow mitigate the effects of not being informed in time.
I don't remember all the details unfortunately. So, can't tell if any registry improvements had been made tackling the original problem.

The 1 minute NSE timeout can be problematic if an nsmgr POD collocated with an LB-FE was deleted. If the new nsmgr POD intance startup would be delayed a bit longer, then the LB-FE wouldn't be able to refresh its NSE registration, thus the respective custom resource in etcd would expire. Proxies on other nodes should notice the timeout and thus would try to clean-up the NSM connection towards the particular LB-FE (causing interface removal).
The Close from the Proxy side might reach the LB with considerable delay due to the unavailability of NSMgr.
Moreover the LB NSE might re-register in the meantime prompting the proxy to re-connect. There can be a race between the Request for the new connection (conn IDs will differ) and the Close of the old connection. Meaning the LB might run into "Interface replaced during interface create event", that is the Request might overtake Close by creating an additional interface on the LB side for the same proxy. Either way, the LB will unconfigure associated Targets (temporarily) based on verifyTargets().

Main question is, how NSM NetworkServiceEndpointRegistryClient works? :) (And whether it would be better to replace it with some other implementation to subscribe to NSE events.)
If it's not really reliable, then datapath healing could really contribute to detecting lost NSEs. But for datapath monitoring it would be nice if NSE would not disappear that fast due to temporary unavailability of nsmgrs. Which means, NSE lifetime had to be increased.


NSC heal local Request timeout 1
--------------------------------
Where: TAPA (but Proxy is also susceptible to the same timeout where the default Request timeout is 15 seconds)

  +----------------------------------+
  |             NODE                 |
  +----------------------------------+
  | +-----+    Request    +-------+  |
  | | NSC |  -----X---->  | NSMgr |  |
  | +-----+               +-------+  |
  +----------------------------------+

In nVIP baseline the Request will run with a hardcoded 10 seconds timeout. NSM's begin.Client in NSC will use that timeout to send out a Request upon Control Plane event if datapath connectivity has been verified. Which leaves a rather short window for local nsmgr recovery.
One might wonder, why it might be a problem, if the Request fails, because seemingly without reconnect option, Close isn't called by begin.eventFactoryClient in case of heal. Apparently,
if there's little time remaining (before the timeout) when a Request attempt reaches the local nsmgr and forwarder, then in the forwarder depending where/when the context timeouts a Close can be issued, that tears down the related interface affecting the traffic. 

I locally modified the TAPA code to use the env variable Timeout instead (whose default value is 15 seconds). Although this change improved the success rate in my tests, there's no telling how long nsmgr recovery could take.

NOTE that increasing the Request timeout has to be aligned with the IP Release Delay in proxies to avoid unwanted release of NSM connection IPs still in use.
The latter (i.e. IP Release Delay) MUST be longer (IMHO at least by 5 seconds). (https://github.com/Nordix/Meridio/pull/524)

Remote Request:
When the NSC resides on a different node than the restarted nsmgr, then NSM operation is different. The connect timeout between local nsmgr and remote nsmgr is governed by some other NSM timeout value, whose default value AFAIK recently got increased to 1 sec. This means, that the Request times out after 1s in the local nsmgr in such cases, returning an error right away, thus forcing a new heal Request. I haven't observed any interface removal events related to this in my tests.
(NSM dial timeout changes: https://github.com/networkservicemesh/deployments-k8s/issues/11372)


NSC heal local Request timeout 2:
---------------------------------
Where: TAPA and Proxy

  +----------------------------------+
  |             NODE                 |
  +----------------------------------+
  | +-----+    Request   +-------+   |
  | | NSC |  -----X----> | NSMgr |   |
  | +-----+              +-------+   |
  +----------------------------------+

Connected to "NSC heal local Request timeout 1". Besides the length of the Request timeout, gRPC's Max Backoff Delay also determines when a heal issued Request can reach the new local NSMgr.
Even if the new nsmgr is up and running (with SVID and everything), gRPC might significantly delay the re-send of the Request.

nVIP has suppor for setting MaxGRPCBackoff, but it is not applied when creating apiClient which is used for example to create the NSCs in the proxy and TAPA.
I locally modified nVIP code, to do consider the env variable in those cases as well. Yet, the env variables default value is 5s, while 1s would be much safer.


Obtaining SVID takes longer:
----------------------------
Where: NSMgr (at least that's where it matters if doing nsmgr delete tests)

Connected to above ("NSC heal local Request timeout 1") problem. It can further delay the startup of nsmgr, thus increases the probability of unsuccessful first heal Request.

In Meridio, we do test with Spire 1.2.2. Using Spire 1.9.4 seemed to eliminate this delay in my tests (https://github.com/Nordix/Meridio/pull/519).
I have seen delays in Infra logs from Pal as well (spire image version 1.14.0-30). What Spire version does that translate to?


Out of the blue datapath monitoring error reports:
--------------------------------------------------
Where: Mostly Proxy but observed in case of TAPA as well

With datapath healing enabled occasionally I noticed reports of "No packets received" log printouts leading to "Data plane is down" and thus reconnect with reselect. This Closes the NSM connection and removes interfaces which might affect traffic.
This "No packet..." printout indicates, in the particular datapath verification round no replies were received within the configured liveness timeout for some IP. According to my observations (dualstack), it's mostly only 1 of the 2 IPs failed to get verified.

Could be, that it would be better to have a slightly longer default LivenessCheckTimeout than 1 second. Although, a longer timeout could delay the detection time and thus delay the recovery.

Not sure if the following plays any role in provoking the problem, but during my nsmgr delete tests ctraffic was run for 200s. And IIRC the ping problem appeared (even multiple times) mostly after nsmgr recovered and after nsm heal succeeded securing the old connection/interfaces.


IPContext update not supported by the custom kernelheal in the current Meridio PR:
----------------------------------------------------------------------------------
Where: TAPA, Proxy NSC

IPs might get updated (especially in case of TAPA) upon a Request. Yet, the newly introduced DATAPATH_* keys are not updated accordingly in such cases. Which will break datapath monitoring for such connections forcing them into a reconnect cycle. (The PR should be updated.)
https://github.com/Nordix/Meridio/pull/522#issuecomment-2100136033

(Deleting and re-adding the Conduit while keeping the Targets running seems like a good approach for reproduction.)


Outdated ARP/neighbor cache:
----------------------------

Where: LB


  +------+   NSM conn  +-------+  NSM conn    +------+
  | TAPA |  <---X---> | Proxy | <--------->  |  LB  |
  +------+             +-------+              +------+


After NSMgr restart if datapath healing succeeds in proxy towards LB then the interfaces are preserved between the two. However, if the TAPA and proxy interfaces had to be recreated as part of heal (either due to disabled datapath monitoring in TAPA, or "NSC heal local Request timeout" with a Close), then in the LB the arp/neighbor cache might contain an outdated L2 address for the Target IP. Which can lead to blackholing of incoming packets arriving from outside the cluster. This could cause a considerable outage i.e. a more severe traffic outage than without any datapath healing would take place. (Impact depends on timing, on traffic mix, and on sysctl settings affecting ARP/neigbhor cache operation).

Gratitious ARP (and its IPv6 counterpart) could help to avoid this problem. (Should be part of NSM.)

Btw, very same problem happens if datapath healing is only enabled in Proxy (towards LBs) but not in TAPA. (Pal managed to run into it internally.)


Datapath verification not cancelled in time upon/after connection Close:
------------------------------------------------------------------------

I have only seen it once. After LB NSE timeout (refer to "NSE timeout"), a proxy ordered closing of the NSM connection towards said LB. Which seemed to have succeeded. But around the same time for the same connection ID datapath verification reported ping errors ("No packets received"). And also got a "Data plane is down" and "Reconnect with reselect" log printouts. Not sure, if the Close imposed context cancel was propagated slowly to heal's eventLoop, but that's how it looked.
Also, monitorDataPlane in heal won't cancel any ongoing heal.livenessCheck either. However seemingly the verdict should be discarded anyways if heal.eventloop's waitForEvents had received the context cancellation...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: 👀 In review
Development

Successfully merging this pull request may close these issues.

2 participants