East/west connectivity monitoring tool #5514

zm1990s · 2023-09-20T15:00:35Z

Description
Antrea only monitors Controller/Agent status at the moment, and Controller/Agent's status doesn’t means East-West connectivity is good, and metrics provieded by Antrea also does not reflect to Pod to Pod connectivity.
From an application perspective, we need a tool that can detect and inform Pod-to-pod connectivity issues.

Core feature required
A tool (maybe a Daemonset) that can generate East/West traffic periodically and check whether the E/W connectivity is good. if some of the detection fails, alerts or logs should be send out to external monitoring tools.

The detection interval should be adjustable like traditional loadbalancer do, for example send detection every 1 second and when 3 consecutive detection fails, sends out an alert.

Other related features
Since we're doing a E/W monitoring tool, so other related Antrea features can be monitored too. For example:

Pod to service type ClusterIP
Pod to service type Nodeport
Pod to external network via SNAT

tnqn · 2023-09-20T15:43:56Z

@zm1990s Thanks for the proposal.
Monitoring EW connectivity should be feasible. But having extra long running Pods, especially DaemonSet deployed in user's cluster for this purpose may not be wanted by most users. We can probably leverage the state of memberlist ran by antrea-agent as the source of connectvitiy status. Regarding to alerts, K8s events associated with the unreachable Node may be a way. However, we need to ensure no duplicate events would flood the event system. Using consistent hash to select one "reporter" node among available nodes may be feasible.

Regarding Pod-to-Service, Pod-to-External monitoring, I'm not sure if it could be really helpful and practicable to proactively generating traffic. It's not easy to even know whether a particular access is supposed to succeed or not given different policy, firewall, network topology configuration, not to mention the generated traffic on behalf of user's application or towards user's application may be not wanted by many users. I think in practice most such tools are implemented as script/playbook executed out-band using user's own application according what they want to monitor.

cc @jianjuns @antoninbas @salv-orlando

tnqn · 2023-09-20T15:47:47Z

However, it seems monitoring EW connectivity via memberlist would just be a faster way to get the notification of Node unreachable event compared with the K8s's native Node status. If user just wants such status is reported faster, they can also just update node-monitor-grace-period, so still wondering what value this can really add.

tnqn · 2023-09-20T15:53:22Z

A tool (like anctl subcommand) for smoke testing may be the most practicable way in the end.

antoninbas · 2023-09-20T16:50:41Z

However, it seems monitoring EW connectivity via memberlist would just be a faster way to get the notification of Node unreachable event compared with the K8s's native Node status.

I think that from an Antrea perspective, it would be good to monitor the health of the overlay network (in encap mode) by running some ping-mesh across all gateways. Being able to report latency across Nodes would also be quite nice, but I don't think we can do that with memberlist (IIRC, we discussed that in the past). With latency data available, we could even display a heat map in Antrea UI and update it in real-time.

jianjuns · 2023-09-21T00:28:38Z

Agree with what @antoninbas said.

tnqn · 2023-09-21T03:20:38Z

If without the need of latency data, I think the health of the overlay network shares the health status of memberlist in practice. Unless a misconfiguration that the memberlist port is whitelisted but not the overlay port, which could only occur when deploying a cluster and not during the routine running, I don't think of a situation that memberlist reports a Node is health but its overlay doesn't work. But if we want to add latency data, I agree memberlist may not achieve it (However, I don't quite rememember we discussed this, could you share a link if there is one?).

antoninbas · 2023-09-21T04:30:32Z

If without the need of latency data, I think the health of the overlay network shares the health status of memberlist in practice. Unless a misconfiguration that the memberlist port is whitelisted but not the overlay port, which could only occur when deploying a cluster and not during the routine running, I don't think of a situation that memberlist reports a Node is health but its overlay doesn't work.

I think overlay (ping between gateway) is a bit more "end-to-end". In addition to port whitelisting, we could potentially detect issues like a missing route on the host (granted, that has not happened in a while, but we used to have such issues). I was thinking that with the right "probe" (e.g. a TCP data exchange), the health check would also fail in case of checksum issue (basically any issue with the NIC configuration that is specific to double encapsulation).

But if we want to add latency data, I agree memberlist may not achieve it (However, I don't quite rememember we discussed this, could you share a link if there is one?).

The latency heat map is something that has been on my mind for a while. I remember someone telling me that Weave had something like this, but I can't find a reference to it.
I brought this up very superficially when we added memberlist as a dependency: #2128 (comment)

zm1990s · 2023-09-22T03:03:45Z

@tnqn I think this tool should be decoupled from Antrea Controller/Agent, just like nsx-interworking. Users can decide whether they need to use it or not.

antoninbas · 2023-09-28T18:41:08Z

Assigning to @tushartathgur who said he would look into this.
cc @yuntanghsu as well.

github-actions · 2023-12-28T00:03:33Z

This issue is stale because it has been open 90 days with no activity. Remove stale label or comment, or this will be closed in 90 days

antoninbas · 2024-01-23T18:46:48Z

We have submitted this issue as a project idea for the LFX mentorship program: cncf/mentoring#1129. So no-one should ideally work on this issue until we know if the proposal is accepted and if we can match a mentee to work on it.
See https://docs.linuxfoundation.org/lfx/mentorship for more information on the program.

prakrit55 · 2024-01-24T04:18:54Z

@antoninbas, I am greatly interested in the project, how do I reach you guys in slack or are there other options? Though I have an intermediate knowledge of k8s and golang, I am curious to know how much frontend approach has to be driven here. Thank you.

antoninbas · 2024-01-24T05:17:41Z

@prakrit55 you can reach out to us on Slack (we have the #antrea channel in the K8s slack)

but for this specific issue, please see comment above (#5514 (comment)). If you are interested, you could consider applying for the LFX mentorship program.

prakrit55 · 2024-01-24T05:21:19Z

@prakrit55 you can reach out to us on Slack (we have the #antrea channel in the K8s slack)

but for this specific issue, please see comment above (#5514 (comment)). If you are interested, you could consider applying for the LFX mentorship program.

Hey thank you @antoninbas, I got your channel. I would really like to apply for lfx mentorship for it, in the term March-May.

btwshivam · 2024-01-24T06:00:20Z

@antoninbas The prospect of working collectively on a comprehensive project like this is truly exciting, and I am keen on contributing my skills and enthusiasm to its success. The outlined sub-projects align perfectly with my interests, and they present a great opportunity for learning growth, and industrial exposure.
I look forward to contributing to the project and learning from the experience.

nate-double-u · 2024-01-31T22:50:52Z

Hello, everyone. I'm pleased to see how many folks are interested in participating in the LFX Mentorship Program.

Upstream issues like this are an excellent place to discuss specific technical topics or provide ideas about how you may tackle a problem; however, please post any questions about the LFX program and how to apply on the mentorship discussion forums (and indeed, some of these questions may have already been answered there, or on the Program Guidelines page).

antoninbas · 2024-02-09T19:23:24Z

For all the folks who have applied or are considering applying to one of the Antrea projects for the LFX mentorship program, we have published instructions to complete test tasks: #5976. We will review your submissions for these tasks alongside other material (resume, cover letter) when selecting mentees. The deadline for submitting is February 20th 5PM PST.

Signed-off-by: Md Sahil <contact.mdsahil@gmail.com>

antoninbas · 2024-03-05T00:09:13Z

@IRONICBo will work on this as part of the LFX mentorship program

IRONICBo · 2024-03-29T04:47:32Z

Monitoring tool api design proposal

Monitoring tool needs a uniform config

Users and administrators need a way to measure and monitor network performance, specifically the latency between nodes, to ensure optimal cluster performance and troubleshoot potential issues.

Watch a singleton CRD

The proposed solution is to introduce a new Custom Resource Definition (CRD) called PingMonitoringToolConfig in Antrea. This CRD will allow users to enable and configure a ping monitoring tool that measures the latency between nodes. The configuration will include parameters such as the ping interval, timeout, and concurrency limit.

The Antrea agents will listen for changes to this CRD and adjust their monitoring behavior accordingly. When we enable this monitoring feature in Feature gate and config, agent will watch the events of creation/update/deletion of this CRD and update the start, stop and parameter update of monitor tool in real time.

Additionally, a singleton pattern will be enforced using a validation webhook to ensure that only one instance of the CRD exists in the cluster.

Use Feature Gate & Config & CRD to start monitoring tool

The solution introduces a new user-facing feature that allows users to enable and configure the ping monitoring tool via a YAML config file. Users can apply this YAML file using kubectl to create or update the PingMonitoringToolConfig resource.

The changes will be automatically picked up by the Antrea agents, and the monitoring behavior will be updated accordingly. This feature provides users with a structured and easy-to-consume API for enabling and configuring the ping mesh feature.

Main design/architecture

The main design involves the following components:

CRD Definition: A new CRD PingMonitoringToolConfig will be defined with fields for enabling the tool, ping interval, timeout, and concurrency limit.
Here is an example of the CRD definition in Go:

    type PingMonitoringToolConfigSpec struct {
        PingInterval        string `json:"pingInterval,omitempty"`
        PingTimeout         string `json:"pingTimeout,omitempty"`
        PingConcurrentLimit int    `json:"pingConcurrentLimit,omitempty"`
    }

    type PingMonitoringToolConfig struct {
        metav1.TypeMeta   `json:",inline"`
        metav1.ObjectMeta `json:"metadata,omitempty"`

        Spec PingMonitoringToolConfigSpec `json:"spec,omitempty"`
    }

Singleton Pattern Enforcement: A validation webhook will be implemented to ensure that only one instance of the PingMonitoringToolConfig resource can exist in the cluster. This webhook will reject the creation of additional instances if one already exists.
Agent Behavior: Antrea agents will listen for changes to the PingMonitoringToolConfig resource and update their monitoring behavior based on the configuration. The agents will use a Kubernetes client to watch for changes to the resource and adjust their ping interval, timeout, and concurrency limit accordingly.
Monitoring Logic: The ping monitoring tool will measure the latency between nodes and provide metrics that can be used for monitoring and troubleshooting. The tool will use ICMP ping requests to measure the latency between nodes. Here is an example of a YAML configuration file for the PingMonitoringToolConfig resource:

apiVersion: networking.antrea.io/v1alpha1
kind: PingMonitoringToolConfig
metadata:
  name: default
spec:
  pingInterval: "10s"
  pingTimeout: "5s"
  pingConcurrentLimit: 10

In this example, the ping monitoring tool is enabled with a ping interval of 10 seconds, a ping timeout of 5 seconds, and a concurrency limit of 10.

Alternative solutions

Using a ConfigMap: Instead of a CRD, a ConfigMap could be used to configure the ping monitoring tool. However, this approach lacks the structure and validation capabilities provided by CRDs.
Using antctl API Server: We need to register an apiServer and use antctl to update the parameters of our monitoring tool. We need to consider the uniformity and observability of the configuration parameters in the cluster.

This proposal aims to provide a flexible and user-friendly way to monitor node-to-node latency in a Kubernetes cluster, enhancing the observability and manageability of the network performance in Antrea-managed clusters.

Dyanngg · 2024-04-01T17:56:20Z

A validation webhook won't be necessary if we simply add an open-api validation rule which constraints the name of the CRD object created. See https://github.com/kubernetes-sigs/network-policy-api/blob/main/apis/v1alpha1/baselineadminnetworkpolicy_types.go#L29 as an example

We introduce a new feature to measure inter-Node latency in a K8s cluster running Antrea. The feature is currently Alpha and uses the NodeLatencyMonitor FeatureGate. In addition to the FeatureGate, enablement of the feature is controlled by a new CRD, called NodeLatencyMonitor. This CRD supports at most one CR instance, which must be named "default". When the CR exists, Antrea Agents will start "pinging" each other to take latency measurements. Each Agent only stores the latest measured value (at least at the moment), we do not store time series data. We support both IPv4 and IPv6. When an oberlay is used by Antrea, the ping is sent over the tunnel (by using the gateway IP as the destination). This change does not add any functionality besides collecting latency data at each Agent. A follow-up change will take care of reporting the latency data to the Antrea Controller, so it can be consumed via an APIService. For #5514 Signed-off-by: IRONICBo <boironic@gmail.com> Signed-off-by: Asklv <boironic@gmail.com>

Follow up to #6120 See #5514 Signed-off-by: Asklv <boironic@gmail.com>

github-actions · 2024-07-01T00:04:55Z

This issue is stale because it has been open 90 days with no activity. Remove stale label or comment, or this will be closed in 90 days

Implement REST server for NodeLatencyStats in v1alpha1.stats.antrea.io With this change the feature is now usable. `kubectl get nodelatencystats` will display the latest latency information. For #5514 Signed-off-by: Asklv <boironic@gmail.com> Signed-off-by: Antonin Bas <antonin.bas@broadcom.com> Co-authored-by: Antonin Bas <antonin.bas@broadcom.com>

antoninbas · 2024-08-06T22:24:51Z

With the addition of the NodeLatencyMonitor feature in Antrea v2.1 (thanks @IRONICBo!), I will now close this issue. Additional capabilities for this feature can be added over time. An issue has been created for the addition of a latency visualization dashboard in the Antrea UI: antrea-io/antrea-ui#455

zm1990s added the kind/feature Categorizes issue or PR as related to a new feature. label Sep 20, 2023

antoninbas assigned tushartathgur Sep 28, 2023

github-actions bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Dec 28, 2023

antoninbas removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 3, 2024

antoninbas unassigned tushartathgur Jan 23, 2024

antoninbas added the lfx-mentorship Issues which have been proposed for the LFX Mentorship program label Jan 23, 2024

ImMdsahil added a commit to ImMdsahil/antrea that referenced this issue Feb 12, 2024

Task: LFX task-1 for antrea-io#5514

6b03612

Signed-off-by: Md Sahil <contact.mdsahil@gmail.com>

IRONICBo mentioned this issue Mar 19, 2024

Support simple ping mesh in agent. #6120

Merged

antoninbas pushed a commit that referenced this issue Jun 18, 2024

Define API for Agents to report Node latency stats (#6392)

25899c3

Follow up to #6120 See #5514 Signed-off-by: Asklv <boironic@gmail.com>

github-actions bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jul 1, 2024

antoninbas removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jul 22, 2024

This was referenced Jul 25, 2024

Add e2e tests for the NodeLatencyMonitor feature #6549

Closed

Add documentation for NodeLatencyMonitor #6551

Closed

antoninbas closed this as completed Aug 6, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

East/west connectivity monitoring tool #5514

East/west connectivity monitoring tool #5514

zm1990s commented Sep 20, 2023

tnqn commented Sep 20, 2023

tnqn commented Sep 20, 2023

tnqn commented Sep 20, 2023

antoninbas commented Sep 20, 2023

jianjuns commented Sep 21, 2023

tnqn commented Sep 21, 2023

antoninbas commented Sep 21, 2023

zm1990s commented Sep 22, 2023

antoninbas commented Sep 28, 2023

github-actions bot commented Dec 28, 2023

antoninbas commented Jan 23, 2024

prakrit55 commented Jan 24, 2024

antoninbas commented Jan 24, 2024

prakrit55 commented Jan 24, 2024 •

edited

Loading

btwshivam commented Jan 24, 2024

nate-double-u commented Jan 31, 2024

antoninbas commented Feb 9, 2024

antoninbas commented Mar 5, 2024

IRONICBo commented Mar 29, 2024 •

edited by Dyanngg

Loading

Dyanngg commented Apr 1, 2024

github-actions bot commented Jul 1, 2024

antoninbas commented Aug 6, 2024

East/west connectivity monitoring tool #5514

East/west connectivity monitoring tool #5514

Comments

zm1990s commented Sep 20, 2023

tnqn commented Sep 20, 2023

tnqn commented Sep 20, 2023

tnqn commented Sep 20, 2023

antoninbas commented Sep 20, 2023

jianjuns commented Sep 21, 2023

tnqn commented Sep 21, 2023

antoninbas commented Sep 21, 2023

zm1990s commented Sep 22, 2023

antoninbas commented Sep 28, 2023

github-actions bot commented Dec 28, 2023

antoninbas commented Jan 23, 2024

prakrit55 commented Jan 24, 2024

antoninbas commented Jan 24, 2024

prakrit55 commented Jan 24, 2024 • edited Loading

btwshivam commented Jan 24, 2024

nate-double-u commented Jan 31, 2024

antoninbas commented Feb 9, 2024

antoninbas commented Mar 5, 2024

IRONICBo commented Mar 29, 2024 • edited by Dyanngg Loading

Monitoring tool api design proposal

Dyanngg commented Apr 1, 2024

github-actions bot commented Jul 1, 2024

antoninbas commented Aug 6, 2024

prakrit55 commented Jan 24, 2024 •

edited

Loading

IRONICBo commented Mar 29, 2024 •

edited by Dyanngg

Loading