-
Notifications
You must be signed in to change notification settings - Fork 373
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support NetworkPolicy statistics #985
Comments
Question - for controller API to expose aggregated policy stats, do we need to follow some format to facilitate Prometheus consumption too? |
I don't know whether Prometheus is suitable to collect the metrics data for dynamically created/destroyed resources, didn't see the Kubernetes metrics API takes Prometheus into consideration: https://github.com/kubernetes/metrics#apis. @ksamoray do you know a similar case of Prometheus integrated apps or whether it supports metrics for frequently created/destroyed resources? If we really want to expose the data to Prometheus, I assume we need to expose the data via "/metrics" API instead, and follow the Prometheus format. In that case, do you think we have two APIs or let users/other monitoring solutions to access the "/metrics" API? |
Will these statistics support denied/not allowed sessions, packets, and bytes too? |
@tnqn I haven't really ran into such use case, where Prometheus manages metrics with a short lifespan. I can look around and conduct a couple of experiments. I believe that Prometheus will store these metrics even after they're not reported anymore at /metrics. |
For Antrea specific NetworkPolicies that have DROP action, the sessions, packets, and bytes will be for denied traffic, though sessions and packets should be same in this case. |
Thanks @ksamoray for you quick answer. The cleanup for stale items is also my concern. Good to know that this parser approach that allows antrea maintain a single API even if user wants the data in their Prometheus server, is it Prometheus official mechanism or 3rd party tool? |
Regarding network policies metrics at Prometheus server: Are deleted network policies should really be considered as stale? There may be a use case, where a user wants to see network policy metrics say 12 hours ago for a duration of one hour. In this scenario, deleted network policy data would still be useful. I am presuming the Prometheus server will have a configurable time period to save the metrics data in DB. This is true for non-deleted metrics resources as well, right? Thanks. |
Any metric which would be exposed on the Agent's /metrics endpoint can be defined as a Prometheus metric and will be scraped by the Prometheus server. Prometheus server stores scraped metrics in its tsdb to a predefined retention (defaults to 15 days), even if the metric has been unregistered on the agent. |
@tnqn @jianjuns a couple questions:
Thx, Su |
Yes, per endpoint stats is useful in some cases, but maybe it can be next step, after we have cluster level stats? |
@suwang48404 the current plan is to have an internal API for agents to push metrics to the controller, and a public API exposed by the controller with aggregated data (see |
@jianjuns @antoninbas thanks for answering the questions. @suwang48404 it targets for the coming release 0.10. |
@antoninbas @jianjuns @antoninbas thx all for replying. |
Hi, |
@srikartati , thank you, that was very informative. |
I think antctl is not implemented with this feature. Could you explain why? @tnqn |
The API follows K8s style and the data can be retrived via |
I think you were referring to aggregated NetworkPolicy stats for the Antrea cluster. Implementing something like |
Yes, I didn't plan node level stats, but it sounds like a good idea to me, maybe helpful for troubleshooting. |
Thank you for your prompt reply. |
Describe what you are trying to solve
This proposal is to collect and expose the statistical data of NetworkPolicy.
antrea-controller
collects NetworkPolicy metrics fromantrea-agent
, aggregates the data, and exposes them through the antrea metrics API.Monitoring solutions and users can access the data via the metrics API. It can also be accessed by
antctl get metrics networkpolicy
, making it easier to view.The metrics data includes total number of
packets
,bytes
,sessions
for given NetworkPolicy. The metrics is collected asynchronously and periodically, hence the data got from the metrics API is not real-time and may have a delay up to the collection interval (configurable, 1 min by default).Describe the solution you have in mind
Scalability consideration
Assuming we want to support 100,000 NetworkPolicies and 1,000 Nodes, 1,000 NetworkPolicies apply to each Node, and the metrics data is collected every minute, this means:
For collection:
There should be no performance issue when collecting and aggregating the data in above scale.
For strorage:
Kubernetes apiserver persists resources including CRD in KV store etcd. If we want to persist the data to Kubernetes, it means 100,000 / 60 = 1666 API writing per second by average (166 API writing per second even persisting them every 10 minutes), which may cause considerable load to the apiserver and the storage. On the other hand, the metrics data will only be lost when the controller itself is restarted and monitoring solutions can persist the data by themselves, so storing them in memory should be reasonable.
Metrics collection by antrea-agent
antrea-agent
is responsible for collecting metrics from openflow stats.In NetworkPolicy implementation, each NetworkPolicy rule gets an unique conjunction ID, the
n_packets
andn_bytes
mean the packets and bytes hit by this rule.However, the current flow stats only counts the first packet of each session as the conjunction match flows only match packets in one direction, and the following packets will be matched by a flow that allows all packets of established sessions.
One possible solution for this is to use
ct_nw_src
,ct_nw_dst
,ct_tp_dst
which match the conntrack original direction tuple source address, destination address, destination port.But we also see a few drawbacks in this solutions:
ct_nw_src
requires Open vSwitch > 2.8 (some distros don't have it yet?).@wenyingd proposed to persist the conjunction ID to conntrack label, and have dedicated metrics collection flows to match the conntrack label. The flows would be:
In the above example,
16202
is the sessions count,16202 + 181551
is the packets count,1198948 + 27050226
is the bytes count.Communication between antrea-agent and antrea-controller
Although antrea-agent now has its own API (for Prometheus api and support bundle), but it's difficult to enable server authentication for agent API as it would require certification generation and distribution for each agent.
In this proposal, instead of antrea-controller pulling data from antrea-agents, antrea-agents push metrics data to an internal metrics collection API exposed by antrea-controller. In this way, the same authentication and authorization mechanism, and even the TCP connection for the internal NetworkPolicy API can be reused.
Since each antrea-agent could restart and the openflow stats could be reset after that. If antrea-agent sends the whole stats to antrea-controller, it's not easy to aggregate the whole stats given that each agent could reset its portion individually.
In this proposal, antrea-agent is responsible for calculating the incremental stats and reports it to antrea-controller. Then antrea-agent could simply sum up the data from all agents.
APIs
Collection API (internal)
The data struct representing the collected metrics is as below.
The API endpoints:
/stats/networkpolicy
.Metrics API (public)
The metrics API must follow the K8s convention so that it can be registered as an APIService and accessed by
antctl get metrics
andkubectl get
.The API group is
metrics.antrea.tanzu.vmware.com
and the endpoints is/apis/metrics.antrea.tanzu.vmware.com/v1alpha1/networkpolicies
Open questions:
Should metrics of all types (K8sNetworkPolicy, ClusterNetworkPolicy, AntreaNetworkPolicy) be exposed via a single endpoint or separate ones?
Describe how your solution impacts user flows
Users can access the NetworkPolicy metrics via the metrics API and
antctl get metrics networkpolicy
.Describe the main design/architecture of your solution
Alternative solutions that you considered
Test plan
TBD
Additional context
PRs for this feature:
The text was updated successfully, but these errors were encountered: