Support NetworkPolicy statistics #1172

tnqn · 2020-08-28T02:49:53Z

This PR supports collecting and querying the NetworkPolicy statistics for
K8s NetworkPolicy and Antrea policies. It does the following:

Introduce a feature gate called "NetworkPolicyStats" to manage its enablement.
Introduce a structure called (stats.)Collector that collects stats from the Openflow client, calculates the delta compared with the last reported stats, and reports it to the antrea-controller via the controlplane NodeStatsSummary API.
Introduce a structure called (stats.)Aggregator that collects the stats from the antrea-agents, aggregates them, caches the result, and provides interfaces for the Stats API handlers to query them.
Aggregate the Stats API group to the Kubernetes API.

The stats can be queried via kubectl get networkpolicystats and kubectl get clusternetworkpolicystats, for example:

# kubectl get networkpolicystats -A
NAMESPACE     NAME                  SESSIONS   PACKETS   BYTES   CREATED AT
default       test-network-policy   3          36        5199    2020-09-07T13:19:38Z
kube-system   test-netpol-1         0          0         0       2020-09-07T13:22:42Z

# kubectl get clusternetworkpolicystats -A
NAME       SESSIONS   PACKETS   BYTES   CREATED AT
test-cnp   3          3         222     2020-09-07T11:38:40Z

Closes #985

Depends on #1140 and #1221

antrea-bot · 2020-08-28T02:50:10Z

Thanks for your PR.
Unit tests and code linters are run automatically every time the PR is updated.
E2e, conformance and network policy tests can only be triggered by a member of the vmware-tanzu organization. Regular contributors to the project should join the org.

The following commands are available:

/test-e2e: to trigger e2e tests.
/skip-e2e: to skip e2e tests.
/test-conformance: to trigger conformance tests.
/skip-conformance: to skip conformance tests.
/test-whole-conformance: to trigger all conformance tests on linux.
/skip-whole-conformance: to skip all conformance tests on linux.
/test-networkpolicy: to trigger networkpolicy tests.
/skip-networkpolicy: to skip networkpolicy tests.
/test-windows-conformance: to trigger windows conformance tests.
/skip-windows-conformance: to skip windows conformance tests.
/test-windows-networkpolicy: to trigger windows networkpolicy tests.
/skip-windows-networkpolicy: to skip windows networkpolicy tests.
/test-hw-offload: to trigger ovs hardware offload test.
/skip-hw-offload: to skip ovs hardware offload test.
/test-all: to trigger all tests (except whole conformance).
/skip-all: to skip all tests (except whole conformance).

codecov-commenter · 2020-08-31T12:42:03Z

Codecov Report

Merging #1172 into master will decrease coverage by 0.02%.
The diff coverage is 56.34%.

@@            Coverage Diff             @@
##           master    #1172      +/-   ##
==========================================
- Coverage   54.40%   54.37%   -0.03%     
==========================================
  Files         115      119       +4     
  Lines       10821    11213     +392     
==========================================
+ Hits         5887     6097     +210     
- Misses       4363     4527     +164     
- Partials      571      589      +18

Flag	Coverage Δ
#integration-tests	`44.91% <50.00%> (-0.08%)`	⬇️
#unit-tests	`41.97% <56.42%> (+0.50%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files	Coverage Δ
pkg/apis/controlplane/register.go	`0.00% <0.00%> (ø)`
pkg/apis/stats/register.go	`0.00% <0.00%> (ø)`
pkg/apiserver/certificate/cacert_controller.go	`12.76% <ø> (ø)`
pkg/features/antrea_features.go	`100.00% <ø> (ø)`
...piserver/registry/stats/networkpolicystats/rest.go	`32.55% <32.55%> (ø)`
...stry/stats/antreaclusternetworkpolicystats/rest.go	`37.20% <37.20%> (ø)`
...er/registry/stats/antreanetworkpolicystats/rest.go	`38.29% <38.29%> (ø)`
pkg/agent/stats/collector.go	`51.72% <51.72%> (ø)`
pkg/controller/stats/aggregator.go	`74.43% <74.43%> (ø)`
pkg/apis/controlplane/v1beta1/register.go	`92.85% <100.00%> (+0.54%)`	⬆️
... and 10 more

tnqn · 2020-09-02T14:57:38Z

/test-all

tnqn · 2020-09-24T11:48:05Z

/test-all

tnqn · 2020-09-24T13:33:50Z

@antoninbas @jianjuns @abhiraut Sorry for changing the API name from metrics to stats in last minute. I changed it because I see most similar functions like ethtool --statistics, conntrack -S, ip -stats link and NSX DFW stats API often call it stats/statistics instead of metrics, and metrics might be confused with prometheus metrics and resource usage like CPU and memory. Please let me know if you have concern on the renaming, I have kept the original patch that uses "metrics" so no problem to change it back.

lzhecheng · 2020-09-24T13:46:05Z

/test-e2e

lzhecheng · 2020-09-24T13:52:02Z

/test-networkpolicy
/test-conformance
/test-windows-conformance
/test-windows-networkpolicy

tnqn · 2020-09-24T15:15:16Z

/test-e2e

antoninbas

LGTM

antoninbas · 2020-09-24T16:55:01Z

pkg/agent/stats/collector.go

+	// TODO: The following process is not atomic, there's a chance that the ofID is released and reused by another
+	//  NetworkPolicy rule in-between, leading to incorrect metrics. We should return relevant NetworkPolicy references
+	//  along with metrics to avoid it.
+	ruleStatsMap := m.ofClient.NetworkPolicyMetrics()


should this be renamed from metrics to stats for consistency?

I think we should, but since it doesn't affect user facing API and the method is not added by this PR, I plan to change it along with addressing the TODOs and ensuring its efficiency. Does it make sense?

Sounds good

jianjuns · 2020-09-24T17:16:32Z

metrics -> stats sounds good to me.

For controller/agent restart, I feel we should at least handle controller restarts, as that will lose all previous stats, which can be discovered from agents.
For agent restart, is it possible for controller to keep a copy of agent stats, so it can compute the diff with the new stats from agent? Another option is to let agent report all 0 diffs - it might be better than report the current counters assuming agent does not stop for longt?

But probably let us consider how to handle these in the next release.

tnqn · 2020-09-24T17:34:32Z

For controller/agent restart, I feel we should at least handle controller restarts, as that will lose all previous stats, which can be discovered from agents.
For agent restart, is it possible for controller to keep a copy of agent stats, so it can compute the diff with the new stats from agent? Another option is to let agent report all 0 diffs - it might be better than report the current counters assuming agent does not stop for longt?

Even if controller keeps a copy of agent stats, it cannot handle a scenario like agent restart -> controller restart. For example:
Time 1: agent A stats: packets=100, agent B stats: packets=200, controller kept both of them, and had the sum packets=300.
Time 2: agent A restarts, its stats became: packets=10, agent B stats: packets=220, controller knew agent A had restarted, so added its previous stats it kept in-memory: now A.packets=110, B.packets=220, sum.packets=330
Time 3: controller restarts, agent A stats: packets=20, agent B stats: packets=240, now controller lost A's previous stats and the sum will decrease to 260.

Unless controller or agent persists the stats somewhere, the stats might be far from the actual value.

But probably let us consider how to handle these in the next release.

Sure, one solution I have thought is to persist agent's stats in local run dir periodically and before it receives a kill signal, then reload it on start.

jianjuns · 2020-09-24T17:40:03Z

For Agent restart (not OVS restart), Agent should report the current counters, and Controller should know about the Agent restart to compute the diff based on the cached counters.
For Controller restart, again Agent should report the current counters (not diff), and Controller can rebuild the cache.

Anything wrong in my assumptions?

tnqn · 2020-09-24T17:46:29Z

@jianjuns in antrea case, agent and ovs are always restarted together unless one of them is killed by liveness probe. Even only agent restarts and ovs doesn't, agent will flush all flows on restart and even on reconnection to agent. Do you mean reading stats from stale flows before flushing them?

jianjuns · 2020-09-24T17:54:44Z

Ok. I got what you mean.
First can we conclude we can handle Controller restart (but Agent should report the current counters but not diff)?

Agent and OVS do not always restart together. But if we assume Agent will always flush all flows, then sure we can make it simpler, and just report the current counters as diff.

pkg/agent/stats/collector.go

abhiraut

lgtm.. maybe resolve nits in a follow up .. so we don't have merge conflicts on this PR

antoninbas

documentation comments

docs/feature-gates.md

tnqn · 2020-09-25T01:43:14Z

@antoninbas @jianjuns @abhiraut thanks for review. I have addressed all comments.

/test-all

antoninbas

LGTM

I think the points raised by @jianjuns can de discussed after the release and addressed in a follow-up PR if needed.

This PR supports collecting and querying the NetworkPolicy statistics for K8s NetworkPolicy and Antrea policies. It does the following: - Introduce a feature gate called "NetworkPolicyStats" to manage its enablement. - Introduce a structure called (stats.)Collector that collects stats from the Openflow client, calculates the delta compared with the last reported stats, and reports it to the antrea-controller via the controlplane NodeStatsSummary API. - Introduce a structure called (stats.)Aggregator that collects the stats from the antrea-agents, aggregates them, caches the result, and provides interfaces for the Stats API handlers to query them. - Aggregate the Stats API group to the Kubernetes API.

tnqn · 2020-09-25T01:50:05Z

@antoninbas sorry, corrected another word in feature-gate, metrics->statistics, could you re-approve?

tnqn · 2020-09-25T01:50:28Z

/test-all

vmwclabot added the cla-not-required label Aug 28, 2020

tnqn marked this pull request as draft August 28, 2020 02:50

tnqn force-pushed the policy-metrics branch 2 times, most recently from 8bbb806 to b2023b0 Compare August 31, 2020 12:38

tnqn force-pushed the policy-metrics branch 20 times, most recently from 9cb99bd to 2647d81 Compare September 2, 2020 14:38

tnqn marked this pull request as ready for review September 2, 2020 14:47

tnqn changed the title ~~WIP: Support NetworkPolicy metrics~~ Support NetworkPolicy metrics Sep 2, 2020

tnqn mentioned this pull request Sep 2, 2020

Support NetworkPolicy statistics #985

Closed

3 tasks

antoninbas previously approved these changes Sep 24, 2020

View reviewed changes

antoninbas reviewed Sep 24, 2020

View reviewed changes

pkg/agent/stats/collector.go Outdated Show resolved Hide resolved

abhiraut previously approved these changes Sep 24, 2020

View reviewed changes

antoninbas reviewed Sep 24, 2020

View reviewed changes

docs/feature-gates.md Outdated Show resolved Hide resolved

docs/feature-gates.md Outdated Show resolved Hide resolved

docs/feature-gates.md Show resolved Hide resolved

docs/feature-gates.md Outdated Show resolved Hide resolved

docs/feature-gates.md Outdated Show resolved Hide resolved

tnqn dismissed stale reviews from abhiraut and antoninbas via 7172387 September 25, 2020 01:40

tnqn force-pushed the policy-metrics branch from 5bc1874 to 7172387 Compare September 25, 2020 01:40

antoninbas previously approved these changes Sep 25, 2020

View reviewed changes

tnqn dismissed antoninbas’s stale review via fa5eada September 25, 2020 01:48

tnqn force-pushed the policy-metrics branch from 7172387 to fa5eada Compare September 25, 2020 01:48

antoninbas approved these changes Sep 25, 2020

View reviewed changes

abhiraut approved these changes Sep 25, 2020

View reviewed changes

tnqn merged commit f40454c into antrea-io:master Sep 25, 2020

tnqn mentioned this pull request Sep 28, 2020

e2e tests on Kind are frequently failing because of TestNetworkPolicyStats #1309

Closed

tnqn mentioned this pull request Oct 20, 2020

E2E test namespaces are failing to be deleted after upgrading to v0.10 #1316

Closed

ceclinux mentioned this pull request Feb 2, 2021

Rule based networkpolicystats #1780

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support NetworkPolicy statistics #1172

Support NetworkPolicy statistics #1172

tnqn commented Aug 28, 2020 •

edited

Loading

antrea-bot commented Aug 28, 2020

codecov-commenter commented Aug 31, 2020 •

edited

Loading

tnqn commented Sep 2, 2020

tnqn commented Sep 24, 2020

tnqn commented Sep 24, 2020

lzhecheng commented Sep 24, 2020

lzhecheng commented Sep 24, 2020

tnqn commented Sep 24, 2020

antoninbas left a comment

antoninbas Sep 24, 2020

tnqn Sep 24, 2020

antoninbas Sep 24, 2020

jianjuns commented Sep 24, 2020

tnqn commented Sep 24, 2020

jianjuns commented Sep 24, 2020

tnqn commented Sep 24, 2020

jianjuns commented Sep 24, 2020

abhiraut left a comment

antoninbas left a comment

tnqn commented Sep 25, 2020

antoninbas left a comment

tnqn commented Sep 25, 2020

tnqn commented Sep 25, 2020

Support NetworkPolicy statistics #1172

Support NetworkPolicy statistics #1172

Conversation

tnqn commented Aug 28, 2020 • edited Loading

antrea-bot commented Aug 28, 2020

codecov-commenter commented Aug 31, 2020 • edited Loading

Codecov Report

tnqn commented Sep 2, 2020

tnqn commented Sep 24, 2020

tnqn commented Sep 24, 2020

lzhecheng commented Sep 24, 2020

lzhecheng commented Sep 24, 2020

tnqn commented Sep 24, 2020

antoninbas left a comment

Choose a reason for hiding this comment

antoninbas Sep 24, 2020

Choose a reason for hiding this comment

tnqn Sep 24, 2020

Choose a reason for hiding this comment

antoninbas Sep 24, 2020

Choose a reason for hiding this comment

jianjuns commented Sep 24, 2020

tnqn commented Sep 24, 2020

jianjuns commented Sep 24, 2020

tnqn commented Sep 24, 2020

jianjuns commented Sep 24, 2020

abhiraut left a comment

Choose a reason for hiding this comment

antoninbas left a comment

Choose a reason for hiding this comment

tnqn commented Sep 25, 2020

antoninbas left a comment

Choose a reason for hiding this comment

tnqn commented Sep 25, 2020

tnqn commented Sep 25, 2020

tnqn commented Aug 28, 2020 •

edited

Loading

codecov-commenter commented Aug 31, 2020 •

edited

Loading