Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support NetworkPolicy statistics #1172

Merged
merged 1 commit into from
Sep 25, 2020
Merged

Conversation

tnqn
Copy link
Member

@tnqn tnqn commented Aug 28, 2020

This PR supports collecting and querying the NetworkPolicy statistics for
K8s NetworkPolicy and Antrea policies. It does the following:

  • Introduce a feature gate called "NetworkPolicyStats" to manage its enablement.

  • Introduce a structure called (stats.)Collector that collects stats from the Openflow client, calculates the delta compared with the last reported stats, and reports it to the antrea-controller via the controlplane NodeStatsSummary API.

  • Introduce a structure called (stats.)Aggregator that collects the stats from the antrea-agents, aggregates them, caches the result, and provides interfaces for the Stats API handlers to query them.

  • Aggregate the Stats API group to the Kubernetes API.

The stats can be queried via kubectl get networkpolicystats and kubectl get clusternetworkpolicystats, for example:

# kubectl get networkpolicystats -A
NAMESPACE     NAME                  SESSIONS   PACKETS   BYTES   CREATED AT
default       test-network-policy   3          36        5199    2020-09-07T13:19:38Z
kube-system   test-netpol-1         0          0         0       2020-09-07T13:22:42Z

# kubectl get clusternetworkpolicystats -A
NAME       SESSIONS   PACKETS   BYTES   CREATED AT
test-cnp   3          3         222     2020-09-07T11:38:40Z

Closes #985

Depends on #1140 and #1221

@antrea-bot
Copy link
Collaborator

Thanks for your PR.
Unit tests and code linters are run automatically every time the PR is updated.
E2e, conformance and network policy tests can only be triggered by a member of the vmware-tanzu organization. Regular contributors to the project should join the org.

The following commands are available:

  • /test-e2e: to trigger e2e tests.
  • /skip-e2e: to skip e2e tests.
  • /test-conformance: to trigger conformance tests.
  • /skip-conformance: to skip conformance tests.
  • /test-whole-conformance: to trigger all conformance tests on linux.
  • /skip-whole-conformance: to skip all conformance tests on linux.
  • /test-networkpolicy: to trigger networkpolicy tests.
  • /skip-networkpolicy: to skip networkpolicy tests.
  • /test-windows-conformance: to trigger windows conformance tests.
  • /skip-windows-conformance: to skip windows conformance tests.
  • /test-windows-networkpolicy: to trigger windows networkpolicy tests.
  • /skip-windows-networkpolicy: to skip windows networkpolicy tests.
  • /test-hw-offload: to trigger ovs hardware offload test.
  • /skip-hw-offload: to skip ovs hardware offload test.
  • /test-all: to trigger all tests (except whole conformance).
  • /skip-all: to skip all tests (except whole conformance).

@tnqn tnqn marked this pull request as draft August 28, 2020 02:50
@tnqn tnqn force-pushed the policy-metrics branch 2 times, most recently from 8bbb806 to b2023b0 Compare August 31, 2020 12:38
@codecov-commenter
Copy link

codecov-commenter commented Aug 31, 2020

Codecov Report

Merging #1172 into master will decrease coverage by 0.02%.
The diff coverage is 56.34%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #1172      +/-   ##
==========================================
- Coverage   54.40%   54.37%   -0.03%     
==========================================
  Files         115      119       +4     
  Lines       10821    11213     +392     
==========================================
+ Hits         5887     6097     +210     
- Misses       4363     4527     +164     
- Partials      571      589      +18     
Flag Coverage Δ
#integration-tests 44.91% <50.00%> (-0.08%) ⬇️
#unit-tests 41.97% <56.42%> (+0.50%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
pkg/apis/controlplane/register.go 0.00% <0.00%> (ø)
pkg/apis/stats/register.go 0.00% <0.00%> (ø)
pkg/apiserver/certificate/cacert_controller.go 12.76% <ø> (ø)
pkg/features/antrea_features.go 100.00% <ø> (ø)
...piserver/registry/stats/networkpolicystats/rest.go 32.55% <32.55%> (ø)
...stry/stats/antreaclusternetworkpolicystats/rest.go 37.20% <37.20%> (ø)
...er/registry/stats/antreanetworkpolicystats/rest.go 38.29% <38.29%> (ø)
pkg/agent/stats/collector.go 51.72% <51.72%> (ø)
pkg/controller/stats/aggregator.go 74.43% <74.43%> (ø)
pkg/apis/controlplane/v1beta1/register.go 92.85% <100.00%> (+0.54%) ⬆️
... and 10 more

@tnqn tnqn force-pushed the policy-metrics branch 20 times, most recently from 9cb99bd to 2647d81 Compare September 2, 2020 14:38
@tnqn tnqn marked this pull request as ready for review September 2, 2020 14:47
@tnqn tnqn changed the title WIP: Support NetworkPolicy metrics Support NetworkPolicy metrics Sep 2, 2020
@tnqn
Copy link
Member Author

tnqn commented Sep 2, 2020

/test-all

@tnqn tnqn mentioned this pull request Sep 2, 2020
3 tasks
@tnqn
Copy link
Member Author

tnqn commented Sep 24, 2020

/test-all

@tnqn
Copy link
Member Author

tnqn commented Sep 24, 2020

@antoninbas @jianjuns @abhiraut Sorry for changing the API name from metrics to stats in last minute. I changed it because I see most similar functions like ethtool --statistics, conntrack -S, ip -stats link and NSX DFW stats API often call it stats/statistics instead of metrics, and metrics might be confused with prometheus metrics and resource usage like CPU and memory. Please let me know if you have concern on the renaming, I have kept the original patch that uses "metrics" so no problem to change it back.

@lzhecheng
Copy link
Contributor

/test-e2e

@lzhecheng
Copy link
Contributor

/test-networkpolicy
/test-conformance
/test-windows-conformance
/test-windows-networkpolicy

@tnqn
Copy link
Member Author

tnqn commented Sep 24, 2020

/test-e2e

antoninbas
antoninbas previously approved these changes Sep 24, 2020
Copy link
Contributor

@antoninbas antoninbas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

// TODO: The following process is not atomic, there's a chance that the ofID is released and reused by another
// NetworkPolicy rule in-between, leading to incorrect metrics. We should return relevant NetworkPolicy references
// along with metrics to avoid it.
ruleStatsMap := m.ofClient.NetworkPolicyMetrics()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should this be renamed from metrics to stats for consistency?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should, but since it doesn't affect user facing API and the method is not added by this PR, I plan to change it along with addressing the TODOs and ensuring its efficiency. Does it make sense?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good

@jianjuns
Copy link
Contributor

metrics -> stats sounds good to me.

For controller/agent restart, I feel we should at least handle controller restarts, as that will lose all previous stats, which can be discovered from agents.
For agent restart, is it possible for controller to keep a copy of agent stats, so it can compute the diff with the new stats from agent? Another option is to let agent report all 0 diffs - it might be better than report the current counters assuming agent does not stop for longt?

But probably let us consider how to handle these in the next release.

@tnqn
Copy link
Member Author

tnqn commented Sep 24, 2020

For controller/agent restart, I feel we should at least handle controller restarts, as that will lose all previous stats, which can be discovered from agents.
For agent restart, is it possible for controller to keep a copy of agent stats, so it can compute the diff with the new stats from agent? Another option is to let agent report all 0 diffs - it might be better than report the current counters assuming agent does not stop for longt?

Even if controller keeps a copy of agent stats, it cannot handle a scenario like agent restart -> controller restart. For example:
Time 1: agent A stats: packets=100, agent B stats: packets=200, controller kept both of them, and had the sum packets=300.
Time 2: agent A restarts, its stats became: packets=10, agent B stats: packets=220, controller knew agent A had restarted, so added its previous stats it kept in-memory: now A.packets=110, B.packets=220, sum.packets=330
Time 3: controller restarts, agent A stats: packets=20, agent B stats: packets=240, now controller lost A's previous stats and the sum will decrease to 260.

Unless controller or agent persists the stats somewhere, the stats might be far from the actual value.

But probably let us consider how to handle these in the next release.

Sure, one solution I have thought is to persist agent's stats in local run dir periodically and before it receives a kill signal, then reload it on start.

@jianjuns
Copy link
Contributor

For Agent restart (not OVS restart), Agent should report the current counters, and Controller should know about the Agent restart to compute the diff based on the cached counters.
For Controller restart, again Agent should report the current counters (not diff), and Controller can rebuild the cache.

Anything wrong in my assumptions?

@tnqn
Copy link
Member Author

tnqn commented Sep 24, 2020

@jianjuns in antrea case, agent and ovs are always restarted together unless one of them is killed by liveness probe. Even only agent restarts and ovs doesn't, agent will flush all flows on restart and even on reconnection to agent. Do you mean reading stats from stale flows before flushing them?

@jianjuns
Copy link
Contributor

Ok. I got what you mean.
First can we conclude we can handle Controller restart (but Agent should report the current counters but not diff)?

Agent and OVS do not always restart together. But if we assume Agent will always flush all flows, then sure we can make it simpler, and just report the current counters as diff.

abhiraut
abhiraut previously approved these changes Sep 24, 2020
Copy link
Contributor

@abhiraut abhiraut left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm.. maybe resolve nits in a follow up .. so we don't have merge conflicts on this PR

Copy link
Contributor

@antoninbas antoninbas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

documentation comments

docs/feature-gates.md Outdated Show resolved Hide resolved
docs/feature-gates.md Outdated Show resolved Hide resolved
docs/feature-gates.md Show resolved Hide resolved
docs/feature-gates.md Outdated Show resolved Hide resolved
docs/feature-gates.md Outdated Show resolved Hide resolved
@tnqn tnqn dismissed stale reviews from abhiraut and antoninbas via 7172387 September 25, 2020 01:40
@tnqn
Copy link
Member Author

tnqn commented Sep 25, 2020

@antoninbas @jianjuns @abhiraut thanks for review. I have addressed all comments.

/test-all

antoninbas
antoninbas previously approved these changes Sep 25, 2020
Copy link
Contributor

@antoninbas antoninbas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

I think the points raised by @jianjuns can de discussed after the release and addressed in a follow-up PR if needed.

This PR supports collecting and querying the NetworkPolicy statistics for
K8s NetworkPolicy and Antrea policies. It does the following:

- Introduce a feature gate called "NetworkPolicyStats" to manage its
  enablement.

- Introduce a structure called (stats.)Collector that collects stats
  from the Openflow client, calculates the delta compared with the last
  reported stats, and reports it to the antrea-controller via the
  controlplane NodeStatsSummary API.

- Introduce a structure called (stats.)Aggregator that collects the
  stats from the antrea-agents, aggregates them, caches the result,
  and provides interfaces for the Stats API handlers to query them.

- Aggregate the Stats API group to the Kubernetes API.
@tnqn
Copy link
Member Author

tnqn commented Sep 25, 2020

@antoninbas sorry, corrected another word in feature-gate, metrics->statistics, could you re-approve?

@tnqn
Copy link
Member Author

tnqn commented Sep 25, 2020

/test-all

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support NetworkPolicy statistics