Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Document the limitations of Audit Logging for policy rules #6225

Merged
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
72 changes: 70 additions & 2 deletions docs/antrea-network-policy.md
Original file line number Diff line number Diff line change
Expand Up @@ -53,6 +53,8 @@
- [<em>kubectl</em> commands for Group](#kubectl-commands-for-group)
- [RBAC](#rbac)
- [Notes and constraints](#notes-and-constraints)
- [Limitations of Antrea policy logging](#limitations-of-antrea-policy-logging)
- [Logging prior to Antrea v1.13](#logging-prior-to-antrea-v113)
<!-- /toc -->

## Summary
Expand Down Expand Up @@ -767,13 +769,19 @@ be enforced in the order in which they are written.

**enableLogging** and **logLabel**: Antrea-native policy ingress or egress rules
can be audited by setting its logging fields. When the `enableLogging` field is set
to `true`, the first packet of any connection that matches this rule will be
to `true`, the first packet of any traffic flow that matches this rule will be
logged to a file (`/var/log/antrea/networkpolicy/np.log`) on the Node on which the
rule is enforced. The log files can then be used for further analysis. If `logLabel`
is provided, the label will be added in the log. For example, in the
[ACNP with log settings](#acnp-with-log-settings), traffic that hits the
"AllowFromFrontend" rule will be logged with log label "frontend-allowed".
antoninbas marked this conversation as resolved.
Show resolved Hide resolved

The logging feature is best-effort, and as such there is no guarantee that all
the flows which match the policy rule will be logged. Additionally, we do not
recommend enabling policy logging for older Antrea versions (all versions prior
to v1.12, as well as v1.12.0 and v1.12.1). See this [section](#limitations-of-antrea-policy-logging)
for more information.

For drop and reject rules, deduplication is applied to reduce duplicated
log messages, and the duplication buffer length is set to 1 second. When a rule
does not have a name, an identifiable name will be generated for the rule and
Expand All @@ -797,7 +805,7 @@ The rules are logged in the following format:
Kubernetes NetworkPolicies can also be audited using Antrea logging to the same file
(`/var/log/antrea/networkpolicy/np.log`). Add Annotation
`networkpolicy.antrea.io/enable-logging: "true"` on a Namespace to enable logging
for all NetworkPolicies in the Namespace. Packets of any connection that match
for all NetworkPolicies in the Namespace. Packets of any network flow that match
a NetworkPolicy rule will be logged with a reference to the NetworkPolicy name,
but packets dropped by the implicit "default drop" (not allowed by any NetworkPolicy)
will only be logged with consistent name `K8sNetworkPolicy` for reference. When
Expand Down Expand Up @@ -1843,3 +1851,63 @@ Similar RBAC is applied to the ClusterGroup resource.
This is due to kube-proxy performing SNAT, which conceals the original source IP from
Antrea. Consequently, NetworkPolicies are unable to differentiate between hairpin
Service traffic and external traffic in this scenario.

### Limitations of Antrea policy logging

Antrea policy logging is enabled by setting `enableLogging` to true for specific
policy rules (or by using the `networkpolicy.antrea.io/enable-logging: "true"`
annotation for K8s NetworkPolicies). Starting with Antrea v1.13, logging is
"best-effort": if too much traffic needs to be logged, we will skip logging
rather than start dropping packets or rather than risking to overrun the Antrea
Agent, which could impact cluster health or other workloads. This behavior
cannot be changed, and the logging feature is therefore not meant to be used for
compliance purposes. By default, the Antrea datapath will send up to 500 packets
per second (with a burst size of 1000 packets) to the Agent for logging. This
rate applies to all the traffic that needs to be logged, and is enforced at the
level of each Node. A rate of 500 packets per second roughly translates to 500
new TCP connections per second, or 500 UDP requests per second. While it is
possible to adjust the rate and burst size by modifying the `packetInRate`
parameter in the antrea-agent configuration, we do not recommend doing so. The
default value was set to 500 after careful consideration.

#### Logging prior to Antrea v1.13

Prior to Antrea v1.13, policy logging was not best-effort. While we did have a
rate limit for the number of packets that could be sent to the Agent for
logging, the datapath behavior was to drop all packets that exceeded the rate
limit, as opposed to skipping the logging and applying the specified policy rule
action. This meant that the logging feature was more suited for audit /
compliance applications, however, we ultimately decided that the behavior was
too aggressive and that it was too easy to disrupt application workloads by
enabling logging - the rate limit was also lower than the default one we use
today (100 packets per second instead of 500). For example, the following policy
which allows ingress DNS traffic for coreDNS Pods, and has logging enabled,
would drastically restrict the number of possible DNS requests in the cluster,
which in turn would cause a lot of errors in applications which rely on DNS:

```yaml
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be more helpful to add a link to ACNP with log settings example here for comparison? And maybe change that previous example to a more suitable example to log (Drop), as opposed to this example?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry I am having a hard time understanding what you are suggesting. What difference are we trying to highlight by comparing with the acnp-with-log-setting example? And which example would you want to update to use the Drop action (and why)?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assumed that acnp-with-log-setting example does not suffer much from disrupting application workloads, rather than this allow-dns example, so I was referring to this difference.

I was wondering if we need to change acnp-with-log-setting to Drop as mentioned below especially when the policy rule uses the Allow action. (Quick question is it no longer the case that only the first packet of an Allow connection will be logged?)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assumed that acnp-with-log-setting example does not suffer much from disrupting application workloads, rather than this allow-dns example, so I was referring to this difference.

This policy applies to traffic from the application frontend to the DB layer. If it is a large scale application with a high number of connections, it could suffer from the same issue. In practice, for this specific case, the number of connections between the frontend and the DB is likely to stay "small", as the application is likely to use a connection pool instead of creating a new connection for each user session. But that really depends on the type of application.

The difference between these policies is more about the workloads to which they apply. I don't think there is a difference in how they are implemented. So they can both suffer from this issue in the same way.

In practice, users are more likely to experience this issue with CoreDNS because: 1) it seems common to enable logging for DNS requests, 2) even medium clusters usually have a large volume of DNS requests.

So I may still be missing your point.

I was wondering if we need to change acnp-with-log-setting to Drop as mentioned below especially when the policy rule uses the Allow action. (Quick question is it no longer the case that only the first packet of an Allow connection will be logged?)

With recent versions of Antrea (since v1.13), logging is fine with the Allow action, so why should we change the example? especially when the policy rule uses the Allow action is because the side effect of enabling logging with older Antrea versions is that flows are dropped after a certain rate limit. If the action is Allow, this is clearly a bigger deal than if the action is Drop anyway.

We only log the first packet of an Allow connection. That has not changed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the detailed explanations! Now I understand, no need for difference, and If the action is Allow, this is clearly a bigger deal than if the action is Drop anyway. makes total sense.

apiVersion: crd.antrea.io/v1beta1
kind: ClusterNetworkPolicy
metadata:
name: allow-core-dns-access
spec:
priority: 5
tier: securityops
appliedTo:
- podSelector: {}
ingress:
- name: allow-dns
enableLogging: true
action: Allow
ports:
- protocol: TCP
port: 53
- protocol: UDP
port: 53
```

**For this reason, we do NOT recommend enabling logging for Antrea versions
prior to v1.13**, especially when the policy rule uses the `Allow` action.

Note that v1.12 patch versions starting with v1.12.2 also do not suffer from
this issue, as we backported the fix to the v1.12 release.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a personal feeling, do we usually include this backport detail in a readme?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is helpful as users are more likely to find this information here, rather than look at the changelogs. I would be comfortable removing it after a few releases, but some users are running Antrea minor versions for a while, even after we stop maintaining them here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Understood, thanks!