Unify TCP and UDP DNS interception flows #5392

GraysonWu · 2023-08-16T00:49:26Z

Change to use ct_state to match the DNS responses that need further interception.
Add OF meter for DNS interception.

GraysonWu · 2023-08-16T03:40:18Z

Realized I shouldn't completely remove the tcp_flags match. Because we don't want syn and fin packets also being sent to the agent. Will add flags that look like tcp_flags=+ack-syn-fin-rst. Just let the reviewers know, I will commit this change in a few hours. I'm having some internet issues right now.

antoninbas · 2023-08-16T03:47:11Z

Realized I shouldn't completely remove the tcp_flags match. Because we don't want syn and fin packets also being sent to the agent. Will add flags that look like tcp_flags=+ack-syn-fin-rst. Just let the reviewers know, I will commit this change in a few hours. I'm having some internet issues right now.

I wouldn't assume that SYN / FIN packets cannot carry data. I believe that the server is allowed to set the FIN flag on the last data segment. And this RFC mentions TCP Fast Open for DNS, so I think the SYN + ACK from the server could also theoretically carry data.

GraysonWu · 2023-08-16T08:06:45Z

Realized I shouldn't completely remove the tcp_flags match. Because we don't want syn and fin packets also being sent to the agent. Will add flags that look like tcp_flags=+ack-syn-fin-rst. Just let the reviewers know, I will commit this change in a few hours. I'm having some internet issues right now.

I wouldn't assume that SYN / FIN packets cannot carry data. I believe that the server is allowed to set the FIN flag on the last data segment. And this RFC mentions TCP Fast Open for DNS, so I think the SYN + ACK from the server could also theoretically carry data.

Make sense. Then I will keep it for review.

antoninbas

I think we should be adding an e2e test for a multi-packet TCP DNS response. It doesn't have to be as part of this PR, but we should add one before the 1.14 release.

pkg/agent/openflow/network_policy.go

antoninbas

I will add @wenyingd for review as well

pkg/agent/openflow/pipeline.go

pkg/agent/openflow/packetin.go

antoninbas

LGTM, but other people should review as well

pkg/agent/openflow/network_policy.go

wenyingd · 2023-08-17T01:06:15Z

pkg/agent/openflow/packetin.go

 	// Meter Entry Rate. It is represented as number of events per second.
 	// Packets which exceed the rate will be dropped.
-	PacketInMeterRateNP = 100
-	PacketInMeterRateTF = 100
+	PacketInMeterRateNP  = 100


Any plan to make the meter rate configurable? A concern is the DNS response packets may be dropped by meter if a burst of Pods traffic happens.

We should update #5358 when this is merged.

Note that based on some recent experiments I did, CPU usage increases very quickly when the rate is increased. There is potentially some improvements that can be made in the agent to process packets more efficiently.

pkg/agent/openflow/network_policy.go

GraysonWu · 2023-08-17T18:45:27Z

/test-all

GraysonWu · 2023-08-17T18:46:56Z

/test-ipv6-e2e

Dyanngg

LGTM overall

pkg/agent/openflow/network_policy.go

GraysonWu · 2023-08-17T22:36:28Z

/test-all
/test-ipv6-e2e

tnqn · 2023-08-18T02:38:37Z

pkg/agent/openflow/packetin.go

-	PacketInMeterRateTF = 100
+	PacketInMeterRateNP  = 100
+	PacketInMeterRateTF  = 100
+	PacketInMeterRateDNS = 100


I'm concerned if this could affect normal traffic. A node could run 110 Pods, if all of them use FQDN rule, each Pod is only allowed to get less than 1 DNS response per second. But even for in-cluster DNS UDP queries, it usually needs 3~5 rounds to get a valid result due the search names. If it's a DNS TCP query, it could also be several responses. With the rate-limiting, I think it's very easy to hit the limit and cause applications can't resolve domains. Even the networkpolicy e2e test could cause tens of DNS responses per second according to my experience.
I wonder which process is more expensive: the packedin, or the userspace code that processes the packet. If it's latter, I wonder if we should use the previous way that uses a channel to rate limit the packets, and resume the transmission of the ones that fail to be queued.

Sorry, I didn't understand the channel story. I know we already have a rate limit channel for packetin. What channel do you propose?

I have a similar concern.
I feel like we need more investigation and experimenting. Just getting the packets to the antrea-agent handling code is quite expensive in terms of CPU time, so I am hoping we can improve that.
We could use a larger rate. However, I am pretty sure that this patch has no effect on the maximum rate of DNS packets we can handle. The reason is that we also have a rate-limited queue in the antrea-agent process, and each PacketIn category is rate-limited to 100 pps:

antrea/pkg/agent/openflow/packetin.go

Lines 113 to 118 in e04c95c

func newFeatureStartPacketIn(category uint8, stopCh <-chan struct{}) *featureStartPacketIn {

featurePacketIn := featureStartPacketIn{category: category, stopCh: stopCh}

featurePacketIn.packetInQueue = openflow.NewPacketInQueue(PacketInQueueSize, rate.Limit(PacketInQueueRate))

return &featurePacketIn

}

DNS responses use the PacketInCategoryDNS category.
The OVS meter should prevent high CPU usage in ovs-vswitchd and antrea-agent, but there should be no change in the effective rate of DNS responses we can process.

I suggest that after merging it, we revisit these parameters. We could have a higher OVS rate for DNS packets, or at least a higher tolerance for bursts. At the moment I think the OVS meter has a burst size of 200 packets, and the software queue also has capacity 200.

It seems 100 or 200 pps is far from enough. I did a statistic when analyzing a pcap got from a production cluster, even a single Pod's DNS response could exceed 100:

# tcpdump -r dns.pcap -n src port 53 | awk -F "." '{print $1}' | uniq -c |sort -r | head reading from file dns.pcap, link-type EN10MB (Ethernet) 101 15:38:43 100 15:38:56 100 15:38:40 100 15:34:01 99 15:38:44 99 15:34:03 99 15:33:58 98 15:38:58 98 15:38:04 98 15:37:48

The above pcap has only UDP DNS traffic, the maximum value around 100 reminds me if it's already limited by the userspace rate limiting.

Given that the userspace rate-limiting was always there, let's merge this PR, and work on increasing the limit as a follow-up. @GraysonWu could you also work on this? I think settings it to a default of 500pps for DNS packets and making the rate configurable would be a good place to start. I think you should also run some experiments on a cluster to see how much CPU is consumed at 500pps. Based on some previous experiments I did, 500pps can consume one full vCPU (ovs-vswitchd + antrea-agent). Maybe there is some room to improve the implementation. Of course, in the long term, we could always consider a different implementation than packet-in (e.g. similar to what we do with Suricata for IDS).

Sure, I will work on that and open a PR ASAP.
I did an experiment with 500 pps it takes around 70% of 1 vCPU in my env.

wenyingd

LGTM

antoninbas

LGTM, but I want to wait and see if @tnqn has any more comments regarding #5392 (comment)

antoninbas · 2023-08-23T16:19:53Z

@GraysonWu can you fix the conflict?

1. Change to use ct_state to match the DNS responses that need further interception. 2. Add OF meter for DNS interception. Signed-off-by: graysonwu <wgrayson@vmware.com>

antoninbas · 2023-08-23T19:17:02Z

/test-all

antoninbas · 2023-08-23T21:54:06Z

@GraysonWu please backport to 1.13

GraysonWu force-pushed the fqdn-uniform branch 3 times, most recently from 48051c3 to e7ff6aa Compare August 16, 2023 02:32

GraysonWu requested review from antoninbas, tnqn and Dyanngg August 16, 2023 02:34

antoninbas reviewed Aug 16, 2023

View reviewed changes

pkg/agent/openflow/network_policy.go Outdated Show resolved Hide resolved

antoninbas reviewed Aug 17, 2023

View reviewed changes

pkg/agent/openflow/pipeline.go Outdated Show resolved Hide resolved

antoninbas requested a review from wenyingd August 17, 2023 00:10

GraysonWu force-pushed the fqdn-uniform branch from 4c16e5e to 65cf562 Compare August 17, 2023 00:21

antoninbas reviewed Aug 17, 2023

View reviewed changes

pkg/agent/openflow/packetin.go Show resolved Hide resolved

GraysonWu force-pushed the fqdn-uniform branch from 65cf562 to 2e9a0e8 Compare August 17, 2023 00:57

antoninbas reviewed Aug 17, 2023

View reviewed changes

wenyingd reviewed Aug 17, 2023

View reviewed changes

GraysonWu force-pushed the fqdn-uniform branch 2 times, most recently from 88f8609 to a321c66 Compare August 17, 2023 05:32

Dyanngg reviewed Aug 17, 2023

View reviewed changes

pkg/agent/openflow/network_policy.go Show resolved Hide resolved

tnqn reviewed Aug 18, 2023

View reviewed changes

wenyingd previously approved these changes Aug 18, 2023

View reviewed changes

GraysonWu requested review from tnqn and antoninbas August 21, 2023 21:11

antoninbas previously approved these changes Aug 21, 2023

View reviewed changes

Uniform DNS Interception

9e846db

1. Change to use ct_state to match the DNS responses that need further interception. 2. Add OF meter for DNS interception. Signed-off-by: graysonwu <wgrayson@vmware.com>

GraysonWu dismissed stale reviews from antoninbas and wenyingd via 9e846db August 23, 2023 18:29

GraysonWu force-pushed the fqdn-uniform branch from 2d131de to 9e846db Compare August 23, 2023 18:29

antoninbas approved these changes Aug 23, 2023

View reviewed changes

antoninbas added action/release-note Indicates a PR that should be included in release notes. action/backport Indicates a PR that requires backports. labels Aug 23, 2023

antoninbas merged commit 72bc791 into antrea-io:main Aug 23, 2023

GraysonWu mentioned this pull request Aug 23, 2023

Automated cherry pick of #5392: Uniform DNS Interception #5432

Merged

luolanzone mentioned this pull request Aug 24, 2023

Release 1.13.1 #5430

Merged

jianjuns changed the title ~~Uniform TCP and UDP DNS interception flows~~ Unify TCP and UDP DNS interception flows Sep 1, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unify TCP and UDP DNS interception flows #5392

Unify TCP and UDP DNS interception flows #5392

GraysonWu commented Aug 16, 2023

GraysonWu commented Aug 16, 2023

antoninbas commented Aug 16, 2023

GraysonWu commented Aug 16, 2023

antoninbas left a comment

antoninbas left a comment

antoninbas left a comment

wenyingd Aug 17, 2023

antoninbas Aug 17, 2023

GraysonWu commented Aug 17, 2023

GraysonWu commented Aug 17, 2023

Dyanngg left a comment

GraysonWu commented Aug 17, 2023

tnqn Aug 18, 2023 •

edited

Loading

GraysonWu Aug 18, 2023 •

edited

Loading

antoninbas Aug 18, 2023

tnqn Aug 23, 2023

antoninbas Aug 23, 2023

GraysonWu Aug 23, 2023

wenyingd left a comment

antoninbas left a comment

antoninbas commented Aug 23, 2023

antoninbas commented Aug 23, 2023

antoninbas commented Aug 23, 2023

	func newFeatureStartPacketIn(category uint8, stopCh <-chan struct{}) *featureStartPacketIn {
	featurePacketIn := featureStartPacketIn{category: category, stopCh: stopCh}
	featurePacketIn.packetInQueue = openflow.NewPacketInQueue(PacketInQueueSize, rate.Limit(PacketInQueueRate))

	return &featurePacketIn
	}

Unify TCP and UDP DNS interception flows #5392

Unify TCP and UDP DNS interception flows #5392

Conversation

GraysonWu commented Aug 16, 2023

GraysonWu commented Aug 16, 2023

antoninbas commented Aug 16, 2023

GraysonWu commented Aug 16, 2023

antoninbas left a comment

Choose a reason for hiding this comment

antoninbas left a comment

Choose a reason for hiding this comment

antoninbas left a comment

Choose a reason for hiding this comment

wenyingd Aug 17, 2023

Choose a reason for hiding this comment

antoninbas Aug 17, 2023

Choose a reason for hiding this comment

GraysonWu commented Aug 17, 2023

GraysonWu commented Aug 17, 2023

Dyanngg left a comment

Choose a reason for hiding this comment

GraysonWu commented Aug 17, 2023

tnqn Aug 18, 2023 • edited Loading

Choose a reason for hiding this comment

GraysonWu Aug 18, 2023 • edited Loading

Choose a reason for hiding this comment

antoninbas Aug 18, 2023

Choose a reason for hiding this comment

tnqn Aug 23, 2023

Choose a reason for hiding this comment

antoninbas Aug 23, 2023

Choose a reason for hiding this comment

GraysonWu Aug 23, 2023

Choose a reason for hiding this comment

wenyingd left a comment

Choose a reason for hiding this comment

antoninbas left a comment

Choose a reason for hiding this comment

antoninbas commented Aug 23, 2023

antoninbas commented Aug 23, 2023

antoninbas commented Aug 23, 2023

tnqn Aug 18, 2023 •

edited

Loading

GraysonWu Aug 18, 2023 •

edited

Loading