-
Notifications
You must be signed in to change notification settings - Fork 744
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support to toggle TCP early demux #1212
Conversation
a6f71c6
to
1ba8b81
Compare
scripts/init.sh
Outdated
else | ||
sysctl -w "net.ipv4.tcp_early_demux=0" | ||
fi | ||
|
||
cat "/proc/sys/net/ipv4/conf/$PRIMARY_IF/rp_filter" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this cat rp_filter
line should go above with the other rp_filter
code, otherwise the output is going to be quite confusing :P
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we also want to add env: [ {name: "ENABLE_TCP_EARLY_DEMUX", value: "false"} ]
, on line 229 of config/master/manifests.jsonnet
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah true .. missed it, will update.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we also want to add env: [ {name: "ENABLE_TCP_EARLY_DEMUX", value: "false"} ], on line 229 of config/master/manifests.jsonnet
Re ^this: Are we going to disable early_demux for everyone? Only for sg-pp users? Ask users to disable the option explicitly as part of the instructions for enabling sg-pp?
Proposed logic disables it for everyone by default. That's probably ok, but we're removing a kernel optimisation and it would be nice to at least measure the impact before doing that unilaterally.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We discussed about this, currently the plan is to do the performance testing between kubelet to other pods and then release it as part of 1.7.3. Couple of blog posts called out that actually disabling this would be beneficial (https://blog.packagecloud.io/eng/2016/06/22/monitoring-tuning-linux-networking-stack-receiving-data/#tuning-adjusting-ip-protocol-early-demux and http://www.newfreesoft.com/linux/linux_kernel_socket_protocol_stack_routing_lookup_cache_mechanism_1567/, which describes that caching happens at two places one through conntrack as well)
Making it disabled will also remove one more flag that needs to be turned on to enable sg-pp. We can also update the default value to true, if its causing performance issues on some of the clusters.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also adding to what @SaranBalaji90 commented -
These are the function's in kernel which are executed ->
skb_sk_is_empty -> 1 : sk is null and 0 : sk is not null
vlan990070403b2 | 192.168.78.231 -> 192.168.94.234 |ip_rcv_finish -> 0 -> 0 [skb_sk_is_empty 1 ]
vlan990070403b2 | 192.168.78.231 -> 192.168.94.234 |ip_route_input_rcu -> 0 -> 0 [skb_sk_is_empty 0 ]
vlan990070403b2 | 192.168.78.231 -> 192.168.94.234 |e__ip_route_input_rcu:0.0.0.0/0 is FIB HIT [RTN_UNICAST(Gateway or direct route)]
vlan990070403b2 | 192.168.78.231 -> 192.168.94.234 | e__ip_route_input_rcu -> 0 -> 0 [skb_sk_is_empty 0 ]
vlan990070403b2 | 192.168.78.231 -> 192.168.94.234 |ip_forward -> 0 -> 0 [skb_sk_is_empty 0 ]
vlan990070403b2 | 192.168.78.231 -> 192.168.94.234 |e__ip_forward -> 1 -> 0 [skb_sk_is_empty 0 ]
vlan990070403b2 | 192.168.78.231 -> 192.168.94.234 |e__ip_rcv_finish -> 1 -> 0 [skb_sk_is_empty 0 ]
For the branch eni case it was going into both if conditions in ip_rcv_finish
-
if (net->ipv4.sysctl_ip_early_demux &&
!skb_dst(skb) &&
!skb->sk &&
!ip_is_fragment(iph)) {
const struct net_protocol *ipprot;
int protocol = iph->protocol;
ipprot = rcu_dereference(inet_protos[protocol]);
if (ipprot && (edemux = READ_ONCE(ipprot->early_demux))) {
err = edemux(skb);
if (unlikely(err))
goto drop_error;
/* must reload iph, skb->head might have changed */
iph = ip_hdr(skb);
}
}
/*
* Initialise the virtual path cache for the packet. It describes
* how the packet travels inside Linux networking.
*/
if (!skb_valid_dst(skb)) {
err = ip_route_input_noref(skb, iph->daddr, iph->saddr,
iph->tos, dev);
if (unlikely(err))
goto drop_error;
}
Because the first if does - !skb_dst(skb) && !skb->sk
so that means sk is null
and dst is null
, so now early demux filled dst_entry. But then skb_valid_dst
is failing so now it also does a route lookup which in turn calls ip_forward
and because skb->sk
is set then packet is dropped. Basically the early demux optimization purpose of using the cached value rather than routing lookup wasn't working here.
Also the PR description doesn't mention sg-pp at all. In 10 years time, we're going to wonder why this sysctl was set in the way it was, and this PR description is going to be our only record of the motivation for this change. It needs to explain the whole story for that team member archaeologist from the future. |
I agree. Will update the description @anguslees |
1ba8b81
to
0f306c3
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Testing looks good, but this PR needs a rebase.
0f306c3
to
70d7990
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since under load DISABLE_TCP_EARLY_DEMUX=true
add some overhead, and it's only needed for kubelet to pod-eni communication, it makes sense to have it off by default.
70d7990
to
f5b056f
Compare
Another option if Host - branch ENI pod
Host - regular pod
Since this was the difference between regular and branch eni pod, we will have to do the below steps to avoid disabling tcp demux ->
|
For completeness, lets include node port and service IP data path logs and explain the reason for tcp_early_demux not getting invoked (i.e., dnat / dport translation leading to socket hash not being same for tcp syn and tcp syn+ack ) |
Curl to Service IP ->
|
Curl to branch eni pod on remote instance ->
|
What type of PR is this?
bug
Which issue does this PR fix:
What does this PR do / Why do we need it:
This PR adds support to toggle tcp_early_demux. This change is required to support liveliness/readiness check on pods using its own security groups. Reason being, unlike regular pods, for pods using security groups kubelet traffic doesn't directly reach the host veth of the pod. For pods using security groups, traffic goes out of the eth0 and then comes back in through Trunk ENI. At this point response packets from the pod (SYN-ACK or ACK) are dropped because of tcp_early_demux setting.
Two scenario where the packets were dropped are
tcp_early_demux is enabled by default. Disabling tcp_early_demux adds 10-15 ms to overall packet throughput.
If an issue # is not available please add repro steps and logs from IPAMD/CNI showing the issue:
Performing curl from host ns to branch eni pod will fail without the change.
Testing done on this change:
After updating cni init container, instance has tcp early demux disabled on the instance and curl started working.
Automation added to e2e:
NA
Will this break upgrades or downgrades. Has updating a running cluster been tested?:
No
Does this change require updates to the CNI daemonset config files to work?:
Yes
Does this PR introduce any user-facing change?:
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.