NodePort not working properly for pods on secondary ENIs #75

lbernail · 2018-05-20T09:29:56Z

When using nodeport services (with or without loadbalancer), if iptables redirects traffic to a local pod with an IP on a secondary interface, the traffic is dropped by the reverse path filter.

Everything seems to work OK because if the first SYN is dropped the client will retry (however queries load-balanced to the local pod take much longer) and will (probably) be sent to another host (or pod on primary interface).

This can be seen by logging martian packets. When traffic is sent to a local pod on a secondary interface, it will be dropped.

The reason is the following:

traffic arrives on interface eth0 (for instance) with destination NodeIP:NodePort and is NATted in PREROUTING to PodIP:PortPort
the reverse path for traffic from PodIP is through secondary ENI and not eth0
rp_filter drops packet

To trigger the issue consistently:

add externalTrafficPolicy: Local to the service definition so traffic arriving on NodePort will only use local pods
make sure your pod is using an IP from a secondary ENI (but scheduling enough pods on the main interface)

lbernail · 2018-05-20T09:31:15Z

A solution to this issue would be to mark nodeport traffic and use conntrack to force the reverse path through the primay ENI:

iptables -A PREROUTING -i eht0 -t mangle -p tcp --dport 30000:32767 -j CONNMARK --set-mark 42
iptables -t mangle -A PREROUTING -j CONNMARK -i eni+ --restore-mark
ip rule add fwmark 42 lookup main pref 1024

liwenwu-amazon · 2018-05-21T21:15:27Z

@lbernail can I reproduce this by:

create a cluster which contains only 1 node
create enough Pods which will exhausts all addresses on primary ENI
create a service and its back-end pods, and back-end pods will get IP addresses from 2nd ENIs
then Pods on primary ENIs will NOT able to communicate with the service which is backed by back-end pods on 2nd ENIs

right? thanks

lbernail · 2018-05-22T08:31:19Z

@liwenwu-amazon yes that would work but you need to access the service using nodeip:nodeport from outside the node: if I recall correctly accessing a nodeport from a pod will redirect you to the standard service endpoints and bypass the nodeport iptables rule

liwenwu-amazon · 2018-06-29T14:22:19Z

@lbernail , we seems only can reproduce this problem when we disable external SNAT. #120 . Can you confirm if this is same case for you? With external SNAT, all traffic for node-port will get sent and received on eth0, so it works. When external SNAT #120 is enabled, the incoming traffic is sent to eth0, and outing traffic is sent to eth1, so it breaks the Linux connection tracking ...

lbernail · 2018-06-29T14:28:11Z

@liwenwu-amazon if the traffic is coming from outside the VPC-CIDR it will be SNATed so yes it will solve the issue. However if the request comes from the VPC-CIDR the answer will be routed using the additional ENI (if the target pod is not on the primary ENI) and you should have the reverse-path issue

fasaxc · 2018-07-09T16:05:53Z

I think this issue impacts anyone who tries to use NodePorts from within the same VPC too, which seems fairly mainline. Since this is impacting my team, I've started working on a fix based on the above suggestion to use the connmark with a couple of changes:

only use one connmark bit so that it doesn't clash with kube-proxy or Calico's use of the mark
use an interface match to avoid needing to match on the port number.

Here's my current set of rules:

iptables -t mangle -A PREROUTING -i eth0 -m addrtype --dst-type LOCAL --limit-iface-in -j CONNMARK --set-mark 0x1/0x1
iptables -t mangle -A PREROUTING -i eni+ -j CONNMARK --restore-mark --mask=0x1

lbernail · 2018-07-09T16:55:58Z

@fasaxc : you do it this way to increase performances?

fasaxc · 2018-07-10T08:53:45Z

@lbernail The change to use the addrtype? That does a couple of things:

It syncs up with kupe-proxy's rule:

-A KUBE-SERVICES -m comment --comment "kubernetes service nodeports; NOTE: this must be the last rule in this chain" -m addrtype --dst-type LOCAL -j KUBE-NODEPORTS

which also has the addrtype clause; this makes sure that we don't match traffic that isn't heading to an IP assigned to the host (i.e. traffic that is going directly to a pod but happens to be going to a port in the node port range)

It avoids needing to hard code or configure the NodePort range, which, at least in theory, is configurable. Also, I think you can create a NodePort manually outside that range.

When you combine the two rules, you get <connection seen on eth0 heading to a local IP> AND <connection seen leaving a veth>. Putting those two together, I think it implies that the packet was heading to a NodePort.

Add iptables and routing rules that - connmark traffic that arrives at the host over eth0 - restore the mark when the traffic leaves a pod veth - force marked traffic to use the main routing table so that it exits via eth0. Fixes aws#75

liwenwu-amazon · 2018-07-10T19:20:22Z

Today, aws-vpc-cni assigns a native VPC IP address to a Pod. So it works directly with AWS ALB and NLB without any port mapping. In another word, Pod IP can be added ALB and NLB 's target group. This improves network performance, operation, debugging and remove the need to manipulate IP tables or IPVS tables.

There is a woke-in-progress PR/Support routing directly to pods in aws-alb-ingress-controller.

We are also actively working on NLB controller, so that traffic towards a service VIP backed by NLB can directly sent to Pod IP.

lbernail · 2018-07-11T07:51:59Z

👍
This is very good news!
We were actually thinking about working on a similar PR for the ALB ingress

Add iptables and routing rules that - connmark traffic that arrives at the host over eth0 - restore the mark when the traffic leaves a pod veth - force marked traffic to use the main routing table so that it exits via eth0. Configure eth0 RPF check for "loose" filtering to prevent NodePort traffic from being blocked due to incorrect reverse path lookup in the kernel. (The kernel is unable to undo the NAT as part of its RPF check so it calculates the incorrect reverse route.) Fixes aws#75

Add iptables and routing rules that - connmark traffic that arrives at the host over eth0 - restore the mark when the traffic leaves a pod veth - force marked traffic to use the main routing table so that it exits via eth0. Configure eth0 RPF check for "loose" filtering to prevent NodePort traffic from being blocked due to incorrect reverse path lookup in the kernel. (The kernel is unable to undo the NAT as part of its RPF check so it calculates the incorrect reverse route.) Add diagnostics for env var configuration and sysctls. Fixes aws#75

ikatson · 2018-11-30T14:02:07Z

Looks like this issue is still there if calico is used for network policy, as Calico triggers its own rules before the ones that solve the issue, see my comment here #231 (comment)

dbenhur added the bug label May 29, 2018

robbrockbank mentioned this issue Jun 28, 2018

NLB does not appear to work when pods are on eth1 #121

Closed

fasaxc mentioned this issue Jul 10, 2018

Fix return path of NodePort traffic. #130

Merged

liwenwu-amazon mentioned this issue Jul 10, 2018

Support routing directly to pods kubernetes-sigs/aws-load-balancer-controller#449

Merged

liwenwu-amazon added this to the v1.2 milestone Aug 14, 2018

liwenwu-amazon self-assigned this Aug 14, 2018

liwenwu-amazon closed this as completed in #130 Aug 21, 2018

toland mentioned this issue Sep 6, 2018

The AWS/EKS issue hippware/wocky#1817

Closed

mumoshu mentioned this issue Oct 1, 2018

feat: initial support for amazon-vpc-cni-k8s kubernetes-retired/kube-aws#1463

Merged

5 tasks

nickdgriffin mentioned this issue Nov 14, 2018

NodePort Connectivity Issue #231

Closed

ikatson mentioned this issue Dec 8, 2018

Fix return path of NodePort traffic when using Calico network policy. #263

Merged

yannick mentioned this issue Jan 6, 2020

ipvs problem #790

Closed

antoninbas mentioned this issue May 5, 2020

[EKS] Return path for NodePort traffic is broken for EKS clusters running Antrea antrea-io/antrea#678

Closed

Nuru mentioned this issue Mar 28, 2021

Traffic from outside VPC does not reach pod #1392

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

NodePort not working properly for pods on secondary ENIs #75

NodePort not working properly for pods on secondary ENIs #75

lbernail commented May 20, 2018

lbernail commented May 20, 2018

liwenwu-amazon commented May 21, 2018

lbernail commented May 22, 2018

liwenwu-amazon commented Jun 29, 2018

lbernail commented Jun 29, 2018

fasaxc commented Jul 9, 2018

lbernail commented Jul 9, 2018

fasaxc commented Jul 10, 2018

liwenwu-amazon commented Jul 10, 2018

lbernail commented Jul 11, 2018

ikatson commented Nov 30, 2018

NodePort not working properly for pods on secondary ENIs #75

NodePort not working properly for pods on secondary ENIs #75

Comments

lbernail commented May 20, 2018

lbernail commented May 20, 2018

liwenwu-amazon commented May 21, 2018

lbernail commented May 22, 2018

liwenwu-amazon commented Jun 29, 2018

lbernail commented Jun 29, 2018

fasaxc commented Jul 9, 2018

lbernail commented Jul 9, 2018

fasaxc commented Jul 10, 2018

liwenwu-amazon commented Jul 10, 2018

lbernail commented Jul 11, 2018

ikatson commented Nov 30, 2018