Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NodePort not working properly for pods on secondary ENIs #75

Closed
lbernail opened this issue May 20, 2018 · 11 comments · Fixed by #130
Closed

NodePort not working properly for pods on secondary ENIs #75

lbernail opened this issue May 20, 2018 · 11 comments · Fixed by #130
Assignees
Labels
Milestone

Comments

@lbernail
Copy link
Contributor

When using nodeport services (with or without loadbalancer), if iptables redirects traffic to a local pod with an IP on a secondary interface, the traffic is dropped by the reverse path filter.

Everything seems to work OK because if the first SYN is dropped the client will retry (however queries load-balanced to the local pod take much longer) and will (probably) be sent to another host (or pod on primary interface).

This can be seen by logging martian packets. When traffic is sent to a local pod on a secondary interface, it will be dropped.

The reason is the following:

  • traffic arrives on interface eth0 (for instance) with destination NodeIP:NodePort and is NATted in PREROUTING to PodIP:PortPort
  • the reverse path for traffic from PodIP is through secondary ENI and not eth0
  • rp_filter drops packet

To trigger the issue consistently:

  • add externalTrafficPolicy: Local to the service definition so traffic arriving on NodePort will only use local pods
  • make sure your pod is using an IP from a secondary ENI (but scheduling enough pods on the main interface)
@lbernail
Copy link
Contributor Author

A solution to this issue would be to mark nodeport traffic and use conntrack to force the reverse path through the primay ENI:

iptables -A PREROUTING -i eht0 -t mangle -p tcp --dport 30000:32767 -j CONNMARK --set-mark 42
iptables -t mangle -A PREROUTING -j CONNMARK -i eni+ --restore-mark
ip rule add fwmark 42 lookup main pref 1024

@liwenwu-amazon
Copy link
Contributor

@lbernail can I reproduce this by:

  • create a cluster which contains only 1 node
  • create enough Pods which will exhausts all addresses on primary ENI
  • create a service and its back-end pods, and back-end pods will get IP addresses from 2nd ENIs
  • then Pods on primary ENIs will NOT able to communicate with the service which is backed by back-end pods on 2nd ENIs

right? thanks

@lbernail
Copy link
Contributor Author

@liwenwu-amazon yes that would work but you need to access the service using nodeip:nodeport from outside the node: if I recall correctly accessing a nodeport from a pod will redirect you to the standard service endpoints and bypass the nodeport iptables rule

@liwenwu-amazon
Copy link
Contributor

@lbernail , we seems only can reproduce this problem when we disable external SNAT. #120 . Can you confirm if this is same case for you? With external SNAT, all traffic for node-port will get sent and received on eth0, so it works. When external SNAT #120 is enabled, the incoming traffic is sent to eth0, and outing traffic is sent to eth1, so it breaks the Linux connection tracking ...

@lbernail
Copy link
Contributor Author

@liwenwu-amazon if the traffic is coming from outside the VPC-CIDR it will be SNATed so yes it will solve the issue. However if the request comes from the VPC-CIDR the answer will be routed using the additional ENI (if the target pod is not on the primary ENI) and you should have the reverse-path issue

@fasaxc
Copy link
Contributor

fasaxc commented Jul 9, 2018

I think this issue impacts anyone who tries to use NodePorts from within the same VPC too, which seems fairly mainline. Since this is impacting my team, I've started working on a fix based on the above suggestion to use the connmark with a couple of changes:

  • only use one connmark bit so that it doesn't clash with kube-proxy or Calico's use of the mark
  • use an interface match to avoid needing to match on the port number.

Here's my current set of rules:

iptables -t mangle -A PREROUTING -i eth0 -m addrtype --dst-type LOCAL --limit-iface-in -j CONNMARK --set-mark 0x1/0x1
iptables -t mangle -A PREROUTING -i eni+ -j CONNMARK --restore-mark --mask=0x1

@lbernail
Copy link
Contributor Author

lbernail commented Jul 9, 2018

@fasaxc : you do it this way to increase performances?

@fasaxc
Copy link
Contributor

fasaxc commented Jul 10, 2018

@lbernail The change to use the addrtype? That does a couple of things:

  • It syncs up with kupe-proxy's rule:
-A KUBE-SERVICES -m comment --comment "kubernetes service nodeports; NOTE: this must be the last rule in this chain" -m addrtype --dst-type LOCAL -j KUBE-NODEPORTS

which also has the addrtype clause; this makes sure that we don't match traffic that isn't heading to an IP assigned to the host (i.e. traffic that is going directly to a pod but happens to be going to a port in the node port range)

  • It avoids needing to hard code or configure the NodePort range, which, at least in theory, is configurable. Also, I think you can create a NodePort manually outside that range.

When you combine the two rules, you get <connection seen on eth0 heading to a local IP> AND <connection seen leaving a veth>. Putting those two together, I think it implies that the packet was heading to a NodePort.

fasaxc added a commit to fasaxc/amazon-vpc-cni-k8s that referenced this issue Jul 10, 2018
Add iptables and routing rules that

- connmark traffic that arrives at the host over eth0
- restore the mark when the traffic leaves a pod veth
- force marked traffic to use the main routing table so that it
  exits via eth0.

Fixes aws#75
fasaxc added a commit to fasaxc/amazon-vpc-cni-k8s that referenced this issue Jul 10, 2018
Add iptables and routing rules that

- connmark traffic that arrives at the host over eth0
- restore the mark when the traffic leaves a pod veth
- force marked traffic to use the main routing table so that it
  exits via eth0.

Fixes aws#75
@liwenwu-amazon
Copy link
Contributor

Today, aws-vpc-cni assigns a native VPC IP address to a Pod. So it works directly with AWS ALB and NLB without any port mapping. In another word, Pod IP can be added ALB and NLB 's target group. This improves network performance, operation, debugging and remove the need to manipulate IP tables or IPVS tables.

There is a woke-in-progress PR/Support routing directly to pods in aws-alb-ingress-controller.

We are also actively working on NLB controller, so that traffic towards a service VIP backed by NLB can directly sent to Pod IP.

@lbernail
Copy link
Contributor Author

👍
This is very good news!
We were actually thinking about working on a similar PR for the ALB ingress

fasaxc added a commit to fasaxc/amazon-vpc-cni-k8s that referenced this issue Jul 31, 2018
Add iptables and routing rules that

- connmark traffic that arrives at the host over eth0
- restore the mark when the traffic leaves a pod veth
- force marked traffic to use the main routing table so that it
  exits via eth0.

Configure eth0 RPF check for "loose" filtering to prevent
NodePort traffic from being blocked due to incorrect reverse
path lookup in the kernel.  (The kernel is unable to undo the
NAT as part of its RPF check so it calculates the incorrect
reverse route.)

Fixes aws#75
fasaxc added a commit to fasaxc/amazon-vpc-cni-k8s that referenced this issue Aug 1, 2018
Add iptables and routing rules that

- connmark traffic that arrives at the host over eth0
- restore the mark when the traffic leaves a pod veth
- force marked traffic to use the main routing table so that it
  exits via eth0.

Configure eth0 RPF check for "loose" filtering to prevent
NodePort traffic from being blocked due to incorrect reverse
path lookup in the kernel.  (The kernel is unable to undo the
NAT as part of its RPF check so it calculates the incorrect
reverse route.)

Add diagnostics for env var configuration and sysctls.

Fixes aws#75
fasaxc added a commit to fasaxc/amazon-vpc-cni-k8s that referenced this issue Aug 2, 2018
Add iptables and routing rules that

- connmark traffic that arrives at the host over eth0
- restore the mark when the traffic leaves a pod veth
- force marked traffic to use the main routing table so that it
  exits via eth0.

Configure eth0 RPF check for "loose" filtering to prevent
NodePort traffic from being blocked due to incorrect reverse
path lookup in the kernel.  (The kernel is unable to undo the
NAT as part of its RPF check so it calculates the incorrect
reverse route.)

Add diagnostics for env var configuration and sysctls.

Fixes aws#75
@liwenwu-amazon liwenwu-amazon added this to the v1.2 milestone Aug 14, 2018
@liwenwu-amazon liwenwu-amazon self-assigned this Aug 14, 2018
seantsb pushed a commit to HypeHub/amazon-vpc-cni-k8s that referenced this issue Sep 12, 2018
Add iptables and routing rules that

- connmark traffic that arrives at the host over eth0
- restore the mark when the traffic leaves a pod veth
- force marked traffic to use the main routing table so that it
  exits via eth0.

Configure eth0 RPF check for "loose" filtering to prevent
NodePort traffic from being blocked due to incorrect reverse
path lookup in the kernel.  (The kernel is unable to undo the
NAT as part of its RPF check so it calculates the incorrect
reverse route.)

Add diagnostics for env var configuration and sysctls.

Fixes aws#75
@ikatson
Copy link
Contributor

ikatson commented Nov 30, 2018

Looks like this issue is still there if calico is used for network policy, as Calico triggers its own rules before the ones that solve the issue, see my comment here #231 (comment)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants