Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Race condition causes quickly opened connections to fail #186

Closed
dave-powell-kensho opened this issue Jan 22, 2024 · 12 comments
Closed

Race condition causes quickly opened connections to fail #186

dave-powell-kensho opened this issue Jan 22, 2024 · 12 comments
Labels
bug Something isn't working strict mode Issues blocked on strict mode implementation

Comments

@dave-powell-kensho
Copy link

dave-powell-kensho commented Jan 22, 2024

What happened:

After enabling network policy support we observed applications which opened connections early in their lifecycle would become hung. It appears that the process had established a connection successfully, and were stuck in a read syscall indefinitely.

# ss -nitp
State                         Recv-Q                         Send-Q                                                 Local Address:Port                                                      Peer Address:Port                         Process                         
ESTAB                         0                              0                                                       xx.xx.xx.xx:44670                                                   xx.xx.xx.xx:443                           xx
         cubic wscale:13,7 rto:204 rtt:2.217/0.927 ato:40 mss:1388 pmtu:9001 rcvmss:1448 advmss:8949 cwnd:10 bytes_sent:783 bytes_acked:784 bytes_received:4735 segs_out:6 segs_in:7 data_segs_out:3 data_segs_in:4 send 50085701bps lastsnd:904904 lastrcv:904904 lastack:904900 pacing_rate 100160104bps delivery_rate 11131824bps delivered:4 app_limited busy:8ms rcv_space:56587 rcv_ssthresh:56587 minrtt:1.339 snd_wnd:65536

The process becomes stuck while in a read syscall

# cat /proc/8/syscall 
0 0x3 0x56505f69dcc3 0x5 0x0 0x0 0x0 0x7ffe65dd4f38 0x7f94ae42b07d

This occurred across multiple disparate deployments with the common feature being early outbound connections. When debugging the affected pods, we found that we were able to open outbound connections without issue. Our theory is that the application is opening connections early in the pod lifecycle before the agent gets going, and once the network policy agent does its work, the connection is affected. In these cases we had no egress filtering network policies applied to the pods, but did have ingress filters.

Attach logs

What you expected to happen:

How to reproduce it (as minimally and precisely as possible):
Create a pod that immediately attempts to download a large enough file to last several seconds. The request ends up hanging, but executing the same request on the same pod after some period of initialization succeeds.

Anything else we need to know?:
Possibly related to #144 ?

Environment:

  • Kubernetes version (use kubectl version): 1.26
  • CNI Version: 1.15.3
  • Network Policy Agent Version: 1.0.5 and 1.0.8rc1
  • OS (e.g: cat /etc/os-release): AL2
  • Kernel (e.g. uname -a): 5.10.192-183.736.amzn2.x86_64 #1 SMP Wed Sep 6 21:15:41 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux
@dave-powell-kensho dave-powell-kensho added the bug Something isn't working label Jan 22, 2024
@ndrafahl
Copy link

ndrafahl commented Feb 5, 2024

Hey @dave-powell-kensho - did you end up finding a resolution to your problem?

We are seeing somewhat similar results in our environment after enabling the network policy enforcement via the AWS CNI addon, that we were not seeing prior to enforcing the network policies. I don't think though that's it's because they're occurring too quickly after the pod comes up.

We are seeing that connections appear to be successful initially, but are timing out due to a "Read Timeout" on the client's side. We know the initial connection is successful because there are actions that are being taken on the server side, and a retry of the same action basically gives us a response of "you already did this".

In other cases, we're seeing that connections that have no timeout enforced basically stay open indefinitely, and we have to forcefully reboot the pods (namely, connections to an RDS instance).

@dave-powell-kensho
Copy link
Author

@ndrafahl We were not able to find a resolution that left the netpol agent running, and rolled back/unwound the netpol agent changes. We have not experienced these issues since the rollback ~2 weeks ago.

@ndrafahl
Copy link

ndrafahl commented Feb 5, 2024

Did you basically go through these steps, for your rollback?:

  • Deleted all of your ingress network policies in the cluster
  • Set enable-network-policy-controller to false in ConfigMap amazon-vpc-cni (kube-system NS).
  • Set the 'enableNetworkPolicy' parameter to false. This will disable the agents on the nodes.

Out of curiosity, did you also try updating the addon to 1.16.x? That's been one suggestion that has been made to us, but we haven't yet taken that path. Right now we're trying to figure out which direction to take.

Sorry - to add one additional question, did you guys do the same thing in other environments without any sort of seen issue there?

@dave-powell-kensho
Copy link
Author

Yes, those are the steps we took, though we also removed the netpol agent container from the daemonset at the end.

I'm not aware of that version - I had seen similar issues with requests from the developers to try 1.0.8rc1, which we did upgrading to (from 1.0.5) with the same results.

We were able to replicate this issue in multiple environments. We have left this addon enabled in our lowest environment so that we're able to test any potential fixes quickly.

@dave-powell-kensho
Copy link
Author

cc @jayanthvn We've been sitting on this issue for a couple weeks now and would really appreciate some eyes from the maintainers.

@ndrafahl
Copy link

ndrafahl commented Feb 5, 2024

Did you find that you also needed to remove the node agent from the daemonset as well, after those steps, to get your issue to be resolved?

I tested the steps in a lower environment, and sure enough that container is still running on the pod even though the addon is set to not enforce network policies any longer.

@jayanthvn
Copy link
Contributor

@dave-powell-kensho - Sorry somehow lost track of this. This is expected behavior if the connection is established prior to policy reconciliation against the new pod. Please see this - #189 (comment)

@dave-powell-kensho
Copy link
Author

@ndrafahl We removed the node agent from the pod's container list, yes, though we self-manage the aws-node deployment config, so I cannot advise on helm charts and the like.

@jayanthvn Thank you for the update, we'll be looking forward to the release of the strict mode feature. Is there any issue or other location we can track to know when it is released?

@ndrafahl
Copy link

ndrafahl commented Feb 6, 2024

@dave-powell-kensho Thanks for the info, appreciate you responding. 👍

@jdn5126
Copy link
Contributor

jdn5126 commented Feb 16, 2024

@dave-powell-kensho you can track the progress of #209 and its release

@jdn5126 jdn5126 added the strict mode Issues blocked on strict mode implementation label Feb 16, 2024
@ariary
Copy link

ariary commented Apr 4, 2024

@jdn5126 @jayanthvn I'm not sure the Strict Mode solved the issue.

In fact, the issue is describing especially a bug in the standard potion of the Strict Mode which is (still) blocking some traffic.

Last tests with v1.17.1-eksbuild.1 and standard: short-lived connections are still blocked (while explicitly allowed by network policies + after some times pod is able to perform same connection without any issue)

@Pavani-Panakanti
Copy link
Contributor

In standard mode, we do a default allow at pod startup and all traffic is allowed before policies are reconciled. It takes 1-2secs for the policies to be reconciled on the new pod. Once the network policy reconciliation happens, we start tracking the flows in conntrack table. For return traffic we check if entry is present in conntrack table and allow it accordingly. For traffic which exited the pod before network policies were applied and return traffic came after policies were applied, the return traffic will be denied as entry is not tracked in conntrack table

As a mitigation, 2-5secs delay can be added at the pod startup using init container. As a result, traffic will start going out of the pod only after network policies were applied and there will be no denies in the return traffic

We are actively working on fixing this issue, so that cx can use standard mode without the need to add sleep at pod startup. The fix for this can be tracked here #345

Closing this issue, please follow above open issue for the fix

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working strict mode Issues blocked on strict mode implementation
Projects
None yet
Development

No branches or pull requests

6 participants