-
Notifications
You must be signed in to change notification settings - Fork 30
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Race condition causes quickly opened connections to fail #186
Comments
Hey @dave-powell-kensho - did you end up finding a resolution to your problem? We are seeing somewhat similar results in our environment after enabling the network policy enforcement via the AWS CNI addon, that we were not seeing prior to enforcing the network policies. I don't think though that's it's because they're occurring too quickly after the pod comes up. We are seeing that connections appear to be successful initially, but are timing out due to a "Read Timeout" on the client's side. We know the initial connection is successful because there are actions that are being taken on the server side, and a retry of the same action basically gives us a response of "you already did this". In other cases, we're seeing that connections that have no timeout enforced basically stay open indefinitely, and we have to forcefully reboot the pods (namely, connections to an RDS instance). |
@ndrafahl We were not able to find a resolution that left the netpol agent running, and rolled back/unwound the netpol agent changes. We have not experienced these issues since the rollback ~2 weeks ago. |
Did you basically go through these steps, for your rollback?:
Out of curiosity, did you also try updating the addon to 1.16.x? That's been one suggestion that has been made to us, but we haven't yet taken that path. Right now we're trying to figure out which direction to take. Sorry - to add one additional question, did you guys do the same thing in other environments without any sort of seen issue there? |
Yes, those are the steps we took, though we also removed the netpol agent container from the daemonset at the end. I'm not aware of that version - I had seen similar issues with requests from the developers to try 1.0.8rc1, which we did upgrading to (from 1.0.5) with the same results. We were able to replicate this issue in multiple environments. We have left this addon enabled in our lowest environment so that we're able to test any potential fixes quickly. |
cc @jayanthvn We've been sitting on this issue for a couple weeks now and would really appreciate some eyes from the maintainers. |
Did you find that you also needed to remove the node agent from the daemonset as well, after those steps, to get your issue to be resolved? I tested the steps in a lower environment, and sure enough that container is still running on the pod even though the addon is set to not enforce network policies any longer. |
@dave-powell-kensho - Sorry somehow lost track of this. This is expected behavior if the connection is established prior to policy reconciliation against the new pod. Please see this - #189 (comment) |
@ndrafahl We removed the node agent from the pod's container list, yes, though we self-manage the aws-node deployment config, so I cannot advise on helm charts and the like. @jayanthvn Thank you for the update, we'll be looking forward to the release of the strict mode feature. Is there any issue or other location we can track to know when it is released? |
@dave-powell-kensho Thanks for the info, appreciate you responding. 👍 |
@dave-powell-kensho you can track the progress of #209 and its release |
@jdn5126 @jayanthvn I'm not sure the Strict Mode solved the issue. In fact, the issue is describing especially a bug in the standard potion of the Strict Mode which is (still) blocking some traffic. Last tests with |
In standard mode, we do a default allow at pod startup and all traffic is allowed before policies are reconciled. It takes 1-2secs for the policies to be reconciled on the new pod. Once the network policy reconciliation happens, we start tracking the flows in conntrack table. For return traffic we check if entry is present in conntrack table and allow it accordingly. For traffic which exited the pod before network policies were applied and return traffic came after policies were applied, the return traffic will be denied as entry is not tracked in conntrack table As a mitigation, 2-5secs delay can be added at the pod startup using init container. As a result, traffic will start going out of the pod only after network policies were applied and there will be no denies in the return traffic We are actively working on fixing this issue, so that cx can use standard mode without the need to add sleep at pod startup. The fix for this can be tracked here #345 Closing this issue, please follow above open issue for the fix |
What happened:
After enabling network policy support we observed applications which opened connections early in their lifecycle would become hung. It appears that the process had established a connection successfully, and were stuck in a read syscall indefinitely.
The process becomes stuck while in a read syscall
This occurred across multiple disparate deployments with the common feature being early outbound connections. When debugging the affected pods, we found that we were able to open outbound connections without issue. Our theory is that the application is opening connections early in the pod lifecycle before the agent gets going, and once the network policy agent does its work, the connection is affected. In these cases we had no egress filtering network policies applied to the pods, but did have ingress filters.
Attach logs
What you expected to happen:
How to reproduce it (as minimally and precisely as possible):
Create a pod that immediately attempts to download a large enough file to last several seconds. The request ends up hanging, but executing the same request on the same pod after some period of initialization succeeds.
Anything else we need to know?:
Possibly related to #144 ?
Environment:
kubectl version
): 1.26cat /etc/os-release
): AL2uname -a
):5.10.192-183.736.amzn2.x86_64 #1 SMP Wed Sep 6 21:15:41 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux
The text was updated successfully, but these errors were encountered: