-
Notifications
You must be signed in to change notification settings - Fork 238
On-the-fly agent replacement #202
Comments
Thanks for reporting the problems you had. The agent command has a flag that will have it insert an iptables rule to intercept metadata traffic and it also removes it by default when the process terminates (which could leave a window where the iptables rule isn't inserted). One option is to change your node configuration to insert the iptables rule as part of the system initialisation process. I'd also happily take a PR for an additional agent flag for something like As for pods failing for longer, that feels like an issue we've seen in the past with poor application behaviour: the standard AWS SDKs appear to cause exceptions/errors as soon as they're unable to access the metadata API (or the kiam agent) even if they have credentials that remain valid. Not all applications have failure handling in the right places (so a call to send an SQS message could fail because the credentials are no longer valid but the application won't recreate an authenticated session, for example). Could you confirm whether this is a plausible explanation for what you experienced? Based on what you describe it sounds like setting iptables rules on node boot/preventing kiam from deleting would help although probably won't guarantee you wouldn't see the problem again. Node replacement will guarantee you won't see the problem (and doesn't require application changes), alternatively I think it's probably a case of reviewing the failure handling within applications. |
@pingles ++ on system level rule installation. Will implement that. This is not an app issue for sure. We used those apps for half a year already on kiam v2, with zero issues. I got into this once upgraded kiam to v3-rc1 and replaced the agent on a running node. Steps to reproduce:
Sys level rule should close all the gaps anyways. Worth mentioning in kiam docs? |
My question was why those application pods aren't obtaining new roles once the rule was re-added: the iptables update should apply immediately for all traffic. Yes restarting the pods fixed it (because they'd restart and presumably recreate any AWS sessions within their client libs) but I'm not sure it's because the iptables interception wasn't taking effect. As you say, system rule should address though. Yep- be good to mention it in the docs somewhere, maybe it's worth starting something around production deployments in a |
I will add more details once I have. I'd love to see PRODUCTION_OPS, I can contribute too. |
We had some issues on kiam agent restart: it appears that if a credentials request goes through to the real AWS metadata service while the iptables rule is absent, the pod will get invalid credentials with an expiry time of 6 hours so will not re-request via kiam. (Using AWS go-client) +1 on moving the rule out of pod startup to system level, and +10 on documenting this. |
Could you maybe open a separate issue for just this? I.e. not removing the |
I (or one of my colleagues) will probably do a PR. I'm unclear why you think it is a separate issue - the window during which requests are not intercepted seems to me to be the main issue. |
You're totally right, ignore me. Please get a PR up and we can close this 😄 |
I'm in favor of the |
Plus I would also like to see documentation (not just a link to go code) on how to set up the |
Thankfully @theatrus contributed this in #253. Agreed on documentation for configuring on nodes. We'd be super happy to have more docs/production operations notes contributed. |
Well, that prevents clients retrieving wrong credentials, which is already better than before. But updating the agents can still cause a short downtime and errors for some apps, i wish there would be a good solution to update with zero downtime. Does anyone have some ideas regarding this? |
The apps should be caching the credentials rather than requesting them on every call, and the credentials should be good for an hour at least. Credential caching is supported by all the AWS SDKs as far as I know. So the solution is to fix the apps. I know that is not necessarily a practical solution, but I think the fact that credential caching is expected behavior means it is not worth trying to create some setup where the new agent comes up on the same instance as the old instance and takes over the local metadata network connection before the old agent shuts down. That is not something Kubernetes supports. |
Environment: AWS, k8s 1.10, kops, calico v2.6.7, kiam helm chart from this PR helm/charts#9262 , quay.io/uswitch/kiam:v3.0-rc1.
We've got into an issue when did kiam upgrade:
Server part went smooth.
After agent part, the new agent pod became invisible for all the existing pods: their metadata calls were not intercepted by kiam agent and they were getting bare node role instead of one set in the annotation. This condition persists for at least 30 min, after that I re-create all affected pods and they started to work normally - seeing the annotation role.
In other words, the new kiam agent pod works only for pods created after itself.
Right now we are doing node replacement to rotate an agent. Am I missing something in docs, is this expected, can be avoided?
This also means that crashed kiam agent pod means crashed node, we gotta do custom node healthchecks and so on to make sure we do not get into this condition.
The text was updated successfully, but these errors were encountered: