On-the-fly agent replacement #202

max-lobur · 2018-12-19T19:47:36Z

Environment: AWS, k8s 1.10, kops, calico v2.6.7, kiam helm chart from this PR helm/charts#9262 , quay.io/uswitch/kiam:v3.0-rc1.

We've got into an issue when did kiam upgrade:

helm upgrade
rotate kiam server pods 1 at time
rotate kiam agent pods 1 at time.

Server part went smooth.
After agent part, the new agent pod became invisible for all the existing pods: their metadata calls were not intercepted by kiam agent and they were getting bare node role instead of one set in the annotation. This condition persists for at least 30 min, after that I re-create all affected pods and they started to work normally - seeing the annotation role.

In other words, the new kiam agent pod works only for pods created after itself.

Right now we are doing node replacement to rotate an agent. Am I missing something in docs, is this expected, can be avoided?

This also means that crashed kiam agent pod means crashed node, we gotta do custom node healthchecks and so on to make sure we do not get into this condition.

pingles · 2018-12-20T10:24:26Z

Thanks for reporting the problems you had. The agent command has a flag that will have it insert an iptables rule to intercept metadata traffic and it also removes it by default when the process terminates (which could leave a window where the iptables rule isn't inserted).

One option is to change your node configuration to insert the iptables rule as part of the system initialisation process. I'd also happily take a PR for an additional agent flag for something like --no-iptables-removal-on-termination to prevent Kiam from dropping the rule on termination.

As for pods failing for longer, that feels like an issue we've seen in the past with poor application behaviour: the standard AWS SDKs appear to cause exceptions/errors as soon as they're unable to access the metadata API (or the kiam agent) even if they have credentials that remain valid. Not all applications have failure handling in the right places (so a call to send an SQS message could fail because the credentials are no longer valid but the application won't recreate an authenticated session, for example). Could you confirm whether this is a plausible explanation for what you experienced?

Based on what you describe it sounds like setting iptables rules on node boot/preventing kiam from deleting would help although probably won't guarantee you wouldn't see the problem again. Node replacement will guarantee you won't see the problem (and doesn't require application changes), alternatively I think it's probably a case of reviewing the failure handling within applications.

max-lobur · 2018-12-20T14:04:54Z

@pingles ++ on system level rule installation. Will implement that.

This is not an app issue for sure. We used those apps for half a year already on kiam v2, with zero issues. I got into this once upgraded kiam to v3-rc1 and replaced the agent on a running node.

Steps to reproduce:

create kiam agent pod (v3rc1), iptables flag.
create app pod with role annotation.
check that app pod works (in our case this is kinesis consumer, continious use of the role).
delete kiam agent pod (so that daemonset creates the new one, same version, v3rc1).
app pod stops getting the role, no intercepted by an agent (Expected).
agent starts, iptables rule added back.
app pod still does not work, still not intercepted (Not Expected). This persists for at least 30min, didn't check more.
kill app pod (deployment created the new one on the same node).
the new pod now intercepted and works as expected.

Sys level rule should close all the gaps anyways.

Worth mentioning in kiam docs?

pingles · 2018-12-20T16:32:56Z

My question was why those application pods aren't obtaining new roles once the rule was re-added: the iptables update should apply immediately for all traffic. Yes restarting the pods fixed it (because they'd restart and presumably recreate any AWS sessions within their client libs) but I'm not sure it's because the iptables interception wasn't taking effect.

As you say, system rule should address though.

Yep- be good to mention it in the docs somewhere, maybe it's worth starting something around production deployments in a ./docs/PRODUCTION_OPS.md (pulling together stuff on setting up IAM, why we run servers and agents on different nodes, inserting iptables on the host etc.?). @uswitch/cloud any other stuff worth mentioning?

max-lobur · 2018-12-20T17:13:37Z

I will add more details once I have.
The app continues to try re-auth, but keeps getting 403 because I see it is doing it using the node role. This is an interception issue, cause initially, it was app role, then it changes to node role quickly (when I kill agent), and never changes back to app role

I'd love to see PRODUCTION_OPS, I can contribute too.

bboreham · 2019-03-20T12:56:47Z

We had some issues on kiam agent restart: it appears that if a credentials request goes through to the real AWS metadata service while the iptables rule is absent, the pod will get invalid credentials with an expiry time of 6 hours so will not re-request via kiam. (Using AWS go-client)

+1 on moving the rule out of pod startup to system level, and +10 on documenting this.

pingles · 2019-03-20T13:02:42Z

+1 on moving the rule out of pod startup to system level, and +10 on documenting this.

Could you maybe open a separate issue for just this? I.e. not removing the --iptables flag but maybe create a shell script that sets up the iptables rule and reference it in the README? (also totally open to better suggestions). Be great to have something in.

bboreham · 2019-03-20T14:03:10Z

I (or one of my colleagues) will probably do a PR.

I'm unclear why you think it is a separate issue - the window during which requests are not intercepted seems to me to be the main issue.

pingles · 2019-03-20T14:33:42Z

You're totally right, ignore me. Please get a PR up and we can close this 😄

Nuru · 2019-03-26T22:34:27Z

I'm in favor of the --no-iptables-removal-on-termination option, even as a default, because it is failsafe from a security perspective as well as a possible solution to this problem. The promise of kiam is that it prevents the pod from assuming the instance role. If it fails, I would rather have no credentials available than be suddenly given access to the instance's credentials.

Nuru · 2019-03-26T23:22:03Z

Plus I would also like to see documentation (not just a link to go code) on how to set up the iptables rules on the nodes outside of the kiam-agent, particularly in a way that allows the instance itself to have access to its instance role.

pingles · 2019-11-06T10:56:41Z

I'm in favor of the --no-iptables-removal-on-termination option, even as a default, because it is failsafe from a security perspective as well as a possible solution to this problem. The promise of kiam is that it prevents the pod from assuming the instance role. If it fails, I would rather have no credentials available than be suddenly given access to the instance's credentials.

Thankfully @theatrus contributed this in #253.

Agreed on documentation for configuring on nodes. We'd be super happy to have more docs/production operations notes contributed.

project0 · 2020-07-28T08:29:22Z

I'm in favor of the --no-iptables-removal-on-termination option, even as a default, because it is failsafe from a security perspective as well as a possible solution to this problem. The promise of kiam is that it prevents the pod from assuming the instance role. If it fails, I would rather have no credentials available than be suddenly given access to the instance's credentials.

Well, that prevents clients retrieving wrong credentials, which is already better than before. But updating the agents can still cause a short downtime and errors for some apps, i wish there would be a good solution to update with zero downtime. Does anyone have some ideas regarding this?
Currently we disabled rolling updates and replace all nodes to force the update, so they can apply the new daemon set config.

Nuru · 2020-07-29T02:28:19Z

I'm in favor of the --no-iptables-removal-on-termination option, even as a default, because it is failsafe from a security perspective as well as a possible solution to this problem. The promise of kiam is that it prevents the pod from assuming the instance role. If it fails, I would rather have no credentials available than be suddenly given access to the instance's credentials.

Well, that prevents clients retrieving wrong credentials, which is already better than before. But updating the agents can still cause a short downtime and errors for some apps, i wish there would be a good solution to update with zero downtime. Does anyone have some ideas regarding this?

The apps should be caching the credentials rather than requesting them on every call, and the credentials should be good for an hour at least. Credential caching is supported by all the AWS SDKs as far as I know. So the solution is to fix the apps. I know that is not necessarily a practical solution, but I think the fact that credential caching is expected behavior means it is not worth trying to create some setup where the new agent comes up on the same instance as the old instance and takes over the local metadata network connection before the old agent shuts down. That is not something Kubernetes supports.

Nuru mentioned this issue Mar 26, 2019

[reloader] Add reloader, [kiam] use reloader, [cert-manager] get CRD from Chart cloudposse/helmfiles#109

Merged

This was referenced Mar 27, 2019

[templates/kops] Install iptable entry required by kiam cloudposse/geodesic#426

Merged

[kiam] Support rolling update cloudposse/helmfiles#110

Merged

Nuru mentioned this issue Jul 17, 2019

Allow kiam-agent to not remove rules on the host #253

Merged

Nuru mentioned this issue Nov 1, 2019

[kops/template] Fix kops install hook that installs iptable rule for kiam cloudposse-archives/reference-architectures#49

Merged

Nuru mentioned this issue Nov 6, 2019

helm: add support for iptables-remove flag #322

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

On-the-fly agent replacement #202

On-the-fly agent replacement #202

max-lobur commented Dec 19, 2018 •

edited

Loading

pingles commented Dec 20, 2018

max-lobur commented Dec 20, 2018 •

edited

Loading

pingles commented Dec 20, 2018

max-lobur commented Dec 20, 2018 •

edited

Loading

bboreham commented Mar 20, 2019

pingles commented Mar 20, 2019

bboreham commented Mar 20, 2019

pingles commented Mar 20, 2019

Nuru commented Mar 26, 2019

Nuru commented Mar 26, 2019

pingles commented Nov 6, 2019

project0 commented Jul 28, 2020

Nuru commented Jul 29, 2020

On-the-fly agent replacement #202

On-the-fly agent replacement #202

Comments

max-lobur commented Dec 19, 2018 • edited Loading

pingles commented Dec 20, 2018

max-lobur commented Dec 20, 2018 • edited Loading

pingles commented Dec 20, 2018

max-lobur commented Dec 20, 2018 • edited Loading

bboreham commented Mar 20, 2019

pingles commented Mar 20, 2019

bboreham commented Mar 20, 2019

pingles commented Mar 20, 2019

Nuru commented Mar 26, 2019

Nuru commented Mar 26, 2019

pingles commented Nov 6, 2019

project0 commented Jul 28, 2020

Nuru commented Jul 29, 2020

max-lobur commented Dec 19, 2018 •

edited

Loading

max-lobur commented Dec 20, 2018 •

edited

Loading

max-lobur commented Dec 20, 2018 •

edited

Loading