Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Machine ID: Problems with the token join method #26885

Closed
strideynet opened this issue May 25, 2023 · 2 comments
Closed

Machine ID: Problems with the token join method #26885

strideynet opened this issue May 25, 2023 · 2 comments
Labels
c-ip Internal Customer Reference feature-request Used for new features in Teleport, improvements to current should be #enhancements machine-id

Comments

@strideynet
Copy link
Contributor

strideynet commented May 25, 2023

⚠️ This issue is a long read 🦦
Skip straight to solutions if you already have an awareness of tbot mechanics.

The current token join method, which is mostly used by customers with environments not served by delegated joining, has a number of problems that make it awkward to use. These problems are significantly worse in ephemeral environments like CI/CD. These problems combined mean customers are often implementing Machine ID in less than ideal ways.

Background

Before a bot can interact with the Teleport cluster, it must perform an initial authentication. We refer to this initial authentication as "joining" or "registering". The way this initial authentication is completed is known as the "join method". There are two main classes of join method today for bots:

token join method

The first method introduced was token, where a ProvisionToken resource is created with spec.join_method == "token". In this case, the only thing the bot needs to join is the name of the ProvisionToken resource - this makes the name of this resource a secret.

This presented a significant security risk, as if this secret was leaked, a bad actor could connect their own instance of tbot anywhere and would have access to all the resources the bot had access to. To mitigate this risk, when the token join method is used, the ProvisionToken is consumed and cannot be used again.

As the bot is issued short-lived certificates, this presents a challenge in how the bot is able to renew its identity. To work-around this, certificates generated when joining with the token join method are marked as renewable. This allows that certificate to be used in a process known as renewal to request a certificate with an expiry date further in the future. In an attempt mitigate to mitigate the danger of a stolen certificate, we use a "generation counter". This stores a generation count on the bot user and within the issued certificates which is incremented on each renewal. If a mismatch is detected when the bot attempts to renew, it's possible that the certificate has been stolen and a bad actor is trying to renew it so the bot is locked out.

Delegated join methods

Later, Delegated join methods were introduced. Delegated Joining makes use of an identity issued to that workload by another identity provider (e.g the platform it is running on) in order to join. Delegated joining (e.g IAM, GitHub etc) currently offers the following benefits:

  • Richer access control based on attributes of the third party issued workload identity (e.g only allow CI runs in this repository to authenticate)
  • Richer auditing by including attributes of third party issued workload identities.
  • Reusability. Since ProvisionTokens for Delegated Joining are not secret, we can allow them to be reused. This means it can be used for renewing an existing bot's identity without the need for "generation counters" (and hence persisted state) and it also means multiple concurrent instances of tbot can use the same ProvisionToken to join and be linked to the same bot user. This makes them ideal for ephemeral environments like CI/CD.

Unfortunately, many workloads run on platforms where Delegated Joining is not possible (this is especially true in on-premises environments) - or on platforms we have not yet added support for the third party identity provider.

Problems with token join method

The current implementation of the token join method a described in the background presents a number of problems.

Single bot instance per bot user

Some customers have many bot instances that need the same privileges as one another. Currently, they need to create a bot user and join token for each of these instances. If they need to grant access to another role to their bots, they have to do this for each bot user. At scale, this becomes difficult to manage.

This becomes even more painful in ephemeral environments as once an instance spins down, the bot user created for that instance will not be deleted.

Workarounds

In some cases like this, customers have been running a single bot instance which pushes the output credentials into a single secret store that all consumers of a Teleport identity can pull from. This creates a SPOF on the single bot instance as well as on the secret store. It also discourages users from using more tightly scoped certificates.

Some more advanced customers may find themselves building an automation using the Teleport API to manage the creation of bot users and join tokens, and then providing this join token to the bot instance on creation.

Single use join tokens

Currently, each join token can only be used once. This is inconvenient in environments where many bot instances need to be stood up or where the bot instances are stood up in response to some trigger and are ephemeral (e.g CI/CD).

Workarounds

The same workarounds presented in "Single bot instance per bot user" can be applied here.

Potential solutions

Allow multiple bot instances to join to one bot user

Upon joining, the certificate would also be encoded with a bot instance ID extension as well as the generation counter extension. This generation counter would be specific to that bot instance ID. This ID could be a randomly generated UUID upon joining.

The generation attribute on the bot user would be replaced with a map of bot-instance-id to generation count, or, we would introduce an entirely new resource, the "bot instance" to hold this generation counter. This could be extended in future to include other information about that bot instance (e.g the time it last renewed, acting as a form of a heartbeat).

We could also choose to remove, or make optional, the behaviour of the ProvisionToken being consumed by the join. My gut feeling is that for backwards compatibility reasons we'd want to make this optional, perhaps with an additional field called allowMultipleBotJoins.

Key characteristics:

  • Still has the "generation counter" mechanism's potential bugginess
  • Allows multiple bot instances per bot user
  • Deleting the ProvisionToken would not affect already joined bots - this is not consistent with how delegated joining bots behave. This means we can continue to support the immediate disposal of these long-lived secrets upon use.

Additional complexities:

  • Locking affects all instances associated with a user unless we introduce per bot-instance locking. Is this desirable ?
  • If a user configures their provision token to not be consumed, any attacker with enough access to the host to steal the bot certs will most likely have enough access to extract the provision token secret from bot - unless the user has entered this secret once and then removed traces of it from the system once the bot has joined for the first time. In this situation, the generation counter is effectively purposeless.

Introduce non-renewable re-usable token join method

Introduce a new field to control this behaviour, or a new join method name. When set:

  • The ProvisionToken is not consumed on use
  • The certificates issued to the bot are not marked renewable, in order to renew, the bot must use the ProvisionToken to join again, in a similar fashion to how delegated joining works.
  • As the certificates are not renewable, the generation counter mechanism is not used (as with delegated joining)

Key characteristics:

  • Avoids the "generation counter" mechanism's potential bugginess
  • Allows multiple bot instances per bot user
  • Deleting the ProvisionToken would lead to bots being unable to renew - this is consistent with how delegated joining bots behave.
  • Encourages the existence of the dreaded long-lived secret.
@strideynet strideynet added feature-request Used for new features in Teleport, improvements to current should be #enhancements machine-id labels May 25, 2023
@strideynet strideynet changed the title Machine ID: Problems with the token join method Machine ID: Problems with the token join method [wip] May 25, 2023
@strideynet strideynet changed the title Machine ID: Problems with the token join method [wip] Machine ID: Problems with the token join method May 25, 2023
@strideynet
Copy link
Contributor Author

So in terms of my personal thoughts on this, I'm attracted to the first option "Allow multiple bot instances to join to one bot user".

This is for a number of reasons:

  • It avoids introducing what is essentially a new join method. This additional join method would be extremely similar to the traditional token joining, but different enough to cause confusion.
  • It extends the existing behaviour on the server side, and would not require any modifications to the client.
  • It lays the path for a "BotInstance" resource on the server which could provide better insight.
  • It doesn't create a long-lived secret that when rotated breaks all existing bots, which discourages rotation.
  • Whilst the generation counter mechanism does currently cause some instability, there are a few things we can do to mitigate that.

I do have a few doubts on it though:

  • Is the idea of multiple bot instances connected to one bot user confusing, or intuitive ?
  • Are we doubling down on the generation counter when other methods of preventing private key theft may be more effective (e.g TPM, HSM) ? Do all customers environments /need/ the level of extra security provided by the generation counter, might it be that the customers that need that extra level of security may be better suited with TPM/HSM integration and the customers that do not need it may not actually miss the removal of the generation counter?
  • Are users just going to ignore best practice and keep a long-lived join token around anyway ? To what extent does this make the generation counter redundant? And to what extent is it our responsibility if users decide to ignore the best practice? Perhaps we'd need to enforce token expiry for these tokens.

@strideynet
Copy link
Contributor Author

Closing this in favour of the following tickets which have planned an explicit solution for this:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
c-ip Internal Customer Reference feature-request Used for new features in Teleport, improvements to current should be #enhancements machine-id
Projects
None yet
Development

No branches or pull requests

2 participants