-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RFD 0162: Machine ID Fleet Management, BotInstance
resource and other token
joining improvements
#36510
base: master
Are you sure you want to change the base?
RFD 0162: Machine ID Fleet Management, BotInstance
resource and other token
joining improvements
#36510
Conversation
The PR changelog entry failed validation: Changelog entry not found in the PR body. Please add a "no-changelog" label to the PR, or changelog lines starting with |
token
joining improvementsBotInstance
resource and other token
joining improvements
The PR changelog entry failed validation: Changelog entry not found in the PR body. Please add a "no-changelog" label to the PR, or changelog lines starting with |
@timothyb89 I did some thinking over the weekend and I stumbled across a fairly big decision I think we need to make. Should we either:
Thoughts on maintaining existing behaviour:
Thoughts on switching to renewal mechanism:
For me, I think the key point to this is the UX around deleting a token: is it better for the bots to then fail to renew or for them to continue to function? There's no requirement that we necessarily make this decision now. We could pursue a strategy of maintaining the existing behaviour and changing this behaviour at some point in the future, but I think from the perspective of designing the bot instances, it would be good to consider this and at least explicitly decide we will defer changing the behaviour. |
### Locking of Individual Bot Instances | ||
|
||
Currently, it's only possible to lock out an entire Bot user. This means that | ||
when managing a large fleet, it would not be able to lock out a specific host | ||
that had been compromised. This is likely to be a major friction point for those | ||
deploying a large number of Bot instances. | ||
|
||
It also increases the significance of the fragility of the generation counter. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The downside here is that the bot remains aware of the secret token it used to join, and that token is meant to be reused to join multiple bots. If we just lock one bot instance, an attacker could purge the bot data and join as a new instance to circumvent the lock.
I think to lock individual instances, we'd need some way to definitively associate a bot with a host. This category of (token based joining) issue always seems to come back to TPMs 😅
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One thing I thought about here is an option to "Lock all bot instances that have authenticated with X token". Seems fairly feasible if we store the hash of the token used on the first join or similar.
I think this is deffo something for next quarter though.
As we now have a way to track the generation for a specific Bot instance, we | ||
can allow multiple Bot instances to be associated with a single Bot. This | ||
also means that the token no longer needs to be consumed on a join. | ||
|
||
Eventually, we may wish to add a way to specify a number of joins which can | ||
occur with a token. This provides a way to easily control the lifetime of a | ||
token when deploying to a fleet of a pre-known size. | ||
|
||
The renewal logic will need to be adjusted to read and update the generation | ||
counter from the BotInstance rather than the Bot user. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm a bit conflicted about launching multi-use tokens without also adding options to limit joins. While they're valid, tokens are much more security sensitive than certificates, and I'm not totally convinced that the improved visibility from BotInstances
makes up for the possibility of malicious token reuse.
The only easy limits I see are TTLs and a use counter, and only TTLs exist today.
I assume we'll at least keep the default 1h token TTL, but we don't meaningfully restrict the TTL beyond that - you can make a token with a 9999d
TTL today. I think users will be incentivized to raise this TTL, possibly indefinitely. I can imagine these rough classes of use case:
- On day 0, they'll be deploying a potentially large fleet of bots which may take more than an hour to initialize.
- On day 1+, they'll be continuing to deploy new bots (e.g. as they dynamically scale their infrastructure), or repairing existing bots that went offline beyond the renewal period.
Today, there isn't much incentive for users to select a long token TTL because they're still just one-time-use; you'll need to make a new token (or many new tokens) regardless, so you're forced to automate token creation if you want to deploy bots at scale.
With reusable tokens, though, long TTLs provide a tangible benefit. They give you time to deploy your initial bot fleet ... and let you scale without any extra work, and let you easily fix broken bots...
Anyway, I think we need to try and balance out the incentives here:
- We should include a join count limit from the start, and arguably default it to 1 so users are making an explicit decision to allow multiple bots to join. This default would also avoid a potentially surprising change in behavior for existing users.
- We should explicitly address that "day 1+" case and simplify joining new instances to / creating new tokens for existing bots, to facilitate adding new bots later or repairing broken ones. Maybe something like the now ancient "bot refresh" feature request and let multiple tokens be issued to one bot?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm a bit conflicted about launching multi-use tokens without also adding options to limit joins.
I agree some sort of limiting factor on a number of joins is likely going to be necessary out of the box.
We should include a join count limit from the start, and arguably default it to 1 so users are making an explicit decision to allow multiple bots to join. This default would also avoid a potentially surprising change in behavior for existing users.
I like this idea - especially using a default of 1 to avoid a big change in behaviour.
One use-case I think we can't neglect is CI platforms we do not support and can not support due to their limitations. In this case, users may well still want a token that is "quite" long-lived and supports a large number of joins.
We'd still want them to rotate though... one thing I was thinking about to make this easier would be to produce a script or command which can be executed on a cron job that generates a new token for a named bot and then calls a user-supplied script which should connect to their CI platform/secret store and updates the token. This may be a nice answer for those who are backed into a corner by a rudimentary CI platform.
This fits well into the "bot refresh" idea, some sort of easyish CLI command where you specify an existing bot and can just generate a new token.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One use-case I think we can't neglect is CI platforms we do not support and can not support due to their limitations. In this case, users may well still want a token that is "quite" long-lived and supports a large number of joins.
I agree. I almost wonder if there's a middle ground here where we might be able to force users to make a security decision:
- The
token
join method sets a sane cap on the TTL of a join token to proactively encourage some variety of automation, say 7 days or something. I don't think we need to cap the number of uses as long as there's a time bound.- It might be useful to write up a cron job example, maybe a guide on using a GHA or k8s cronjob to rotate a Jenkins joining secret or something.
- We then (eventually) introduce a new
token-insecure
that otherwise behaves identically but has no restrictions at all on TTL or number of joins (basically this other thread down the page). This method might also opt out of the generation counter.
I think we could argue enforcing a maximum token TTL is an acceptable breaking change given how narrowly useful long TTL'd tokens are today, and then introduce token-insecure
at some point in the future? I can't imagine any users today are making practical use of extremely long TTL bot tokens that would have their use cases broken.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm a little cautious about setting a cap on the TTL, if we were totally greenfield I'd agree but:
- Today, there's no cap. I agree with you and doubt anyone is relying on this behaviour - so perhaps we can discount this.
- We'd need to have logic that ignores the cap if it's for Agent joining
- We'd need to have logic that ignores the cap if the join method is not
token
.
I'd perhaps be less cautious about the TTL cap if we were approaching this at a later date when we add token-insecure
, imho, the whole token type could do with a V3 and that might be a good opportunity to do so. As an aside, I'd love to make it so token name is not the actual secret value, e.g a token
join method token has a UUID-esque name or a user provided name, and then a field which specifies the actual secret value. This would make audit logging etc much easier because we wouldn't need to "hide" the name.
I'm not overly decided on this and I think I need to ponder on it a little longer to figure out my full thoughts. It's almost out of scope for this RFD - so perhaps we don't need to make a decision on this immediately and we can proceed with the BotInstance work since I don't think this has any impact.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think my hopefully more in scope recommendation for this RFD is: let's enforce a soft client-side limit in tctl
and/or warn that it's a bad idea and we may restrict token length in the future. I figure that if a user wants to create an extremely long-lived token, they can write up the YAML themselves and tctl create -f
it.
I think warnings or soft errors today help keep the door open for us to lock this down more in the future while making sure use cases that need very long lived tokens are at least possible.
Unrelatedly, I think the tctl bot refresh
case is covered here already: just create a new token for the existing bot with tctl tokens add --type=bot --bot foo
.
This adds a new section on the security requirements for the persistent bot identity, alongside an updated preference for UUID identifiers. This also includes revisions to the "changes to the `token` join method" to include join count limits by default and set sane defaults to encourage short lived tokens. Lastly, various other small changes were applied, like removing the implementation details for bot instance names in favor of an explicitly specified bot name and instance identifier.
// The public key of the Bot instance. | ||
// When authenticating a Bot instance, the full public key must be compared | ||
// rather than just the fingerprint to mitigate pre-image attacks. | ||
bytes public_key = 2; | ||
// The fingerprint of the public key of the Bot instance. | ||
string fingerprint = 3; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've so far kept the public key and fingerprint fields even though we're adding a UUID. Do we want to keep these? I don't think they're strictly required, but may still be useful.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's useful info but if it's placed at the top level, we don't have a context point for when the value was this. We could move this into BotInstanceStatusAuthentication
and that way we'll properly handle the key changing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good call, will move it into that message.
I've done a pass to apply some of the discussions we've more or less found a consensus on, mainly switching to UUID identifiers and lightly touching on how we'll verify them. I've also suggested some tweaks to the |
// The public key of the Bot instance. | ||
// When authenticating a Bot instance, the full public key must be compared | ||
// rather than just the fingerprint to mitigate pre-image attacks. | ||
bytes public_key = 2; | ||
// The fingerprint of the public key of the Bot instance. | ||
string fingerprint = 3; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's useful info but if it's placed at the top level, we don't have a context point for when the value was this. We could move this into BotInstanceStatusAuthentication
and that way we'll properly handle the key changing.
then the second-oldest entry will be removed. This prevents growth without | ||
bounds but also ensures that the original record is retained. | ||
|
||
In addition, the TTL of the BotInstance resource will be extended. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In addition, the TTL of the BotInstance resource will be extended. | |
In addition, the TTL of the BotInstance resource will be extended to cover the validity period of the issued certificate, plus a short additional time to allow for some imprecision. |
…-token-join-method-bot-instance
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I raised this PR initially but Tim has taken it over - so I can't Approve it. Please take this as my "approval".
|
||
If the heartbeat fails, then `tbot` should retry on an exponential backoff. | ||
|
||
##### Alternative: Submit Heartbeat Data on Join/Renew |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Before merging this, it would be helpful if each Alternative
that we present but ultimately do not select lists why it was not chosen.
long TTLs will need to be explicitly specified. | ||
|
||
We may additionally want to put hurdles in the way of things like extremely long | ||
token TTLs. There are legitimate low-security use cases for these, but we could |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm curious what Sasha thinks about this.
Extremely long join token TTLs basically become a shared secret like an API key, and we built Machine ID to avoid long-lived shared secrets..
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree, and would ideally like to only support this by way of an explicitly insecure token-insecure
join method at some point in the future (see previous thread: #36510 (comment)). As we discussed in that thread, adding a TTL limit would technically be a breaking change, and there are use cases we may still want to support.
For context on why, today's status quo is problematic because it seems like we properly support token joining, but it's kind of an illusion. It works great in a Getting Started guide on your dev box, and then breaks as soon as you try it in production because one of your nodes went offline for more than an hour. Production use really requires full automation (e.g. ansible or terraform detecting dead bots and reissuing tokens), and I'm not aware of any customers who've actually put in the effort to do it.
The changes proposed here (multiple uses per token, multiple tokens per bot) simplify the automation requirements a lot, but it's still a heavy lift, and I'm not sure it'll ever be worthwhile for a lot of use cases. There may be value in giving customers in this niche a way to still make use of Teleport if they (explicitly) accept the security implications of a long lived token; they still benefit from our RBAC, audit logging, etc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Michael is working on the display and creation of join tokens in the web UI, and we actually just had a conversation with Sasha about token creation during the Design and Product Office Hours yesterday. In the meeting, we agreed to show a danger state for any token
join method with an expiration >24 hours. We also agreed to make it impossible to create long-lived tokens through the web UI.
(What we discussed doesn't affect the underlying mechanics of the token
join method, but it does put some guard rails for their creation in the web UI and affects how they will be displayed, regardless of how they were created.)
A PostHog event should be emitted for each BotInstance heartbeat. This will | ||
allow us to track active bots in a similar way to how we track active agents. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we included it in the aggregated tp.resource.heartbeat
event rather than creating a new event for each heartbeat?
when managing a large fleet, it would not be able to lock out a specific host | ||
that had been compromised. This is likely to be a major friction point for those |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can leave this out of scope but we should explore it in the future.
We can already lock any heartbeating agent based on its Host ID today, so it feels like this should extend to bots without too much trouble.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, we should add something here, but it's trivial to circumvent when the join token can just be reused, @strideynet had some ideas here: https://github.com/gravitational/teleport/pull/36510/files/61e17365cac7aaec059de86d470a57a6b612eebd#r1602421886
Co-authored-by: Zac Bergquist <zac.bergquist@goteleport.com>
Traditionally, bots using non-`token` join methods "renew" their certs by rejoining, taking advantage of their repeatable join methods to avoid relying on `token` rejoin logic. However, as part of [RFD 162], we need a way for rejoining bots to assert an existing identity (if any) even when rejoining "from scratch". When paired with #43732, this change allows bots to use an authenticated auth client for all "initial" joins. If present, an existing `BotInstanceID` is extracted from the client identity and used to associate the bot with its previous identity. As certain join methods (TPM, Azure) are initially handled by the `JoinService` using gRPC, this additionally adds a new `BotInstanceID` field to `RegisterUsingTokenRequest` so the `JoinService` can verify the client identity and extract the `BotInstanceID` if available. Like `RemoteAddr`, this value is only trusted when set by the proxy and should be ignored in all other cases. [RFD 162]: #36510
* Allow bots to provide client cert authentication when rejoining Traditionally, bots using non-`token` join methods "renew" their certs by rejoining, taking advantage of their repeatable join methods to avoid relying on `token` rejoin logic. However, as part of [RFD 162], we need a way for rejoining bots to assert an existing identity (if any) even when rejoining "from scratch". When paired with #43732, this change allows bots to use an authenticated auth client for all "initial" joins. If present, an existing `BotInstanceID` is extracted from the client identity and used to associate the bot with its previous identity. As certain join methods (TPM, Azure) are initially handled by the `JoinService` using gRPC, this additionally adds a new `BotInstanceID` field to `RegisterUsingTokenRequest` so the `JoinService` can verify the client identity and extract the `BotInstanceID` if available. Like `RemoteAddr`, this value is only trusted when set by the proxy and should be ignored in all other cases. [RFD 162]: #36510 * Add tests for authenticated bot rejoining * Create a new bot instance if the requested instance is not found Also, includes a new check in for this behavior in `TestRegisterBotWithInvalidInstanceID`. * Fix imports * Update lib/auth/auth_with_roles.go Co-authored-by: Zac Bergquist <zac.bergquist@goteleport.com> * Fix `cmp` import --------- Co-authored-by: Zac Bergquist <zac.bergquist@goteleport.com>
* Allow bots to provide client cert authentication when rejoining Traditionally, bots using non-`token` join methods "renew" their certs by rejoining, taking advantage of their repeatable join methods to avoid relying on `token` rejoin logic. However, as part of [RFD 162], we need a way for rejoining bots to assert an existing identity (if any) even when rejoining "from scratch". When paired with #43732, this change allows bots to use an authenticated auth client for all "initial" joins. If present, an existing `BotInstanceID` is extracted from the client identity and used to associate the bot with its previous identity. As certain join methods (TPM, Azure) are initially handled by the `JoinService` using gRPC, this additionally adds a new `BotInstanceID` field to `RegisterUsingTokenRequest` so the `JoinService` can verify the client identity and extract the `BotInstanceID` if available. Like `RemoteAddr`, this value is only trusted when set by the proxy and should be ignored in all other cases. [RFD 162]: #36510 * Add tests for authenticated bot rejoining * Create a new bot instance if the requested instance is not found Also, includes a new check in for this behavior in `TestRegisterBotWithInvalidInstanceID`. * Fix imports * Update lib/auth/auth_with_roles.go Co-authored-by: Zac Bergquist <zac.bergquist@goteleport.com> * Fix `cmp` import --------- Co-authored-by: Zac Bergquist <zac.bergquist@goteleport.com>
* Allow bots to provide client cert authentication when rejoining Traditionally, bots using non-`token` join methods "renew" their certs by rejoining, taking advantage of their repeatable join methods to avoid relying on `token` rejoin logic. However, as part of [RFD 162], we need a way for rejoining bots to assert an existing identity (if any) even when rejoining "from scratch". When paired with #43732, this change allows bots to use an authenticated auth client for all "initial" joins. If present, an existing `BotInstanceID` is extracted from the client identity and used to associate the bot with its previous identity. As certain join methods (TPM, Azure) are initially handled by the `JoinService` using gRPC, this additionally adds a new `BotInstanceID` field to `RegisterUsingTokenRequest` so the `JoinService` can verify the client identity and extract the `BotInstanceID` if available. Like `RemoteAddr`, this value is only trusted when set by the proxy and should be ignored in all other cases. [RFD 162]: #36510 * Add tests for authenticated bot rejoining * Create a new bot instance if the requested instance is not found Also, includes a new check in for this behavior in `TestRegisterBotWithInvalidInstanceID`. * Fix imports * Update lib/auth/auth_with_roles.go Co-authored-by: Zac Bergquist <zac.bergquist@goteleport.com> * Fix `cmp` import --------- Co-authored-by: Zac Bergquist <zac.bergquist@goteleport.com>
* Add basic CRUD for BotInstanceService (#42899) * Add basic CRUD for BotInstanceService This adds an early backend for the bot instance service, with resource definitions and basic CRUD operations. * Update api/proto/teleport/machineid/v1/bot_instance.proto Co-authored-by: Noah Stride <noah.stride@goteleport.com> * Further work on BotInstance boilerplate Among other things, this adds event definitions, RBAC and presets, and client definitions. This also includes a few proto changes, namely moving bot name and instance ID from .Status to .Spec so metadata can be uniformly generated. This also adds tests for the backend service showing functional CRUD. * gRPC tests; fix filtering; fix import lints; misc cleanup * Consistently use `instance_id` in gRPC fields * Add tctl support for bot_instance resource * protos update * Various typo fixes * Fix more comment typos * More doc comments and minor test tweaks --------- Co-authored-by: Noah Stride <noah.stride@goteleport.com> * Create Bot Instances during initial bot join (#43577) * Create Bot Instances during initial bot join This creates new instances for bots when they initially join the cluster, and persists instance IDs in new certificate fields on join and during renewal. Note that this does not yet handle instance reuse for non-token join methods. Additionally, bot instance creation is locked behind a `BOT_INSTANCE_EXPERIMENT` flag; it must be set to `1` to enable creation. * Proto cleanup, and update bot auth records on cert renewal This makes various (admittedly breaking) protobuf changes, including removing the TTL field (calculating resource expiry based on cert requests), removing public key fingerprints, and changing the data type of the generation counter to match the preexisting internal datatype. These changes _should_ be safe as no consumers of the proto API currently exist. Additionally, this also updates bot authentications on renewal. * Fix proto lints * Fix misleading doc comment in the bot instance experiment * Create bot instances for old bots on join; other fixes This now creates bot instances for bots whose certs are missing the BotInstanceID field. Additionally, it fixes two related bugs: expiration dates are extended on renewal, the generated UUID is properly appended to certs on initial join, and instances are only created or updated when the experiment is enabled. * Add a minimal test for bot instance creation on initial join * Validate bot instance state in generation counter checks * Remove outdated TODO comment and fix test lints * Add an expiration change check to the generation test * Machine ID: Provide existing identity when rejoining as part of renewals (#43732) * Allow existing bot identity to be used for token based renewals * Add test for register with auth client * Fix tests * Fix test * Fix partial assignment to interface * Improve loggging * Fix spelling * Fix missing import * Fix built gRPC * Machine ID: Bot Instance heartbeating (#43463) * Add basic CRUD for BotInstanceService This adds an early backend for the bot instance service, with resource definitions and basic CRUD operations. * Update api/proto/teleport/machineid/v1/bot_instance.proto Co-authored-by: Noah Stride <noah.stride@goteleport.com> * Further work on BotInstance boilerplate Among other things, this adds event definitions, RBAC and presets, and client definitions. This also includes a few proto changes, namely moving bot name and instance ID from .Status to .Spec so metadata can be uniformly generated. This also adds tests for the backend service showing functional CRUD. * gRPC tests; fix filtering; fix import lints; misc cleanup * Consistently use `instance_id` in gRPC fields * Add tctl support for bot_instance resource * protos update * Various typo fixes * Fix more comment typos * More doc comments and minor test tweaks * Add SubmitHeartbeat RPC implementation * Fix log keys * Update to use proper BotInstanceID field in identiy * Add rough outline of heartbeat service --------- Co-authored-by: Tim Buckley <tim@goteleport.com> * Allow bots to provide client cert authentication when rejoining (#44257) * Allow bots to provide client cert authentication when rejoining Traditionally, bots using non-`token` join methods "renew" their certs by rejoining, taking advantage of their repeatable join methods to avoid relying on `token` rejoin logic. However, as part of [RFD 162], we need a way for rejoining bots to assert an existing identity (if any) even when rejoining "from scratch". When paired with #43732, this change allows bots to use an authenticated auth client for all "initial" joins. If present, an existing `BotInstanceID` is extracted from the client identity and used to associate the bot with its previous identity. As certain join methods (TPM, Azure) are initially handled by the `JoinService` using gRPC, this additionally adds a new `BotInstanceID` field to `RegisterUsingTokenRequest` so the `JoinService` can verify the client identity and extract the `BotInstanceID` if available. Like `RemoteAddr`, this value is only trusted when set by the proxy and should be ignored in all other cases. [RFD 162]: #36510 * Add tests for authenticated bot rejoining * Create a new bot instance if the requested instance is not found Also, includes a new check in for this behavior in `TestRegisterBotWithInvalidInstanceID`. * Fix imports * Update lib/auth/auth_with_roles.go Co-authored-by: Zac Bergquist <zac.bergquist@goteleport.com> * Fix `cmp` import --------- Co-authored-by: Zac Bergquist <zac.bergquist@goteleport.com> * Machine ID: Include Bot Instance ID in audit logs (#43786) * Create Bot Instances during initial bot join This creates new instances for bots when they initially join the cluster, and persists instance IDs in new certificate fields on join and during renewal. Note that this does not yet handle instance reuse for non-token join methods. Additionally, bot instance creation is locked behind a `BOT_INSTANCE_EXPERIMENT` flag; it must be set to `1` to enable creation. * Proto cleanup, and update bot auth records on cert renewal This makes various (admittedly breaking) protobuf changes, including removing the TTL field (calculating resource expiry based on cert requests), removing public key fingerprints, and changing the data type of the generation counter to match the preexisting internal datatype. These changes _should_ be safe as no consumers of the proto API currently exist. Additionally, this also updates bot authentications on renewal. * Fix proto lints * Fix misleading doc comment in the bot instance experiment * Create bot instances for old bots on join; other fixes This now creates bot instances for bots whose certs are missing the BotInstanceID field. Additionally, it fixes two related bugs: expiration dates are extended on renewal, the generated UUID is properly appended to certs on initial join, and instances are only created or updated when the experiment is enabled. * Add a minimal test for bot instance creation on initial join * Validate bot instance state in generation counter checks * Remove outdated TODO comment and fix test lints * Add an expiration change check to the generation test * Add BotInstanceID to audit events * Fix borked conflict resolution * Fix further borked conflict resolution * Add test for cert create/join --------- Co-authored-by: Tim Buckley <tim@goteleport.com> * Show bot instance ID in tbot log output (#44578) This tweaks the "fetched new bot identity" message to show the bot name and instance ID as embedded in the bot's certificate. Example: ``` 2024-07-23T15:51:20-06:00 INFO [TBOT:IDEN] Fetched new bot identity identity:tpm-test, id=5a2865d3-d3dc-4eaa-853c-5377a1fe83f6 | valid: after=2024-07-23T21:50:20Z, before=2024-07-23T21:56:19Z, duration=5m59s | kind=tls, renewable=false, disallow-reissue=false, roles=[bot-tpm-test], principals=[-teleport-internal-join], generation=0 tbot/service_bot_identity.go:223 ``` * Machine ID: Validate generation counter using bot instances (#44583) * Show bot instance ID in tbot log output This tweaks the "fetched new bot identity" message to show the bot name and instance ID as embedded in the bot's certificate. Example: ``` 2024-07-23T15:51:20-06:00 INFO [TBOT:IDEN] Fetched new bot identity identity:tpm-test, id=5a2865d3-d3dc-4eaa-853c-5377a1fe83f6 | valid: after=2024-07-23T21:50:20Z, before=2024-07-23T21:56:19Z, duration=5m59s | kind=tls, renewable=false, disallow-reissue=false, roles=[bot-tpm-test], principals=[-teleport-internal-join], generation=0 tbot/service_bot_identity.go:223 ``` * Machine ID: Validate generation counter using bot instances This change moves the generation counter from bot user labels to a field in `BotInstanceStatusAuthentication`. This allows for tracking of individual generations regardless of join method, allows for multiple `token`-type joins per bot, and opens the door for multi-use `token`-type tokens in the future. For now, the new behavior remains behind the `BOT_INSTANCE_EXPERIMENT` feature flag, and generation counter handling is largely unchanged when it is unset. * Update legacy generation counter for downgrade compatibility This updates the legacy user label generation counter to support some downgrade compatibility if a user either disables the bot instance experiment or downgrades to a Teleport version predating it. Also includes some logger cleanup, and adds a warning when a generation counter mismatch is detected but not enforced. * Set initial generation value for legacy mode `validateGenerationLabel` expects a nonzero generation counter value for renewable certs to populate the counter, so this provides it again. * Add new generation counter tests This adds various new tests to validate new per-instance generation counter behavior. Also includes backcompat fixes to import old generation counters, and a small fix to ensure downgrades don't leave bots in a bad state. * Fix imports * Tweak TODO message * Fix HostCAs not being returned during bot renewal (#44832) * Machine ID: Remove `BOT_INSTANCE_EXPERIMENT` feature flag (#44904) This removes the `BOT_INSTANCE_EXPERIMENT` feature flag to enable the feature universally. It also adjusts some comments to add deprecation notices. * Machine ID: Only set legacy generation counter for renewable joins (#44993) We unconditionally committed the generation counter to the legacy user label to allow for downgrade compatibility. This is not valid for non-token join methods as previous Teleport versions will attempt to check the generation counter whenever the label is nonzero, and because these older versions don't support authenticated rejoining, will never be able to compare to the client's actual generation value. (Plus, it was hard-coded to zero.) This disables the legacy generation counter for non-renewable join methods. This should not create a later upgrade problem as bots will lose their instance ID upon their first rejoin after downgrading, and so will be issued a fresh ID and generation counter if and when the cluster is upgraded again. * Remove unused function sshPublicKeyToPKIXPEM Left-over function from backport * Always use PEM PKIX DER representation of public keys for BotInstances (#43845) * Always use PEM PKIX DER representation of public keys for BotInstances * Prefer using `keys.MarshalPublicKey` * Remove mistakenly included TestVerifyALPNUpgradedConn The recent merge included an unrelated unit test. This removes it. --------- Co-authored-by: Noah Stride <noah.stride@goteleport.com> Co-authored-by: Zac Bergquist <zac.bergquist@goteleport.com>
I've updated this to be marked implemented, @zmb3 / @timothyb89 PTAL so we can merge this RFD down. |
Proposes solutions for #26885
RFD to improve the UX of
token
joining and managements of large fleets oftbot
. Introduces a BotInstance resource to allow these to be managed and monitored in the UI and CLI - similar to how we handle heartbeats for agents today.