RFD 0162: Machine ID Fleet Management, `BotInstance` resource and other `token` joining improvements #36510

strideynet · 2024-01-10T14:05:17Z

Proposes solutions for #26885

RFD to improve the UX of token joining and managements of large fleets of tbot. Introduces a BotInstance resource to allow these to be managed and monitored in the UI and CLI - similar to how we handle heartbeats for agents today.

github-actions · 2024-01-12T18:27:06Z

The PR changelog entry failed validation: Changelog entry not found in the PR body. Please add a "no-changelog" label to the PR, or changelog lines starting with changelog: followed by the changelog entries for the PR.

github-actions · 2024-01-12T18:27:36Z

The PR changelog entry failed validation: Changelog entry not found in the PR body. Please add a "no-changelog" label to the PR, or changelog lines starting with changelog: followed by the changelog entries for the PR.

rfd/0162-machine-id-token-join-method-bot-instance.md

strideynet · 2024-05-07T12:27:46Z

@timothyb89 I did some thinking over the weekend and I stumbled across a fairly big decision I think we need to make.

Should we either:

Maintain the existing behaviour of delegated joining today, where we essentially "rejoin" for each renewal and the generation counter mechanism is not used.
Switch to using the renewal/generation counter mechanism for all join methods

Thoughts on maintaining existing behaviour:

Each BotInstance will have multiple joins associated with it. This could make how we display information a little more complicated (e.g if we wanted to show what GitHub action a BotInstance is associated with). Whilst we expect each of these joins/delegated identities to be similar/identical, we can't prevent someone switching join method for an already joined bot.
Avoids relying more on the currently flakey generation counter.
If a user deletes the join token, already joined bots will no longer be able to renew.
Avoids a change from the behaviour users are used to today.

Thoughts on switching to renewal mechanism:

Each BotInstance has a single initial join associated with it. This makes attributing actions to a specific delegated identity simpler.
If a user deletes the join token, already joined bots can continue to renew and function as expected. We could add an option when deleting a token that then "locks out" any botinstances that joined using that token to essentially mimic the previous behaviour. This is more similar to how join tokens work for agent joining, as they do not currently need to renew, so deleting their join token does not impact them.
Generation counter is fairly flakey at the moment, we'd need to make this more stable.
Makes behaviour consistent across join methods and bots vs agents. Today, the token join method is "special" as the only renewable join method. This means it's inconsistent with the other join methods and how they work, and also means that the other join methods behave differently when used for a bot and when used for an agent.

For me, I think the key point to this is the UX around deleting a token: is it better for the bots to then fail to renew or for them to continue to function?

There's no requirement that we necessarily make this decision now. We could pursue a strategy of maintaining the existing behaviour and changing this behaviour at some point in the future, but I think from the perspective of designing the bot instances, it would be good to consider this and at least explicitly decide we will defer changing the behaviour.

rfd/0162-machine-id-token-join-method-bot-instance.md

timothyb89 · 2024-05-16T01:14:41Z

rfd/0162-machine-id-token-join-method-bot-instance.md

+### Locking of Individual Bot Instances
+
+Currently, it's only possible to lock out an entire Bot user. This means that
+when managing a large fleet, it would not be able to lock out a specific host
+that had been compromised. This is likely to be a major friction point for those
+deploying a large number of Bot instances.
+
+It also increases the significance of the fragility of the generation counter.


The downside here is that the bot remains aware of the secret token it used to join, and that token is meant to be reused to join multiple bots. If we just lock one bot instance, an attacker could purge the bot data and join as a new instance to circumvent the lock.

I think to lock individual instances, we'd need some way to definitively associate a bot with a host. This category of (token based joining) issue always seems to come back to TPMs 😅

One thing I thought about here is an option to "Lock all bot instances that have authenticated with X token". Seems fairly feasible if we store the hash of the token used on the first join or similar.

I think this is deffo something for next quarter though.

rfd/0162-machine-id-token-join-method-bot-instance.md

timothyb89 · 2024-05-16T03:56:47Z

rfd/0162-machine-id-token-join-method-bot-instance.md

+As we now have a way to track the generation for a specific Bot instance, we
+can allow multiple Bot instances to be associated with a single Bot. This
+also means that the token no longer needs to be consumed on a join.
+
+Eventually, we may wish to add a way to specify a number of joins which can
+occur with a token. This provides a way to easily control the lifetime of a 
+token when deploying to a fleet of a pre-known size.
+
+The renewal logic will need to be adjusted to read and update the generation
+counter from the BotInstance rather than the Bot user.


I'm a bit conflicted about launching multi-use tokens without also adding options to limit joins. While they're valid, tokens are much more security sensitive than certificates, and I'm not totally convinced that the improved visibility from BotInstances makes up for the possibility of malicious token reuse.

The only easy limits I see are TTLs and a use counter, and only TTLs exist today.

I assume we'll at least keep the default 1h token TTL, but we don't meaningfully restrict the TTL beyond that - you can make a token with a 9999d TTL today. I think users will be incentivized to raise this TTL, possibly indefinitely. I can imagine these rough classes of use case:

On day 0, they'll be deploying a potentially large fleet of bots which may take more than an hour to initialize.

On day 1+, they'll be continuing to deploy new bots (e.g. as they dynamically scale their infrastructure), or repairing existing bots that went offline beyond the renewal period.

Today, there isn't much incentive for users to select a long token TTL because they're still just one-time-use; you'll need to make a new token (or many new tokens) regardless, so you're forced to automate token creation if you want to deploy bots at scale.

With reusable tokens, though, long TTLs provide a tangible benefit. They give you time to deploy your initial bot fleet ... and let you scale without any extra work, and let you easily fix broken bots...

Anyway, I think we need to try and balance out the incentives here:

We should include a join count limit from the start, and arguably default it to 1 so users are making an explicit decision to allow multiple bots to join. This default would also avoid a potentially surprising change in behavior for existing users.

We should explicitly address that "day 1+" case and simplify joining new instances to / creating new tokens for existing bots, to facilitate adding new bots later or repairing broken ones. Maybe something like the now ancient "bot refresh" feature request and let multiple tokens be issued to one bot?

I'm a bit conflicted about launching multi-use tokens without also adding options to limit joins.

I agree some sort of limiting factor on a number of joins is likely going to be necessary out of the box.

We should include a join count limit from the start, and arguably default it to 1 so users are making an explicit decision to allow multiple bots to join. This default would also avoid a potentially surprising change in behavior for existing users.

I like this idea - especially using a default of 1 to avoid a big change in behaviour.

One use-case I think we can't neglect is CI platforms we do not support and can not support due to their limitations. In this case, users may well still want a token that is "quite" long-lived and supports a large number of joins.

We'd still want them to rotate though... one thing I was thinking about to make this easier would be to produce a script or command which can be executed on a cron job that generates a new token for a named bot and then calls a user-supplied script which should connect to their CI platform/secret store and updates the token. This may be a nice answer for those who are backed into a corner by a rudimentary CI platform.

This fits well into the "bot refresh" idea, some sort of easyish CLI command where you specify an existing bot and can just generate a new token.

One use-case I think we can't neglect is CI platforms we do not support and can not support due to their limitations. In this case, users may well still want a token that is "quite" long-lived and supports a large number of joins.

I agree. I almost wonder if there's a middle ground here where we might be able to force users to make a security decision:

The token join method sets a sane cap on the TTL of a join token to proactively encourage some variety of automation, say 7 days or something. I don't think we need to cap the number of uses as long as there's a time bound.

It might be useful to write up a cron job example, maybe a guide on using a GHA or k8s cronjob to rotate a Jenkins joining secret or something.

We then (eventually) introduce a new token-insecure that otherwise behaves identically but has no restrictions at all on TTL or number of joins (basically this other thread down the page). This method might also opt out of the generation counter.

I think we could argue enforcing a maximum token TTL is an acceptable breaking change given how narrowly useful long TTL'd tokens are today, and then introduce token-insecure at some point in the future? I can't imagine any users today are making practical use of extremely long TTL bot tokens that would have their use cases broken.

I'm a little cautious about setting a cap on the TTL, if we were totally greenfield I'd agree but:

Today, there's no cap. I agree with you and doubt anyone is relying on this behaviour - so perhaps we can discount this.

We'd need to have logic that ignores the cap if it's for Agent joining

We'd need to have logic that ignores the cap if the join method is not token.

I'd perhaps be less cautious about the TTL cap if we were approaching this at a later date when we add token-insecure, imho, the whole token type could do with a V3 and that might be a good opportunity to do so. As an aside, I'd love to make it so token name is not the actual secret value, e.g a token join method token has a UUID-esque name or a user provided name, and then a field which specifies the actual secret value. This would make audit logging etc much easier because we wouldn't need to "hide" the name.

I'm not overly decided on this and I think I need to ponder on it a little longer to figure out my full thoughts. It's almost out of scope for this RFD - so perhaps we don't need to make a decision on this immediately and we can proceed with the BotInstance work since I don't think this has any impact.

I think my hopefully more in scope recommendation for this RFD is: let's enforce a soft client-side limit in tctl and/or warn that it's a bad idea and we may restrict token length in the future. I figure that if a user wants to create an extremely long-lived token, they can write up the YAML themselves and tctl create -f it.

I think warnings or soft errors today help keep the door open for us to lock this down more in the future while making sure use cases that need very long lived tokens are at least possible.

Unrelatedly, I think the tctl bot refresh case is covered here already: just create a new token for the existing bot with tctl tokens add --type=bot --bot foo.

rfd/0162-machine-id-token-join-method-bot-instance.md

This adds a new section on the security requirements for the persistent bot identity, alongside an updated preference for UUID identifiers. This also includes revisions to the "changes to the `token` join method" to include join count limits by default and set sane defaults to encourage short lived tokens. Lastly, various other small changes were applied, like removing the implementation details for bot instance names in favor of an explicitly specified bot name and instance identifier.

timothyb89 · 2024-05-25T00:34:25Z

rfd/0162-machine-id-token-join-method-bot-instance.md

+  // The public key of the Bot instance.
+  // When authenticating a Bot instance, the full public key must be compared
+  // rather than just the fingerprint to mitigate pre-image attacks.
+  bytes public_key = 2;
+  // The fingerprint of the public key of the Bot instance.
+  string fingerprint = 3;


I've so far kept the public key and fingerprint fields even though we're adding a UUID. Do we want to keep these? I don't think they're strictly required, but may still be useful.

I think it's useful info but if it's placed at the top level, we don't have a context point for when the value was this. We could move this into BotInstanceStatusAuthentication and that way we'll properly handle the key changing.

Good call, will move it into that message.

timothyb89 · 2024-05-25T00:38:11Z

I've done a pass to apply some of the discussions we've more or less found a consensus on, mainly switching to UUID identifiers and lightly touching on how we'll verify them. I've also suggested some tweaks to the token join method to include join counter limits by default and discourage (but still ultimately allow) very long TTLs.

rfd/0162-machine-id-token-join-method-bot-instance.md

strideynet · 2024-06-06T13:16:52Z

rfd/0162-machine-id-token-join-method-bot-instance.md

+  // The public key of the Bot instance.
+  // When authenticating a Bot instance, the full public key must be compared
+  // rather than just the fingerprint to mitigate pre-image attacks.
+  bytes public_key = 2;
+  // The fingerprint of the public key of the Bot instance.
+  string fingerprint = 3;


I think it's useful info but if it's placed at the top level, we don't have a context point for when the value was this. We could move this into BotInstanceStatusAuthentication and that way we'll properly handle the key changing.

rfd/0162-machine-id-token-join-method-bot-instance.md

strideynet · 2024-06-06T13:19:38Z

rfd/0162-machine-id-token-join-method-bot-instance.md

+then the second-oldest entry will be removed. This prevents growth without
+bounds but also ensures that the original record is retained.
+
+In addition, the TTL of the BotInstance resource will be extended.


Suggested change

In addition, the TTL of the BotInstance resource will be extended.

In addition, the TTL of the BotInstance resource will be extended to cover the validity period of the issued certificate, plus a short additional time to allow for some imprecision.

rfd/0162-machine-id-token-join-method-bot-instance.md

…-token-join-method-bot-instance

strideynet

I raised this PR initially but Tim has taken it over - so I can't Approve it. Please take this as my "approval".

rfd/0162-machine-id-token-join-method-bot-instance.md

zmb3 · 2024-06-18T23:19:15Z

rfd/0162-machine-id-token-join-method-bot-instance.md

+
+If the heartbeat fails, then `tbot` should retry on an exponential backoff.
+
+##### Alternative: Submit Heartbeat Data on Join/Renew


Before merging this, it would be helpful if each Alternative that we present but ultimately do not select lists why it was not chosen.

zmb3 · 2024-06-18T23:21:44Z

rfd/0162-machine-id-token-join-method-bot-instance.md

+long TTLs will need to be explicitly specified.
+
+We may additionally want to put hurdles in the way of things like extremely long
+token TTLs. There are legitimate low-security use cases for these, but we could


I'm curious what Sasha thinks about this.

Extremely long join token TTLs basically become a shared secret like an API key, and we built Machine ID to avoid long-lived shared secrets..

I agree, and would ideally like to only support this by way of an explicitly insecure token-insecure join method at some point in the future (see previous thread: #36510 (comment)). As we discussed in that thread, adding a TTL limit would technically be a breaking change, and there are use cases we may still want to support.

For context on why, today's status quo is problematic because it seems like we properly support token joining, but it's kind of an illusion. It works great in a Getting Started guide on your dev box, and then breaks as soon as you try it in production because one of your nodes went offline for more than an hour. Production use really requires full automation (e.g. ansible or terraform detecting dead bots and reissuing tokens), and I'm not aware of any customers who've actually put in the effort to do it.

The changes proposed here (multiple uses per token, multiple tokens per bot) simplify the automation requirements a lot, but it's still a heavy lift, and I'm not sure it'll ever be worthwhile for a lot of use cases. There may be value in giving customers in this niche a way to still make use of Teleport if they (explicitly) accept the security implications of a long lived token; they still benefit from our RBAC, audit logging, etc.

Michael is working on the display and creation of join tokens in the web UI, and we actually just had a conversation with Sasha about token creation during the Design and Product Office Hours yesterday. In the meeting, we agreed to show a danger state for any token join method with an expiration >24 hours. We also agreed to make it impossible to create long-lived tokens through the web UI.

(What we discussed doesn't affect the underlying mechanics of the token join method, but it does put some guard rails for their creation in the web UI and affects how they will be displayed, regardless of how they were created.)

rfd/0162-machine-id-token-join-method-bot-instance.md

zmb3 · 2024-06-18T23:23:45Z

rfd/0162-machine-id-token-join-method-bot-instance.md

+A PostHog event should be emitted for each BotInstance heartbeat. This will
+allow us to track active bots in a similar way to how we track active agents.


Should we included it in the aggregated tp.resource.heartbeat event rather than creating a new event for each heartbeat?

zmb3 · 2024-06-18T23:26:29Z

rfd/0162-machine-id-token-join-method-bot-instance.md

+when managing a large fleet, it would not be able to lock out a specific host
+that had been compromised. This is likely to be a major friction point for those


We can leave this out of scope but we should explore it in the future.

We can already lock any heartbeating agent based on its Host ID today, so it feels like this should extend to bots without too much trouble.

Yeah, we should add something here, but it's trivial to circumvent when the join token can just be reused, @strideynet had some ideas here: https://github.com/gravitational/teleport/pull/36510/files/61e17365cac7aaec059de86d470a57a6b612eebd#r1602421886

Co-authored-by: Zac Bergquist <zac.bergquist@goteleport.com>

Traditionally, bots using non-`token` join methods "renew" their certs by rejoining, taking advantage of their repeatable join methods to avoid relying on `token` rejoin logic. However, as part of [RFD 162], we need a way for rejoining bots to assert an existing identity (if any) even when rejoining "from scratch". When paired with #43732, this change allows bots to use an authenticated auth client for all "initial" joins. If present, an existing `BotInstanceID` is extracted from the client identity and used to associate the bot with its previous identity. As certain join methods (TPM, Azure) are initially handled by the `JoinService` using gRPC, this additionally adds a new `BotInstanceID` field to `RegisterUsingTokenRequest` so the `JoinService` can verify the client identity and extract the `BotInstanceID` if available. Like `RemoteAddr`, this value is only trusted when set by the proxy and should be ignored in all other cases. [RFD 162]: #36510

* Allow bots to provide client cert authentication when rejoining Traditionally, bots using non-`token` join methods "renew" their certs by rejoining, taking advantage of their repeatable join methods to avoid relying on `token` rejoin logic. However, as part of [RFD 162], we need a way for rejoining bots to assert an existing identity (if any) even when rejoining "from scratch". When paired with #43732, this change allows bots to use an authenticated auth client for all "initial" joins. If present, an existing `BotInstanceID` is extracted from the client identity and used to associate the bot with its previous identity. As certain join methods (TPM, Azure) are initially handled by the `JoinService` using gRPC, this additionally adds a new `BotInstanceID` field to `RegisterUsingTokenRequest` so the `JoinService` can verify the client identity and extract the `BotInstanceID` if available. Like `RemoteAddr`, this value is only trusted when set by the proxy and should be ignored in all other cases. [RFD 162]: #36510 * Add tests for authenticated bot rejoining * Create a new bot instance if the requested instance is not found Also, includes a new check in for this behavior in `TestRegisterBotWithInvalidInstanceID`. * Fix imports * Update lib/auth/auth_with_roles.go Co-authored-by: Zac Bergquist <zac.bergquist@goteleport.com> * Fix `cmp` import --------- Co-authored-by: Zac Bergquist <zac.bergquist@goteleport.com>

* Add basic CRUD for BotInstanceService (#42899) * Add basic CRUD for BotInstanceService This adds an early backend for the bot instance service, with resource definitions and basic CRUD operations. * Update api/proto/teleport/machineid/v1/bot_instance.proto Co-authored-by: Noah Stride <noah.stride@goteleport.com> * Further work on BotInstance boilerplate Among other things, this adds event definitions, RBAC and presets, and client definitions. This also includes a few proto changes, namely moving bot name and instance ID from .Status to .Spec so metadata can be uniformly generated. This also adds tests for the backend service showing functional CRUD. * gRPC tests; fix filtering; fix import lints; misc cleanup * Consistently use `instance_id` in gRPC fields * Add tctl support for bot_instance resource * protos update * Various typo fixes * Fix more comment typos * More doc comments and minor test tweaks --------- Co-authored-by: Noah Stride <noah.stride@goteleport.com> * Create Bot Instances during initial bot join (#43577) * Create Bot Instances during initial bot join This creates new instances for bots when they initially join the cluster, and persists instance IDs in new certificate fields on join and during renewal. Note that this does not yet handle instance reuse for non-token join methods. Additionally, bot instance creation is locked behind a `BOT_INSTANCE_EXPERIMENT` flag; it must be set to `1` to enable creation. * Proto cleanup, and update bot auth records on cert renewal This makes various (admittedly breaking) protobuf changes, including removing the TTL field (calculating resource expiry based on cert requests), removing public key fingerprints, and changing the data type of the generation counter to match the preexisting internal datatype. These changes _should_ be safe as no consumers of the proto API currently exist. Additionally, this also updates bot authentications on renewal. * Fix proto lints * Fix misleading doc comment in the bot instance experiment * Create bot instances for old bots on join; other fixes This now creates bot instances for bots whose certs are missing the BotInstanceID field. Additionally, it fixes two related bugs: expiration dates are extended on renewal, the generated UUID is properly appended to certs on initial join, and instances are only created or updated when the experiment is enabled. * Add a minimal test for bot instance creation on initial join * Validate bot instance state in generation counter checks * Remove outdated TODO comment and fix test lints * Add an expiration change check to the generation test * Machine ID: Provide existing identity when rejoining as part of renewals (#43732) * Allow existing bot identity to be used for token based renewals * Add test for register with auth client * Fix tests * Fix test * Fix partial assignment to interface * Improve loggging * Fix spelling * Fix missing import * Fix built gRPC * Machine ID: Bot Instance heartbeating (#43463) * Add basic CRUD for BotInstanceService This adds an early backend for the bot instance service, with resource definitions and basic CRUD operations. * Update api/proto/teleport/machineid/v1/bot_instance.proto Co-authored-by: Noah Stride <noah.stride@goteleport.com> * Further work on BotInstance boilerplate Among other things, this adds event definitions, RBAC and presets, and client definitions. This also includes a few proto changes, namely moving bot name and instance ID from .Status to .Spec so metadata can be uniformly generated. This also adds tests for the backend service showing functional CRUD. * gRPC tests; fix filtering; fix import lints; misc cleanup * Consistently use `instance_id` in gRPC fields * Add tctl support for bot_instance resource * protos update * Various typo fixes * Fix more comment typos * More doc comments and minor test tweaks * Add SubmitHeartbeat RPC implementation * Fix log keys * Update to use proper BotInstanceID field in identiy * Add rough outline of heartbeat service --------- Co-authored-by: Tim Buckley <tim@goteleport.com> * Allow bots to provide client cert authentication when rejoining (#44257) * Allow bots to provide client cert authentication when rejoining Traditionally, bots using non-`token` join methods "renew" their certs by rejoining, taking advantage of their repeatable join methods to avoid relying on `token` rejoin logic. However, as part of [RFD 162], we need a way for rejoining bots to assert an existing identity (if any) even when rejoining "from scratch". When paired with #43732, this change allows bots to use an authenticated auth client for all "initial" joins. If present, an existing `BotInstanceID` is extracted from the client identity and used to associate the bot with its previous identity. As certain join methods (TPM, Azure) are initially handled by the `JoinService` using gRPC, this additionally adds a new `BotInstanceID` field to `RegisterUsingTokenRequest` so the `JoinService` can verify the client identity and extract the `BotInstanceID` if available. Like `RemoteAddr`, this value is only trusted when set by the proxy and should be ignored in all other cases. [RFD 162]: #36510 * Add tests for authenticated bot rejoining * Create a new bot instance if the requested instance is not found Also, includes a new check in for this behavior in `TestRegisterBotWithInvalidInstanceID`. * Fix imports * Update lib/auth/auth_with_roles.go Co-authored-by: Zac Bergquist <zac.bergquist@goteleport.com> * Fix `cmp` import --------- Co-authored-by: Zac Bergquist <zac.bergquist@goteleport.com> * Machine ID: Include Bot Instance ID in audit logs (#43786) * Create Bot Instances during initial bot join This creates new instances for bots when they initially join the cluster, and persists instance IDs in new certificate fields on join and during renewal. Note that this does not yet handle instance reuse for non-token join methods. Additionally, bot instance creation is locked behind a `BOT_INSTANCE_EXPERIMENT` flag; it must be set to `1` to enable creation. * Proto cleanup, and update bot auth records on cert renewal This makes various (admittedly breaking) protobuf changes, including removing the TTL field (calculating resource expiry based on cert requests), removing public key fingerprints, and changing the data type of the generation counter to match the preexisting internal datatype. These changes _should_ be safe as no consumers of the proto API currently exist. Additionally, this also updates bot authentications on renewal. * Fix proto lints * Fix misleading doc comment in the bot instance experiment * Create bot instances for old bots on join; other fixes This now creates bot instances for bots whose certs are missing the BotInstanceID field. Additionally, it fixes two related bugs: expiration dates are extended on renewal, the generated UUID is properly appended to certs on initial join, and instances are only created or updated when the experiment is enabled. * Add a minimal test for bot instance creation on initial join * Validate bot instance state in generation counter checks * Remove outdated TODO comment and fix test lints * Add an expiration change check to the generation test * Add BotInstanceID to audit events * Fix borked conflict resolution * Fix further borked conflict resolution * Add test for cert create/join --------- Co-authored-by: Tim Buckley <tim@goteleport.com> * Show bot instance ID in tbot log output (#44578) This tweaks the "fetched new bot identity" message to show the bot name and instance ID as embedded in the bot's certificate. Example: ``` 2024-07-23T15:51:20-06:00 INFO [TBOT:IDEN] Fetched new bot identity identity:tpm-test, id=5a2865d3-d3dc-4eaa-853c-5377a1fe83f6 | valid: after=2024-07-23T21:50:20Z, before=2024-07-23T21:56:19Z, duration=5m59s | kind=tls, renewable=false, disallow-reissue=false, roles=[bot-tpm-test], principals=[-teleport-internal-join], generation=0 tbot/service_bot_identity.go:223 ``` * Machine ID: Validate generation counter using bot instances (#44583) * Show bot instance ID in tbot log output This tweaks the "fetched new bot identity" message to show the bot name and instance ID as embedded in the bot's certificate. Example: ``` 2024-07-23T15:51:20-06:00 INFO [TBOT:IDEN] Fetched new bot identity identity:tpm-test, id=5a2865d3-d3dc-4eaa-853c-5377a1fe83f6 | valid: after=2024-07-23T21:50:20Z, before=2024-07-23T21:56:19Z, duration=5m59s | kind=tls, renewable=false, disallow-reissue=false, roles=[bot-tpm-test], principals=[-teleport-internal-join], generation=0 tbot/service_bot_identity.go:223 ``` * Machine ID: Validate generation counter using bot instances This change moves the generation counter from bot user labels to a field in `BotInstanceStatusAuthentication`. This allows for tracking of individual generations regardless of join method, allows for multiple `token`-type joins per bot, and opens the door for multi-use `token`-type tokens in the future. For now, the new behavior remains behind the `BOT_INSTANCE_EXPERIMENT` feature flag, and generation counter handling is largely unchanged when it is unset. * Update legacy generation counter for downgrade compatibility This updates the legacy user label generation counter to support some downgrade compatibility if a user either disables the bot instance experiment or downgrades to a Teleport version predating it. Also includes some logger cleanup, and adds a warning when a generation counter mismatch is detected but not enforced. * Set initial generation value for legacy mode `validateGenerationLabel` expects a nonzero generation counter value for renewable certs to populate the counter, so this provides it again. * Add new generation counter tests This adds various new tests to validate new per-instance generation counter behavior. Also includes backcompat fixes to import old generation counters, and a small fix to ensure downgrades don't leave bots in a bad state. * Fix imports * Tweak TODO message * Fix HostCAs not being returned during bot renewal (#44832) * Machine ID: Remove `BOT_INSTANCE_EXPERIMENT` feature flag (#44904) This removes the `BOT_INSTANCE_EXPERIMENT` feature flag to enable the feature universally. It also adjusts some comments to add deprecation notices. * Machine ID: Only set legacy generation counter for renewable joins (#44993) We unconditionally committed the generation counter to the legacy user label to allow for downgrade compatibility. This is not valid for non-token join methods as previous Teleport versions will attempt to check the generation counter whenever the label is nonzero, and because these older versions don't support authenticated rejoining, will never be able to compare to the client's actual generation value. (Plus, it was hard-coded to zero.) This disables the legacy generation counter for non-renewable join methods. This should not create a later upgrade problem as bots will lose their instance ID upon their first rejoin after downgrading, and so will be issued a fresh ID and generation counter if and when the cluster is upgraded again. * Remove unused function sshPublicKeyToPKIXPEM Left-over function from backport * Always use PEM PKIX DER representation of public keys for BotInstances (#43845) * Always use PEM PKIX DER representation of public keys for BotInstances * Prefer using `keys.MarshalPublicKey` * Remove mistakenly included TestVerifyALPNUpgradedConn The recent merge included an unrelated unit test. This removes it. --------- Co-authored-by: Noah Stride <noah.stride@goteleport.com> Co-authored-by: Zac Bergquist <zac.bergquist@goteleport.com>

rfd/0162-machine-id-token-join-method-bot-instance.md

strideynet · 2024-10-01T09:17:23Z

I've updated this to be marked implemented, @zmb3 / @timothyb89 PTAL so we can merge this RFD down.

strideynet added 11 commits January 10, 2024 12:44

skeleton out rfd

80a43a3

Restrucutre to highlight the selected options over alternatives

03df36c

identify edge cases

b83e49d

Explain backend key format

cc4ee99

Add spec for heartbeat rpc

6f59677

Add API for BotInstance resource

09d7351

Add notes on analytics

a1d06e0

Add a brief summary of out of scope improvements

4dd6dd9

Add output/services to heartbeat

42c2e30

Remove services/outputs

6d86217

Expand on tctl command changes

f8f7896

strideynet marked this pull request as ready for review January 12, 2024 18:26

github-actions bot added rfd Request for Discussion size/md labels Jan 12, 2024

github-actions bot requested review from bl-nero and ryanclark January 12, 2024 18:27

strideynet changed the title ~~WIP - RFD 0162: Machine ID Fleet Management and token joining improvements~~ RFD 0162: Machine ID Fleet Management, BotInstance resource and other token joining improvements Jan 12, 2024

strideynet requested review from timothyb89 and zmb3 and removed request for ryanclark and bl-nero January 12, 2024 18:27

strideynet added the no-changelog Indicates that a PR does not require a changelog entry label Jan 12, 2024

strideynet commented Jan 12, 2024

View reviewed changes

rfd/0162-machine-id-token-join-method-bot-instance.md Outdated Show resolved Hide resolved

rfd/0162-machine-id-token-join-method-bot-instance.md Outdated Show resolved Hide resolved

thedevelopnik reviewed May 10, 2024

View reviewed changes

rfd/0162-machine-id-token-join-method-bot-instance.md Outdated Show resolved Hide resolved

timothyb89 mentioned this pull request May 14, 2024

Machine ID: Bot Instance epic #41456

Closed

10 tasks

strideynet marked this pull request as draft May 14, 2024 16:38

timothyb89 reviewed May 16, 2024

View reviewed changes

timothyb89 reviewed May 18, 2024

View reviewed changes

rfd/0162-machine-id-token-join-method-bot-instance.md Show resolved Hide resolved

timothyb89 reviewed May 25, 2024

View reviewed changes

strideynet commented Jun 6, 2024

View reviewed changes

Apply code review feedback

b4d7a4f

timothyb89 marked this pull request as ready for review June 7, 2024 00:49

Merge remote-tracking branch 'origin/master' into rfd/0162-machine-id…

13b83db

…-token-join-method-bot-instance

github-actions bot added the size/lg label Jun 7, 2024

github-actions bot requested review from ibeckermayer and jakule June 7, 2024 00:50

strideynet commented Jun 7, 2024

View reviewed changes

Use consistent field names for authentications and heartbeats

61e1736

zmb3 reviewed Jun 18, 2024

View reviewed changes

timothyb89 and others added 2 commits June 18, 2024 18:54

Apply suggestions from code review

f4ffebf

Co-authored-by: Zac Bergquist <zac.bergquist@goteleport.com>

Address review feedback: mention TPM joining, alternatives, audit

97b4d3a

strideynet mentioned this pull request Jun 24, 2024

Machine ID: Allow token join method Join Token to be used multiple times to join Bots #43402

Open

timothyb89 mentioned this pull request Jul 16, 2024

Allow bots to provide client cert authentication when rejoining #44257

Merged

strideynet commented Oct 1, 2024

View reviewed changes

rfd/0162-machine-id-token-join-method-bot-instance.md Outdated Show resolved Hide resolved

Update rfd/0162-machine-id-token-join-method-bot-instance.md

5fa1817

strideynet requested review from zmb3 and timothyb89 October 1, 2024 09:17

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RFD 0162: Machine ID Fleet Management, `BotInstance` resource and other `token` joining improvements #36510

RFD 0162: Machine ID Fleet Management, `BotInstance` resource and other `token` joining improvements #36510

strideynet commented Jan 10, 2024 •

edited

Loading

github-actions bot commented Jan 12, 2024

github-actions bot commented Jan 12, 2024

strideynet commented May 7, 2024 •

edited

Loading

timothyb89 May 16, 2024

strideynet May 16, 2024

timothyb89 May 16, 2024

strideynet May 16, 2024

timothyb89 May 16, 2024

strideynet May 17, 2024

timothyb89 May 25, 2024 •

edited

Loading

timothyb89 May 25, 2024

strideynet Jun 6, 2024

timothyb89 Jun 7, 2024

timothyb89 commented May 25, 2024

strideynet Jun 6, 2024

strideynet Jun 6, 2024

strideynet left a comment

zmb3 Jun 18, 2024

zmb3 Jun 18, 2024

timothyb89 Jun 19, 2024

roraback Jun 26, 2024

zmb3 Jun 18, 2024

zmb3 Jun 18, 2024

timothyb89 Jun 19, 2024

strideynet commented Oct 1, 2024

	In addition, the TTL of the BotInstance resource will be extended.
	In addition, the TTL of the BotInstance resource will be extended to cover the validity period of the issued certificate, plus a short additional time to allow for some imprecision.


		If the heartbeat fails, then `tbot` should retry on an exponential backoff.

		##### Alternative: Submit Heartbeat Data on Join/Renew

		A PostHog event should be emitted for each BotInstance heartbeat. This will
		allow us to track active bots in a similar way to how we track active agents.

		when managing a large fleet, it would not be able to lock out a specific host
		that had been compromised. This is likely to be a major friction point for those

RFD 0162: Machine ID Fleet Management, BotInstance resource and other token joining improvements #36510

Are you sure you want to change the base?

RFD 0162: Machine ID Fleet Management, BotInstance resource and other token joining improvements #36510

Conversation

strideynet commented Jan 10, 2024 • edited Loading

github-actions bot commented Jan 12, 2024

github-actions bot commented Jan 12, 2024

strideynet commented May 7, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

timothyb89 May 25, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

timothyb89 commented May 25, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

strideynet left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

strideynet commented Oct 1, 2024

RFD 0162: Machine ID Fleet Management, `BotInstance` resource and other `token` joining improvements #36510

RFD 0162: Machine ID Fleet Management, `BotInstance` resource and other `token` joining improvements #36510

strideynet commented Jan 10, 2024 •

edited

Loading

strideynet commented May 7, 2024 •

edited

Loading

timothyb89 May 25, 2024 •

edited

Loading