Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFD 0162: Machine ID Fleet Management, BotInstance resource and other token joining improvements #36510

Open
wants to merge 18 commits into
base: master
Choose a base branch
from

Conversation

strideynet
Copy link
Contributor

@strideynet strideynet commented Jan 10, 2024

Proposes solutions for #26885

RFD to improve the UX of token joining and managements of large fleets of tbot. Introduces a BotInstance resource to allow these to be managed and monitored in the UI and CLI - similar to how we handle heartbeats for agents today.

@strideynet strideynet marked this pull request as ready for review January 12, 2024 18:26
Copy link

The PR changelog entry failed validation: Changelog entry not found in the PR body. Please add a "no-changelog" label to the PR, or changelog lines starting with changelog: followed by the changelog entries for the PR.

@github-actions github-actions bot added rfd Request for Discussion size/md labels Jan 12, 2024
@strideynet strideynet changed the title WIP - RFD 0162: Machine ID Fleet Management and token joining improvements RFD 0162: Machine ID Fleet Management, BotInstance resource and other token joining improvements Jan 12, 2024
@strideynet strideynet requested review from timothyb89 and zmb3 and removed request for ryanclark and bl-nero January 12, 2024 18:27
Copy link

The PR changelog entry failed validation: Changelog entry not found in the PR body. Please add a "no-changelog" label to the PR, or changelog lines starting with changelog: followed by the changelog entries for the PR.

@strideynet strideynet added the no-changelog Indicates that a PR does not require a changelog entry label Jan 12, 2024
@strideynet
Copy link
Contributor Author

strideynet commented May 7, 2024

@timothyb89 I did some thinking over the weekend and I stumbled across a fairly big decision I think we need to make.

Should we either:

  • Maintain the existing behaviour of delegated joining today, where we essentially "rejoin" for each renewal and the generation counter mechanism is not used.
  • Switch to using the renewal/generation counter mechanism for all join methods

Thoughts on maintaining existing behaviour:

  • Each BotInstance will have multiple joins associated with it. This could make how we display information a little more complicated (e.g if we wanted to show what GitHub action a BotInstance is associated with). Whilst we expect each of these joins/delegated identities to be similar/identical, we can't prevent someone switching join method for an already joined bot.
  • Avoids relying more on the currently flakey generation counter.
  • If a user deletes the join token, already joined bots will no longer be able to renew.
  • Avoids a change from the behaviour users are used to today.

Thoughts on switching to renewal mechanism:

  • Each BotInstance has a single initial join associated with it. This makes attributing actions to a specific delegated identity simpler.
  • If a user deletes the join token, already joined bots can continue to renew and function as expected. We could add an option when deleting a token that then "locks out" any botinstances that joined using that token to essentially mimic the previous behaviour. This is more similar to how join tokens work for agent joining, as they do not currently need to renew, so deleting their join token does not impact them.
  • Generation counter is fairly flakey at the moment, we'd need to make this more stable.
  • Makes behaviour consistent across join methods and bots vs agents. Today, the token join method is "special" as the only renewable join method. This means it's inconsistent with the other join methods and how they work, and also means that the other join methods behave differently when used for a bot and when used for an agent.

For me, I think the key point to this is the UX around deleting a token: is it better for the bots to then fail to renew or for them to continue to function?

There's no requirement that we necessarily make this decision now. We could pursue a strategy of maintaining the existing behaviour and changing this behaviour at some point in the future, but I think from the perspective of designing the bot instances, it would be good to consider this and at least explicitly decide we will defer changing the behaviour.

@timothyb89 timothyb89 mentioned this pull request May 14, 2024
10 tasks
@strideynet strideynet marked this pull request as draft May 14, 2024 16:38
rfd/0162-machine-id-token-join-method-bot-instance.md Outdated Show resolved Hide resolved
Comment on lines +553 to +560
### Locking of Individual Bot Instances

Currently, it's only possible to lock out an entire Bot user. This means that
when managing a large fleet, it would not be able to lock out a specific host
that had been compromised. This is likely to be a major friction point for those
deploying a large number of Bot instances.

It also increases the significance of the fragility of the generation counter.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The downside here is that the bot remains aware of the secret token it used to join, and that token is meant to be reused to join multiple bots. If we just lock one bot instance, an attacker could purge the bot data and join as a new instance to circumvent the lock.

I think to lock individual instances, we'd need some way to definitively associate a bot with a host. This category of (token based joining) issue always seems to come back to TPMs 😅

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One thing I thought about here is an option to "Lock all bot instances that have authenticated with X token". Seems fairly feasible if we store the hash of the token used on the first join or similar.

I think this is deffo something for next quarter though.

rfd/0162-machine-id-token-join-method-bot-instance.md Outdated Show resolved Hide resolved
rfd/0162-machine-id-token-join-method-bot-instance.md Outdated Show resolved Hide resolved
Comment on lines 402 to 411
As we now have a way to track the generation for a specific Bot instance, we
can allow multiple Bot instances to be associated with a single Bot. This
also means that the token no longer needs to be consumed on a join.

Eventually, we may wish to add a way to specify a number of joins which can
occur with a token. This provides a way to easily control the lifetime of a
token when deploying to a fleet of a pre-known size.

The renewal logic will need to be adjusted to read and update the generation
counter from the BotInstance rather than the Bot user.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a bit conflicted about launching multi-use tokens without also adding options to limit joins. While they're valid, tokens are much more security sensitive than certificates, and I'm not totally convinced that the improved visibility from BotInstances makes up for the possibility of malicious token reuse.

The only easy limits I see are TTLs and a use counter, and only TTLs exist today.

I assume we'll at least keep the default 1h token TTL, but we don't meaningfully restrict the TTL beyond that - you can make a token with a 9999d TTL today. I think users will be incentivized to raise this TTL, possibly indefinitely. I can imagine these rough classes of use case:

  • On day 0, they'll be deploying a potentially large fleet of bots which may take more than an hour to initialize.
  • On day 1+, they'll be continuing to deploy new bots (e.g. as they dynamically scale their infrastructure), or repairing existing bots that went offline beyond the renewal period.

Today, there isn't much incentive for users to select a long token TTL because they're still just one-time-use; you'll need to make a new token (or many new tokens) regardless, so you're forced to automate token creation if you want to deploy bots at scale.

With reusable tokens, though, long TTLs provide a tangible benefit. They give you time to deploy your initial bot fleet ... and let you scale without any extra work, and let you easily fix broken bots...

Anyway, I think we need to try and balance out the incentives here:

  • We should include a join count limit from the start, and arguably default it to 1 so users are making an explicit decision to allow multiple bots to join. This default would also avoid a potentially surprising change in behavior for existing users.
  • We should explicitly address that "day 1+" case and simplify joining new instances to / creating new tokens for existing bots, to facilitate adding new bots later or repairing broken ones. Maybe something like the now ancient "bot refresh" feature request and let multiple tokens be issued to one bot?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a bit conflicted about launching multi-use tokens without also adding options to limit joins.

I agree some sort of limiting factor on a number of joins is likely going to be necessary out of the box.

We should include a join count limit from the start, and arguably default it to 1 so users are making an explicit decision to allow multiple bots to join. This default would also avoid a potentially surprising change in behavior for existing users.

I like this idea - especially using a default of 1 to avoid a big change in behaviour.


One use-case I think we can't neglect is CI platforms we do not support and can not support due to their limitations. In this case, users may well still want a token that is "quite" long-lived and supports a large number of joins.

We'd still want them to rotate though... one thing I was thinking about to make this easier would be to produce a script or command which can be executed on a cron job that generates a new token for a named bot and then calls a user-supplied script which should connect to their CI platform/secret store and updates the token. This may be a nice answer for those who are backed into a corner by a rudimentary CI platform.

This fits well into the "bot refresh" idea, some sort of easyish CLI command where you specify an existing bot and can just generate a new token.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One use-case I think we can't neglect is CI platforms we do not support and can not support due to their limitations. In this case, users may well still want a token that is "quite" long-lived and supports a large number of joins.

I agree. I almost wonder if there's a middle ground here where we might be able to force users to make a security decision:

  • The token join method sets a sane cap on the TTL of a join token to proactively encourage some variety of automation, say 7 days or something. I don't think we need to cap the number of uses as long as there's a time bound.
    • It might be useful to write up a cron job example, maybe a guide on using a GHA or k8s cronjob to rotate a Jenkins joining secret or something.
  • We then (eventually) introduce a new token-insecure that otherwise behaves identically but has no restrictions at all on TTL or number of joins (basically this other thread down the page). This method might also opt out of the generation counter.

I think we could argue enforcing a maximum token TTL is an acceptable breaking change given how narrowly useful long TTL'd tokens are today, and then introduce token-insecure at some point in the future? I can't imagine any users today are making practical use of extremely long TTL bot tokens that would have their use cases broken.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a little cautious about setting a cap on the TTL, if we were totally greenfield I'd agree but:

  • Today, there's no cap. I agree with you and doubt anyone is relying on this behaviour - so perhaps we can discount this.
  • We'd need to have logic that ignores the cap if it's for Agent joining
  • We'd need to have logic that ignores the cap if the join method is not token.

I'd perhaps be less cautious about the TTL cap if we were approaching this at a later date when we add token-insecure, imho, the whole token type could do with a V3 and that might be a good opportunity to do so. As an aside, I'd love to make it so token name is not the actual secret value, e.g a token join method token has a UUID-esque name or a user provided name, and then a field which specifies the actual secret value. This would make audit logging etc much easier because we wouldn't need to "hide" the name.

I'm not overly decided on this and I think I need to ponder on it a little longer to figure out my full thoughts. It's almost out of scope for this RFD - so perhaps we don't need to make a decision on this immediately and we can proceed with the BotInstance work since I don't think this has any impact.

Copy link
Contributor

@timothyb89 timothyb89 May 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think my hopefully more in scope recommendation for this RFD is: let's enforce a soft client-side limit in tctl and/or warn that it's a bad idea and we may restrict token length in the future. I figure that if a user wants to create an extremely long-lived token, they can write up the YAML themselves and tctl create -f it.

I think warnings or soft errors today help keep the door open for us to lock this down more in the future while making sure use cases that need very long lived tokens are at least possible.


Unrelatedly, I think the tctl bot refresh case is covered here already: just create a new token for the existing bot with tctl tokens add --type=bot --bot foo.

This adds a new section on the security requirements for the
persistent bot identity, alongside an updated preference for UUID
identifiers.

This also includes revisions to the "changes to the `token` join
method" to include join count limits by default and set sane
defaults to encourage short lived tokens.

Lastly, various other small changes were applied, like removing
the implementation details for bot instance names in favor of an
explicitly specified bot name and instance identifier.
Comment on lines 309 to 314
// The public key of the Bot instance.
// When authenticating a Bot instance, the full public key must be compared
// rather than just the fingerprint to mitigate pre-image attacks.
bytes public_key = 2;
// The fingerprint of the public key of the Bot instance.
string fingerprint = 3;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've so far kept the public key and fingerprint fields even though we're adding a UUID. Do we want to keep these? I don't think they're strictly required, but may still be useful.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's useful info but if it's placed at the top level, we don't have a context point for when the value was this. We could move this into BotInstanceStatusAuthentication and that way we'll properly handle the key changing.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good call, will move it into that message.

@timothyb89
Copy link
Contributor

I've done a pass to apply some of the discussions we've more or less found a consensus on, mainly switching to UUID identifiers and lightly touching on how we'll verify them. I've also suggested some tweaks to the token join method to include join counter limits by default and discourage (but still ultimately allow) very long TTLs.

Comment on lines 309 to 314
// The public key of the Bot instance.
// When authenticating a Bot instance, the full public key must be compared
// rather than just the fingerprint to mitigate pre-image attacks.
bytes public_key = 2;
// The fingerprint of the public key of the Bot instance.
string fingerprint = 3;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's useful info but if it's placed at the top level, we don't have a context point for when the value was this. We could move this into BotInstanceStatusAuthentication and that way we'll properly handle the key changing.

rfd/0162-machine-id-token-join-method-bot-instance.md Outdated Show resolved Hide resolved
rfd/0162-machine-id-token-join-method-bot-instance.md Outdated Show resolved Hide resolved
then the second-oldest entry will be removed. This prevents growth without
bounds but also ensures that the original record is retained.

In addition, the TTL of the BotInstance resource will be extended.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
In addition, the TTL of the BotInstance resource will be extended.
In addition, the TTL of the BotInstance resource will be extended to cover the validity period of the issued certificate, plus a short additional time to allow for some imprecision.

rfd/0162-machine-id-token-join-method-bot-instance.md Outdated Show resolved Hide resolved
rfd/0162-machine-id-token-join-method-bot-instance.md Outdated Show resolved Hide resolved
@timothyb89 timothyb89 marked this pull request as ready for review June 7, 2024 00:49
Copy link
Contributor Author

@strideynet strideynet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I raised this PR initially but Tim has taken it over - so I can't Approve it. Please take this as my "approval".

rfd/0162-machine-id-token-join-method-bot-instance.md Outdated Show resolved Hide resolved
rfd/0162-machine-id-token-join-method-bot-instance.md Outdated Show resolved Hide resolved
rfd/0162-machine-id-token-join-method-bot-instance.md Outdated Show resolved Hide resolved

If the heartbeat fails, then `tbot` should retry on an exponential backoff.

##### Alternative: Submit Heartbeat Data on Join/Renew
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Before merging this, it would be helpful if each Alternative that we present but ultimately do not select lists why it was not chosen.

long TTLs will need to be explicitly specified.

We may additionally want to put hurdles in the way of things like extremely long
token TTLs. There are legitimate low-security use cases for these, but we could
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm curious what Sasha thinks about this.

Extremely long join token TTLs basically become a shared secret like an API key, and we built Machine ID to avoid long-lived shared secrets..

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, and would ideally like to only support this by way of an explicitly insecure token-insecure join method at some point in the future (see previous thread: #36510 (comment)). As we discussed in that thread, adding a TTL limit would technically be a breaking change, and there are use cases we may still want to support.

For context on why, today's status quo is problematic because it seems like we properly support token joining, but it's kind of an illusion. It works great in a Getting Started guide on your dev box, and then breaks as soon as you try it in production because one of your nodes went offline for more than an hour. Production use really requires full automation (e.g. ansible or terraform detecting dead bots and reissuing tokens), and I'm not aware of any customers who've actually put in the effort to do it.

The changes proposed here (multiple uses per token, multiple tokens per bot) simplify the automation requirements a lot, but it's still a heavy lift, and I'm not sure it'll ever be worthwhile for a lot of use cases. There may be value in giving customers in this niche a way to still make use of Teleport if they (explicitly) accept the security implications of a long lived token; they still benefit from our RBAC, audit logging, etc.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Michael is working on the display and creation of join tokens in the web UI, and we actually just had a conversation with Sasha about token creation during the Design and Product Office Hours yesterday. In the meeting, we agreed to show a danger state for any token join method with an expiration >24 hours. We also agreed to make it impossible to create long-lived tokens through the web UI.

(What we discussed doesn't affect the underlying mechanics of the token join method, but it does put some guard rails for their creation in the web UI and affects how they will be displayed, regardless of how they were created.)

rfd/0162-machine-id-token-join-method-bot-instance.md Outdated Show resolved Hide resolved
Comment on lines +537 to +538
A PostHog event should be emitted for each BotInstance heartbeat. This will
allow us to track active bots in a similar way to how we track active agents.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we included it in the aggregated tp.resource.heartbeat event rather than creating a new event for each heartbeat?

Comment on lines +644 to +645
when managing a large fleet, it would not be able to lock out a specific host
that had been compromised. This is likely to be a major friction point for those
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can leave this out of scope but we should explore it in the future.

We can already lock any heartbeating agent based on its Host ID today, so it feels like this should extend to bots without too much trouble.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, we should add something here, but it's trivial to circumvent when the join token can just be reused, @strideynet had some ideas here: https://github.com/gravitational/teleport/pull/36510/files/61e17365cac7aaec059de86d470a57a6b612eebd#r1602421886

timothyb89 added a commit that referenced this pull request Jul 16, 2024
Traditionally, bots using non-`token` join methods "renew" their
certs by rejoining, taking advantage of their repeatable join methods
to avoid relying on `token` rejoin logic. However, as part of
[RFD 162], we need a way for rejoining bots to assert an existing
identity (if any) even when rejoining "from scratch".

When paired with #43732, this change allows bots to use an
authenticated auth client for all "initial" joins. If present, an
existing `BotInstanceID` is extracted from the client identity and
used to associate the bot with its previous identity.

As certain join methods (TPM, Azure) are initially handled by the
`JoinService` using gRPC, this additionally adds a new
`BotInstanceID` field to `RegisterUsingTokenRequest` so the
`JoinService` can verify the client identity and extract the
`BotInstanceID` if available. Like `RemoteAddr`, this value is only
trusted when set by the proxy and should be ignored in all other
cases.

[RFD 162]: #36510
github-merge-queue bot pushed a commit that referenced this pull request Jul 23, 2024
* Allow bots to provide client cert authentication when rejoining

Traditionally, bots using non-`token` join methods "renew" their
certs by rejoining, taking advantage of their repeatable join methods
to avoid relying on `token` rejoin logic. However, as part of
[RFD 162], we need a way for rejoining bots to assert an existing
identity (if any) even when rejoining "from scratch".

When paired with #43732, this change allows bots to use an
authenticated auth client for all "initial" joins. If present, an
existing `BotInstanceID` is extracted from the client identity and
used to associate the bot with its previous identity.

As certain join methods (TPM, Azure) are initially handled by the
`JoinService` using gRPC, this additionally adds a new
`BotInstanceID` field to `RegisterUsingTokenRequest` so the
`JoinService` can verify the client identity and extract the
`BotInstanceID` if available. Like `RemoteAddr`, this value is only
trusted when set by the proxy and should be ignored in all other
cases.

[RFD 162]: #36510

* Add tests for authenticated bot rejoining

* Create a new bot instance if the requested instance is not found

Also, includes a new check in for this behavior in
`TestRegisterBotWithInvalidInstanceID`.

* Fix imports

* Update lib/auth/auth_with_roles.go

Co-authored-by: Zac Bergquist <zac.bergquist@goteleport.com>

* Fix `cmp` import

---------

Co-authored-by: Zac Bergquist <zac.bergquist@goteleport.com>
timothyb89 added a commit that referenced this pull request Aug 6, 2024
* Allow bots to provide client cert authentication when rejoining

Traditionally, bots using non-`token` join methods "renew" their
certs by rejoining, taking advantage of their repeatable join methods
to avoid relying on `token` rejoin logic. However, as part of
[RFD 162], we need a way for rejoining bots to assert an existing
identity (if any) even when rejoining "from scratch".

When paired with #43732, this change allows bots to use an
authenticated auth client for all "initial" joins. If present, an
existing `BotInstanceID` is extracted from the client identity and
used to associate the bot with its previous identity.

As certain join methods (TPM, Azure) are initially handled by the
`JoinService` using gRPC, this additionally adds a new
`BotInstanceID` field to `RegisterUsingTokenRequest` so the
`JoinService` can verify the client identity and extract the
`BotInstanceID` if available. Like `RemoteAddr`, this value is only
trusted when set by the proxy and should be ignored in all other
cases.

[RFD 162]: #36510

* Add tests for authenticated bot rejoining

* Create a new bot instance if the requested instance is not found

Also, includes a new check in for this behavior in
`TestRegisterBotWithInvalidInstanceID`.

* Fix imports

* Update lib/auth/auth_with_roles.go

Co-authored-by: Zac Bergquist <zac.bergquist@goteleport.com>

* Fix `cmp` import

---------

Co-authored-by: Zac Bergquist <zac.bergquist@goteleport.com>
timothyb89 added a commit that referenced this pull request Aug 8, 2024
* Allow bots to provide client cert authentication when rejoining

Traditionally, bots using non-`token` join methods "renew" their
certs by rejoining, taking advantage of their repeatable join methods
to avoid relying on `token` rejoin logic. However, as part of
[RFD 162], we need a way for rejoining bots to assert an existing
identity (if any) even when rejoining "from scratch".

When paired with #43732, this change allows bots to use an
authenticated auth client for all "initial" joins. If present, an
existing `BotInstanceID` is extracted from the client identity and
used to associate the bot with its previous identity.

As certain join methods (TPM, Azure) are initially handled by the
`JoinService` using gRPC, this additionally adds a new
`BotInstanceID` field to `RegisterUsingTokenRequest` so the
`JoinService` can verify the client identity and extract the
`BotInstanceID` if available. Like `RemoteAddr`, this value is only
trusted when set by the proxy and should be ignored in all other
cases.

[RFD 162]: #36510

* Add tests for authenticated bot rejoining

* Create a new bot instance if the requested instance is not found

Also, includes a new check in for this behavior in
`TestRegisterBotWithInvalidInstanceID`.

* Fix imports

* Update lib/auth/auth_with_roles.go

Co-authored-by: Zac Bergquist <zac.bergquist@goteleport.com>

* Fix `cmp` import

---------

Co-authored-by: Zac Bergquist <zac.bergquist@goteleport.com>
github-merge-queue bot pushed a commit that referenced this pull request Aug 26, 2024
* Add basic CRUD for BotInstanceService (#42899)

* Add basic CRUD for BotInstanceService

This adds an early backend for the bot instance service, with resource
definitions and basic CRUD operations.

* Update api/proto/teleport/machineid/v1/bot_instance.proto

Co-authored-by: Noah Stride <noah.stride@goteleport.com>

* Further work on BotInstance boilerplate

Among other things, this adds event definitions, RBAC and presets,
and client definitions. This also includes a few proto changes,
namely moving bot name and instance ID from .Status to .Spec so
metadata can be uniformly generated.

This also adds tests for the backend service showing functional CRUD.

* gRPC tests; fix filtering; fix import lints; misc cleanup

* Consistently use `instance_id` in gRPC fields

* Add tctl support for bot_instance resource

* protos update

* Various typo fixes

* Fix more comment typos

* More doc comments and minor test tweaks

---------

Co-authored-by: Noah Stride <noah.stride@goteleport.com>

* Create Bot Instances during initial bot join (#43577)

* Create Bot Instances during initial bot join

This creates new instances for bots when they initially join the
cluster, and persists instance IDs in new certificate fields on join
and during renewal.

Note that this does not yet handle instance reuse for non-token join
methods.

Additionally, bot instance creation is locked behind a
`BOT_INSTANCE_EXPERIMENT` flag; it must be set to `1` to enable
creation.

* Proto cleanup, and update bot auth records on cert renewal

This makes various (admittedly breaking) protobuf changes, including
removing the TTL field (calculating resource expiry based on cert
requests), removing public key fingerprints, and changing the data
type of the generation counter to match the preexisting internal
datatype. These changes _should_ be safe as no consumers of the proto
API currently exist.

Additionally, this also updates bot authentications on renewal.

* Fix proto lints

* Fix misleading doc comment in the bot instance experiment

* Create bot instances for old bots on join; other fixes

This now creates bot instances for bots whose certs are missing the
BotInstanceID field. Additionally, it fixes two related bugs:
expiration dates are extended on renewal, the generated UUID
is properly appended to certs on initial join, and instances are
only created or updated when the experiment is enabled.

* Add a minimal test for bot instance creation on initial join

* Validate bot instance state in generation counter checks

* Remove outdated TODO comment and fix test lints

* Add an expiration change check to the generation test

* Machine ID: Provide existing identity when rejoining as part of renewals (#43732)

* Allow existing bot identity to be used for token based renewals

* Add test for register with auth client

* Fix tests

* Fix test

* Fix partial assignment to interface

* Improve loggging

* Fix spelling

* Fix missing import

* Fix built gRPC

* Machine ID: Bot Instance heartbeating (#43463)

* Add basic CRUD for BotInstanceService

This adds an early backend for the bot instance service, with resource
definitions and basic CRUD operations.

* Update api/proto/teleport/machineid/v1/bot_instance.proto

Co-authored-by: Noah Stride <noah.stride@goteleport.com>

* Further work on BotInstance boilerplate

Among other things, this adds event definitions, RBAC and presets,
and client definitions. This also includes a few proto changes,
namely moving bot name and instance ID from .Status to .Spec so
metadata can be uniformly generated.

This also adds tests for the backend service showing functional CRUD.

* gRPC tests; fix filtering; fix import lints; misc cleanup

* Consistently use `instance_id` in gRPC fields

* Add tctl support for bot_instance resource

* protos update

* Various typo fixes

* Fix more comment typos

* More doc comments and minor test tweaks

* Add SubmitHeartbeat RPC implementation

* Fix log keys

* Update to use proper BotInstanceID field in identiy

* Add rough outline of heartbeat service

---------

Co-authored-by: Tim Buckley <tim@goteleport.com>

* Allow bots to provide client cert authentication when rejoining (#44257)

* Allow bots to provide client cert authentication when rejoining

Traditionally, bots using non-`token` join methods "renew" their
certs by rejoining, taking advantage of their repeatable join methods
to avoid relying on `token` rejoin logic. However, as part of
[RFD 162], we need a way for rejoining bots to assert an existing
identity (if any) even when rejoining "from scratch".

When paired with #43732, this change allows bots to use an
authenticated auth client for all "initial" joins. If present, an
existing `BotInstanceID` is extracted from the client identity and
used to associate the bot with its previous identity.

As certain join methods (TPM, Azure) are initially handled by the
`JoinService` using gRPC, this additionally adds a new
`BotInstanceID` field to `RegisterUsingTokenRequest` so the
`JoinService` can verify the client identity and extract the
`BotInstanceID` if available. Like `RemoteAddr`, this value is only
trusted when set by the proxy and should be ignored in all other
cases.

[RFD 162]: #36510

* Add tests for authenticated bot rejoining

* Create a new bot instance if the requested instance is not found

Also, includes a new check in for this behavior in
`TestRegisterBotWithInvalidInstanceID`.

* Fix imports

* Update lib/auth/auth_with_roles.go

Co-authored-by: Zac Bergquist <zac.bergquist@goteleport.com>

* Fix `cmp` import

---------

Co-authored-by: Zac Bergquist <zac.bergquist@goteleport.com>

* Machine ID: Include Bot Instance ID in audit logs (#43786)

* Create Bot Instances during initial bot join

This creates new instances for bots when they initially join the
cluster, and persists instance IDs in new certificate fields on join
and during renewal.

Note that this does not yet handle instance reuse for non-token join
methods.

Additionally, bot instance creation is locked behind a
`BOT_INSTANCE_EXPERIMENT` flag; it must be set to `1` to enable
creation.

* Proto cleanup, and update bot auth records on cert renewal

This makes various (admittedly breaking) protobuf changes, including
removing the TTL field (calculating resource expiry based on cert
requests), removing public key fingerprints, and changing the data
type of the generation counter to match the preexisting internal
datatype. These changes _should_ be safe as no consumers of the proto
API currently exist.

Additionally, this also updates bot authentications on renewal.

* Fix proto lints

* Fix misleading doc comment in the bot instance experiment

* Create bot instances for old bots on join; other fixes

This now creates bot instances for bots whose certs are missing the
BotInstanceID field. Additionally, it fixes two related bugs:
expiration dates are extended on renewal, the generated UUID
is properly appended to certs on initial join, and instances are
only created or updated when the experiment is enabled.

* Add a minimal test for bot instance creation on initial join

* Validate bot instance state in generation counter checks

* Remove outdated TODO comment and fix test lints

* Add an expiration change check to the generation test

* Add BotInstanceID to audit events

* Fix borked conflict resolution

* Fix further borked conflict resolution

* Add test for cert create/join

---------

Co-authored-by: Tim Buckley <tim@goteleport.com>

* Show bot instance ID in tbot log output (#44578)

This tweaks the "fetched new bot identity" message to show the bot
name and instance ID as embedded in the bot's certificate.

Example:

```
2024-07-23T15:51:20-06:00 INFO [TBOT:IDEN] Fetched new bot identity identity:tpm-test, id=5a2865d3-d3dc-4eaa-853c-5377a1fe83f6 | valid: after=2024-07-23T21:50:20Z, before=2024-07-23T21:56:19Z, duration=5m59s | kind=tls, renewable=false, disallow-reissue=false, roles=[bot-tpm-test], principals=[-teleport-internal-join], generation=0 tbot/service_bot_identity.go:223
```

* Machine ID: Validate generation counter using bot instances (#44583)

* Show bot instance ID in tbot log output

This tweaks the "fetched new bot identity" message to show the bot
name and instance ID as embedded in the bot's certificate.

Example:

```
2024-07-23T15:51:20-06:00 INFO [TBOT:IDEN] Fetched new bot identity identity:tpm-test, id=5a2865d3-d3dc-4eaa-853c-5377a1fe83f6 | valid: after=2024-07-23T21:50:20Z, before=2024-07-23T21:56:19Z, duration=5m59s | kind=tls, renewable=false, disallow-reissue=false, roles=[bot-tpm-test], principals=[-teleport-internal-join], generation=0 tbot/service_bot_identity.go:223
```

* Machine ID: Validate generation counter using bot instances

This change moves the generation counter from bot user labels to a
field in `BotInstanceStatusAuthentication`. This allows for tracking
of individual generations regardless of join method, allows for
multiple `token`-type joins per bot, and opens the door for multi-use
`token`-type tokens in the future.

For now, the new behavior remains behind the `BOT_INSTANCE_EXPERIMENT`
feature flag, and generation counter handling is largely unchanged
when it is unset.

* Update legacy generation counter for downgrade compatibility

This updates the legacy user label generation counter to support some
downgrade compatibility if a user either disables the bot instance
experiment or downgrades to a Teleport version predating it.

Also includes some logger cleanup, and adds a warning when a
generation counter mismatch is detected but not enforced.

* Set initial generation value for legacy mode

`validateGenerationLabel` expects a nonzero generation counter value
for renewable certs to populate the counter, so this provides it
again.

* Add new generation counter tests

This adds various new tests to validate new per-instance generation
counter behavior.

Also includes backcompat fixes to import old generation counters, and
a small fix to ensure downgrades don't leave bots in a bad state.

* Fix imports

* Tweak TODO message

* Fix HostCAs not being returned during bot renewal (#44832)

* Machine ID: Remove `BOT_INSTANCE_EXPERIMENT` feature flag (#44904)

This removes the `BOT_INSTANCE_EXPERIMENT` feature flag to enable the
feature universally. It also adjusts some comments to add deprecation
notices.

* Machine ID: Only set legacy generation counter for renewable joins (#44993)

We unconditionally committed the generation counter to the legacy
user label to allow for downgrade compatibility. This is not valid
for non-token join methods as previous Teleport versions will attempt
to check the generation counter whenever the label is nonzero, and
because these older versions don't support authenticated rejoining,
will never be able to compare to the client's actual generation
value. (Plus, it was hard-coded to zero.)

This disables the legacy generation counter for non-renewable join
methods. This should not create a later upgrade problem as bots will
lose their instance ID upon their first rejoin after downgrading, and
so will be issued a fresh ID and generation counter if and when the
cluster is upgraded again.

* Remove unused function sshPublicKeyToPKIXPEM

Left-over function from backport

* Always use PEM PKIX DER representation of public keys for BotInstances (#43845)

* Always use PEM PKIX DER representation of public keys for BotInstances

* Prefer using `keys.MarshalPublicKey`

* Remove mistakenly included TestVerifyALPNUpgradedConn

The recent merge included an unrelated unit test. This removes it.

---------

Co-authored-by: Noah Stride <noah.stride@goteleport.com>
Co-authored-by: Zac Bergquist <zac.bergquist@goteleport.com>
@strideynet
Copy link
Contributor Author

I've updated this to be marked implemented, @zmb3 / @timothyb89 PTAL so we can merge this RFD down.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
no-changelog Indicates that a PR does not require a changelog entry rfd Request for Discussion size/lg size/md
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants