From 80a43a3b7698ad7844f14fb4492d11d5d2f70fe6 Mon Sep 17 00:00:00 2001 From: Noah Stride Date: Wed, 10 Jan 2024 12:44:25 +0000 Subject: [PATCH 01/17] skeleton out rfd --- ...chine-id-token-join-method-bot-instance.md | 264 ++++++++++++++++++ 1 file changed, 264 insertions(+) create mode 100644 rfd/0162-machine-id-token-join-method-bot-instance.md diff --git a/rfd/0162-machine-id-token-join-method-bot-instance.md b/rfd/0162-machine-id-token-join-method-bot-instance.md new file mode 100644 index 000000000000..4aad3f03b3a6 --- /dev/null +++ b/rfd/0162-machine-id-token-join-method-bot-instance.md @@ -0,0 +1,264 @@ +--- +authors: Noah Stride (noah.stride@goteleport.com) +state: draft +--- + +# RFD 00162 - Improving the on-prem fleet Bot management experience + +## Required approvers + +* Engineering: @zmb3 +* Product: @klizhentas || @xinding33 +* Security: @reedloden || @jentfoo + +## What + +This RFD proposes improvements to the management of fleets of Machine ID Bots. +These improvements are mostly targetted at on-prem deployments, where the +delegated join methods are not available. + +The improvements are two-fold: + +- Allowing a single join token to be used to join a number of hosts. +- Providing a way to track individual bot instances. + +Terminology: + +- Bot: An identity within Teleport intended for use by machines as opposed to + humans. Many individual machines may act as this shared identity. +- `tbot`: The Teleport binary that acts as aBot and generates credentials for + consumption by client applications. +- Bot instance: A single instance of `tbot` running on a host. + +## Why + +Whilst deploying a large fleet of Bots is fairly trivial when using the +delegated join methods, the experience when managing a fleet of bot hosts +in-prem is more challenging. + +The following burdens currently exist: + +- When using the `token` join method, a Bot must be created for each host. This + means that the privileges of many distinct Bots need to be synchronised where + those hosts are performing the same function. +- When using the `token` join method, a token can only be used once. This means + creating hundreds of join tokens and managing securely distributing these to + hosts. +- When managing a large fleet of `tbot` deployments, there is no way to + track these within Teleport. This makes it more difficult to identify hosts + which may need updating. + +As we look to onboard more Enterprise customers to Machine ID, the pains of +the on-prem experience have become more apparent. Enterprise customers are +more likely to have on-prem deployments and these are likely to be larger in +scale. + +## Details + +### Current State + +Currently, the `token` join method introduces a generation counter as a label +on the Bot user. This counter is contained within the Bot certificate and on +each renewal, this counter is incremented. When the counter within the certificate +de-synchronises with the counter on the user, the Bot is locked out as a security +measure. + +The fact that this counter is stored within a label on the Bot user creates a +one-to-one binding between a single instance of `tbot` and a single Bot user. +This is not the case when using the delegated join methods. + +### Persistent Bot Instance Identity + +Today, there is no persistent identifier for an individual instance of a Bot. +Instead, all `tbot` instances are effectively identified solely by their Bot +identity. There is no easy way to distinguish them. On each renewal, `tbot` +regenerates the private-public key pair that is used within its certificate and +there is no other form of unique ID that is persisted across renewals. + +This poses a few challenges: + +- For the purposes of auditing, it is not possible to trace actions to a + specific instance of a Bot. +- For improving the `token` join method to support multiple Bot instances + associated with a single Bot, there is no identifier to correlate with the + generation counter. +- For analytics purposes, it's difficult for us to track the number of + individual Bot instances in use. We cannot easily determine if it's a single + very active Bot instance, or many less active Bot instances. + +To rectify this, a unique identifier should be established for an instance of +a Bot. + +#### A) UUID Certificate Attribute + +On the initial join of a Bot instance, we could generate a UUID to identify that +Bot instance and encode this within the certificate. Upon renewals, the UUID +would be copied from the current certificate and into the new one. + +Whilst this is fairly easy to implement for the `token` join method, one +challenge for the delegated join methods is that rather than renewing, the +`tbot` instance merely re-joins. As the join RPCs are unauthenticated, the +previous certificate of the Bot instance is not readily available. We can +either: + +- Accept this limitation and treat each renewal of a delegated Bot instance as a + new Bot instance. +- Add support for calling the join RPCs with a client certificate. + +TODO: There was a recent investigation about certificate hierarchies. Integrating +with this would be ideal and would mean this integrates with security reports. + +#### B) Public Key Fingerprint + +Another option is to modify the behaviour of `tbot` to persist and reuse the +keypair across renewals. We could then use a fingerprint of the public key as a +unique identifier of the Bot instance. + +This feels like a natural identifier. It avoids introducing a new attribute to +certificates as the public key is already encoded within certificates. The +nature of public-key cryptography also means that this provides the Bot instance +a way to identify itself without needing an issued certificate. + +However: + +- Does switching to key reuse reduce security? +- Is a fingerprint a user understandable identifier? +- Key rotation resets the identity of a Bot instance. + +#### Decision + +TODO + +### Bot Instance Data (a.k.a Heartbeats??) + +With a persistent identifier for a Bot instance established, we can now track +information about a specific Bot server-side. In addition to providing a way +to store a generation counter per instance, this could yield other benefits: + +- Allow Bot instances to be viewed within the UI and CLI. +- Allowing Bot instances to submit basic self-reported information about itself + and its host, e.g: + - `tbot` version + - Hostname, OS and OS version + - The configuration of `tbot` + - Health status +- Record metadata from delegated joins to enrich information about the Bot. + E.g show the linked repository / CI run number +- Billing based on Bot instances rather than Bots. + +Some of this information is known and verified by the server - for example, the +certificate generation or the join metadata. Some of this information is +self-reported and should not be trusted. The information from these two sources +should be segregated to avoid confusion. + +```protobuf +syntax = "proto3"; + +import "teleport/header/v1/metadata.proto"; + +// A BotInstance +message BotInstance { + // The kind of resource represented. + string kind = 1; + // Differentiates variations of the same kind. All resources should + // contain one, even if it is never populated. + string sub_kind = 2; + // The version of the resource being represented. + string version = 3; + // Common metadata that all resources share. + teleport.header.v1.Metadata metadata = 4; + // The configured properties of a BotInstance. + BotInstanceSpec spec = 5; + // Fields that are set by the server as results of operations. These should + // not be modified by users. + BotInstanceStatus status = 6; +} + +message BotInstanceSpec { + // Empty as is not user configurable. + // Eventually this could be leveraged for simple command and control? +} + +// BotInstanceStatusHeartbeat contains information self-reported by an instance +// of a Bot. This information is not verified by the server and should not be +// trusted. +message BotInstanceStatusHeartbeat { + google.protobuf.Timestamp timestamp = 1; + bool one_shot = 2; + string version = 3; + string hostname = 4; + // In future iterations, additional information can be submitted here. +} + +message BotInstanceStatusAuthentication { + google.protobuf.Timestamp timestamp = 1; + google.protobuf.Struct metadata = 2; +} + +message BotInstanceStatus { + string join_method = 1; + string generation = 2; + repeated BotInstanceStatusAuthentication authentications = 3; + repeated BotInstanceStatusHeartbeat heartbeats = 4; +} +``` + +#### Submitting Heartbeat Data + +This additional information from the Bot could be submitted in two ways. + +##### A) Specific Heartbeat RPC + +Pros: + +- Avoids making significant changes to the existing join/renew RPCs. +- Allows for Heartbeats to be submitted at a different frequency to renewals. + +Cons: + +- Information about the Bot instance would be incomplete immediately after + joining. +- Some information can only be updated during the join/renew + e.g generation counter and last join metadata. So we'd still need to update + the join/renew RPCs to support this. However, no changes would need to be + made to the RPC message. +- Information within the Heartbeat could come from different instances in time. + +##### B) Submit Heartbeat data on Join/Renew + +Pros: + +- Avoids introducing a new RPC and ensures that all data within the Heartbeat + comes from the same instance in time. + +Cons: + +- Heartbeats are limited to the interval of renewal. + +##### Decision + +### Improving the `token` join method + +### Implementation + +1. a +2. b +3. c + +## Security Considerations + +### Audit Events + +TODO + +## Alternatives + +### Skip the Heartbeats and just improve the `token` join method + +TODO + +One challenge would be contention over the user resource if a large number of +Bot instances are trying to renew their certificates at the same time. Our +Backend lacks support for transactional consistency and this increases the risk +of two Bot instances renewing simultaneously and producing an inconsistent state +that locks one of them out. \ No newline at end of file From 03df36c1c9b8c9eb60c9146558be6868233b8771 Mon Sep 17 00:00:00 2001 From: Noah Stride Date: Wed, 10 Jan 2024 13:16:06 +0000 Subject: [PATCH 02/17] Restrucutre to highlight the selected options over alternatives --- ...chine-id-token-join-method-bot-instance.md | 105 ++++++++++++------ 1 file changed, 70 insertions(+), 35 deletions(-) diff --git a/rfd/0162-machine-id-token-join-method-bot-instance.md b/rfd/0162-machine-id-token-join-method-bot-instance.md index 4aad3f03b3a6..a34b4b858aeb 100644 --- a/rfd/0162-machine-id-token-join-method-bot-instance.md +++ b/rfd/0162-machine-id-token-join-method-bot-instance.md @@ -89,7 +89,24 @@ This poses a few challenges: To rectify this, a unique identifier should be established for an instance of a Bot. -#### A) UUID Certificate Attribute +#### Public Key Fingerprint + +One option is to modify the behaviour of `tbot` to persist and reuse the +keypair across renewals. We could then use a fingerprint of the public key as a +unique identifier of the Bot instance. + +This feels like a natural identifier. It avoids introducing a new attribute to +certificates as the public key is already encoded within certificates. The +nature of public-key cryptography also means that this provides the Bot instance +a way to identify itself without needing an issued certificate. + +However: + +- Does switching to key reuse reduce security? +- Is a fingerprint a user understandable identifier? +- Key rotation resets the identity of a Bot instance. + +##### Alternative: UUID Certificate Attribute On the initial join of a Bot instance, we could generate a UUID to identify that Bot instance and encode this within the certificate. Upon renewals, the UUID @@ -108,28 +125,7 @@ either: TODO: There was a recent investigation about certificate hierarchies. Integrating with this would be ideal and would mean this integrates with security reports. -#### B) Public Key Fingerprint - -Another option is to modify the behaviour of `tbot` to persist and reuse the -keypair across renewals. We could then use a fingerprint of the public key as a -unique identifier of the Bot instance. - -This feels like a natural identifier. It avoids introducing a new attribute to -certificates as the public key is already encoded within certificates. The -nature of public-key cryptography also means that this provides the Bot instance -a way to identify itself without needing an issued certificate. - -However: - -- Does switching to key reuse reduce security? -- Is a fingerprint a user understandable identifier? -- Key rotation resets the identity of a Bot instance. - -#### Decision - -TODO - -### Bot Instance Data (a.k.a Heartbeats??) +### BotInstance Resource With a persistent identifier for a Bot instance established, we can now track information about a specific Bot server-side. In addition to providing a way @@ -151,6 +147,8 @@ certificate generation or the join metadata. Some of this information is self-reported and should not be trusted. The information from these two sources should be segregated to avoid confusion. +BotInstance will be a new resource type introduced to track this information. + ```protobuf syntax = "proto3"; @@ -184,7 +182,7 @@ message BotInstanceSpec { // trusted. message BotInstanceStatusHeartbeat { google.protobuf.Timestamp timestamp = 1; - bool one_shot = 2; + bool is_startup = 2; string version = 3; string hostname = 4; // In future iterations, additional information can be submitted here. @@ -192,22 +190,42 @@ message BotInstanceStatusHeartbeat { message BotInstanceStatusAuthentication { google.protobuf.Timestamp timestamp = 1; - google.protobuf.Struct metadata = 2; + string join_method = 2; + google.protobuf.Struct metadata = 3; + // On each renewal, this generation is incremented. For delegated join + // methods, this counter is not checked during renewal. For the `token` join + // method, this counter is checked during renewal and the Bot is locked out if + // the counter in the certificate does not match the counter of the last + // authentication. + int32 generation = 4; } message BotInstanceStatus { - string join_method = 1; - string generation = 2; - repeated BotInstanceStatusAuthentication authentications = 3; - repeated BotInstanceStatusHeartbeat heartbeats = 4; + string bot_name = 1; + // Last X records kept, with the second oldest being removed once the limit + // is reached. This avoids the indefinite growth of the resource but also + // ensures the initial record is retained. + repeated BotInstanceStatusAuthentication authentications = 2; + // Last X records kept, with the second oldest being removed once the limit + // is reached. This avoids the indefinite growth of the resource but also + // ensures the initial record is retained. + repeated BotInstanceStatusHeartbeat heartbeats = 3; } ``` -#### Submitting Heartbeat Data +Specific edge-cases to handle: + +- BotInstance is deleted but renewal/heartbeat is received + - Reject renewals/heartbeats, trigger `tbot` to exit and suggest reset, OR + - Create a BotInstance and continue as normal. Warn/Error log. +- Join method/token changes: + - Reject renewals/heartbeats, trigger `tbot` to exit and suggest reset, OR + - Emit warning and continue. + - Consider case where linked Bot changes -This additional information from the Bot could be submitted in two ways. +#### Recording Authentication Data -##### A) Specific Heartbeat RPC +#### Recording Heartbeat Data Pros: @@ -224,20 +242,37 @@ Cons: made to the RPC message. - Information within the Heartbeat could come from different instances in time. -##### B) Submit Heartbeat data on Join/Renew +##### Alternative: Submit Heartbeat data on Join/Renew Pros: - Avoids introducing a new RPC and ensures that all data within the Heartbeat comes from the same instance in time. +- Allows self-reported information to be used as part of renewal decision. + This is not a strong defence as it is self-reported and cannot be trusted. Cons: - Heartbeats are limited to the interval of renewal. -##### Decision +#### API + +### Changes to the `token` Join Method + +No longer consumed on join. + +### CLI Changes + +#### `tbot` + +`tbot reset` + +#### `tctl` + +`tctl bot instances list` +`tctl bot instances list --bot ` -### Improving the `token` join method +Additionally, `tctl rm`/`tctl get` should be able to operate on BotInstance. ### Implementation From b83e49dd31a24150d116719d584fd99b6db14b61 Mon Sep 17 00:00:00 2001 From: Noah Stride Date: Wed, 10 Jan 2024 16:09:49 +0000 Subject: [PATCH 03/17] identify edge cases --- ...chine-id-token-join-method-bot-instance.md | 46 +++++++++++-------- 1 file changed, 28 insertions(+), 18 deletions(-) diff --git a/rfd/0162-machine-id-token-join-method-bot-instance.md b/rfd/0162-machine-id-token-join-method-bot-instance.md index a34b4b858aeb..740c9015566e 100644 --- a/rfd/0162-machine-id-token-join-method-bot-instance.md +++ b/rfd/0162-machine-id-token-join-method-bot-instance.md @@ -13,15 +13,6 @@ state: draft ## What -This RFD proposes improvements to the management of fleets of Machine ID Bots. -These improvements are mostly targetted at on-prem deployments, where the -delegated join methods are not available. - -The improvements are two-fold: - -- Allowing a single join token to be used to join a number of hosts. -- Providing a way to track individual bot instances. - Terminology: - Bot: An identity within Teleport intended for use by machines as opposed to @@ -30,6 +21,18 @@ Terminology: consumption by client applications. - Bot instance: A single instance of `tbot` running on a host. +This RFD proposes improvements to the management of fleets of Machine ID Bots. +These improvements are mostly targetted at on-prem deployments, where the +delegated join methods are not available. + +The improvements will focus on three points: + +- To allow multiple Bot instances to be associated with a single Bot when using + the `token` join method. +- To allow multiple Bot instances to be joined using a single join token when + using the `token` join method. +- Providing a way to track and monitor Bot instances. + ## Why Whilst deploying a large fleet of Bots is fairly trivial when using the @@ -128,10 +131,10 @@ with this would be ideal and would mean this integrates with security reports. ### BotInstance Resource With a persistent identifier for a Bot instance established, we can now track -information about a specific Bot server-side. In addition to providing a way -to store a generation counter per instance, this could yield other benefits: +information about a specific Bot instance server-side. In addition to providing +a way to store a generation counter per instance, this yields other benefits: -- Allow Bot instances to be viewed within the UI and CLI. +- Allows Bot instances to be viewed within the UI and CLI. - Allowing Bot instances to submit basic self-reported information about itself and its host, e.g: - `tbot` version @@ -213,18 +216,18 @@ message BotInstanceStatus { } ``` +#### Recording Authentication Data + Specific edge-cases to handle: -- BotInstance is deleted but renewal/heartbeat is received - - Reject renewals/heartbeats, trigger `tbot` to exit and suggest reset, OR +- BotInstance does not exist but renewal is received + - Reject renewals, trigger `tbot` to exit and suggest reset, OR - Create a BotInstance and continue as normal. Warn/Error log. - Join method/token changes: - - Reject renewals/heartbeats, trigger `tbot` to exit and suggest reset, OR + - Reject renewals, trigger `tbot` to exit and suggest reset, OR - Emit warning and continue. - Consider case where linked Bot changes -#### Recording Authentication Data - #### Recording Heartbeat Data Pros: @@ -240,7 +243,11 @@ Cons: e.g generation counter and last join metadata. So we'd still need to update the join/renew RPCs to support this. However, no changes would need to be made to the RPC message. -- Information within the Heartbeat could come from different instances in time. +- Information within the Heartbeat could come from different instances in time. + +Specific edge-cases to handle: + +- BotInstance does not exist but heartbeat is received ##### Alternative: Submit Heartbeat data on Join/Renew @@ -274,6 +281,9 @@ No longer consumed on join. Additionally, `tctl rm`/`tctl get` should be able to operate on BotInstance. +There is no requirement for it to be possible to create or update a BotInstance +with `tctl`. + ### Implementation 1. a From cc4ee9936997b5ff61fabbf58627ad7cb3b5b6df Mon Sep 17 00:00:00 2001 From: Noah Stride Date: Thu, 11 Jan 2024 16:30:20 +0000 Subject: [PATCH 04/17] Explain backend key format --- ...chine-id-token-join-method-bot-instance.md | 37 +++++++++++++++++-- 1 file changed, 34 insertions(+), 3 deletions(-) diff --git a/rfd/0162-machine-id-token-join-method-bot-instance.md b/rfd/0162-machine-id-token-join-method-bot-instance.md index 740c9015566e..1bb354d0acef 100644 --- a/rfd/0162-machine-id-token-join-method-bot-instance.md +++ b/rfd/0162-machine-id-token-join-method-bot-instance.md @@ -109,6 +109,10 @@ However: - Is a fingerprint a user understandable identifier? - Key rotation resets the identity of a Bot instance. +To mitigate the risk of pre-image attacks, SHA256 will be used to determine the +fingerprint of the public key. In addition, the full public key should be +recorded and verified against when authenticating a Bot instance action. + ##### Alternative: UUID Certificate Attribute On the initial join of a Bot instance, we could generate a UUID to identify that @@ -188,6 +192,7 @@ message BotInstanceStatusHeartbeat { bool is_startup = 2; string version = 3; string hostname = 4; + google.protobuf.Duration uptime = 5; // In future iterations, additional information can be submitted here. } @@ -204,20 +209,37 @@ message BotInstanceStatusAuthentication { } message BotInstanceStatus { - string bot_name = 1; + // The public key of the Bot instance. + bytes public_key = 1; + // The name of the Bot that this instance is associated with. + string bot_name = 2; // Last X records kept, with the second oldest being removed once the limit // is reached. This avoids the indefinite growth of the resource but also // ensures the initial record is retained. - repeated BotInstanceStatusAuthentication authentications = 2; + repeated BotInstanceStatusAuthentication authentications = 3; // Last X records kept, with the second oldest being removed once the limit // is reached. This avoids the indefinite growth of the resource but also // ensures the initial record is retained. - repeated BotInstanceStatusHeartbeat heartbeats = 3; + repeated BotInstanceStatusHeartbeat heartbeats = 4; } ``` +The name used for a BotInstance will be a concatenation of the Bot name and the +SHA256 fingerprint of the instance's public key +e.g `my-robot/2c26b46b68ffc68ff99b453c1d30413413422d706483bfa0f98a5e886266e7ae`. + +When storing the BotInstance in the backend, the key will be: +`bot_instances/{bot_name}/{fingerprint}`. + #### Recording Authentication Data +Upon each join and renewal, the BotInstance record will be updated with an +additional entry in the `status.authentications` field. If there is X entries, +then the second-oldest entry will be removed. This prevents growth without +bounds but also ensures that the original record is retained. + +If a BotInstance does not exist, then one will be created. + Specific edge-cases to handle: - BotInstance does not exist but renewal is received @@ -230,6 +252,15 @@ Specific edge-cases to handle: #### Recording Heartbeat Data +A new special endpoint will be added for submitting Heartbeat data. + +```protobuf + +``` + +This endpoint will be called by `tbot` immediately after it has initially +authenticated and then every hour after. + Pros: - Avoids making significant changes to the existing join/renew RPCs. From 6f596774a1ab48f2db23ca23e838dca8b0fa921a Mon Sep 17 00:00:00 2001 From: Noah Stride Date: Thu, 11 Jan 2024 16:52:24 +0000 Subject: [PATCH 05/17] Add spec for heartbeat rpc --- ...chine-id-token-join-method-bot-instance.md | 35 +++++++++++++++++-- 1 file changed, 33 insertions(+), 2 deletions(-) diff --git a/rfd/0162-machine-id-token-join-method-bot-instance.md b/rfd/0162-machine-id-token-join-method-bot-instance.md index 1bb354d0acef..f433363e26c7 100644 --- a/rfd/0162-machine-id-token-join-method-bot-instance.md +++ b/rfd/0162-machine-id-token-join-method-bot-instance.md @@ -252,14 +252,43 @@ Specific edge-cases to handle: #### Recording Heartbeat Data -A new special endpoint will be added for submitting Heartbeat data. +A new RPC will be added for submitting Heartbeat data. ```protobuf +syntax = "proto3"; + +package teleport.machineid.v1; + +service BotInstanceService { + // SubmitHeartbeat submits a heartbeat for a BotInstance. + rpc SubmitHeartbeat(SubmitHeartbeatRequest) returns (SubmitHeartbeatResponse); +} + +// The request for SubmitHeartbeat. +message SubmitHeartbeatRequest { + // The heartbeat data to submit. + BotInstanceStatusHeartbeat heartbeat = 1; +} +// The response for SubmitHeartbeat. +message SubmitHeartbeatResponse { + // Empty +} ``` +The endpoint will have a special authentication check. RBAC will not be used and +instead the endpoint will check: + +- The presented client certificate is for the Bot linked to the instance. +- The presented client certificate's public key matches the public key recorded + for the BotInstance. + This endpoint will be called by `tbot` immediately after it has initially -authenticated and then every hour after. +authenticated. After a heartbeat has succesfully completed, another should be +scheduled for an hour after. A small amount of jitter should be added to the +heartbeat period to avoid a thundering herd of heartbeats. + +If the heartbeat fails, then `tbot` should retry on a exponential backoff. Pros: @@ -282,6 +311,8 @@ Specific edge-cases to handle: ##### Alternative: Submit Heartbeat data on Join/Renew +Alternatively, we could add a Heartbeat field to the join/renew RPCs. + Pros: - Avoids introducing a new RPC and ensures that all data within the Heartbeat From 09d735129928f8b88eef9f4873476db05023763a Mon Sep 17 00:00:00 2001 From: Noah Stride Date: Fri, 12 Jan 2024 12:44:07 +0000 Subject: [PATCH 06/17] Add API for BotInstance resource --- ...chine-id-token-join-method-bot-instance.md | 196 +++++++++++++----- 1 file changed, 139 insertions(+), 57 deletions(-) diff --git a/rfd/0162-machine-id-token-join-method-bot-instance.md b/rfd/0162-machine-id-token-join-method-bot-instance.md index f433363e26c7..1fa49bdf4a8b 100644 --- a/rfd/0162-machine-id-token-join-method-bot-instance.md +++ b/rfd/0162-machine-id-token-join-method-bot-instance.md @@ -103,16 +103,14 @@ certificates as the public key is already encoded within certificates. The nature of public-key cryptography also means that this provides the Bot instance a way to identify itself without needing an issued certificate. -However: - -- Does switching to key reuse reduce security? -- Is a fingerprint a user understandable identifier? -- Key rotation resets the identity of a Bot instance. - To mitigate the risk of pre-image attacks, SHA256 will be used to determine the fingerprint of the public key. In addition, the full public key should be recorded and verified against when authenticating a Bot instance action. +It should be noted that with this technique, rotating the keypair of a `tbot` +instance would reset the identity of that instance. Rotation of this keypair +would be unusual and this side effect seems expected. + ##### Alternative: UUID Certificate Attribute On the initial join of a Bot instance, we could generate a UUID to identify that @@ -126,14 +124,20 @@ previous certificate of the Bot instance is not readily available. We can either: - Accept this limitation and treat each renewal of a delegated Bot instance as a - new Bot instance. + new Bot instance. This is likely unacceptable and would limit any advantages + of this work to the `token` join method. - Add support for calling the join RPCs with a client certificate. -TODO: There was a recent investigation about certificate hierarchies. Integrating -with this would be ideal and would mean this integrates with security reports. +This technique could reuse a recently proposed LoginID attribute. This would +allow features such as security reports and automated anomaly detection to work +seamlessly across humans and machines. ### BotInstance Resource +The +[Resource Guidelines RFD](https://github.com/gravitational/teleport/blob/master/rfd/0153-resource-guidelines.md) +will be followed. + With a persistent identifier for a Bot instance established, we can now track information about a specific Bot instance server-side. In addition to providing a way to store a generation counter per instance, this yields other benefits: @@ -188,17 +192,33 @@ message BotInstanceSpec { // of a Bot. This information is not verified by the server and should not be // trusted. message BotInstanceStatusHeartbeat { - google.protobuf.Timestamp timestamp = 1; + // The timestamp that the heartbeat was recorded by the Auth Server. Any + // value submitted by `tbot` for this field will be ignored. + google.protobuf.Timestamp recorded_at = 1; + // Indicates whether this is the heartbeat submitted by `tbot` on startup. bool is_startup = 2; + // The version of `tbot` that submitted this heartbeat. string version = 3; + // The hostname of the host that `tbot` is running on. string hostname = 4; + // The duration that `tbot` has been running for when it submitted this + // heartbeat. google.protobuf.Duration uptime = 5; + // In future iterations, additional information can be submitted here. + // For example, the configuration of `tbot` or the health of individual + // outputs. } +// BotInstanceStatusAuthentication contains information about a join or renewal. +// Ths information is entirely sourced by the Auth Server and can be trusted. message BotInstanceStatusAuthentication { - google.protobuf.Timestamp timestamp = 1; + // The timestamp that the join or renewal was authenticated by the Auth + // Server. + google.protobuf.Timestamp authenticated_at = 1; + // The join method used for this join or renewal. string join_method = 2; + // The metadata sourced from the join method. google.protobuf.Struct metadata = 3; // On each renewal, this generation is incremented. For delegated join // methods, this counter is not checked during renewal. For the `token` join @@ -208,19 +228,24 @@ message BotInstanceStatusAuthentication { int32 generation = 4; } +// BotInstanceStatus holds the status of a BotInstance. message BotInstanceStatus { // The public key of the Bot instance. + // When authenticating a Bot instance, the full public key must be compared + // rather than just the fingerprint to mitigate pre-image attacks. bytes public_key = 1; + // The fingerprint of the public key of the Bot instance. + string fingerprint = 2; // The name of the Bot that this instance is associated with. - string bot_name = 2; + string bot_name = 3; // Last X records kept, with the second oldest being removed once the limit // is reached. This avoids the indefinite growth of the resource but also // ensures the initial record is retained. - repeated BotInstanceStatusAuthentication authentications = 3; + repeated BotInstanceStatusAuthentication authentications = 4; // Last X records kept, with the second oldest being removed once the limit // is reached. This avoids the indefinite growth of the resource but also // ensures the initial record is retained. - repeated BotInstanceStatusHeartbeat heartbeats = 4; + repeated BotInstanceStatusHeartbeat heartbeats = 5; } ``` @@ -229,7 +254,11 @@ SHA256 fingerprint of the instance's public key e.g `my-robot/2c26b46b68ffc68ff99b453c1d30413413422d706483bfa0f98a5e886266e7ae`. When storing the BotInstance in the backend, the key will be: -`bot_instances/{bot_name}/{fingerprint}`. +`bot_instances/{bot_name}/{fingerprint}`. This will allow for efficient listing +of BotInstances for a given Bot. + +Like agent heartbeats, the BotInstance will expire after a period of inactivity. +This avoids the accumulation of ephemeral BotInstances. #### Recording Authentication Data @@ -238,21 +267,16 @@ additional entry in the `status.authentications` field. If there is X entries, then the second-oldest entry will be removed. This prevents growth without bounds but also ensures that the original record is retained. -If a BotInstance does not exist, then one will be created. +In addition, the TTL of the BotInstance resource will be extended. -Specific edge-cases to handle: - -- BotInstance does not exist but renewal is received - - Reject renewals, trigger `tbot` to exit and suggest reset, OR - - Create a BotInstance and continue as normal. Warn/Error log. -- Join method/token changes: - - Reject renewals, trigger `tbot` to exit and suggest reset, OR - - Emit warning and continue. - - Consider case where linked Bot changes +If a BotInstance does not exist, then one will be created. In the case that +this occurs for a bot using the `token` join method and this is a renewal, +a warning will be emitted and the initial generation of the BotInstance will +be sourced from the certificates current generation counter. #### Recording Heartbeat Data -A new RPC will be added for submitting Heartbeat data. +A new RPC will be added for submitting heartbeat data: ```protobuf syntax = "proto3"; @@ -276,7 +300,7 @@ message SubmitHeartbeatResponse { } ``` -The endpoint will have a special authentication check. RBAC will not be used and +The endpoint will have a special auth/authz check. RBAC will not be used and instead the endpoint will check: - The presented client certificate is for the Bot linked to the instance. @@ -290,25 +314,6 @@ heartbeat period to avoid a thundering herd of heartbeats. If the heartbeat fails, then `tbot` should retry on a exponential backoff. -Pros: - -- Avoids making significant changes to the existing join/renew RPCs. -- Allows for Heartbeats to be submitted at a different frequency to renewals. - -Cons: - -- Information about the Bot instance would be incomplete immediately after - joining. -- Some information can only be updated during the join/renew - e.g generation counter and last join metadata. So we'd still need to update - the join/renew RPCs to support this. However, no changes would need to be - made to the RPC message. -- Information within the Heartbeat could come from different instances in time. - -Specific edge-cases to handle: - -- BotInstance does not exist but heartbeat is received - ##### Alternative: Submit Heartbeat data on Join/Renew Alternatively, we could add a Heartbeat field to the join/renew RPCs. @@ -319,33 +324,107 @@ Pros: comes from the same instance in time. - Allows self-reported information to be used as part of renewal decision. This is not a strong defence as it is self-reported and cannot be trusted. +- Avoids a state where the BotInstance is incomplete immediately after joining + and before it has called SubmitHeartbeat. Cons: +- Adds Bot specific behaviour to RPCs that are also used for Node joining. - Heartbeats are limited to the interval of renewal. #### API +Additional RPCs will be added to the BotInstance service to allow these to +be listed and deleted: + +```protobuf +syntax = "proto3"; + +package teleport.machineid.v1; + +service BotInstanceService { + // GetBotInstance returns the specified BotInstance resource. + rpc GetBotInstance(GetBotInstanceRequest) returns (BotInstance); + // ListBotInstances returns a page of BotInstance resources. + rpc ListBotInstances(ListBotInstancesRequest) returns (ListBotInstancesResponse); + // DeleteBotInstance hard deletes the specified BotInstance resource. + rpc DeleteBotInstance(DeleteBotInstanceRequest) returns (google.protobuf.Empty); +} + +// Request for GetBotInstance. +message GetBotInstanceRequest { + // The name of the BotInstance to retrieve. + string name = 1; +} + +// Request for ListFoos. +// +// Follows the pagination semantics of +// https://cloud.google.com/apis/design/standard_methods#list +message ListBotInstancesRequest { + // The name of the Bot to list BotInstances for. If empty, all BotInstances + // will be listed. + string filter_bot_name = 1; + // The maximum number of items to return. + // The server may impose a different page size at its discretion. + int32 page_size = 2; + // The page_token value returned from a previous ListBotInstances request, if + // any. + string page_token = 3; +} + +// Response for ListBotInstances. +message ListBotInstancesResponse { + // BotInstance that matched the search. + repeated BotInstance bot_instances = 1; + // Token to retrieve the next page of results, or empty if there are no + // more results exist. + string next_page_token = 2; +} + +// Request for DeleteBotInstance. +message DeleteBotInstanceRequest { + // The name of the BotInstance to delete. + string name = 1; +} +``` + ### Changes to the `token` Join Method -No longer consumed on join. +As we now have a way to track the generation for a specific Bot instance, we +can allow multiple Bot instances to be associated with a single Bot. This +also means that the token no longer needs to be consumed on a join. + +Eventually, we may wish to add a way to specify a number of joins which can +occur with a token. This provides a way to easily control the lifetime of a +token when deploying to a fleet of a pre-known size. + +The renewal logic will need to be adjusted to read the generation counter from +the BotInstance rather than the Bot user. ### CLI Changes #### `tbot` -`tbot reset` +`tbot reset` will be added to allow a `tbot` instance to be reset. This will +simply clear out any artifacts within the `tbot` storage directory. #### `tctl` -`tctl bot instances list` -`tctl bot instances list --bot ` +`tctl bots instances list` +`tctl bots instances list --bot ` +`tctl tokens add --type=bot --bot ` Additionally, `tctl rm`/`tctl get` should be able to operate on BotInstance. There is no requirement for it to be possible to create or update a BotInstance with `tctl`. +### Analytics + +A PostHog event should be emitted for each BotInstance heartbeat. This will +allow us to track active bots in a similar way to how we track active agents. + ### Implementation 1. a @@ -356,16 +435,19 @@ with `tctl`. ### Audit Events -TODO +Deletion of a BotInstance should be audited. ## Alternatives -### Skip the Heartbeats and just improve the `token` join method +### Defer the Heartbeats work and solely improve the `token` join method -TODO +We could modify the way we record the generation counter for the `token` join +method without introducing the BotInstance resource. Instead of storing a +single counter within the Bot User labels, we could store a JSON encoded map +of counters. -One challenge would be contention over the user resource if a large number of +One challenge would be contention over the User resource if a large number of Bot instances are trying to renew their certificates at the same time. Our -Backend lacks support for transactional consistency and this increases the risk -of two Bot instances renewing simultaneously and producing an inconsistent state -that locks one of them out. \ No newline at end of file +Backend has limited support for transactional consistency and this increases the +risk of two Bot instances renewing simultaneously and producing an inconsistent +state that locks one of them out. From a1d06e008513a5922111e6abf16adf3813c9243b Mon Sep 17 00:00:00 2001 From: Noah Stride Date: Fri, 12 Jan 2024 12:56:54 +0000 Subject: [PATCH 07/17] Add notes on analytics --- ...chine-id-token-join-method-bot-instance.md | 22 ++++++++++++++++++- 1 file changed, 21 insertions(+), 1 deletion(-) diff --git a/rfd/0162-machine-id-token-join-method-bot-instance.md b/rfd/0162-machine-id-token-join-method-bot-instance.md index 1fa49bdf4a8b..da95543361a9 100644 --- a/rfd/0162-machine-id-token-join-method-bot-instance.md +++ b/rfd/0162-machine-id-token-join-method-bot-instance.md @@ -204,6 +204,10 @@ message BotInstanceStatusHeartbeat { // The duration that `tbot` has been running for when it submitted this // heartbeat. google.protobuf.Duration uptime = 5; + // The currently configured join_method. + string join_method = 6; + // Indicates whether `tbot` is running in one-shot mode. + bool one_shot = 7; // In future iterations, additional information can be submitted here. // For example, the configuration of `tbot` or the health of individual @@ -314,7 +318,7 @@ heartbeat period to avoid a thundering herd of heartbeats. If the heartbeat fails, then `tbot` should retry on a exponential backoff. -##### Alternative: Submit Heartbeat data on Join/Renew +##### Alternative: Submit Heartbeat Data on Join/Renew Alternatively, we could add a Heartbeat field to the join/renew RPCs. @@ -425,6 +429,22 @@ with `tctl`. A PostHog event should be emitted for each BotInstance heartbeat. This will allow us to track active bots in a similar way to how we track active agents. +```protobuf +// a heartbeat for a Bot Instance +// +// PostHog event: tp.bot.instance.hb +message BotInstanceHeartbeatEvent { + // anonymized name of the instance, 32 bytes (HMAC-SHA-256); + bytes bot_instance_name = 1; + // the version of tbot + string version = 2; + // indicates whether or not tbot is running in one-shot mode + bool one_shot = 3; + // indicates the configured join method of `tbot`. + string join_method = 4; +} +``` + ### Implementation 1. a From 4dd6dd90bb84585d94966a41aef9ca98ffe73e38 Mon Sep 17 00:00:00 2001 From: Noah Stride Date: Fri, 12 Jan 2024 13:15:51 +0000 Subject: [PATCH 08/17] Add a brief summary of out of scope improvements --- ...chine-id-token-join-method-bot-instance.md | 40 +++++++++++++++++-- 1 file changed, 36 insertions(+), 4 deletions(-) diff --git a/rfd/0162-machine-id-token-join-method-bot-instance.md b/rfd/0162-machine-id-token-join-method-bot-instance.md index da95543361a9..383ae94abfbe 100644 --- a/rfd/0162-machine-id-token-join-method-bot-instance.md +++ b/rfd/0162-machine-id-token-join-method-bot-instance.md @@ -361,7 +361,7 @@ message GetBotInstanceRequest { string name = 1; } -// Request for ListFoos. +// Request for ListBotInstances. // // Follows the pagination semantics of // https://cloud.google.com/apis/design/standard_methods#list @@ -403,8 +403,8 @@ Eventually, we may wish to add a way to specify a number of joins which can occur with a token. This provides a way to easily control the lifetime of a token when deploying to a fleet of a pre-known size. -The renewal logic will need to be adjusted to read the generation counter from -the BotInstance rather than the Bot user. +The renewal logic will need to be adjusted to read and update the generation +counter from the BotInstance rather than the Bot user. ### CLI Changes @@ -466,8 +466,40 @@ method without introducing the BotInstance resource. Instead of storing a single counter within the Bot User labels, we could store a JSON encoded map of counters. -One challenge would be contention over the User resource if a large number of +One challenge would be contention over the user resource if a large number of Bot instances are trying to renew their certificates at the same time. Our Backend has limited support for transactional consistency and this increases the risk of two Bot instances renewing simultaneously and producing an inconsistent state that locks one of them out. + +### Remove generation counter from the `token` join method + +## Out of Scope + +These tasks are out of scope of this RFD but could be considered natural +follow-on tasks. + +### Multi-phase Commit of Generation Counter + +Currently, the generation counter is fragile as it is incremented server side +without confirmation that `tbot` has been able to use and persist the new +credentials. If `tbot` does not receive confirmation of the renewal or is +unable to persist the new credentials, it will be locked out on it's next +attempt to renew. + +We could introduce a multi-phase commit of the generation counter. This would +provide more robustness to the renewal process. + +### Locking of Individual Bot Instances + +Currently, it's only possible to lock out an entire Bot user. This means that +when managing a large fleet, it would not be able to lock out a specific host +that had been compromised. This is likely to be a major friction point for those +deploying a large number of Bot instances. + +It also increases the significance of the fragility of the generation counter. + +### Bot Command and Control + +The BotInstance resource could be extended to allow `tbot` to be controlled +remotely. \ No newline at end of file From 42c2e300cca82b00c0ea7be9f68e2dbe756d9968 Mon Sep 17 00:00:00 2001 From: Noah Stride Date: Fri, 12 Jan 2024 14:07:37 +0000 Subject: [PATCH 09/17] Add output/services to heartbeat --- ...chine-id-token-join-method-bot-instance.md | 91 ++++++++++++++++--- 1 file changed, 78 insertions(+), 13 deletions(-) diff --git a/rfd/0162-machine-id-token-join-method-bot-instance.md b/rfd/0162-machine-id-token-join-method-bot-instance.md index 383ae94abfbe..05648045e332 100644 --- a/rfd/0162-machine-id-token-join-method-bot-instance.md +++ b/rfd/0162-machine-id-token-join-method-bot-instance.md @@ -188,6 +188,14 @@ message BotInstanceSpec { // Eventually this could be leveraged for simple command and control? } +message BotInstanceStatusHeartbeatOutput { + string type = 1; +} + +message BotInstanceStatusHeartbeatService { + string type = 1; +} + // BotInstanceStatusHeartbeat contains information self-reported by an instance // of a Bot. This information is not verified by the server and should not be // trusted. @@ -208,6 +216,10 @@ message BotInstanceStatusHeartbeat { string join_method = 6; // Indicates whether `tbot` is running in one-shot mode. bool one_shot = 7; + // List of currently user configured outputs. + repeated outputs BotInstanceStatusHeartbeatOutput = 8; + // List of currently user configured services. + repeated services BotInstanceStatusHeartbeatService = 9; // In future iterations, additional information can be submitted here. // For example, the configuration of `tbot` or the health of individual @@ -222,14 +234,17 @@ message BotInstanceStatusAuthentication { google.protobuf.Timestamp authenticated_at = 1; // The join method used for this join or renewal. string join_method = 2; + // The join token used for this join or renewal. This is only populated for + // delegated join methods as the value for `token` join methods is sensitive. + string join_token = 3; // The metadata sourced from the join method. - google.protobuf.Struct metadata = 3; + google.protobuf.Struct metadata = 4; // On each renewal, this generation is incremented. For delegated join // methods, this counter is not checked during renewal. For the `token` join // method, this counter is checked during renewal and the Bot is locked out if // the counter in the certificate does not match the counter of the last // authentication. - int32 generation = 4; + int32 generation = 5; } // BotInstanceStatus holds the status of a BotInstance. @@ -276,7 +291,9 @@ In addition, the TTL of the BotInstance resource will be extended. If a BotInstance does not exist, then one will be created. In the case that this occurs for a bot using the `token` join method and this is a renewal, a warning will be emitted and the initial generation of the BotInstance will -be sourced from the certificates current generation counter. +be sourced from the certificates current generation counter. This behaviour +will support the migration of existing `tbot` instances to the new BotInstance +behaviour. #### Recording Heartbeat Data @@ -312,11 +329,11 @@ instead the endpoint will check: for the BotInstance. This endpoint will be called by `tbot` immediately after it has initially -authenticated. After a heartbeat has succesfully completed, another should be -scheduled for an hour after. A small amount of jitter should be added to the -heartbeat period to avoid a thundering herd of heartbeats. +authenticated. After a heartbeat has successfully completed, another should be +scheduled for a half hour after. A small amount of jitter should be added to +the heartbeat period to avoid a thundering herd of heartbeats. -If the heartbeat fails, then `tbot` should retry on a exponential backoff. +If the heartbeat fails, then `tbot` should retry on an exponential backoff. ##### Alternative: Submit Heartbeat Data on Join/Renew @@ -413,6 +430,14 @@ counter from the BotInstance rather than the Bot user. `tbot reset` will be added to allow a `tbot` instance to be reset. This will simply clear out any artifacts within the `tbot` storage directory. +In addition, if the bot detects a change in join token or join method, it should +automatically rotate it's keypair. This will ensure it presents as a fresh +BotInstance to the AuthServer. + +A log message should be output that identifies the linked BotInstance at +startup and on each heartbeat. This will allow users to easily correlate the +`tbot` installation with a BotInstance. + #### `tctl` `tctl bots instances list` @@ -445,17 +470,44 @@ message BotInstanceHeartbeatEvent { } ``` -### Implementation +Existing analytics for join, renewal and certificate generation should be +extended to include the BotInstance ID anonymized. This will allow them to be +linked together. + +### Migration/Compatability + +The "create if not exists" behaviour of the BotInstance resource will mean that +existing Bot instances will have a BotInstance resource created on their first +renewal after this feature is released. Their existing generation counter will +be trusted on this first renewal. This allows for a seamless migration to the +new system. -1. a -2. b -3. c +Older `tbot` instances will not submit heartbeats. This means that their +BotInstance will only contain authentication data. Any CLI or GUI that shows +BotInstances should show a gracefully degraded state in this case that explains +that the `tbot` needs to be upgraded. ## Security Considerations ### Audit Events -Deletion of a BotInstance should be audited. +An audit event should be added for the deletion of a BotInstance. The +name of the BotInstance should be added to the existing join, renewal and +certificate generation audit events. + +### Resistance to collision/pre-image attacks + +We should be cautious that a BotInstance cannot be impersonated using a +second pre-image attack. This risk is introduced by using the public key +fingerprint as an identifier. + +To mitigate this, we should ensure that the full public key is compared +to the recorded one when authenticating a BotInstance rather than merely +comparing the fingerprint. + +In addition, a more modern hashing algorithm should be used to calculate the +fingerprint. In this case, we have selected SHA256 as this is more resistant +compared to hash functions such as MD5 or SHA1. ## Alternatives @@ -472,7 +524,20 @@ Backend has limited support for transactional consistency and this increases the risk of two Bot instances renewing simultaneously and producing an inconsistent state that locks one of them out. -### Remove generation counter from the `token` join method +### Introduce renewal-less and generation-less `token` join method + +One option is to introduce a new join method that does not produce renewable +certificates. There would be no need for the fragile generation counter +and the join token would be continually re-used to join as is done for the +delegated join methods. This also circumvents the need for a one-to-one binding +between a Bot instance and a Bot. + +This token would be incredibly sensitive and if stolen, there would be no +automated mechanisms to detect this as exists today with the generation counter. + +It likely makes more sense to improve the existing `token` join method rather +than introduce a variant which behaves differently and is less secure. It would +increase the complexity of the codebase and the user experience. ## Out of Scope From 6d862171ffc8cd49437db0295c4237bc9b054ef8 Mon Sep 17 00:00:00 2001 From: Noah Stride Date: Fri, 12 Jan 2024 14:09:30 +0000 Subject: [PATCH 10/17] Remove services/outputs --- ...162-machine-id-token-join-method-bot-instance.md | 13 ------------- 1 file changed, 13 deletions(-) diff --git a/rfd/0162-machine-id-token-join-method-bot-instance.md b/rfd/0162-machine-id-token-join-method-bot-instance.md index 05648045e332..bc656e217396 100644 --- a/rfd/0162-machine-id-token-join-method-bot-instance.md +++ b/rfd/0162-machine-id-token-join-method-bot-instance.md @@ -188,14 +188,6 @@ message BotInstanceSpec { // Eventually this could be leveraged for simple command and control? } -message BotInstanceStatusHeartbeatOutput { - string type = 1; -} - -message BotInstanceStatusHeartbeatService { - string type = 1; -} - // BotInstanceStatusHeartbeat contains information self-reported by an instance // of a Bot. This information is not verified by the server and should not be // trusted. @@ -216,11 +208,6 @@ message BotInstanceStatusHeartbeat { string join_method = 6; // Indicates whether `tbot` is running in one-shot mode. bool one_shot = 7; - // List of currently user configured outputs. - repeated outputs BotInstanceStatusHeartbeatOutput = 8; - // List of currently user configured services. - repeated services BotInstanceStatusHeartbeatService = 9; - // In future iterations, additional information can be submitted here. // For example, the configuration of `tbot` or the health of individual // outputs. From f8f78965395a0d70aca64d564cf5e6a3f7ff6e1b Mon Sep 17 00:00:00 2001 From: Noah Stride Date: Fri, 12 Jan 2024 14:13:22 +0000 Subject: [PATCH 11/17] Expand on `tctl` command changes --- rfd/0162-machine-id-token-join-method-bot-instance.md | 8 ++++++++ 1 file changed, 8 insertions(+) diff --git a/rfd/0162-machine-id-token-join-method-bot-instance.md b/rfd/0162-machine-id-token-join-method-bot-instance.md index bc656e217396..fe505dc634b5 100644 --- a/rfd/0162-machine-id-token-join-method-bot-instance.md +++ b/rfd/0162-machine-id-token-join-method-bot-instance.md @@ -427,8 +427,16 @@ startup and on each heartbeat. This will allow users to easily correlate the #### `tctl` +Commands to list all BotInstances and the BotInstances for a specific Bot should +be added to the `tctl bots` family: + `tctl bots instances list` `tctl bots instances list --bot ` + +The `tctl tokens add` command should be extended to allow a new token to be +associated with an existing Bot now that multiple Bot instances can be run +against a single Bot: + `tctl tokens add --type=bot --bot ` Additionally, `tctl rm`/`tctl get` should be able to operate on BotInstance. From 399eb4aa38ea49ab9033a2fd25e2a16d1bfcb4ea Mon Sep 17 00:00:00 2001 From: Tim Buckley Date: Fri, 24 May 2024 18:29:29 -0600 Subject: [PATCH 12/17] Update RFD with revised persistent identity and other changes This adds a new section on the security requirements for the persistent bot identity, alongside an updated preference for UUID identifiers. This also includes revisions to the "changes to the `token` join method" to include join count limits by default and set sane defaults to encourage short lived tokens. Lastly, various other small changes were applied, like removing the implementation details for bot instance names in favor of an explicitly specified bot name and instance identifier. --- ...chine-id-token-join-method-bot-instance.md | 192 +++++++++++++----- 1 file changed, 141 insertions(+), 51 deletions(-) diff --git a/rfd/0162-machine-id-token-join-method-bot-instance.md b/rfd/0162-machine-id-token-join-method-bot-instance.md index fe505dc634b5..2a60efa2f1da 100644 --- a/rfd/0162-machine-id-token-join-method-bot-instance.md +++ b/rfd/0162-machine-id-token-join-method-bot-instance.md @@ -7,9 +7,9 @@ state: draft ## Required approvers -* Engineering: @zmb3 -* Product: @klizhentas || @xinding33 -* Security: @reedloden || @jentfoo +- Engineering: @zmb3 +- Product: @klizhentas || @xinding33 +- Security: @reedloden || @jentfoo ## What @@ -37,7 +37,7 @@ The improvements will focus on three points: Whilst deploying a large fleet of Bots is fairly trivial when using the delegated join methods, the experience when managing a fleet of bot hosts -in-prem is more challenging. +on-prem is more challenging. The following burdens currently exist: @@ -66,10 +66,10 @@ each renewal, this counter is incremented. When the counter within the certifica de-synchronises with the counter on the user, the Bot is locked out as a security measure. -The fact that this counter is stored within a label on the Bot user creates a +The fact that this counter is stored within a label on the Bot user creates a one-to-one binding between a single instance of `tbot` and a single Bot user. This is not the case when using the delegated join methods. - + ### Persistent Bot Instance Identity Today, there is no persistent identifier for an individual instance of a Bot. @@ -83,7 +83,7 @@ This poses a few challenges: - For the purposes of auditing, it is not possible to trace actions to a specific instance of a Bot. - For improving the `token` join method to support multiple Bot instances - associated with a single Bot, there is no identifier to correlate with the + associated with a single Bot, there is no identifier to correlate with the generation counter. - For analytics purposes, it's difficult for us to track the number of individual Bot instances in use. We cannot easily determine if it's a single @@ -92,7 +92,85 @@ This poses a few challenges: To rectify this, a unique identifier should be established for an instance of a Bot. -#### Public Key Fingerprint +#### Bot Identity Trust + +We should be mindful that adding a new persistent identifier may increase our +attack surface, particularly if we allow clients manipulate their persistent +identifier or in any way trust the values communicated to Auth during the join +process. + +For example, trusting this identifier might allow a bot to better masquerade +itself as a preexisting instance and avoid discovery by an end user that falsely +assumed no unexpected instances had joined their cluster. Additionally, if we +implement join limits (i.e. tokens that only allow N bots to join), malicious +bots could reuse existing identifiers to bypass join limits. + +To mitigate this, we should make certain to cryptographically verify identifiers +during the renewal process. For example, we can embed the identifier as a +certificate field to ensure it cannot be tampered with once issued by the Auth +service, or encrypt the renewed certificates using the previous iteration's +public key to ensure the calling bot owns the private key. Adopting proper mTLS +during the join process should accomplish both of these goals. + +##### Verifying Bot Identities + +We currently see two methods for cryptographically verifying bot identities at +renewal time: + +1. We could expose the existing functionality of the HTTPS-only + `RegisterUsingToken` over gRPC. The existing gRPC `JoinService` can be + accessed with and without authentication, so we could inspect the client + connection to find the existing bot identity, if any, and it would be + implicitly verified. +2. We could adapt the existing HTTPS implementation of `RegisterUsingToken` to + additionally accept an encoded existing certificate, and return certificates + encrypted with the certificate's public key. We can verify the certificate + was originally signed with our CA, and the client will only be able to + decrypt the returned identity if they actually have the private key for the + previous identity. + +Our preference is option (1): bots always join over gRPC, and provide a client +certificate when re-joining. Bots without a client cert to present are +registered as new instances, while bots with a valid client cert preserve their +identity. + +Note that bots joined with the `token` method are not affected by this, as +renewals take place over a fully mTLS-authenticated tunnel to the auth service. + +Additionally, a downside to certificate verification is that the certificate +validity period becomes a factor. If bots are only run intermittently, like +from a CI workflow, their certificates could expire and prevent them from +being identified as the same instance. This is likely to only impact a small +number of cases, however, as most CI provider joins are stateless and have no +certificates to present anyway. Bots that present expired certificates will +either be rejected and will need to join as a new instance, or we'll need to +discard the expired identity treat them as a new instance. + +#### UUID Certificate Attribute + +On the initial join of a Bot instance, we could generate a UUID to identify that +Bot instance and encode this within the certificate. Upon renewals, the UUID +would be copied from the current certificate and into the new one. + +This method additionally gives us freedom to change various join parameters +while preserving the lineage of a bot identity. Bots could change their keypair +or join method and still be properly associated with their previous iteration. + +Whilst this is fairly easy to implement for the `token` join method, one +challenge for the delegated join methods is that rather than renewing, the +`tbot` instance merely re-joins. As the join RPCs are unauthenticated, the +previous certificate of the Bot instance is not readily available. We can +either: + +- Accept this limitation and treat each renewal of a delegated Bot instance as a + new Bot instance. This is likely unacceptable and would limit any advantages + of this work to the `token` join method. +- Add support for calling the join RPCs with a client certificate. + +Given our desire to ensure this identifier is trustworthy, we should prefer to +support the latter case and verify client certificates at re-joining time. + +#### Alternative: Public Key Fingerprint One option is to modify the behaviour of `tbot` to persist and reuse the keypair across renewals. We could then use a fingerprint of the public key as a @@ -111,26 +189,16 @@ It should be noted that with this technique, rotating the keypair of a `tbot` instance would reset the identity of that instance. Rotation of this keypair would be unusual and this side effect seems expected. -##### Alternative: UUID Certificate Attribute - -On the initial join of a Bot instance, we could generate a UUID to identify that -Bot instance and encode this within the certificate. Upon renewals, the UUID -would be copied from the current certificate and into the new one. +This technique does have some downsides: -Whilst this is fairly easy to implement for the `token` join method, one -challenge for the delegated join methods is that rather than renewing, the -`tbot` instance merely re-joins. As the join RPCs are unauthenticated, the -previous certificate of the Bot instance is not readily available. We can -either: - -- Accept this limitation and treat each renewal of a delegated Bot instance as a - new Bot instance. This is likely unacceptable and would limit any advantages - of this work to the `token` join method. -- Add support for calling the join RPCs with a client certificate. - -This technique could reuse a recently proposed LoginID attribute. This would -allow features such as security reports and automated anomaly detection to work -seamlessly across humans and machines. +- It makes it impossible to rotate a bot's private key. However, we do not + currently support this today, and purging a bot's data directory to do so + would simply result in a new `BotInstance`, which is likely an acceptable + workaround. +- Our join process today is unable to cryptographically verify the public key + presented by a joining bot to ensure that particular keypair has been issued + an identity already. Clients can provide any public key they like, including + that of an existing bot. ### BotInstance Resource @@ -236,31 +304,32 @@ message BotInstanceStatusAuthentication { // BotInstanceStatus holds the status of a BotInstance. message BotInstanceStatus { + // The unique identifier for this bot. + string id = 1; // The public key of the Bot instance. // When authenticating a Bot instance, the full public key must be compared // rather than just the fingerprint to mitigate pre-image attacks. - bytes public_key = 1; + bytes public_key = 2; // The fingerprint of the public key of the Bot instance. - string fingerprint = 2; + string fingerprint = 3; // The name of the Bot that this instance is associated with. - string bot_name = 3; - // Last X records kept, with the second oldest being removed once the limit - // is reached. This avoids the indefinite growth of the resource but also - // ensures the initial record is retained. - repeated BotInstanceStatusAuthentication authentications = 4; - // Last X records kept, with the second oldest being removed once the limit - // is reached. This avoids the indefinite growth of the resource but also - // ensures the initial record is retained. - repeated BotInstanceStatusHeartbeat heartbeats = 5; + string bot_name = 4; + // The initial authentication status for this bot instance. + BotInstanceStatusAuthentication initial_authentication = 5; + // The N most recent authentication status records for this bot instance. + repeated BotInstanceStatusAuthentication authentications = 6; + // The initial heartbeat status for this bot instance. + BotInstanceStatusHeartbeat initial_heartbeat = 7; + // The N most recent heartbeats for this bot instance. + repeated BotInstanceStatusHeartbeat latest_heartbeats = 8; } ``` -The name used for a BotInstance will be a concatenation of the Bot name and the -SHA256 fingerprint of the instance's public key -e.g `my-robot/2c26b46b68ffc68ff99b453c1d30413413422d706483bfa0f98a5e886266e7ae`. +The name used for a BotInstance will be a concatenation of the Bot name and its +unique identifier (UUID). When storing the BotInstance in the backend, the key will be: -`bot_instances/{bot_name}/{fingerprint}`. This will allow for efficient listing +`bot_instances/{bot_name}/{uuid}`. This will allow for efficient listing of BotInstances for a given Bot. Like agent heartbeats, the BotInstance will expire after a period of inactivity. @@ -361,8 +430,10 @@ service BotInstanceService { // Request for GetBotInstance. message GetBotInstanceRequest { - // The name of the BotInstance to retrieve. + // The name of the bot associated with the instance. string name = 1; + // The unique identifier of the bot instance to retrieve. + string id = 2; } // Request for ListBotInstances. @@ -394,6 +465,8 @@ message ListBotInstancesResponse { message DeleteBotInstanceRequest { // The name of the BotInstance to delete. string name = 1; + // The unique identifier of the bot instance to delete. + string id = 2; } ``` @@ -403,12 +476,29 @@ As we now have a way to track the generation for a specific Bot instance, we can allow multiple Bot instances to be associated with a single Bot. This also means that the token no longer needs to be consumed on a join. -Eventually, we may wish to add a way to specify a number of joins which can -occur with a token. This provides a way to easily control the lifetime of a -token when deploying to a fleet of a pre-known size. - -The renewal logic will need to be adjusted to read and update the generation -counter from the BotInstance rather than the Bot user. +However, this does introduce a change in our security guarantees, and without +additional tooling support and sensible defaults, the change may incentivize end +users to create long-lived join tokens instead of using a more appropriate join +method, or automating issuance of short lived tokens. + +To this end, we should introduce a per-bot-instance join count limit, and +configure that to be 1 join by default. This matches today's behavior, and +will help ensure users do not accidentally create a token that provides more +access than expected: "infinite use" tokens with massive join limits and/or very +long TTLs will need to be explicitly specified. + +We may additionally want to put hurdles in the way of things like extremely long +token TTLs. There are legitimate low-security use cases for these, but we could +introduce a soft limit in `tctl` preventing automatic token creation with TTL +longer than 7 days, forcing users to manually create the token if they really +need it to last longer. + +Additionally, renewal logic will need to be adjusted to read and update the +generation counter from the BotInstance rather than the Bot user. We should also +take care to ensure this counter behaves sensibly even when many bots are +attempting to join the cluster concurrently. The generation counter today +already has issues with concurrent joins, and it's even more important to get +this right when we can expect contention over a single bot instance resource. ### CLI Changes @@ -536,7 +626,7 @@ increase the complexity of the codebase and the user experience. ## Out of Scope -These tasks are out of scope of this RFD but could be considered natural +These tasks are out of scope of this RFD but could be considered natural follow-on tasks. ### Multi-phase Commit of Generation Counter @@ -562,4 +652,4 @@ It also increases the significance of the fragility of the generation counter. ### Bot Command and Control The BotInstance resource could be extended to allow `tbot` to be controlled -remotely. \ No newline at end of file +remotely. From b4d7a4f2e0791f30d6b6a8b939792dc1afe472c2 Mon Sep 17 00:00:00 2001 From: Tim Buckley Date: Thu, 6 Jun 2024 18:49:06 -0600 Subject: [PATCH 13/17] Apply code review feedback --- ...chine-id-token-join-method-bot-instance.md | 44 +++++++++---------- 1 file changed, 21 insertions(+), 23 deletions(-) diff --git a/rfd/0162-machine-id-token-join-method-bot-instance.md b/rfd/0162-machine-id-token-join-method-bot-instance.md index 2a60efa2f1da..80c8d5e05405 100644 --- a/rfd/0162-machine-id-token-join-method-bot-instance.md +++ b/rfd/0162-machine-id-token-join-method-bot-instance.md @@ -143,8 +143,7 @@ from a CI workflow, their certificates could expire and prevent them from being identified as the same instance. This is likely to only impact a small number of cases, however, as most CI provider joins are stateless and have no certificates to present anyway. Bots that present expired certificates will -either be rejected and will need to join as a new instance, or we'll need to -discard the expired identity treat them as a new instance. +either be rejected and will need to join as a new instance. #### UUID Certificate Attribute @@ -300,28 +299,28 @@ message BotInstanceStatusAuthentication { // the counter in the certificate does not match the counter of the last // authentication. int32 generation = 5; + // The public key of the Bot instance. + // When authenticating a Bot instance, the full public key must be compared + // rather than just the fingerprint to mitigate pre-image attacks. + bytes public_key = 6; + // The fingerprint of the public key of the Bot instance. + string fingerprint = 7; } // BotInstanceStatus holds the status of a BotInstance. message BotInstanceStatus { // The unique identifier for this bot. string id = 1; - // The public key of the Bot instance. - // When authenticating a Bot instance, the full public key must be compared - // rather than just the fingerprint to mitigate pre-image attacks. - bytes public_key = 2; - // The fingerprint of the public key of the Bot instance. - string fingerprint = 3; // The name of the Bot that this instance is associated with. - string bot_name = 4; + string bot_name = 2; // The initial authentication status for this bot instance. - BotInstanceStatusAuthentication initial_authentication = 5; + BotInstanceStatusAuthentication initial_authentication = 3; // The N most recent authentication status records for this bot instance. - repeated BotInstanceStatusAuthentication authentications = 6; + repeated BotInstanceStatusAuthentication authentications = 4; // The initial heartbeat status for this bot instance. - BotInstanceStatusHeartbeat initial_heartbeat = 7; + BotInstanceStatusHeartbeat initial_heartbeat = 5; // The N most recent heartbeats for this bot instance. - repeated BotInstanceStatusHeartbeat latest_heartbeats = 8; + repeated BotInstanceStatusHeartbeat latest_heartbeats = 6; } ``` @@ -339,10 +338,12 @@ This avoids the accumulation of ephemeral BotInstances. Upon each join and renewal, the BotInstance record will be updated with an additional entry in the `status.authentications` field. If there is X entries, -then the second-oldest entry will be removed. This prevents growth without -bounds but also ensures that the original record is retained. +then the oldest entry will be removed. This prevents growth without bounds but +also ensures that the original record is retained. -In addition, the TTL of the BotInstance resource will be extended. +In addition, the TTL of the BotInstance resource will be extended to cover the +validity period of the issued certificate, plus a short additional time to allow +for some imprecision. If a BotInstance does not exist, then one will be created. In the case that this occurs for a bot using the `token` join method and this is a renewal, @@ -378,11 +379,8 @@ message SubmitHeartbeatResponse { ``` The endpoint will have a special auth/authz check. RBAC will not be used and -instead the endpoint will check: - -- The presented client certificate is for the Bot linked to the instance. -- The presented client certificate's public key matches the public key recorded - for the BotInstance. +instead the endpoint will ensure that the presented client certificate is for +the Bot linked to the instance. This endpoint will be called by `tbot` immediately after it has initially authenticated. After a heartbeat has successfully completed, another should be @@ -431,7 +429,7 @@ service BotInstanceService { // Request for GetBotInstance. message GetBotInstanceRequest { // The name of the bot associated with the instance. - string name = 1; + string bot_name = 1; // The unique identifier of the bot instance to retrieve. string id = 2; } @@ -464,7 +462,7 @@ message ListBotInstancesResponse { // Request for DeleteBotInstance. message DeleteBotInstanceRequest { // The name of the BotInstance to delete. - string name = 1; + string bot_name = 1; // The unique identifier of the bot instance to delete. string id = 2; } From 61e17365cac7aaec059de86d470a57a6b612eebd Mon Sep 17 00:00:00 2001 From: Tim Buckley Date: Mon, 10 Jun 2024 20:24:50 -0600 Subject: [PATCH 14/17] Use consistent field names for authentications and heartbeats --- rfd/0162-machine-id-token-join-method-bot-instance.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/rfd/0162-machine-id-token-join-method-bot-instance.md b/rfd/0162-machine-id-token-join-method-bot-instance.md index 80c8d5e05405..a2aa9984590b 100644 --- a/rfd/0162-machine-id-token-join-method-bot-instance.md +++ b/rfd/0162-machine-id-token-join-method-bot-instance.md @@ -316,7 +316,7 @@ message BotInstanceStatus { // The initial authentication status for this bot instance. BotInstanceStatusAuthentication initial_authentication = 3; // The N most recent authentication status records for this bot instance. - repeated BotInstanceStatusAuthentication authentications = 4; + repeated BotInstanceStatusAuthentication latest_authentications = 4; // The initial heartbeat status for this bot instance. BotInstanceStatusHeartbeat initial_heartbeat = 5; // The N most recent heartbeats for this bot instance. From f4ffebf8c82e6c74580f3910c27555f77449a6b0 Mon Sep 17 00:00:00 2001 From: Tim Buckley Date: Tue, 18 Jun 2024 18:54:46 -0600 Subject: [PATCH 15/17] Apply suggestions from code review Co-authored-by: Zac Bergquist --- rfd/0162-machine-id-token-join-method-bot-instance.md | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/rfd/0162-machine-id-token-join-method-bot-instance.md b/rfd/0162-machine-id-token-join-method-bot-instance.md index a2aa9984590b..8249b33d90e4 100644 --- a/rfd/0162-machine-id-token-join-method-bot-instance.md +++ b/rfd/0162-machine-id-token-join-method-bot-instance.md @@ -17,7 +17,7 @@ Terminology: - Bot: An identity within Teleport intended for use by machines as opposed to humans. Many individual machines may act as this shared identity. -- `tbot`: The Teleport binary that acts as aBot and generates credentials for +- `tbot`: The Teleport binary that acts as a Bot and generates credentials for consumption by client applications. - Bot instance: A single instance of `tbot` running on a host. @@ -41,13 +41,13 @@ on-prem is more challenging. The following burdens currently exist: -- When using the `token` join method, a Bot must be created for each host. This +- When using the `token` join method, a Bot must be created for each Bot instance. This means that the privileges of many distinct Bots need to be synchronised where those hosts are performing the same function. - When using the `token` join method, a token can only be used once. This means creating hundreds of join tokens and managing securely distributing these to hosts. -- When managing a large fleet of `tbot` deployments, there is no way to +- When managing a large fleet of Bot instances, there is no way to track these within Teleport. This makes it more difficult to identify hosts which may need updating. @@ -506,7 +506,7 @@ this right when we can expect contention over a single bot instance resource. simply clear out any artifacts within the `tbot` storage directory. In addition, if the bot detects a change in join token or join method, it should -automatically rotate it's keypair. This will ensure it presents as a fresh +automatically rotate its keypair. This will ensure it presents as a fresh BotInstance to the AuthServer. A log message should be output that identifies the linked BotInstance at From 97b4d3a03257d3f630643d7d7d6a06308300e002 Mon Sep 17 00:00:00 2001 From: Tim Buckley Date: Tue, 18 Jun 2024 19:24:27 -0600 Subject: [PATCH 16/17] Address review feedback: mention TPM joining, alternatives, audit --- ...2-machine-id-token-join-method-bot-instance.md | 15 +++++++++++++-- 1 file changed, 13 insertions(+), 2 deletions(-) diff --git a/rfd/0162-machine-id-token-join-method-bot-instance.md b/rfd/0162-machine-id-token-join-method-bot-instance.md index 8249b33d90e4..38962035241a 100644 --- a/rfd/0162-machine-id-token-join-method-bot-instance.md +++ b/rfd/0162-machine-id-token-join-method-bot-instance.md @@ -36,8 +36,8 @@ The improvements will focus on three points: ## Why Whilst deploying a large fleet of Bots is fairly trivial when using the -delegated join methods, the experience when managing a fleet of bot hosts -on-prem is more challenging. +delegated and TPM join methods, the experience when managing a fleet of bot +hosts on-prem is more challenging. The following burdens currently exist: @@ -199,6 +199,11 @@ This technique does have some downsides: an identity already. Clients can provide any public key they like, including that of an existing bot. +Given these downsides, we'll prefer to implement UUID instance identifiers. +Most of the technical challenge lies in adapting the join process to accept +client certificate authentication for re-joins, at which point adding a new +certificate field is trivial. + ### BotInstance Resource The @@ -407,6 +412,8 @@ Cons: - Adds Bot specific behaviour to RPCs that are also used for Node joining. - Heartbeats are limited to the interval of renewal. +Given these cons, we'll opt introduce the new heartbeat RPC. + #### API Additional RPCs will be added to the BotInstance service to allow these to @@ -578,6 +585,10 @@ An audit event should be added for the deletion of a BotInstance. The name of the BotInstance should be added to the existing join, renewal and certificate generation audit events. +Additionally, we should ensure bot instance identifiers are present in existing +audit events to ensure actions taken by bots can be traced back to specific +instances. + ### Resistance to collision/pre-image attacks We should be cautious that a BotInstance cannot be impersonated using a From 5fa1817c9f2bfb32c68883785f97ce60be04bf57 Mon Sep 17 00:00:00 2001 From: Noah Stride Date: Tue, 1 Oct 2024 10:16:04 +0100 Subject: [PATCH 17/17] Update rfd/0162-machine-id-token-join-method-bot-instance.md --- rfd/0162-machine-id-token-join-method-bot-instance.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/rfd/0162-machine-id-token-join-method-bot-instance.md b/rfd/0162-machine-id-token-join-method-bot-instance.md index 38962035241a..54969a89e43e 100644 --- a/rfd/0162-machine-id-token-join-method-bot-instance.md +++ b/rfd/0162-machine-id-token-join-method-bot-instance.md @@ -1,6 +1,6 @@ --- authors: Noah Stride (noah.stride@goteleport.com) -state: draft +state: Implemented (16.2.0) --- # RFD 00162 - Improving the on-prem fleet Bot management experience