Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Machine ID: Bot Instance heartbeating #43463

Merged
merged 19 commits into from
Jul 8, 2024

Conversation

strideynet
Copy link
Contributor

@strideynet strideynet commented Jun 25, 2024

Related to #41456

Introduces an implementation of the SubmitHeartbeat RPC and the code necessary within tbot to submit a heartbeat.

No changelog since we'll do one big changelog entry when backporting this entire feature.

kind: bot_instance
metadata:
  expires:
    nanos: 905422000
    seconds: 1720098731
  name: 9abc2c84-7fb3-46bb-aa11-cce389b727ab
  revision: 0298c6a0-7655-4cd6-ab17-5887fb990674
spec:
  bot_name: testbot8383
  instance_id: 9abc2c84-7fb3-46bb-aa11-cce389b727ab
status:
  initial_authentication:
    authenticated_at:
      nanos: 916478000
      seconds: 1720084008
    generation: 1
    join_method: token
    join_token: '************************cfa52fba'
    public_key: LS0tLS1CRUdJTiBSU0EgUFVCTElDIEtFWS0tLS0tCk1JSUJJakFOQmdrcWhraUc5dzBCQVFFRkFBT0NBUThBTUlJQkNnS0NBUUVBMG42RXpNakxkNlYvb2M0MnZtUlEKd2RNMi9TYnl2VzJNV3B6Z1lvMEE0REVhKzVoM3pwcS9WcGtFWXNFQ2VqMmRlMTVyRU9sdHNPR3lXSXVjYlgvUwp5ZFR2TmRwTUdMZy8zOWEwYXlvaktndTUyb1BDS1ZpTlVITHVCaVBVdEoxZHFIcHFmRVFhemZnbTg1eXB3bU5jCk1VWEwzc3FUTFkrS2Z0c2xHcjN6QjBWOUdQU0tsOHNJckZ2cjF6SmdqbDg1QklKRFh3OWZvRm0ybXk2UGxxbGcKRWFoNzZ5YzBKbTlxRWlhMjRRV3hhb1ZLb001TXNkMUUyaXMxM0kzc2RYVnNZNUcvTXlaZVcwOWVBeU5PTEY2dwpVMzlZQlU1dmhrWUpaS1A4T0h4QUhxN1BlM3I1WUFBaWdrdFRMcEhqblhzMUxJdG45dndXVU1Yb3FPNTlZVUZXCmp3SURBUUFCCi0tLS0tRU5EIFJTQSBQVUJMSUMgS0VZLS0tLS0K
  initial_heartbeat:
    architecture: arm64
    hostname: Noahs-MacBook-Pro.local
    is_startup: true
    join_method: token
    one_shot: true
    os: darwin
    recorded_at:
      nanos: 4455000
      seconds: 1720084009
    uptime:
      nanos: 235452750
    version: 17.0.0-dev
  latest_authentications:
  - authenticated_at:
      nanos: 916478000
      seconds: 1720084008
    generation: 1
    join_method: token
    join_token: '************************cfa52fba'
    public_key: LS0tLS1CRUdJTiBSU0EgUFVCTElDIEtFWS0tLS0tCk1JSUJJakFOQmdrcWhraUc5dzBCQVFFRkFBT0NBUThBTUlJQkNnS0NBUUVBMG42RXpNakxkNlYvb2M0MnZtUlEKd2RNMi9TYnl2VzJNV3B6Z1lvMEE0REVhKzVoM3pwcS9WcGtFWXNFQ2VqMmRlMTVyRU9sdHNPR3lXSXVjYlgvUwp5ZFR2TmRwTUdMZy8zOWEwYXlvaktndTUyb1BDS1ZpTlVITHVCaVBVdEoxZHFIcHFmRVFhemZnbTg1eXB3bU5jCk1VWEwzc3FUTFkrS2Z0c2xHcjN6QjBWOUdQU0tsOHNJckZ2cjF6SmdqbDg1QklKRFh3OWZvRm0ybXk2UGxxbGcKRWFoNzZ5YzBKbTlxRWlhMjRRV3hhb1ZLb001TXNkMUUyaXMxM0kzc2RYVnNZNUcvTXlaZVcwOWVBeU5PTEY2dwpVMzlZQlU1dmhrWUpaS1A4T0h4QUhxN1BlM3I1WUFBaWdrdFRMcEhqblhzMUxJdG45dndXVU1Yb3FPNTlZVUZXCmp3SURBUUFCCi0tLS0tRU5EIFJTQSBQVUJMSUMgS0VZLS0tLS0K
  - authenticated_at:
      nanos: 905980000
      seconds: 1720084031
    generation: 2
    join_method: token
    public_key: c3NoLXJzYSBBQUFBQjNOemFDMXljMkVBQUFBREFRQUJBQUFCQVFEU2ZvVE15TXQzcFgraHpqYStaRkRCMHpiOUp2SzliWXhhbk9CaWpRRGdNUnI3bUhmT21yOVdtUVJpd1FKNlBaMTdYbXNRNlcydzRiSllpNXh0ZjlMSjFPODEya3dZdUQvZjFyUnJLaU1xQzduYWc4SXBXSTFRY3U0R0k5UzBuVjJvZW1wOFJCck4rQ2J6bktuQ1kxd3hSY3ZleXBNdGo0cCsyeVVhdmZNSFJYMFk5SXFYeXdpc1crdlhNbUNPWHprRWdrTmZEMStnV2JhYkxvK1dxV0FScUh2ckp6UW1iMm9TSnJiaEJiRnFoVXFnemt5eDNVVGFLelhjamV4MWRXeGprYjh6Smw1YlQxNERJMDRzWHJCVGYxZ0ZUbStHUmdsa28vdzRmRUFlcnM5N2V2bGdBQ0tDUzFNdWtlT2RlelVzaTJmMi9CWlF4ZWlvN24xaFFWYVAK
  latest_heartbeats:
  - architecture: arm64
    hostname: Noahs-MacBook-Pro.local
    is_startup: true
    join_method: token
    one_shot: true
    os: darwin
    recorded_at:
      nanos: 4455000
      seconds: 1720084009
    uptime:
      nanos: 235452750
    version: 17.0.0-dev
  - architecture: arm64
    hostname: Noahs-MacBook-Pro.local
    is_startup: true
    join_method: token
    one_shot: true
    os: darwin
    recorded_at:
      nanos: 957177000
      seconds: 1720084031
    uptime:
      nanos: 143367459
    version: 17.0.0-dev
version: v1

timothyb89 and others added 12 commits June 12, 2024 19:06
This adds an early backend for the bot instance service, with resource
definitions and basic CRUD operations.
Co-authored-by: Noah Stride <noah.stride@goteleport.com>
Among other things, this adds event definitions, RBAC and presets,
and client definitions. This also includes a few proto changes,
namely moving bot name and instance ID from .Status to .Spec so
metadata can be uniformly generated.

This also adds tests for the backend service showing functional CRUD.
@strideynet strideynet added the no-changelog Indicates that a PR does not require a changelog entry label Jun 25, 2024
lib/tlsca/ca.go Outdated Show resolved Hide resolved
@strideynet
Copy link
Contributor Author

DNM until other PRs this relies on are merged and this is rebased.

Base automatically changed from timothyb89/bot-instance-service to master June 25, 2024 22:10
@strideynet strideynet marked this pull request as ready for review June 26, 2024 08:39
@strideynet strideynet removed do-not-merge no-changelog Indicates that a PR does not require a changelog entry labels Jul 4, 2024
Copy link

github-actions bot commented Jul 4, 2024

The PR changelog entry failed validation: Changelog entry not found in the PR body. Please add a "no-changelog" label to the PR, or changelog lines starting with changelog: followed by the changelog entries for the PR.

@strideynet strideynet changed the title Add SubmitHeartbeat RPC implementation Machine ID: Bot Instance heartbeating Jul 4, 2024
@strideynet strideynet added the no-changelog Indicates that a PR does not require a changelog entry label Jul 4, 2024
@strideynet
Copy link
Contributor Author

Ready for review - I've merged the bot instance heartbeating in tbot into this PR so it's a more cohesive chunk

Copy link
Contributor

@timothyb89 timothyb89 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Working great on my test cluster!

@strideynet strideynet enabled auto-merge July 8, 2024 09:34
@strideynet strideynet added this pull request to the merge queue Jul 8, 2024
Merged via the queue into master with commit 1399056 Jul 8, 2024
38 checks passed
@strideynet strideynet deleted the strideynet/bot-instance-service-heartbeating branch July 8, 2024 10:11
@strideynet strideynet mentioned this pull request Jul 9, 2024
10 tasks
timothyb89 added a commit that referenced this pull request Aug 6, 2024
* Add basic CRUD for BotInstanceService

This adds an early backend for the bot instance service, with resource
definitions and basic CRUD operations.

* Update api/proto/teleport/machineid/v1/bot_instance.proto

Co-authored-by: Noah Stride <noah.stride@goteleport.com>

* Further work on BotInstance boilerplate

Among other things, this adds event definitions, RBAC and presets,
and client definitions. This also includes a few proto changes,
namely moving bot name and instance ID from .Status to .Spec so
metadata can be uniformly generated.

This also adds tests for the backend service showing functional CRUD.

* gRPC tests; fix filtering; fix import lints; misc cleanup

* Consistently use `instance_id` in gRPC fields

* Add tctl support for bot_instance resource

* protos update

* Various typo fixes

* Fix more comment typos

* More doc comments and minor test tweaks

* Add SubmitHeartbeat RPC implementation

* Fix log keys

* Update to use proper BotInstanceID field in identiy

* Add rough outline of heartbeat service

---------

Co-authored-by: Tim Buckley <tim@goteleport.com>
timothyb89 added a commit that referenced this pull request Aug 8, 2024
* Add basic CRUD for BotInstanceService

This adds an early backend for the bot instance service, with resource
definitions and basic CRUD operations.

* Update api/proto/teleport/machineid/v1/bot_instance.proto

Co-authored-by: Noah Stride <noah.stride@goteleport.com>

* Further work on BotInstance boilerplate

Among other things, this adds event definitions, RBAC and presets,
and client definitions. This also includes a few proto changes,
namely moving bot name and instance ID from .Status to .Spec so
metadata can be uniformly generated.

This also adds tests for the backend service showing functional CRUD.

* gRPC tests; fix filtering; fix import lints; misc cleanup

* Consistently use `instance_id` in gRPC fields

* Add tctl support for bot_instance resource

* protos update

* Various typo fixes

* Fix more comment typos

* More doc comments and minor test tweaks

* Add SubmitHeartbeat RPC implementation

* Fix log keys

* Update to use proper BotInstanceID field in identiy

* Add rough outline of heartbeat service

---------

Co-authored-by: Tim Buckley <tim@goteleport.com>
github-merge-queue bot pushed a commit that referenced this pull request Aug 26, 2024
* Add basic CRUD for BotInstanceService (#42899)

* Add basic CRUD for BotInstanceService

This adds an early backend for the bot instance service, with resource
definitions and basic CRUD operations.

* Update api/proto/teleport/machineid/v1/bot_instance.proto

Co-authored-by: Noah Stride <noah.stride@goteleport.com>

* Further work on BotInstance boilerplate

Among other things, this adds event definitions, RBAC and presets,
and client definitions. This also includes a few proto changes,
namely moving bot name and instance ID from .Status to .Spec so
metadata can be uniformly generated.

This also adds tests for the backend service showing functional CRUD.

* gRPC tests; fix filtering; fix import lints; misc cleanup

* Consistently use `instance_id` in gRPC fields

* Add tctl support for bot_instance resource

* protos update

* Various typo fixes

* Fix more comment typos

* More doc comments and minor test tweaks

---------

Co-authored-by: Noah Stride <noah.stride@goteleport.com>

* Create Bot Instances during initial bot join (#43577)

* Create Bot Instances during initial bot join

This creates new instances for bots when they initially join the
cluster, and persists instance IDs in new certificate fields on join
and during renewal.

Note that this does not yet handle instance reuse for non-token join
methods.

Additionally, bot instance creation is locked behind a
`BOT_INSTANCE_EXPERIMENT` flag; it must be set to `1` to enable
creation.

* Proto cleanup, and update bot auth records on cert renewal

This makes various (admittedly breaking) protobuf changes, including
removing the TTL field (calculating resource expiry based on cert
requests), removing public key fingerprints, and changing the data
type of the generation counter to match the preexisting internal
datatype. These changes _should_ be safe as no consumers of the proto
API currently exist.

Additionally, this also updates bot authentications on renewal.

* Fix proto lints

* Fix misleading doc comment in the bot instance experiment

* Create bot instances for old bots on join; other fixes

This now creates bot instances for bots whose certs are missing the
BotInstanceID field. Additionally, it fixes two related bugs:
expiration dates are extended on renewal, the generated UUID
is properly appended to certs on initial join, and instances are
only created or updated when the experiment is enabled.

* Add a minimal test for bot instance creation on initial join

* Validate bot instance state in generation counter checks

* Remove outdated TODO comment and fix test lints

* Add an expiration change check to the generation test

* Machine ID: Provide existing identity when rejoining as part of renewals (#43732)

* Allow existing bot identity to be used for token based renewals

* Add test for register with auth client

* Fix tests

* Fix test

* Fix partial assignment to interface

* Improve loggging

* Fix spelling

* Fix missing import

* Fix built gRPC

* Machine ID: Bot Instance heartbeating (#43463)

* Add basic CRUD for BotInstanceService

This adds an early backend for the bot instance service, with resource
definitions and basic CRUD operations.

* Update api/proto/teleport/machineid/v1/bot_instance.proto

Co-authored-by: Noah Stride <noah.stride@goteleport.com>

* Further work on BotInstance boilerplate

Among other things, this adds event definitions, RBAC and presets,
and client definitions. This also includes a few proto changes,
namely moving bot name and instance ID from .Status to .Spec so
metadata can be uniformly generated.

This also adds tests for the backend service showing functional CRUD.

* gRPC tests; fix filtering; fix import lints; misc cleanup

* Consistently use `instance_id` in gRPC fields

* Add tctl support for bot_instance resource

* protos update

* Various typo fixes

* Fix more comment typos

* More doc comments and minor test tweaks

* Add SubmitHeartbeat RPC implementation

* Fix log keys

* Update to use proper BotInstanceID field in identiy

* Add rough outline of heartbeat service

---------

Co-authored-by: Tim Buckley <tim@goteleport.com>

* Allow bots to provide client cert authentication when rejoining (#44257)

* Allow bots to provide client cert authentication when rejoining

Traditionally, bots using non-`token` join methods "renew" their
certs by rejoining, taking advantage of their repeatable join methods
to avoid relying on `token` rejoin logic. However, as part of
[RFD 162], we need a way for rejoining bots to assert an existing
identity (if any) even when rejoining "from scratch".

When paired with #43732, this change allows bots to use an
authenticated auth client for all "initial" joins. If present, an
existing `BotInstanceID` is extracted from the client identity and
used to associate the bot with its previous identity.

As certain join methods (TPM, Azure) are initially handled by the
`JoinService` using gRPC, this additionally adds a new
`BotInstanceID` field to `RegisterUsingTokenRequest` so the
`JoinService` can verify the client identity and extract the
`BotInstanceID` if available. Like `RemoteAddr`, this value is only
trusted when set by the proxy and should be ignored in all other
cases.

[RFD 162]: #36510

* Add tests for authenticated bot rejoining

* Create a new bot instance if the requested instance is not found

Also, includes a new check in for this behavior in
`TestRegisterBotWithInvalidInstanceID`.

* Fix imports

* Update lib/auth/auth_with_roles.go

Co-authored-by: Zac Bergquist <zac.bergquist@goteleport.com>

* Fix `cmp` import

---------

Co-authored-by: Zac Bergquist <zac.bergquist@goteleport.com>

* Machine ID: Include Bot Instance ID in audit logs (#43786)

* Create Bot Instances during initial bot join

This creates new instances for bots when they initially join the
cluster, and persists instance IDs in new certificate fields on join
and during renewal.

Note that this does not yet handle instance reuse for non-token join
methods.

Additionally, bot instance creation is locked behind a
`BOT_INSTANCE_EXPERIMENT` flag; it must be set to `1` to enable
creation.

* Proto cleanup, and update bot auth records on cert renewal

This makes various (admittedly breaking) protobuf changes, including
removing the TTL field (calculating resource expiry based on cert
requests), removing public key fingerprints, and changing the data
type of the generation counter to match the preexisting internal
datatype. These changes _should_ be safe as no consumers of the proto
API currently exist.

Additionally, this also updates bot authentications on renewal.

* Fix proto lints

* Fix misleading doc comment in the bot instance experiment

* Create bot instances for old bots on join; other fixes

This now creates bot instances for bots whose certs are missing the
BotInstanceID field. Additionally, it fixes two related bugs:
expiration dates are extended on renewal, the generated UUID
is properly appended to certs on initial join, and instances are
only created or updated when the experiment is enabled.

* Add a minimal test for bot instance creation on initial join

* Validate bot instance state in generation counter checks

* Remove outdated TODO comment and fix test lints

* Add an expiration change check to the generation test

* Add BotInstanceID to audit events

* Fix borked conflict resolution

* Fix further borked conflict resolution

* Add test for cert create/join

---------

Co-authored-by: Tim Buckley <tim@goteleport.com>

* Show bot instance ID in tbot log output (#44578)

This tweaks the "fetched new bot identity" message to show the bot
name and instance ID as embedded in the bot's certificate.

Example:

```
2024-07-23T15:51:20-06:00 INFO [TBOT:IDEN] Fetched new bot identity identity:tpm-test, id=5a2865d3-d3dc-4eaa-853c-5377a1fe83f6 | valid: after=2024-07-23T21:50:20Z, before=2024-07-23T21:56:19Z, duration=5m59s | kind=tls, renewable=false, disallow-reissue=false, roles=[bot-tpm-test], principals=[-teleport-internal-join], generation=0 tbot/service_bot_identity.go:223
```

* Machine ID: Validate generation counter using bot instances (#44583)

* Show bot instance ID in tbot log output

This tweaks the "fetched new bot identity" message to show the bot
name and instance ID as embedded in the bot's certificate.

Example:

```
2024-07-23T15:51:20-06:00 INFO [TBOT:IDEN] Fetched new bot identity identity:tpm-test, id=5a2865d3-d3dc-4eaa-853c-5377a1fe83f6 | valid: after=2024-07-23T21:50:20Z, before=2024-07-23T21:56:19Z, duration=5m59s | kind=tls, renewable=false, disallow-reissue=false, roles=[bot-tpm-test], principals=[-teleport-internal-join], generation=0 tbot/service_bot_identity.go:223
```

* Machine ID: Validate generation counter using bot instances

This change moves the generation counter from bot user labels to a
field in `BotInstanceStatusAuthentication`. This allows for tracking
of individual generations regardless of join method, allows for
multiple `token`-type joins per bot, and opens the door for multi-use
`token`-type tokens in the future.

For now, the new behavior remains behind the `BOT_INSTANCE_EXPERIMENT`
feature flag, and generation counter handling is largely unchanged
when it is unset.

* Update legacy generation counter for downgrade compatibility

This updates the legacy user label generation counter to support some
downgrade compatibility if a user either disables the bot instance
experiment or downgrades to a Teleport version predating it.

Also includes some logger cleanup, and adds a warning when a
generation counter mismatch is detected but not enforced.

* Set initial generation value for legacy mode

`validateGenerationLabel` expects a nonzero generation counter value
for renewable certs to populate the counter, so this provides it
again.

* Add new generation counter tests

This adds various new tests to validate new per-instance generation
counter behavior.

Also includes backcompat fixes to import old generation counters, and
a small fix to ensure downgrades don't leave bots in a bad state.

* Fix imports

* Tweak TODO message

* Fix HostCAs not being returned during bot renewal (#44832)

* Machine ID: Remove `BOT_INSTANCE_EXPERIMENT` feature flag (#44904)

This removes the `BOT_INSTANCE_EXPERIMENT` feature flag to enable the
feature universally. It also adjusts some comments to add deprecation
notices.

* Machine ID: Only set legacy generation counter for renewable joins (#44993)

We unconditionally committed the generation counter to the legacy
user label to allow for downgrade compatibility. This is not valid
for non-token join methods as previous Teleport versions will attempt
to check the generation counter whenever the label is nonzero, and
because these older versions don't support authenticated rejoining,
will never be able to compare to the client's actual generation
value. (Plus, it was hard-coded to zero.)

This disables the legacy generation counter for non-renewable join
methods. This should not create a later upgrade problem as bots will
lose their instance ID upon their first rejoin after downgrading, and
so will be issued a fresh ID and generation counter if and when the
cluster is upgraded again.

* Remove unused function sshPublicKeyToPKIXPEM

Left-over function from backport

* Always use PEM PKIX DER representation of public keys for BotInstances (#43845)

* Always use PEM PKIX DER representation of public keys for BotInstances

* Prefer using `keys.MarshalPublicKey`

* Remove mistakenly included TestVerifyALPNUpgradedConn

The recent merge included an unrelated unit test. This removes it.

---------

Co-authored-by: Noah Stride <noah.stride@goteleport.com>
Co-authored-by: Zac Bergquist <zac.bergquist@goteleport.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
no-changelog Indicates that a PR does not require a changelog entry size/md
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants