Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFD 0169: Automatic Updates for Agents #40190

Closed
wants to merge 84 commits into from

Conversation

sclevine
Copy link
Member

@sclevine sclevine commented Apr 3, 2024

This RFD proposes a new mechanism for Teleport agents to automatically update to a version set by an operator via tctl.

Readable

@github-actions github-actions bot added rfd Request for Discussion size/md labels Apr 3, 2024
Copy link

github-actions bot commented Apr 3, 2024

The PR changelog entry failed validation: Changelog entry not found in the PR body. Please add a "no-changelog" label to the PR, or changelog lines starting with changelog: followed by the changelog entries for the PR.

1 similar comment
Copy link

github-actions bot commented Apr 3, 2024

The PR changelog entry failed validation: Changelog entry not found in the PR body. Please add a "no-changelog" label to the PR, or changelog lines starting with changelog: followed by the changelog entries for the PR.

Copy link

github-actions bot commented Apr 3, 2024

The PR changelog entry failed validation: Changelog entry not found in the PR body. Please add a "no-changelog" label to the PR, or changelog lines starting with changelog: followed by the changelog entries for the PR.

Copy link

github-actions bot commented Apr 3, 2024

The PR changelog entry failed validation: Changelog entry not found in the PR body. Please add a "no-changelog" label to the PR, or changelog lines starting with changelog: followed by the changelog entries for the PR.

@michellescripts michellescripts self-requested a review April 4, 2024 18:15
rfd/0169-auto-updates-linux-agents.md Show resolved Hide resolved
rfd/0169-auto-updates-linux-agents.md Outdated Show resolved Hide resolved
rfd/0169-auto-updates-linux-agents.md Outdated Show resolved Hide resolved
Copy link

github-actions bot commented Apr 4, 2024

The PR changelog entry failed validation: Changelog entry not found in the PR body. Please add a "no-changelog" label to the PR, or changelog lines starting with changelog: followed by the changelog entries for the PR.

Copy link

github-actions bot commented Apr 4, 2024

The PR changelog entry failed validation: Changelog entry not found in the PR body. Please add a "no-changelog" label to the PR, or changelog lines starting with changelog: followed by the changelog entries for the PR.

rfd/0169-auto-updates-linux-agents.md Outdated Show resolved Hide resolved
rfd/0169-auto-updates-linux-agents.md Outdated Show resolved Hide resolved
rfd/0169-auto-updates-linux-agents.md Outdated Show resolved Hide resolved
rfd/0169-auto-updates-linux-agents.md Outdated Show resolved Hide resolved
rfd/0169-auto-updates-linux-agents.md Outdated Show resolved Hide resolved
rfd/0169-auto-updates-linux-agents.md Outdated Show resolved Hide resolved
rfd/0169-auto-updates-linux-agents.md Outdated Show resolved Hide resolved
rfd/0169-auto-updates-linux-agents.md Outdated Show resolved Hide resolved
active_version: 15.1.1
```

### Runtime
Copy link
Contributor

@russjones russjones Apr 5, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you create another section called ### Installer?

One of the biggest problems I saw during during our push to get customers to switch to automatic updates was the number of different ways we have to install Teleport.

I'd like us to consolidate on a single way to install agents that 90% of our customers use. Ideally anything we build should work for self-hosted and Cloud similar to client tools updates.

The good news I think you have the answer.

$ apt-get install teleport-updater
$ teleport-updater enable --proxy proxy.example.com

This works for Cloud, but would also work for self-hosted customers. We just have to work through the edge cases.

Otherwise, what's left?

  • We need to make sure all of our install scripts are updated to use the new installation method above, for example the discover script and curl | bash on our downloads page. Please enumerate all the scripts/locations that will need to be updated.
  • We need to make sure the teleport-updater package can support OSS, Enterprise, and Enterprise FIPs installations.
  • We need to make sure the teleport-updater is available for all the different architectures we support (I think this is already the case).
  • We update our documentation to reference the new installation instructions. 1 2 3 4
  • Dashboard tenants downloads tab is updated to reference new instructions.

Let's make sure that's in-scope for this RFD.

Copy link
Contributor

@russjones russjones Apr 5, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to make sure the teleport-updater package can support OSS, Enterprise, and Enterprise FIPs installations.

I think we can put this information into the ping endpoint. Then you no longer have to worry about which binary you need to install (OSS, Enterprise, FIPS Enterprise), you just install the teleport-updater and it downloads and runs this right version.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Something we should consider: the apt-get install && teleport-updater enable approach works fine if you running the commands on the host (or running something like Ansible), what would we recommend to someone that builds an AMI using this new install method?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I've covered everything above:

  • Added new "Installers" section with scripts changes and VM/container image instructions
  • Added new "Documentation" section with links and notes
  • New server_edition field is added to ping, along with associated changes to "Runtime" section logic

Let me know if this looks good.

Copy link
Contributor

@hugoShaka hugoShaka Apr 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we make this new installer the default for everything (which I think is a great approach, we must reduce complexity) I would suggest going further by making the new binary the only recommended way of installing teleport, even without Automatic Updates.

We could name it teleport-version-manager or whatever name that doesn't imply it requires AUs and expose the following commands:

  • tvm update as already described in the RFD
  • tvm follow/tvm unfollow to follow updates from a proxy, this is the current RFD enable/disable.
  • tvm install <X.Y.Z|teleport.example.com> would be an additional command not enabling AUs, but installing a static version (given, or from a proxy).

I think this would answer

what would we recommend to someone that builds an AMI using this new install method

We would recommend "install tvm, and run tvm install teleport.example.com (for non-AU setups) or tvm follow teleport.example.com (for AU) on startup".

Copy link
Member Author

@sclevine sclevine Apr 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would tvm use-version vX.Y.Z or tvm use vX.Y.Z be better?

tvm use sounds better to me, but I don't have a strong opinion.

Happy to rename. I'll wait for a few others to chime in, and change it when we have some consensus -- names can be contentious 🙂

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even though I really want to refactor our installation process, looking at our capacity for Q2, I don't think we can.

So I am in-favor of keeping it simple and solving the issue for Cloud customers first then coming back in Q3 and fixing our installation problems as a whole.

Here is what I am thinking.

  • We keep the name teleport-updater to not introduce too many changes at once.
  • Everywhere we are currently installing our auto updater, we only update it to install the new one.
  • Scope out self-hosted installation.

What do you guys think?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 Works for me. Maybe there are two phases here:

  • Ship this RFD, which minimizes user-facing changes
  • Re-design UX, simplify scripts, etc.

Copy link
Contributor

@hugoShaka hugoShaka May 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we postpone the simplification we must absolutely ensure we do it the next quarter.

The installation methods are already a mess and we're adding new methods, instructions and scripts. This is not maintainable and self-hosted customers are already complaining about how hard it is to just install teleport. This is not an issue we can solve with docs. We've tried for the last year and the installation docs only became worse and more obscure since I joined Teleport 2 years ago.

I'm against adding new ways without removing the old ones but I fear that we'll do it anyway.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why did we scope out self-hosted installations?
Is there extra work for those scenarios? Except for docs around the autoupdate_version resource?

Adding more complexity to distinguish between cloud/self-hosted in the installer scripts seems we are shifting the complexity to something which is way hard to test/maintain.

rfd/0169-auto-updates-linux-agents.md Show resolved Hide resolved
Co-authored-by: Russell Jones <russjones@users.noreply.github.com>
Copy link

github-actions bot commented Apr 5, 2024

The PR changelog entry failed validation: Changelog entry not found in the PR body. Please add a "no-changelog" label to the PR, or changelog lines starting with changelog: followed by the changelog entries for the PR.

Co-authored-by: Russell Jones <russjones@users.noreply.github.com>
Copy link

github-actions bot commented Apr 5, 2024

The PR changelog entry failed validation: Changelog entry not found in the PR body. Please add a "no-changelog" label to the PR, or changelog lines starting with changelog: followed by the changelog entries for the PR.

Co-authored-by: Russell Jones <russjones@users.noreply.github.com>
Copy link

github-actions bot commented Apr 5, 2024

The PR changelog entry failed validation: Changelog entry not found in the PR body. Please add a "no-changelog" label to the PR, or changelog lines starting with changelog: followed by the changelog entries for the PR.

@russjones russjones added the no-changelog Indicates that a PR does not require a changelog entry label Apr 5, 2024
Copy link
Collaborator

@zmb3 zmb3 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First pass. I think I need to read it a few more times to get up to speed, but sometimes feedback from someone seeing something for the first time is helpful, so here you go.

Teleport proxies will be updated to serve the desired agent version and edition from `/v1/webapi/find`.
The version and edition served from that endpoint will be configured using new `autoupdate_version` resource.

Whether the Teleport updater querying the endpoint is instructed to upgrade (via the `agent_autoupdate` field) is dependent on:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What exactly is Teleport updater? (I assume this is a new tool we're building, but this is my first read through of the RFD and the term was introduced without context)

If it is a new program, do we need a separate program? Managing the version of some updater and the version of Teleport simultaneously has proven difficult. Have we considered exposing the updater as a subcommand of Teleport?

If the updater itself was part of Teleport, then the updater could use the agent's own certificates to authenticate itself and we wouldn't need to surface upgrade instructions over an unauthenticated channel.

Copy link
Member Author

@sclevine sclevine Sep 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What exactly is Teleport updater?

Sorry, this was more clear when the Runtime section was earlier in the document. This is a separate binary, in tools/teleport-update. The binary is versioned with Teleport, but may be distributed separately from the other Teleport binaries (similar to tbot or tsh).

do we need a separate program?

The updater is intentionally not part of the teleport binary for a few reasons:

teleport-update is distributed separately from teleport, and without any of the other binaries, to bootstrap an initial agent installation. Including all of Teleport would mean that security scanners would match all Teleport dependencies in the bootstrap/update artifact that could be vulnerable, so customers would have to handle rebuilding artifacts (e.g., AMIs) more often.

teleport-update will also handle long-running tbot updates in the future, and I hesitate to distribute teleport with tbot.

teleport-update's API contract with the cluster must be extremely stable. The first GA version of teleport-update will need to work with Teleport v14+, and should ideally work with all future versions of Teleport. This seems (a bit) easier to manage with a separate binary.

then the updater could use the agent's own certificates to authenticate itself and we wouldn't need to surface upgrade instructions over an unauthenticated channel.

We will not always have an authenticated connection to auth to receive these instructions. E.g., the agent could be a failed state due to a botched upgrade, temporarily stopped, or newly installed. In the future, tbot-only installations may have expired certificates. Keeping this unauthenticated reduces complexity. The information exposed is minimal, and the API has been reviewed by Doyensec.

(Will add this context to the RFD.)


The rollout logic is progressed by instance heartbeat backend writes, as changes can only occur on these events.

The following data related to the rollout are stored in each instance heartbeat:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do all agents already perform instance heartbeats?

rfd/0169-auto-updates-linux-agents.md Show resolved Hide resolved

```shell
$ apt-get install teleport-ent-updater
$ teleport-update enable --proxy example.teleport.sh
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just so I'm clear, teleport-update enable would actually download teleport if it's not already installed?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By default, yes. This ensures the configuration is valid, and that further updates can be applied. We may want to provide a --no-update flag in the future, so e.g., AMIs won't need changes to cloud-init to do the initial installation.

updates.yaml:
```
version: v1
kind: agent_versions
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks an awful lot like a teleport resource, but it seems this is just a configuration file for the updater? If so, I might not use the same YAML structure with a version and kind to avoid people thinking that this is a Teleport resource that they need to tctl create. (Similar to how our access plugins or event handler are configured)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@russjones requested this format explicitly

Copy link
Member Author

@sclevine sclevine Sep 10, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Original thread: #40190 (comment)

(I feel like it's confusing as well)

active_version: 15.1.1
```

backup.yaml:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is this supposed to do? I'm not sure if I missed something above or just haven't got to it yet, but I'm finding it hard to comprehend with just a snippet of YAML and no other commentary.

Additionally, when I see kind: db_backup nothing about that name suggests that this is in any way related to agent auto upgrades. Is that intended?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, I'll make this more explicit in the section below.

It's metadata about the sqlite DB backup, referred to here: https://github.com/gravitational/teleport/pull/40190/files#diff-4fb277bd3d884c7b045e7efad61111ac4e7957c063e0c0c9281f4e74fd72997fR571

(Users don't need to worry about this file -- it's an implementation detail.)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nothing about that name suggests that this is in any way related to agent auto upgrades. Is that intended?

This file is not a Teleport resource (see #40190 (comment)). It's machine-local configuration storage for individual Teleport backups. That said, the metadata is descriptive of any backup of Teleport's sqlite DB, so it could describe non-agent data.

Comment on lines 588 to 598
The `autoupdate_agent_plan` spec is owned by the Teleport cluster administrator.
In Teleport Cloud, this is the cloud operations team. For self-hosted setups this is the user with access to the local
admin socket (tctl on local machine).

> [!NOTE]
> This is currently an anti-pattern as we are trying to remove the use of the local administrator in Teleport.
> However, Teleport does not provide any role/permission that we can use for Teleport Cloud operations and cannot be
> granted to users. To part with local admin rights, we need a way to have cloud or admi-only operations.
> This would also improve Cloud team operations by interacting with Teleport API rather than executing local tctl.
>
> Solving this problem is out of the scope of this RFD.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's defer this note to the Cloud RFD

@@ -479,9 +536,12 @@ However, any changes to `agent_schedules` that occur while a group is active wil
Releasing new agent versions multiple times a week has the potential to starve dependent groups from updates.

Note that the `default` schedule applies to agents that do not specify a group name.
[TODO: It seems we removed the default bool, So we have a mandatory default group? Can we pick the last one instead?]
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We got feedback that the boolean was confusing during the feedback sessions. No strong opinion. How about last-as-default, but only if no default group is explicitly defined?

days: [ “Sun”, “Mon”, ... | "*" ]
# start_hour specifies the hour when the group may start upgrading.
# default: 0
start_hour: 0-23
# wait_days specifies how many days to wait after the previous group finished before starting.
# default: 0
wait_days: 0-1
# TODO: is this needed? In which case a customer would need to set a custom jitter?
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay to drop the configuration option for now. Thought it might be important for self-hosted customers with many agents. I would like the jitter to continue to be served from the cluster in case we need to adjust it.

@@ -454,23 +512,22 @@ spec:
agent_schedules:
regular:
- name: default
days: ["*"]
days: ["*"] # TODO: restrict to work week? Minus Friday?
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was assuming the restriction would be a Cloud-only setting, and self-hosted customers would have full control.

- the installation process is complex and users end up installing the wrong version of Teleport
- the current update process does not provide safeties to protect against broken updates
- many customers are not adopting the existing updater because they want to control when updates happen
- we don't offer a ni
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(missing words)

Comment on lines 196 to 198
I create a new deployment, with a broken version. The version is deployed to the canaries.
The canaries crash, the updater reverts the update, the agents connect back online and
advertise they rolled-back. The maintenance is stuck until the canaries are running the target version.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is zero mention of a canary prior to this and until I scrolled down and realized that canaries were now part of the upgrade plan I had no idea what this text was talking about. Without having a high level understanding for how this system is supposed to work some of this UX section is a bit hard to grok.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This RFD is currently being used for Hugo and I to collaborate. We'll have a new RFD open before Thursday.

@sclevine sclevine marked this pull request as draft October 1, 2024 20:56
@sclevine
Copy link
Member Author

sclevine commented Oct 3, 2024

Continued in #47126

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
no-changelog Indicates that a PR does not require a changelog entry rfd Request for Discussion size/md
Projects
None yet
Development

Successfully merging this pull request may close these issues.