Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[cmd/opampsupervisor]: Implement PackagesAvailable for upgrading agent #35503

Open
wants to merge 35 commits into
base: main
Choose a base branch
from

Conversation

BinaryFissionGames
Copy link
Contributor

Description:
Implements ReportPackageStatuses and taking PackagesAvailable for upgrading the agent.

The agent will only accept a top-level package with an empty name. This agent must be signed using cosign's keyless signing method (this is how the opentelemetry-collector-releases repository signs its releases).

The signature field must be populated with the resultant base64 encoded cert and signature, both being space separated (signature = b64_cert + " " + b64_signature).

This first implementation only allows online verification; In order to verify the certificate, it must reach out to the public rekor transparency log instance. It also fetches certificates from the internet. Some of these things are configurable through environment variables, but I figure we can parse out how offline signature verification works in a follow-up PR. This basic setup should allow for signature verification for agents that have access to the internet.

This PR also does not revert the collector if it is unhealthy. This will need to be done in a follow up PR. I think we should do it after #34907 is merged, as I imagine the logic will overlap here.

Link to tracking Issue: Closes #34734, Partially solves #33947

Testing:

  • Unit testing, e2e testing

@BinaryFissionGames BinaryFissionGames marked this pull request as ready for review October 1, 2024 08:07
@BinaryFissionGames BinaryFissionGames requested a review from a team as a code owner October 1, 2024 08:07
Copy link
Contributor

@evan-bradley evan-bradley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for taking this on, its no small task. I'm still looking through this, but wanted to give early feedback on a few points.

require.Equal(t, &protobufs.PackageStatuses{
Packages: map[string]*protobufs.PackageStatus{
"": {
// TODO: Should initital version be filled in?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should probably be the version/hash of the Collector binary, right? I think the version can be the one obtained during bootstrapping.

return nil
}

// TODO: Certificate paths? The certificate can be specified via SIGSTORE_ROOT_FILE for now
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we link to an issue here? It would make it easier to track and provides some context.

agentHash := h.Sum(nil)

// Load persisted package state, if it exists
// TODO: use packagesStatusPath method somehow
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is okay given that it's the only other usage of this and is a very simple call to filepath.Join.

The other options I can think of aren't worth the additional complexity:

  1. Extract the packagesStatusPath functionality to a function and call the function in the method as well. Too much abstraction for a simple call.
  2. Instantiate packageManager above and call pathagesStatusPath here, then fill in other details on the struct afterward. Splits the initialization too much for fairly minimal gains.

return err
}

type packageState struct {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it make sense to instead use the persistent state file to store this? I see how the purpose of each file is distinct, but having two small-ish YAML files also feels a little odd.

This may also be premature, but I think abstracting state storage through the persistent state struct may keep things cleaner in the long run.

@@ -275,20 +297,20 @@ func (s *Supervisor) loadConfig(configFile string) error {
// an OpAMP extension, obtains the agent description, then
// shuts down the Collector. This only needs to happen
// once per Collector binary.
func (s *Supervisor) getBootstrapInfo() (err error) {
func (s *Supervisor) getBootstrapInfo() (agentVersion string, err error) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is another place where consolidating our storage to the persistent state may make things cleaner. We have two different data flows for the instance ID vs. the agent version in this function, when I think we should probably store the two next two each other so all of the agent's information is recorded in the same place. What do you think?

}

// overwrite agent process
startAgent, err := p.am.stopAgentProcess(ctx)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think it would substantially complicate things if we limited agent process management to the Supervisor struct and only deal with managing the binary file here? My first impression is that we should keep those responsibilities separated if possible and not have the package manager know anything about the Collector process, even if the current implementation isn't too complicated.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My thought was that we need to stop the agent process to write the binary in the first place.

We could write the binary somewhere temporarily, and send a message to the agent goroutine to do the replacement, which could be cleaner (better separation of concerns). The package manager would still need to have some handle to signal that, but it wouldn't necessarily be tied to the collector.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would we be able to accomplish this with synchronous, one-way communication? I was thinking maybe we could do something vaguely like this in the Supervisor struct:

oldVersion := 0
newVersion := 1

s.commander.stopAgent()
s.packageManager.SwapInVersion(newVersion)

if err := s.commander.startAgent(); err != nil {
	s.packageManager.SwapInVersion(oldVersion)
	s.commander.startAgent()
	s.reportFailedInstallation()
	return
}

s.packageManager.DeleteVersion(oldVersion)
s.reportSuccessfulInstallation()

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I think we could do something like that. I think this also sets us up nicely for restoring the previous version if the new one fails. I'll look into making those changes.

if err != nil {
s.logger.Error("Failed to sync PackagesAvailable message", zap.Error(err))
}
// TODO: Should we wait for the sync to be done somehow? Should it be in a separate goroutine
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From what I'm seeing, the Done channel only communicates that the PackageSyncer is done communicating package statuses to the server, and doesn't track whether the packages have been downloaded: https://github.com/open-telemetry/opamp-go/blob/main/client/internal/packagessyncer.go#L51. It may be worth waiting for this before communicating any additional updates about package statuses to the server.

Copy link
Contributor Author

@BinaryFissionGames BinaryFissionGames Oct 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm actually really confused reading that code. It looks like once Sync is done, that channel is closed, so waiting on Done here wouldn't actually add any extra synchronization.

Good call at looking at that impl, I definitely had a different idea of its purpose in my head.

@evan-bradley
Copy link
Contributor

@tigrannajaryan I'd appreciate you taking a look at this, even if just to review conformance to the spec.

Copy link
Member

@tigrannajaryan tigrannajaryan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I started reviewing but got confused by the many moving parts of signatures and verifications.

I think we need a design doc or some other form of documentation that explains all the involved parties (cosign, Collector releases, OpAMP Server, Supervisor) and how they dance together to make sure the downloaded binary is what it pretends to be and is safe to use.

Some sort of a sequence diagram that starts with the Collector build on opentelemetry-collector-releases and ends with Supervisor launching the new executable and shows all the steps in between would be great. If we number the steps we could even refer to them in the code.

}

// TODO: Certificate paths? The certificate can be specified via SIGSTORE_ROOT_FILE for now
type AgentSignature struct {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please document AgentSignature and AgentSignatureIdentity.

return splitSignature[0], splitSignature[1], nil
}

// sig is the decoded signature of
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Incomplete comment?

}

func parsePackageSignature(signature []byte) (b64Cert, b64Signature []byte, err error) {
splitSignature := bytes.SplitN(signature, []byte(" "), 2)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this signature format documented somewhere? This imposes a requirement on the OpAMP server to generate signatures in a specific way, right?

@BinaryFissionGames
Copy link
Contributor Author

I think we need a design doc or some other form of documentation that explains all the involved parties (cosign, Collector releases, OpAMP Server, Supervisor) and how they dance together to make sure the downloaded binary is what it pretends to be and is safe to use.

Yeah, good call out. I will work on this and get something in our spec that explains everything here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Implement package manager in supervisor
4 participants