-
Notifications
You must be signed in to change notification settings - Fork 2.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[cmd/opampsupervisor]: Implement PackagesAvailable for upgrading agent #35503
base: main
Are you sure you want to change the base?
[cmd/opampsupervisor]: Implement PackagesAvailable for upgrading agent #35503
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for taking this on, its no small task. I'm still looking through this, but wanted to give early feedback on a few points.
require.Equal(t, &protobufs.PackageStatuses{ | ||
Packages: map[string]*protobufs.PackageStatus{ | ||
"": { | ||
// TODO: Should initital version be filled in? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should probably be the version/hash of the Collector binary, right? I think the version can be the one obtained during bootstrapping.
return nil | ||
} | ||
|
||
// TODO: Certificate paths? The certificate can be specified via SIGSTORE_ROOT_FILE for now |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we link to an issue here? It would make it easier to track and provides some context.
agentHash := h.Sum(nil) | ||
|
||
// Load persisted package state, if it exists | ||
// TODO: use packagesStatusPath method somehow |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is okay given that it's the only other usage of this and is a very simple call to filepath.Join
.
The other options I can think of aren't worth the additional complexity:
- Extract the
packagesStatusPath
functionality to a function and call the function in the method as well. Too much abstraction for a simple call. - Instantiate
packageManager
above and callpathagesStatusPath
here, then fill in other details on the struct afterward. Splits the initialization too much for fairly minimal gains.
return err | ||
} | ||
|
||
type packageState struct { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it make sense to instead use the persistent state file to store this? I see how the purpose of each file is distinct, but having two small-ish YAML files also feels a little odd.
This may also be premature, but I think abstracting state storage through the persistent state struct may keep things cleaner in the long run.
@@ -275,20 +297,20 @@ func (s *Supervisor) loadConfig(configFile string) error { | |||
// an OpAMP extension, obtains the agent description, then | |||
// shuts down the Collector. This only needs to happen | |||
// once per Collector binary. | |||
func (s *Supervisor) getBootstrapInfo() (err error) { | |||
func (s *Supervisor) getBootstrapInfo() (agentVersion string, err error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is another place where consolidating our storage to the persistent state may make things cleaner. We have two different data flows for the instance ID vs. the agent version in this function, when I think we should probably store the two next two each other so all of the agent's information is recorded in the same place. What do you think?
} | ||
|
||
// overwrite agent process | ||
startAgent, err := p.am.stopAgentProcess(ctx) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you think it would substantially complicate things if we limited agent process management to the Supervisor struct and only deal with managing the binary file here? My first impression is that we should keep those responsibilities separated if possible and not have the package manager know anything about the Collector process, even if the current implementation isn't too complicated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My thought was that we need to stop the agent process to write the binary in the first place.
We could write the binary somewhere temporarily, and send a message to the agent goroutine to do the replacement, which could be cleaner (better separation of concerns). The package manager would still need to have some handle to signal that, but it wouldn't necessarily be tied to the collector.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would we be able to accomplish this with synchronous, one-way communication? I was thinking maybe we could do something vaguely like this in the Supervisor struct:
oldVersion := 0
newVersion := 1
s.commander.stopAgent()
s.packageManager.SwapInVersion(newVersion)
if err := s.commander.startAgent(); err != nil {
s.packageManager.SwapInVersion(oldVersion)
s.commander.startAgent()
s.reportFailedInstallation()
return
}
s.packageManager.DeleteVersion(oldVersion)
s.reportSuccessfulInstallation()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I think we could do something like that. I think this also sets us up nicely for restoring the previous version if the new one fails. I'll look into making those changes.
if err != nil { | ||
s.logger.Error("Failed to sync PackagesAvailable message", zap.Error(err)) | ||
} | ||
// TODO: Should we wait for the sync to be done somehow? Should it be in a separate goroutine |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From what I'm seeing, the Done
channel only communicates that the PackageSyncer is done communicating package statuses to the server, and doesn't track whether the packages have been downloaded: https://github.com/open-telemetry/opamp-go/blob/main/client/internal/packagessyncer.go#L51. It may be worth waiting for this before communicating any additional updates about package statuses to the server.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm actually really confused reading that code. It looks like once Sync is done, that channel is closed, so waiting on Done here wouldn't actually add any extra synchronization.
Good call at looking at that impl, I definitely had a different idea of its purpose in my head.
@tigrannajaryan I'd appreciate you taking a look at this, even if just to review conformance to the spec. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I started reviewing but got confused by the many moving parts of signatures and verifications.
I think we need a design doc or some other form of documentation that explains all the involved parties (cosign, Collector releases, OpAMP Server, Supervisor) and how they dance together to make sure the downloaded binary is what it pretends to be and is safe to use.
Some sort of a sequence diagram that starts with the Collector build on opentelemetry-collector-releases and ends with Supervisor launching the new executable and shows all the steps in between would be great. If we number the steps we could even refer to them in the code.
} | ||
|
||
// TODO: Certificate paths? The certificate can be specified via SIGSTORE_ROOT_FILE for now | ||
type AgentSignature struct { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please document AgentSignature and AgentSignatureIdentity.
return splitSignature[0], splitSignature[1], nil | ||
} | ||
|
||
// sig is the decoded signature of |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Incomplete comment?
} | ||
|
||
func parsePackageSignature(signature []byte) (b64Cert, b64Signature []byte, err error) { | ||
splitSignature := bytes.SplitN(signature, []byte(" "), 2) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this signature format documented somewhere? This imposes a requirement on the OpAMP server to generate signatures in a specific way, right?
Yeah, good call out. I will work on this and get something in our spec that explains everything here. |
Description:
Implements ReportPackageStatuses and taking PackagesAvailable for upgrading the agent.
The agent will only accept a top-level package with an empty name. This agent must be signed using cosign's keyless signing method (this is how the opentelemetry-collector-releases repository signs its releases).
The signature field must be populated with the resultant base64 encoded cert and signature, both being space separated (signature = b64_cert + " " + b64_signature).
This first implementation only allows online verification; In order to verify the certificate, it must reach out to the public rekor transparency log instance. It also fetches certificates from the internet. Some of these things are configurable through environment variables, but I figure we can parse out how offline signature verification works in a follow-up PR. This basic setup should allow for signature verification for agents that have access to the internet.
This PR also does not revert the collector if it is unhealthy. This will need to be done in a follow up PR. I think we should do it after #34907 is merged, as I imagine the logic will overlap here.
Link to tracking Issue: Closes #34734, Partially solves #33947
Testing: