Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

⚠️ RSDK-7903 Cache machine status in robot client #4399

Conversation

maximpertsov
Copy link
Contributor

@maximpertsov maximpertsov commented Sep 27, 2024

⚠️ hold off on reviewing for now - the following issue needs to be addressed first


Update robot client to source and cache resources from the MachineStatus API instead of the ResourceNames API.

This change is mostly a no-op for public APIs:

  • ResourceNames does not change - it will return the exact same resources.
  • MachineStatus changes slightly - it will no longer include RDK internal resources and RemoteAPIs (e.g. the resource that represents the remote itself). This mirrors the behavior of ResourceNames.

This change is a prerequisite for keeping remote resources in ResourceNames when they get disconnected - after making that change, we will want MachineStatus to show that those disconnected resources are in a non-Ready state, which is why the robot client needs to cache more information than just ResourceNames.

See this thread for details on our strategy break this change up.

@viambot viambot added the safe to test This pull request is marked safe to test from a trusted zone label Sep 27, 2024
@viambot viambot added safe to test This pull request is marked safe to test from a trusted zone and removed safe to test This pull request is marked safe to test from a trusted zone labels Sep 27, 2024
@@ -28,6 +28,7 @@ var exemptFromSession = map[string]bool{
"/proto.rpc.webrtc.v1.SignalingService/OptionalWebRTCConfig": true,
"/proto.rpc.v1.AuthService/Authenticate": true,
"/proto.rpc.v1.ExternalAuthService/AuthenticateTo": true,
"/viam.robot.v1.RobotService/GetMachineStatus": true,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since we are now using MachineStatus for inter-robot communication alongside ResourceNames, we need to exempt it from session safety monitoring like we do for ResourceNames

Comment on lines +107 to +109
ResourceNamesFunc: func() []resource.Name { return injectResources },
MachineStatusFunc: func(ctx context.Context) (robot.MachineStatus, error) {
return rdktestutils.ResourcesToMachineStatus(injectResources), nil
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the bulk of test updates in this PR involve injecting MachineStatus function to test robot clients, since this API replaces ResourceNames as the means of sourcing resources

Comment on lines +376 to +381
if name.API == client.RemoteAPI {
continue
}
if name.API.Type.Namespace == resource.APINamespaceRDKInternal {
continue
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is where I make the slight change to the MachineStatus API that I note in the PR description:

MachineStatus changes slightly - it will no longer include RDK internal resources and RemoteAPIs (e.g. the resource that represents the remote itself). This mirrors the behavior of ResourceNames.

Comment on lines -3412 to -3439
{
Name: resource.Name{
API: resource.APINamespaceRDKInternal.WithServiceType("framesystem"),
Name: "builtin",
},
State: resource.NodeStateReady,
},
{
Name: resource.Name{
API: resource.APINamespaceRDKInternal.WithServiceType("cloud_connection"),
Name: "builtin",
},
State: resource.NodeStateReady,
},
{
Name: resource.Name{
API: resource.APINamespaceRDKInternal.WithServiceType("packagemanager"),
Name: "builtin",
},
State: resource.NodeStateReady,
},
{
Name: resource.Name{
API: resource.APINamespaceRDKInternal.WithServiceType("web"),
Name: "builtin",
},
State: resource.NodeStateReady,
},
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cheukt
Copy link
Member

cheukt commented Sep 27, 2024

removing myself since I probably won't have time to review, but generally seems good

Copy link
Member

@benjirewis benjirewis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pretty much LGTM! Just some questions.

@@ -71,7 +71,7 @@ type RobotClient struct {
dialOptions []rpc.DialOption

mu sync.RWMutex
resourceNames []resource.Name
cachedMachineStatus robot.MachineStatus
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be *robot.MachineStatus to indicate a possible state where we haven't cached anything? Would mean you didn't have to return robot.MachineStatus{} in some of your error cases and could just return nil.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be *robot.MachineStatus to indicate a possible state where we haven't cached anything?

agreed - it would be helpful to at least make the internal cachedMachineStatus field a pointer to better express the case where nothing was cached yet.

however...

Would mean you didn't have to return robot.MachineStatus{} in some of your error cases and could just return nil.

the MachineStatus interface method specifies a struct return value instead of a struct pointer, so that method would still have to return MachineStatus{} values. we can revisit this interface typing decision if folks are strongly against it. I wanted to avoid having to do worry about excessive nil checking on MachineStatus, but depending on how large that structure grows it might be useful to make it a pointer for performance reasons.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

make it a pointer for performance reasons

discussed offline with @dgottlieb - the resource.Status objects, whose fields are most likely to expand in the short-term, are in slice and therefore stored on the heap and don't create extra work when copying.

there's no immediate plans to add any fields to the top-level struct - perhaps we can revisit this interface type if/when that changes

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah fair; thanks for the response here! Leaving as-is seems fine to me given your points.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cool, as-is for both the internal cachedMachineStatus field as well as the public method?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't feel too strongly. I guess I noticed everything was robot.MachineStatus and not *robot.MachineStatus, and felt it to be odd. Given your reasoning for keeping the public method as robot.MachineStatus, I think either value or pointer is fine for the cached value.

robot/impl/resource_manager.go Outdated Show resolved Hide resolved
injectRobot.MachineStatusFunc = func(ctx context.Context) (robot.MachineStatus, error) {
mu.Lock()
defer mu.Unlock()
callCountNames++
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Curious if you could use a different variable here. Is it right to assume that the ResourceNamesFunc above is not going to get called at all anymore?

Copy link
Contributor Author

@maximpertsov maximpertsov Sep 30, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ugh - sorry missed this due to large-file folding. yes good call, will make separate counters for each and test accordingly just count machine status and remove the resource names stub altogether

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ended up dropping a lot of useless code - thanks for catching!

@viambot viambot added safe to test This pull request is marked safe to test from a trusted zone and removed safe to test This pull request is marked safe to test from a trusted zone labels Sep 30, 2024
@viambot viambot added safe to test This pull request is marked safe to test from a trusted zone and removed safe to test This pull request is marked safe to test from a trusted zone labels Sep 30, 2024
@viambot viambot added safe to test This pull request is marked safe to test from a trusted zone and removed safe to test This pull request is marked safe to test from a trusted zone labels Sep 30, 2024
@@ -620,22 +620,29 @@ func (rc *RobotClient) resources(ctx context.Context) ([]resource.Name, []resour
defer cancel()
}

resp, err := rc.client.ResourceNames(ctx, &pb.ResourceNamesRequest{})
resp, err := rc.machineStatus(ctx)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

heads up @benjirewis @cheukt

spoke IRL with @dgottlieb - we can't move forward with this change, since it's backwards breaking. specifically, this will break comms between two machines if one is running a version has this this endpoint but the other is running a version that does not.

making this transition will require a different strategy to account for this risk, if we still decide to make it.

@maximpertsov
Copy link
Contributor Author

revert this PR to a draft until this issue is resolved: https://github.com/viamrobotics/rdk/pull/4399/files#r1783017466

@maximpertsov maximpertsov marked this pull request as draft October 1, 2024 16:36
@maximpertsov maximpertsov changed the title RSDK-7903 Cache machine status in robot client ⚠️ RSDK-7903 Cache machine status in robot client Oct 1, 2024
@maximpertsov
Copy link
Contributor Author

superseded by #4421

@maximpertsov maximpertsov deleted the RSDK-7903-client-uses-machine-status-noop branch October 7, 2024 19:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
safe to test This pull request is marked safe to test from a trusted zone
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants