Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

⚠️ RSDK-7903 Include unhealthy remotes in resource list with status #4273

Draft
wants to merge 41 commits into
base: main
Choose a base branch
from

Conversation

maximpertsov
Copy link
Member

@maximpertsov maximpertsov commented Aug 8, 2024

⚠️ hold off on reviewing for now - the following issue needs to be addressed first


Summary

Keep remote resources in the ResourceNames and MachineStatus APIs when their remote gets disconnected. In the latter API, mark such resources with a state indicate that they are disconnected.

Prerequisites

To make reviews a bit easier, this PR is going to broken out into smaller chunks:

Details

Robot clients now fetch and cache resource statuses via the GetMachineStatus API instead of resource names via ResourceNames API.

Open questions

  • Should we add a new state for disconnected nodes? (e.g. NodeStateDisconnected), or is it sufficient to reuse the unhealthy state with a corresponding error (e.g. NodeStateUnhealthy[reason:disconnected])?
    • FWIW, I slightly prefer a new state given the proposed implementation, since a healthy but disconnected remote resource can appear as "disconnected" or "ready" depending on whether a client calls MachineStatus on the local or remote machine, respectively. Having a specialized state that can only appear for remote resources feels like the less confusing option in this case.
  • Shall we bubble up remote update errors? (see this comment and this comment) - this will make the PR much noisier but might be the correct thing to do.

Depends on

@viambot viambot added the safe to test This pull request is marked safe to test from a trusted zone label Aug 8, 2024
@maximpertsov maximpertsov force-pushed the RSDK-7903-unhealthy-remotes-replace-names branch from aeeba0d to 36dbd45 Compare August 8, 2024 03:39
@viambot viambot added safe to test This pull request is marked safe to test from a trusted zone and removed safe to test This pull request is marked safe to test from a trusted zone labels Aug 8, 2024
@viambot viambot added safe to test This pull request is marked safe to test from a trusted zone and removed safe to test This pull request is marked safe to test from a trusted zone labels Aug 9, 2024
@maximpertsov maximpertsov force-pushed the RSDK-7903-unhealthy-remotes-replace-names branch from 73cae61 to 4153c0c Compare August 9, 2024 21:08
@viambot viambot added safe to test This pull request is marked safe to test from a trusted zone and removed safe to test This pull request is marked safe to test from a trusted zone labels Aug 9, 2024
@maximpertsov maximpertsov force-pushed the RSDK-7903-unhealthy-remotes-replace-names branch from fb91e32 to c30937d Compare August 9, 2024 21:17
@viambot viambot added safe to test This pull request is marked safe to test from a trusted zone and removed safe to test This pull request is marked safe to test from a trusted zone labels Aug 9, 2024
@maximpertsov maximpertsov force-pushed the RSDK-7903-unhealthy-remotes-replace-names branch from c30937d to e2243e1 Compare August 20, 2024 20:12
@viambot viambot added safe to test This pull request is marked safe to test from a trusted zone and removed safe to test This pull request is marked safe to test from a trusted zone labels Aug 20, 2024
@maximpertsov maximpertsov force-pushed the RSDK-7903-unhealthy-remotes-replace-names branch from bbc5e4c to e2243e1 Compare August 20, 2024 22:07
@viambot viambot added safe to test This pull request is marked safe to test from a trusted zone and removed safe to test This pull request is marked safe to test from a trusted zone labels Aug 20, 2024
@maximpertsov maximpertsov force-pushed the RSDK-7903-unhealthy-remotes-replace-names branch from e2243e1 to fd0784b Compare August 21, 2024 19:36
@viambot viambot added safe to test This pull request is marked safe to test from a trusted zone and removed safe to test This pull request is marked safe to test from a trusted zone labels Aug 21, 2024
@viambot viambot added safe to test This pull request is marked safe to test from a trusted zone and removed safe to test This pull request is marked safe to test from a trusted zone labels Sep 24, 2024
Copy link
Member Author

@maximpertsov maximpertsov Sep 24, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An alternative approach to the changes in this file is to define a new state (e.g. NodeStateDisconnected) - see the "Open questions" section in the PR description for more details and see the revert of this commit for a general idea of how this would look.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the high-level change here is that a robot client now caches the results of MachineStatus rather than just ResourceNames. As a result, multiple tests that result on injecting a mock function for ResourceNames now need to mock MachineStatus.

@maximpertsov maximpertsov marked this pull request as ready for review September 24, 2024 20:11
@viambot viambot added safe to test This pull request is marked safe to test from a trusted zone and removed safe to test This pull request is marked safe to test from a trusted zone labels Sep 24, 2024
Comment on lines +39 to +41

// NodeStateDisconnected denotes a resource is disconnected.
NodeStateDisconnected
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An alternative to defining a new state would be to mark disconnected resource nodes as "unhealthy" with an appropriate error message. See "Open questions" in the description of this PR for more details and revert this commit to see how this alternative approach would look.

@viambot viambot added safe to test This pull request is marked safe to test from a trusted zone and removed safe to test This pull request is marked safe to test from a trusted zone labels Sep 24, 2024
@viambot viambot added safe to test This pull request is marked safe to test from a trusted zone and removed safe to test This pull request is marked safe to test from a trusted zone labels Sep 24, 2024
@viambot viambot added safe to test This pull request is marked safe to test from a trusted zone and removed safe to test This pull request is marked safe to test from a trusted zone labels Sep 24, 2024
@dgottlieb
Copy link
Member

dgottlieb commented Sep 25, 2024

I hate to be that person, but is there any chance to split this PR up into 2 (or 3?) commits?

  1. Code change to make resource names work
  2. Code change to make MachineStatus work
  3. Refactor changing client.resourceNames -> client.cachedMachineStatus structures

Or maybe 2 and 3 are the same and its only two commits.

I feel confident reviewing "commit 1", but not useful for the MachineStatus changes. And I know commit 1 really just changes 1/2 files + test files.

I'm also happy to step back if Cheuk/Benji are confident being the stakeholder for (1)

@maximpertsov
Copy link
Member Author

I hate to be that person, but is there any chance to split this PR up into 2 (or 3?) commits?

1. Code change to make resource names work

2. Code change to make MachineStatus work

3. Refactor changing client.resourceNames -> client.cachedMachineStatus structures

Or maybe 2 and 3 are the same and its only two commits.

I feel confident reviewing "commit 1", but not useful for the MachineStatus changes. And I know commit 1 really just changes 1/2 files + test files.

I'm also happy to step back if Cheuk/Benji are confident being the stakeholder for (1)

This is a reasonable request, I was considering making the client cache change as it's own no-op PR anyway.

@maximpertsov
Copy link
Member Author

  1. Refactor changing client.resourceNames -> client.cachedMachineStatus structures

@dgottlieb @benjirewis @cheukt

#4399

@@ -361,6 +364,13 @@ func (w *GraphNode) SetNeedsUpdate() {
w.setNeedsReconfigure(w.Config(), false, w.UnresolvedDependencies())
}

// SetDisconnected is used to mark a remote node as disconnected.
func (w *GraphNode) SetDisconnected() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIUC, in the resource graph we have:

  • A node for each remote robot. Where the underlying resource is the robot client
  • A node for each resource on a robot robot. Where the underlying resource is the component client (e.g: camera client).

Does this SetDisconnected state only apply to nodes representing the remote robot client in the resource graph? Or does it also apply to the remote resource objects in the resource graph?

Second:
And when a remote goes from working -> disconnected -> working, what state transition will that take? Is it ready -> disconnected -> ready? Or do we jump through any of the unconfigured/configuring states?

I think the answers would work well in the documentation.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this SetDisconnected state only apply to nodes representing the remote robot client in the resource graph? Or does it also apply to the remote resource objects in the resource graph?

it applies to the remote resource objects. also note that resources representing full remote machine clients are excluded from the MachineStatus endpoint.

And when a remote goes from working -> disconnected -> working, what state transition will that take? Is it ready -> disconnected -> ready? Or do we jump through any of the unconfigured/configuring states?

thanks for drawing attention to this. the flow I want is for disconnected nodes to go directly back into the ready state (unless they are unhealthy for some other reason). however, I don't think the current logic actually transitions disconnected nodes back to ready, so let me address that.

I think the answers would work well in the documentation.

will do

@@ -620,22 +621,29 @@ func (rc *RobotClient) resources(ctx context.Context) ([]resource.Name, []resour
defer cancel()
}

resp, err := rc.client.ResourceNames(ctx, &pb.ResourceNamesRequest{})
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wait, are you changing robots to use machine status instead of resource names to populate remote resources?

@maximpertsov maximpertsov marked this pull request as draft October 1, 2024 16:34
@maximpertsov maximpertsov changed the title RSDK-7903 Include unhealthy remotes in resource list with status ⚠️ RSDK-7903 Include unhealthy remotes in resource list with status Oct 1, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
safe to test This pull request is marked safe to test from a trusted zone
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants