Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RSDK-7903 Keep unhealthy remotes in ResourceNames #4421

Conversation

maximpertsov
Copy link
Contributor

@maximpertsov maximpertsov commented Oct 4, 2024

Changes

  • Keep offline remote resources in ResourceNames
  • Add an internal unreachable field to graph_nodes, so that existing tests have a way to assert on the subset of resources are still online

@viambot viambot added the safe to test This pull request is marked safe to test from a trusted zone label Oct 4, 2024
@@ -182,9 +182,31 @@ func (manager *resourceManager) updateRemoteResourceNames(
rr internalRemoteRobot,
recreateAllClients bool,
) bool {
manager.logger.CDebugw(ctx, "updating remote resource names", "remote", remoteName, "recreateAllClients", recreateAllClients)
logger := manager.logger.WithFields("remote", remoteName)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

flyby: we have a lot of logs in this function that all include the remote name as metadata - this local logger formalized this

@@ -194,15 +216,14 @@ func (manager *resourceManager) updateRemoteResourceNames(

for _, resName := range newResources {
remoteResName := resName
resLogger := logger.WithFields("resource", remoteResName)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

flyby: similar to the flyby above - all logs local to this loop should include in the resource name in the current loop variable.

}

for resName, isActive := range activeResourceNames {
if isActive {
continue
}
manager.logger.CDebugw(ctx, "attempting to remove remote resource", "name", resName)
resLogger := logger.WithFields("resource", resName)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

flyby: exactly the same as this flyby

Comment on lines 392 to 402
func (manager *resourceManager) includeResourceName(k resource.Name) bool {
if k.API == client.RemoteAPI ||
k.API.Type.Namespace == resource.APINamespaceRDKInternal {
return false
}
gNode, ok := manager.resources.Node(k)
if !ok || !gNode.HasResource() {
return false
}
return true
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

factored out to ensure ResourceNames and reachableResourceNames filter for the same resources. another approach instead of having multiple functions is to allow ResourceNames to receive an optional parameter that configures what kinds of resources are filtered (i think @dgottlieb suggested a version of this approach)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think I suggested that. The ResourceNamesRequest proto object can take new optional parameters, but the RobotClient.ResourceNames() API cannot without being compile-time backwards breaking.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah sorry I got mixed up, you mentioned this for the proto not the method! My mistake - I meant to give credit not incept :)

// want to ensure that once calling `ResourceNames` on the main part returns some remote resource --
// that remote resource will always be returned until it is configured away. When either the remote
// robot removes it from its config. Or when the main part removes the remote.
func TestOfflineRemoteResources(t *testing.T) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

full credit to @dgottlieb for the test - I ported it from here

@viambot viambot added safe to test This pull request is marked safe to test from a trusted zone and removed safe to test This pull request is marked safe to test from a trusted zone labels Oct 4, 2024
@maximpertsov maximpertsov marked this pull request as ready for review October 4, 2024 20:58
@maximpertsov maximpertsov requested a review from a team October 4, 2024 20:58
Copy link
Member

@dgottlieb dgottlieb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm good with the logic as-is so long as we document what "include resource names" is all about.

return g.subGraphFrom(node)
}

func (g *Graph) subGraphFrom(node Name) (*Graph, error) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This makes sense to me, but let's call the function subGraphFromWithMutex. And document the caller must be holding the mu mutex that protects the nodes map.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to confirm, did you mean to suggest renaming the public function to SubGraphFromWithMutex, or renaming this private function to subGraphFromWith*out*Mutex (asterisks for emphasis)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I meant naming the private one to subGraphWithMutex. Which I suppose sounds funny on the function side, but looks proper when calling it:

func (g *Graph) SubGraphFrom(node Name) (*Graph, error) {
	g.mu.Lock()
	defer g.mu.Unlock()

	return g.subGraphFromWithMutex(node)
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm, for me that's a bit misleading since I would expect the function name to communicate what it is doing - this name assumes that is being called with a mutex held upstream.

Copy link
Member

@dgottlieb dgottlieb Oct 7, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given functions (by default) always take the locks they require, the confusion isn't registering on my side. I'm not going to hold the PR up over the naming. But there needs to be documentation that the caller must not be holding the graph mutex.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah i was going to say i’m flexible on naming also as long as there is a doc-string that states the behavior. will add that before merging

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i've warmed up to your version: b380627

}
return names
}

func (manager *resourceManager) includeResourceName(k resource.Name) bool {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see why you factored this out, but the name isn't good. This isn't looking at reachability, so it's not the whole story on whether to "include the resource name". We should document this with exactly what it does and why it does it.

Or because the tests are already using reachableResourceNames under the hood, we can remove this and change the test helper:

func verifyReachableResourceNames(tb testing.TB, r robot.LocalRobot, expected []resource.Name) {
	tb.Helper()

	lRobot, ok := r.(*localRobot)
	test.That(tb, ok, test.ShouldBeTrue)

        reachable := make([]resource.Name, 0)
        for _, resName := range lRobot.ResourceNames() {
          if node, exists := lRobot.manager.graph.Node(resName); exists && node.IsReachable() {
    	    reachable = append(reachable, resName)
          }
        }
        testutils.VerifySameResourceNames(tb, reachable, expected)
}

But that would also require adding the IsReachable API method and fixing mutexes on MarkReachability.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My PR initially included an IsReachable method but the result was kind of awkward and deadlock-y. Probably because I needed to do your second suggestion of fixing mutexes.

I'd prefer to just change the name to some better if that's okay - I agree with you that include is crappy and incomplete. So really this is just a utility function that is meant to apply the same filters that ResourceNames applies. So perhaps a lowercase resourceNames with a doc-string specifying what exactly is included/excluded is the way to go?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done 2d2fbba

test.That(tb, ok, test.ShouldBeTrue)

reachable := lRobot.manager.reachableResourceNames()
testutils.VerifySameResourceNames(tb, reachable, expected)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice compromise to keep existing tests working.

return err
}
for _, node := range subGraph.nodes {
node.unreachable = !reachable
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't need to take the node mutex because the graph is locked? I think that's risky.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, that's true - I'll add a private helper to set node reachability with a mutex

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done 324ce54

return false
}
gNode, ok := manager.resources.Node(k)
if !ok || !gNode.HasResource() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was always confused why a node wouldn't have a resource. Is this the unconfigured case? Can we document it here (in addition to the function documentation).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this the unconfigured case

I don't think !HasResource() and Unconfigured are exactly 1-1, or at least I haven't been able to prove that to myself yet. I would certainly like them to be. Lemme do a quick review of when this happens exactly and see if I can document this a bit better.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah there is some inconsistency. A glaring example is found in the [GraphNode.Close] method - there we nil-ify node resource but do not change the node state. There's a follow-up ticket to clean this and potential other state inconsistencies up, so let me take care of that as a separate change if that's okay.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, definitely don't do anything about it here. Was just hoping to gain clarity on whether that clause is needed and for what reason.

Comment on lines 392 to 402
func (manager *resourceManager) includeResourceName(k resource.Name) bool {
if k.API == client.RemoteAPI ||
k.API.Type.Namespace == resource.APINamespaceRDKInternal {
return false
}
gNode, ok := manager.resources.Node(k)
if !ok || !gNode.HasResource() {
return false
}
return true
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think I suggested that. The ResourceNamesRequest proto object can take new optional parameters, but the RobotClient.ResourceNames() API cannot without being compile-time backwards breaking.

@viambot viambot added safe to test This pull request is marked safe to test from a trusted zone and removed safe to test This pull request is marked safe to test from a trusted zone labels Oct 7, 2024
@viambot viambot added safe to test This pull request is marked safe to test from a trusted zone and removed safe to test This pull request is marked safe to test from a trusted zone labels Oct 7, 2024
@maximpertsov maximpertsov force-pushed the RSDK-7903-keep-unhealthy-remotes-minimal branch from ddf426f to b380627 Compare October 7, 2024 17:56
@viambot viambot added safe to test This pull request is marked safe to test from a trusted zone and removed safe to test This pull request is marked safe to test from a trusted zone labels Oct 7, 2024
@dgottlieb
Copy link
Member

Those docstrings are excellent

@maximpertsov maximpertsov merged commit 216563f into viamrobotics:main Oct 7, 2024
29 of 37 checks passed
@maximpertsov maximpertsov deleted the RSDK-7903-keep-unhealthy-remotes-minimal branch October 7, 2024 19:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
safe to test This pull request is marked safe to test from a trusted zone
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants