-
Notifications
You must be signed in to change notification settings - Fork 110
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RSDK-7903 Keep unhealthy remotes in ResourceNames #4421
RSDK-7903 Keep unhealthy remotes in ResourceNames #4421
Conversation
@@ -182,9 +182,31 @@ func (manager *resourceManager) updateRemoteResourceNames( | |||
rr internalRemoteRobot, | |||
recreateAllClients bool, | |||
) bool { | |||
manager.logger.CDebugw(ctx, "updating remote resource names", "remote", remoteName, "recreateAllClients", recreateAllClients) | |||
logger := manager.logger.WithFields("remote", remoteName) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
flyby: we have a lot of logs in this function that all include the remote name as metadata - this local logger formalized this
@@ -194,15 +216,14 @@ func (manager *resourceManager) updateRemoteResourceNames( | |||
|
|||
for _, resName := range newResources { | |||
remoteResName := resName | |||
resLogger := logger.WithFields("resource", remoteResName) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
flyby: similar to the flyby above - all logs local to this loop should include in the resource name in the current loop variable.
} | ||
|
||
for resName, isActive := range activeResourceNames { | ||
if isActive { | ||
continue | ||
} | ||
manager.logger.CDebugw(ctx, "attempting to remove remote resource", "name", resName) | ||
resLogger := logger.WithFields("resource", resName) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
flyby: exactly the same as this flyby
robot/impl/resource_manager.go
Outdated
func (manager *resourceManager) includeResourceName(k resource.Name) bool { | ||
if k.API == client.RemoteAPI || | ||
k.API.Type.Namespace == resource.APINamespaceRDKInternal { | ||
return false | ||
} | ||
gNode, ok := manager.resources.Node(k) | ||
if !ok || !gNode.HasResource() { | ||
return false | ||
} | ||
return true | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
factored out to ensure ResourceNames
and reachableResourceNames
filter for the same resources. another approach instead of having multiple functions is to allow ResourceNames
to receive an optional parameter that configures what kinds of resources are filtered (i think @dgottlieb suggested a version of this approach)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think I suggested that. The ResourceNamesRequest proto object can take new optional parameters, but the RobotClient.ResourceNames()
API cannot without being compile-time backwards breaking.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah sorry I got mixed up, you mentioned this for the proto not the method! My mistake - I meant to give credit not incept :)
// want to ensure that once calling `ResourceNames` on the main part returns some remote resource -- | ||
// that remote resource will always be returned until it is configured away. When either the remote | ||
// robot removes it from its config. Or when the main part removes the remote. | ||
func TestOfflineRemoteResources(t *testing.T) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
full credit to @dgottlieb for the test - I ported it from here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm good with the logic as-is so long as we document what "include resource names" is all about.
resource/resource_graph.go
Outdated
return g.subGraphFrom(node) | ||
} | ||
|
||
func (g *Graph) subGraphFrom(node Name) (*Graph, error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This makes sense to me, but let's call the function subGraphFromWithMutex
. And document the caller must be holding the mu
mutex that protects the nodes
map.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
to confirm, did you mean to suggest renaming the public function to SubGraphFromWithMutex
, or renaming this private function to subGraphFromWith*out*Mutex
(asterisks for emphasis)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I meant naming the private one to subGraphWithMutex
. Which I suppose sounds funny on the function side, but looks proper when calling it:
func (g *Graph) SubGraphFrom(node Name) (*Graph, error) {
g.mu.Lock()
defer g.mu.Unlock()
return g.subGraphFromWithMutex(node)
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm, for me that's a bit misleading since I would expect the function name to communicate what it is doing - this name assumes that is being called with a mutex held upstream.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given functions (by default) always take the locks they require, the confusion isn't registering on my side. I'm not going to hold the PR up over the naming. But there needs to be documentation that the caller must not be holding the graph mutex.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah i was going to say i’m flexible on naming also as long as there is a doc-string that states the behavior. will add that before merging
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i've warmed up to your version: b380627
robot/impl/resource_manager.go
Outdated
} | ||
return names | ||
} | ||
|
||
func (manager *resourceManager) includeResourceName(k resource.Name) bool { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see why you factored this out, but the name isn't good. This isn't looking at reachability, so it's not the whole story on whether to "include the resource name". We should document this with exactly what it does and why it does it.
Or because the tests are already using reachableResourceNames
under the hood, we can remove this and change the test helper:
func verifyReachableResourceNames(tb testing.TB, r robot.LocalRobot, expected []resource.Name) {
tb.Helper()
lRobot, ok := r.(*localRobot)
test.That(tb, ok, test.ShouldBeTrue)
reachable := make([]resource.Name, 0)
for _, resName := range lRobot.ResourceNames() {
if node, exists := lRobot.manager.graph.Node(resName); exists && node.IsReachable() {
reachable = append(reachable, resName)
}
}
testutils.VerifySameResourceNames(tb, reachable, expected)
}
But that would also require adding the IsReachable
API method and fixing mutexes on MarkReachability
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My PR initially included an IsReachable
method but the result was kind of awkward and deadlock-y. Probably because I needed to do your second suggestion of fixing mutexes.
I'd prefer to just change the name to some better if that's okay - I agree with you that include
is crappy and incomplete. So really this is just a utility function that is meant to apply the same filters that ResourceNames
applies. So perhaps a lowercase resourceNames
with a doc-string specifying what exactly is included/excluded is the way to go?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done 2d2fbba
test.That(tb, ok, test.ShouldBeTrue) | ||
|
||
reachable := lRobot.manager.reachableResourceNames() | ||
testutils.VerifySameResourceNames(tb, reachable, expected) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice compromise to keep existing tests working.
resource/resource_graph.go
Outdated
return err | ||
} | ||
for _, node := range subGraph.nodes { | ||
node.unreachable = !reachable |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't need to take the node mutex because the graph is locked? I think that's risky.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, that's true - I'll add a private helper to set node reachability with a mutex
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done 324ce54
return false | ||
} | ||
gNode, ok := manager.resources.Node(k) | ||
if !ok || !gNode.HasResource() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was always confused why a node wouldn't have a resource. Is this the unconfigured case? Can we document it here (in addition to the function documentation).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this the unconfigured case
I don't think !HasResource()
and Unconfigured
are exactly 1-1, or at least I haven't been able to prove that to myself yet. I would certainly like them to be. Lemme do a quick review of when this happens exactly and see if I can document this a bit better.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah there is some inconsistency. A glaring example is found in the [GraphNode.Close] method - there we nil
-ify node resource but do not change the node state. There's a follow-up ticket to clean this and potential other state inconsistencies up, so let me take care of that as a separate change if that's okay.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, definitely don't do anything about it here. Was just hoping to gain clarity on whether that clause is needed and for what reason.
robot/impl/resource_manager.go
Outdated
func (manager *resourceManager) includeResourceName(k resource.Name) bool { | ||
if k.API == client.RemoteAPI || | ||
k.API.Type.Namespace == resource.APINamespaceRDKInternal { | ||
return false | ||
} | ||
gNode, ok := manager.resources.Node(k) | ||
if !ok || !gNode.HasResource() { | ||
return false | ||
} | ||
return true | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think I suggested that. The ResourceNamesRequest proto object can take new optional parameters, but the RobotClient.ResourceNames()
API cannot without being compile-time backwards breaking.
ddf426f
to
b380627
Compare
Those docstrings are excellent |
Changes
ResourceNames
unreachable
field to graph_nodes, so that existing tests have a way to assert on the subset of resources are still online