-
Notifications
You must be signed in to change notification settings - Fork 13
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Internal] [Work in Progress] Fix tests that causes LabVIEW hang #514
Conversation
.../Discovery/test_DiscoveryServiceNotRunning_LaunchDiscoveryService_DiscoveryServiceStarted.vi
Show resolved
Hide resolved
...erver/Discovery/test_DiscoveryWrapper_StopDiscoveryService_WrapperMethodsWorkWithoutError.vi
Outdated
Show resolved
Hide resolved
Source/Tests/Tests.Runtime/Measurement Server/Helpers/Restart Discovery Service.vi
Show resolved
Hide resolved
...on/test_SessionManagerWrapper_StopSessionManagementService_WrapperMethodsWorkWithoutError.vi
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Were you looking for some feedback on this change? It doesn't seem ready for submission in its current state. Are you looking to check in changes to try and diagnose the problem further, or are the test failures understood and this a proposed fix for the test hangs? If it is the latter, I think I need more details on exactly what the problem was and how this fixes it before I can provide better feedback.
Without any further context, the questions that come to mind are:
- Three of the hanging tests are part of the Discovery Client.lvproj and each indicates it requires the discovery service to be started ahead of time. If it isn't started, then the tests will hang. How has this ever been working? What ensures the discovery service is actually started in order for these tests to succeed. Is the test order execution deterministic and it just so happens some test before it happened to start the service? I actually thought these were manual sanity tests to be run after regenerating code from a proto file change to the discovery service. If nothing else, maybe our tests shouldn't be using infinite timeouts?
- It appears we are always using the default cluster ID which will launch the discovery service on the static port 42000. This isn't what we do for our C# tests. For those tests, the testing framework ensures each test uses a unique cluster ID so that each test is isolated from each other. The test framework also ensures the discovery service process is terminated after each test so we don't "leak" processes and bog down the machine over time. Perhaps a similar approach here would resolve the hang and provide better robustness overall for the test suite?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jasonmreding
This is still work in progress PR, I'm testing out the possible fix on the LabVIEW hang.
Three of the hanging tests are part of the Discovery Client.lvproj and each indicates it requires the discovery service to be started ahead of time. If it isn't started, then the tests will hang. How has this ever been working? What ensures the discovery service is actually started in order for these tests to succeed. Is the test order execution deterministic and it just so happens some test before it happened to start the service?
Yes! The order of execution is determined by the tests folder structure. Thus, I believe previously in spawning pool, Discovery service exe has been already running before running tests, that's why we didn't see any errors. We are also trying to start the Discovery Service first before running tests in this pipeline. So this is resolved.
This isn't what we do for our C# tests. For those tests, the testing framework ensures each test uses a unique cluster ID so that each test is isolated from each other.
Thanks for this suggestion. The issue which we are now facing is when services are abruptly killed, we see Resolve Service is unable to restart the services thus leading to hang. Since all tests use the same Discovery Service port, this might cause LabVIEW hang is what we conclude. Thus, we tried setting a 5s delay after killing the services such that services are killed completely and the restarted port is available without any issues. We tested this solution multiple times in pipeline and it seemed to work.
But now your suggestion of having a unique cluster Id and making Discovery service to listen to different ports seems to be an optimal solution.
Here are few observations on cluster id approach:
- Currently, cluster id support is only provided for
Get Key File Path.vi
and not forStart Discovery Service.vi
(here it is actually needed) or any of its callers. We would like to understand why this support is not being provided.
Because of this it is difficult to pass in cluster id from tests. - So, we are proposing a solution of using Named Queue approach to pass in cluster id from tests and use the cluster id only for tests. We will store the cluster id in queue before restarting the services(in test files) and while restarting the services(in Start Discovery Service.vi), we will use the stored cluster id from queue to start the service. Thus, whenever Services are killed in tests, the latter restarted services will listen to different ports.
Here below is sample of how the test vi and Start Discovery service.vi is going to look like
Also, a 50ms timeout is provided for dequeuing to avoid processing infinitely.
Please share your thoughts on this approach.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We are also trying to start the Discovery Service first before running tests in this pipeline.
Unless there are performance reasons for doing so, I would rather tests be responsible for setting themselves up so they can succeed rather than relying on some other entity to do it for them. I find this approach generally results in a more maintainable test suite, and developers wanting to run tests locally don't have to know or worry about how to duplicate what the CI machinery is doing for them. However, if this means the test suite takes 10x longer to run, then the extra coupling/brittleness to the test suite might be worth it.
The issue which we are now facing is when services are abruptly killed, we see Resolve Service is unable to restart the services thus leading to hang.
When this occurs, does the ResolveService
call itself succeed, or does the call effectively hang because the process for the resolved service never starts? If it succeeds, does it report the same port that was previously being used by the service? I wonder if there is a race condition where the discovery service's event for process termination hasn't completed or hasn't been received from the OS yet so it thinks the target service is still running on the port that was previously registered by the service.
We would like to understand why this support is not being provided.
There is no good reason. It wasn't functionality that was needed at the time so there was no real focus/attention on its use throughout the API to ensure it could be used in the future if needed.
So, we are proposing a solution of using Named Queue ... Please share your thoughts on this approach.
This approach is effectively the same as creating a global variable for the cluster id
except that you have a little more control over the lifetime of the global. I don't love this idea, but I understand the inclination as it is the least intrusive solution given the current state of the code. In my mind, the ideal approach would be to have a client class/object rather than just an int. This class would then encapsulate the cluster id
setting. If you wanted to use a different cluster in the test, then create a new client. If you just want to restart the existing client, then continue using that object the internal implementation will continue using the same cluster id. Some ideas/thoughts on different approaches:
- Create a class object around each grpc client session. Since we have a generic client cache in the discovery library (some concept as a channel pool), this would impact the discovery client, session management client, and pin map client. Since each client class would be a unique type, we would have to make the shared FGV implementation a malleable VI or create a unique "pool" implementation for each client type. I think this is doable, but a fair bit of work.
- Rather than create a class object for each client, just bubble the cluster id parameter out through the FGV used for the client pool as well as the VI ref signatures for the Create/Recreate VI references it uses. For the session management and pin map clients, this would just be unused, but they would each have to update their VI signatures to match. Production code would leave the input unwired and get the existing behavior of using the static port. Test code could explicitly create the client first thing in the test in order to create a new service instance on a unique port each time. This solution doesn't provide as much encapsulation as the previous solution, but it has more limited impact on the surrounding code. This isn't really all that much different than the named queue approach since the FGV is effectively already a named queue of sorts. However, I do think it's a little cleaner than what you proposed.
- None of the above deals with cleaning up the services spawned from the unique clusters. It sounds like the test suite was never really doing this anyway in a consistent way to achieve isolation and prevent potential cross talk between tests or possible parallel execution. However, since we always used the same port, that limited the number of processes to one. I don't know if the current deployment mechanism for our tests always starts from a clean slate or not. If so, this would at least isolate impact of the previous test suite run with the current test suite run. Otherwise, we will likely run into some resource exhaustion issue (memory or port exhaustion) due to launching so many processes that never exit or get killed. The cluster id impacts all services launched by the discovery service as well. So if you resolve the pin map service twice from the discovery service but with a different cluster id each time, you'll end up with two discovery service processes and two pin map service processes. With our C# tests, the test framework makes this easy to cleanup in a consistent / reliable way. We don't have that same level of capability here, and I'm not sure how difficult that would be to add.
- The FGV for the client cache could be updated to use a <string, int> map rather than just an int. The cluster id could then effectively be used as the name for the client session. This would allow multiple clients to exist in parallel that communicate with different clusters. I don't know that we really need this or not so it might not be worth thinking about further. This would also require some sort of client factory / singleton at the appropriate scope for our tests. Otherwise, the session management client and the pin map client won't know which discovery cluster to communicate with.
- I don't actually know how many tests are doing end to end testing by spinning up service instances and communicating across grpc. If there are a lot of them, we'll definitely want to keep an eye on the before/after test times to ensure the isolation/robustness benefits of the change are worth whatever the performance cost is.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BTW, it looks like we only have 5 C# tests which kill a service and restart it:
- One uses a 100 ms delay after killing the process.
- One uses no delay and is relatively new. It has never failed so far.
- The other three call Process.Kill followed by Process.WaitForExit.
All of these tests should be running under a unique cluster ID each time. However, the restart should occur on the same cluster ID.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the suggestions. I'll analyze and reach out for any queries.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With the discussions with Jason, I have created a PR that has the updates changes. Please review that PR. Hence closing this PR.
What does this Pull Request accomplish?
Fixes the tests that causes LabVIEW hang while run. Refer this PR for more details on the issue.
Why should this Pull Request be merged?
What testing has been done?
PR checks