[Internal] [Work in Progress] Fix tests that causes LabVIEW hang #514

MounikaBattu17 · 2024-05-15T11:55:08Z

This contribution adheres to CONTRIBUTING.md. (Required)
Automatically post PR comments with images for G code changes? (Recommended for small changes)

What does this Pull Request accomplish?

Fixes the tests that causes LabVIEW hang while run. Refer this PR for more details on the issue.

Why should this Pull Request be merged?

At times when starting Discovery Service(any service for this matter), it needs some time to get started if abruptly killed before.
Due to this, Resolve Service is unable to restart the Discovery service and also timeout as -1 to Resolve Service.vi makes LabVIEW hang infinitely.
Thus, added a 5s delay after killing the Discovery Service to ensure Discovery Service is killed properly and can be restarted without any errors. Programmatically included a 5s delay in Kill Service.vi.
I have started Discovery Service manually in the pipeline(run_g_tests.yml) before running tests to ensure the tests which needs Discovery Service to be running passes.
Removed all recently added tests from skip list.

What testing has been done?

PR checks

.../Discovery/test_DiscoveryServiceNotRunning_LaunchDiscoveryService_DiscoveryServiceStarted.vi

...erver/Discovery/test_DiscoveryWrapper_StopDiscoveryService_WrapperMethodsWorkWithoutError.vi

Source/Tests/Tests.Runtime/Measurement Server/Helpers/Restart Discovery Service.vi

...on/test_SessionManagerWrapper_StopSessionManagementService_WrapperMethodsWorkWithoutError.vi

MounikaBattu17 · 2024-05-16T04:37:43Z

Kill Service.vi diffs:
Before:

After:

jasonmreding · 2024-05-17T00:26:07Z

Source/Tests/Tests.Runtime/Measurement Server/Helpers/Kill Service.vi

Were you looking for some feedback on this change? It doesn't seem ready for submission in its current state. Are you looking to check in changes to try and diagnose the problem further, or are the test failures understood and this a proposed fix for the test hangs? If it is the latter, I think I need more details on exactly what the problem was and how this fixes it before I can provide better feedback.

Without any further context, the questions that come to mind are:

Three of the hanging tests are part of the Discovery Client.lvproj and each indicates it requires the discovery service to be started ahead of time. If it isn't started, then the tests will hang. How has this ever been working? What ensures the discovery service is actually started in order for these tests to succeed. Is the test order execution deterministic and it just so happens some test before it happened to start the service? I actually thought these were manual sanity tests to be run after regenerating code from a proto file change to the discovery service. If nothing else, maybe our tests shouldn't be using infinite timeouts?

It appears we are always using the default cluster ID which will launch the discovery service on the static port 42000. This isn't what we do for our C# tests. For those tests, the testing framework ensures each test uses a unique cluster ID so that each test is isolated from each other. The test framework also ensures the discovery service process is terminated after each test so we don't "leak" processes and bog down the machine over time. Perhaps a similar approach here would resolve the hang and provide better robustness overall for the test suite?

@jasonmreding
This is still work in progress PR, I'm testing out the possible fix on the LabVIEW hang.

Three of the hanging tests are part of the Discovery Client.lvproj and each indicates it requires the discovery service to be started ahead of time. If it isn't started, then the tests will hang. How has this ever been working? What ensures the discovery service is actually started in order for these tests to succeed. Is the test order execution deterministic and it just so happens some test before it happened to start the service?

Yes! The order of execution is determined by the tests folder structure. Thus, I believe previously in spawning pool, Discovery service exe has been already running before running tests, that's why we didn't see any errors. We are also trying to start the Discovery Service first before running tests in this pipeline. So this is resolved.

This isn't what we do for our C# tests. For those tests, the testing framework ensures each test uses a unique cluster ID so that each test is isolated from each other.

Thanks for this suggestion. The issue which we are now facing is when services are abruptly killed, we see Resolve Service is unable to restart the services thus leading to hang. Since all tests use the same Discovery Service port, this might cause LabVIEW hang is what we conclude. Thus, we tried setting a 5s delay after killing the services such that services are killed completely and the restarted port is available without any issues. We tested this solution multiple times in pipeline and it seemed to work.
But now your suggestion of having a unique cluster Id and making Discovery service to listen to different ports seems to be an optimal solution.

Here are few observations on cluster id approach:

Currently, cluster id support is only provided for Get Key File Path.vi and not for Start Discovery Service.vi (here it is actually needed) or any of its callers. We would like to understand why this support is not being provided.

Because of this it is difficult to pass in cluster id from tests.

So, we are proposing a solution of using Named Queue approach to pass in cluster id from tests and use the cluster id only for tests. We will store the cluster id in queue before restarting the services(in test files) and while restarting the services(in Start Discovery Service.vi), we will use the stored cluster id from queue to start the service. Thus, whenever Services are killed in tests, the latter restarted services will listen to different ports.
Here below is sample of how the test vi and Start Discovery service.vi is going to look like

Also, a 50ms timeout is provided for dequeuing to avoid processing infinitely.

Please share your thoughts on this approach.

We are also trying to start the Discovery Service first before running tests in this pipeline.

Unless there are performance reasons for doing so, I would rather tests be responsible for setting themselves up so they can succeed rather than relying on some other entity to do it for them. I find this approach generally results in a more maintainable test suite, and developers wanting to run tests locally don't have to know or worry about how to duplicate what the CI machinery is doing for them. However, if this means the test suite takes 10x longer to run, then the extra coupling/brittleness to the test suite might be worth it.

The issue which we are now facing is when services are abruptly killed, we see Resolve Service is unable to restart the services thus leading to hang.

When this occurs, does the ResolveService call itself succeed, or does the call effectively hang because the process for the resolved service never starts? If it succeeds, does it report the same port that was previously being used by the service? I wonder if there is a race condition where the discovery service's event for process termination hasn't completed or hasn't been received from the OS yet so it thinks the target service is still running on the port that was previously registered by the service.

We would like to understand why this support is not being provided.

There is no good reason. It wasn't functionality that was needed at the time so there was no real focus/attention on its use throughout the API to ensure it could be used in the future if needed.

So, we are proposing a solution of using Named Queue ... Please share your thoughts on this approach.

This approach is effectively the same as creating a global variable for the cluster id except that you have a little more control over the lifetime of the global. I don't love this idea, but I understand the inclination as it is the least intrusive solution given the current state of the code. In my mind, the ideal approach would be to have a client class/object rather than just an int. This class would then encapsulate the cluster id setting. If you wanted to use a different cluster in the test, then create a new client. If you just want to restart the existing client, then continue using that object the internal implementation will continue using the same cluster id. Some ideas/thoughts on different approaches:

Create a class object around each grpc client session. Since we have a generic client cache in the discovery library (some concept as a channel pool), this would impact the discovery client, session management client, and pin map client. Since each client class would be a unique type, we would have to make the shared FGV implementation a malleable VI or create a unique "pool" implementation for each client type. I think this is doable, but a fair bit of work.

Rather than create a class object for each client, just bubble the cluster id parameter out through the FGV used for the client pool as well as the VI ref signatures for the Create/Recreate VI references it uses. For the session management and pin map clients, this would just be unused, but they would each have to update their VI signatures to match. Production code would leave the input unwired and get the existing behavior of using the static port. Test code could explicitly create the client first thing in the test in order to create a new service instance on a unique port each time. This solution doesn't provide as much encapsulation as the previous solution, but it has more limited impact on the surrounding code. This isn't really all that much different than the named queue approach since the FGV is effectively already a named queue of sorts. However, I do think it's a little cleaner than what you proposed.

None of the above deals with cleaning up the services spawned from the unique clusters. It sounds like the test suite was never really doing this anyway in a consistent way to achieve isolation and prevent potential cross talk between tests or possible parallel execution. However, since we always used the same port, that limited the number of processes to one. I don't know if the current deployment mechanism for our tests always starts from a clean slate or not. If so, this would at least isolate impact of the previous test suite run with the current test suite run. Otherwise, we will likely run into some resource exhaustion issue (memory or port exhaustion) due to launching so many processes that never exit or get killed. The cluster id impacts all services launched by the discovery service as well. So if you resolve the pin map service twice from the discovery service but with a different cluster id each time, you'll end up with two discovery service processes and two pin map service processes. With our C# tests, the test framework makes this easy to cleanup in a consistent / reliable way. We don't have that same level of capability here, and I'm not sure how difficult that would be to add.

The FGV for the client cache could be updated to use a <string, int> map rather than just an int. The cluster id could then effectively be used as the name for the client session. This would allow multiple clients to exist in parallel that communicate with different clusters. I don't know that we really need this or not so it might not be worth thinking about further. This would also require some sort of client factory / singleton at the appropriate scope for our tests. Otherwise, the session management client and the pin map client won't know which discovery cluster to communicate with.

I don't actually know how many tests are doing end to end testing by spinning up service instances and communicating across grpc. If there are a lot of them, we'll definitely want to keep an eye on the before/after test times to ensure the isolation/robustness benefits of the change are worth whatever the performance cost is.

BTW, it looks like we only have 5 C# tests which kill a service and restart it:

One uses a 100 ms delay after killing the process.

One uses no delay and is relatively new. It has never failed so far.

The other three call Process.Kill followed by Process.WaitForExit.

All of these tests should be running under a unique cluster ID each time. However, the restart should occur on the same cluster ID.

Thank you for the suggestions. I'll analyze and reach out for any queries.

With the discussions with Jason, I have created a PR that has the updates changes. Please review that PR. Hence closing this PR.

MounikaBattu17 added 3 commits May 15, 2024 17:22

Fix: Add delay of 5s in tests

b5c30b2

Fix: Start Discovery Service manually

5bf3947

Fix: Removed fixed issue tests from skip list

c0b30a1

github-actions bot reviewed May 15, 2024

View reviewed changes

.../Discovery/test_DiscoveryServiceNotRunning_LaunchDiscoveryService_DiscoveryServiceStarted.vi Show resolved Hide resolved

github-actions bot reviewed May 15, 2024

View reviewed changes

...erver/Discovery/test_DiscoveryWrapper_StopDiscoveryService_WrapperMethodsWorkWithoutError.vi Outdated Show resolved Hide resolved

github-actions bot reviewed May 15, 2024

View reviewed changes

Source/Tests/Tests.Runtime/Measurement Server/Helpers/Restart Discovery Service.vi Show resolved Hide resolved

github-actions bot reviewed May 15, 2024

View reviewed changes

...on/test_SessionManagerWrapper_StopSessionManagementService_WrapperMethodsWorkWithoutError.vi Outdated Show resolved Hide resolved

MounikaBattu17 added 2 commits May 15, 2024 19:10

Fix: revrt back test changes

68ae127

Fix: Added delay in Kill Service

f4a9206

ni deleted a comment from github-actions bot May 16, 2024

MounikaBattu17 marked this pull request as ready for review May 16, 2024 04:43

MounikaBattu17 requested review from dixonjoel, aepete and jasonmreding as code owners May 16, 2024 04:43

MounikaBattu17 requested review from Prem-MysteryPro and SowmyaRavee May 16, 2024 04:43

MounikaBattu17 added 2 commits May 16, 2024 10:50

Test: Decrease delay time to 3s

1915e0d

Test commit

3f29acb

MounikaBattu17 changed the title ~~[Internal] Fix tests that causes LabVIEW hang~~ [Internal] [Work in Progress] Fix tests that causes LabVIEW hang May 16, 2024

MounikaBattu17 added 4 commits May 16, 2024 17:56

Test: Check for key file

253eb05

Test: Syntax change

c9f1982

Test: Add more logs

685b9df

Test: Add wait time

e46d854

jasonmreding reviewed May 17, 2024

View reviewed changes

MounikaBattu17 added 2 commits May 17, 2024 17:11

Test: Add cluster id for starting Discovery service

8bacea8

Test: Remove logger

697c700

MounikaBattu17 closed this May 28, 2024

MounikaBattu17 deleted the user/mounika/fix-lv-hang-tests branch August 22, 2024 07:49

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Internal] [Work in Progress] Fix tests that causes LabVIEW hang #514

[Internal] [Work in Progress] Fix tests that causes LabVIEW hang #514

MounikaBattu17 commented May 15, 2024 •

edited

Loading

MounikaBattu17 commented May 16, 2024

jasonmreding May 17, 2024

MounikaBattu17 May 17, 2024 •

edited

Loading

jasonmreding May 17, 2024

jasonmreding May 17, 2024

MounikaBattu17 May 21, 2024

MounikaBattu17 May 28, 2024

[Internal] [Work in Progress] Fix tests that causes LabVIEW hang #514

[Internal] [Work in Progress] Fix tests that causes LabVIEW hang #514

Conversation

MounikaBattu17 commented May 15, 2024 • edited Loading

What does this Pull Request accomplish?

Why should this Pull Request be merged?

What testing has been done?

MounikaBattu17 commented May 16, 2024

jasonmreding May 17, 2024

Choose a reason for hiding this comment

MounikaBattu17 May 17, 2024 • edited Loading

Choose a reason for hiding this comment

jasonmreding May 17, 2024

Choose a reason for hiding this comment

jasonmreding May 17, 2024

Choose a reason for hiding this comment

MounikaBattu17 May 21, 2024

Choose a reason for hiding this comment

MounikaBattu17 May 28, 2024

Choose a reason for hiding this comment

MounikaBattu17 commented May 15, 2024 •

edited

Loading

MounikaBattu17 May 17, 2024 •

edited

Loading