Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sparse live test matrix on client and service version #10085

Closed
sima-zhu opened this issue Apr 10, 2020 · 26 comments
Closed

Sparse live test matrix on client and service version #10085

sima-zhu opened this issue Apr 10, 2020 · 26 comments
Assignees
Labels
Client This issue points to a problem in the data-plane of the library. EngSys This issue is impacting the engineering system. test-enhancement

Comments

@sima-zhu
Copy link
Contributor

sima-zhu commented Apr 10, 2020

Problems:

Java live tests currently only run on netty httpclient and latest service version. However, we have more popular httpclients and more service version needs to cover on our live tests to make sure customers are using good library.

Proposals:

We currently have 2 httpClients (netty, okhttp) and several service version.
Our test is able to configure which httpclient and service version the test can run on.

  1. We implement a rolling strategy, which make all combinations can run on different platform (os+jdk) on one day. Then roll the tests on the next day, which means all platforms has the chance of running different settings on weekly basis.

One of the example basically looks like below:

               
    Monday Tuesday Wednesday Thursday Friday Saturday Sunday
Java 8 Linux Netty OkHttp Netty OkHttp Netty OkHttp Netty
  Windows OkHttp Netty OkHttp Netty OkHttp Netty OkHttp
  MacOS Netty OkHttp Netty OkHttp Netty OkHttp Netty
Java 11 Linux OkHttp Netty OkHttp Netty OkHttp Netty OkHttp
  Windows Netty OkHttp Netty OkHttp Netty OkHttp Netty
  MacOS OkHttp Netty OkHttp Netty OkHttp Netty OkHttp
                 
                 
    Monday Tuesday Wednesday Thursday Friday Saturday Sunday
Java 8 Linux Version 1 Version 2 Version 3 Version 4 Version 1 Version 2 Version 3
  Windows Version 2 Version 3 Version 4 Version 1 Version 2 Version 3 Version 1
  MacOS Version 3 Version 4 Version 1 Version 2 Version 3 Version 1 Version 2
Java 11 Linux Version 4 Version 1 Version 2 Version 3 Version 4 Version 1 Version 2
  Windows Version 1 Version 2 Version 3 Version 4 Version 1 Version 2 Version 3
  MacOS Version 2 Version 3 Version 4 Version 1 Version 2 Version 3 Version 4

Pros:

  1. In this way, each build does not takes too long to finish as there are usually 1 or 2 configurations to run on for every test.
  2. We have a weekly repeatable pattern.
  3. There are display names and printouts on log messages which is able to reproduce on the configurations it failed on pipeline.
@sima-zhu sima-zhu added the EngSys This issue is impacting the engineering system. label Apr 10, 2020
@sima-zhu sima-zhu added Client This issue points to a problem in the data-plane of the library. test-enhancement labels Apr 10, 2020
@weshaggard
Copy link
Member

Varying what we run based on the day of the week makes builds non-repeatable and will cause issues when diagnosing issues as well as tracking the test history (assuming the test names are the same). Instead of varying which settings we use based on the day we should create a test matrix that remains constant and we can execute a subset of that matrix in different jobs or pipelines that can be run on whatever schedule makes sense to balance resource consumption.

It is not feasible to test every possible combination and permutation so we have to do our best to come up with a sparse test matrix that maximizes the value we gain from the test coverage with the cost (both time and money) it takes to run all those. I do not believe we should be introducing what appears to be randomization to our test runs to try and address this problem we should simply run the set of tests we feel covers the most important permutations.

@kurtzeborn
Copy link
Member

I agree with Wes's points on this. We should not have a single pipeline that has different behaviors on different days. It will be impossible to debug failures in the pipeline for anyone without the tribal knowledge of what happens on each day of the week.

@sima-zhu
Copy link
Contributor Author

It is a tradeoff between repeatable and discovery/maintenance.

For the option of static setting on pipelines, we have to manually add configs every time there is an update on version/httpclient. It asks a lot of maintenance work, sometimes we have to re-release the azure-core-test by supporting some new env variables.

Moreover, Java is at the phase of issue discover. Randomization will disclose some failures only happen on specific combo on certain platforms. I don't think day of week will affect the test result at current phase. If it is the case in future, we can easily switch back to static setting by current implementation.

@weshaggard
Copy link
Member

I don't think discovery/maintenance will be helped by making the pipelines run differently on different days. It will in fact make it more difficult to discovery what really caused something to fail and thus make test exhibit reliability issues. Imagine working on a PR mutlipe days and on one day the CI checks fail but the next day when you get back to it they all pass. The likelihood of someone actually investigating the failure from yesterday is almost zero.

I'd like to get to the true issue we are trying to solve as opposed to jumping to this particular solution. So what is the real underlying issue we are trying to solve? Is it that we want to run all the test combinations but they take too long? If that is the case have we tried any parallelization efforts in running the tests? I bet there is work we can do to improve the performance of these tests that would make this a non-issue and we wouldn't need to worry about overly-complicating things by running different variants on different days.

@sima-zhu
Copy link
Contributor Author

The rolling strategy only applies to live tests. CI will only run on the same set of parameters (currently mock client and latest service version).

The target is to avoid manually changing the configuration file for newly introduced parameters.
Meanwhile, don't run too many test suites in one build. Better to finish the build in the early morning.

@weshaggard
Copy link
Member

Even for nightly live test runs we don't want it failing one day and not the next the same thing will start to happen were folks will ignore the failure because it started passing the next day.

You still didn't answer the question about the real problem we are trying to solve. If it is about runtime then we should explore parallelization options.

@weshaggard
Copy link
Member

The target is to avoid manually changing the configuration file for newly introduced parameters.

I'm sure I understand that statement. Wouldn't it simply be a matter of passing the parameter in the build yml file vs doing it in java code? I'm not sure how it avoids changing one over the other, some configuration will have to be changed in either case.

@joshfree
Copy link
Member

Our main goal here is make a sparse test matrix to reduce the daily test run times and to save resources.

Instead of pivoting on calendar day, we could also refactor the test runs into multiple legs (which are self-consistent) and then decide on the frequency of those test runs. @sima-zhu do you you could take a 2nd pass at the table above and propose different factoring so that each test leg is self-consistent and doesn't use the day of the week as a deciding factor of what gets tested?

@sima-zhu
Copy link
Contributor Author

sima-zhu commented Apr 14, 2020

For example, if we have new parameters, like rpc or new service version.
Then we need to have all clients (storage, keyvault, as you can name it) to update the yml file. Also we need to setup tons of new pipelines to make all clients run on new parameters. What's more, in order to let tests pick up the right env, we need to introduce env variable in java code as well, either in common azure-core-test or in each client library. If say in azure-core-test, then we have to release azure-core-test which don't want to release that often.

@weshaggard
Copy link
Member

For example, if we have new parameters, like rpc or new service version.
Then we need to have all clients (storage, keyvault, as you can name it) to update the yml file.

There is a common test yml template specifically for sharing such test configuration. Adding a new test matrix into that template will include it in all the libraries that share the common test templates, which should be all our newer ones. So what is the issue with adding a new test matrix entry to that matrix to cover the new parameters you want to change?

@sima-zhu
Copy link
Contributor Author

sima-zhu commented Apr 14, 2020

Another thing is we don't want win java8 always run on netty, v1.
If say we have (netty, okhttp) and (v1,v2)
We want (win + java8) has the chance to run (netty, v1), (netty, v2), (okhttp,v1), (okhttp, v2)
If we don't want to run too long, then we need to have 4 pipelines on each (os + jdk).
If we add one more httpclient, then it will increase to 6 pipelines per (os + jdk)

@weshaggard
Copy link
Member

The nature of the sparse matrix is to identify the key combinations we care about and add them to the test matrix. Unfortunately it isn't realistic to run all combinations, all the time. We should identity the matrix parameters we want to run all the time and if we want to run additional matrix entries only sometimes we should create additional pipelines that will run them on some schedule handled by our devops, not vary them in code as our pipelines will not be deterministic in that case.

@sima-zhu
Copy link
Contributor Author

@JonathanGiles @srnagar

Could you share your thoughts?

@weshaggard
Copy link
Member

@sima-zhu have there been any investigation into running all the combinations in parallel in maven? Also for reference do we have any timing data on how long a test run currently takes to run okhttp and netty all the time in one test run? I believe that is the current situation.

@sima-zhu
Copy link
Contributor Author

I have KeyVault for your reference.
https://dev.azure.com/azure-sdk/internal/_build?definitionId=504&_a=summary

@sima-zhu
Copy link
Contributor Author

sima-zhu commented Apr 14, 2020

KeyVault is running all for now(4 in total, combinations of 2 clients and 2 vesrions). KeyVault is not the most time consuming tests. Storage takes even more, so we cannot extend this to storage until we figure out the best approach.

@weshaggard
Copy link
Member

cc @g2vinay
Just taking a quick look at KeyVault it looks like each test leg takes about 2hrs 40mins and there are a couple things that jumps out at me:

  1. Why are we setting MaxParallel to 2 (https://github.com/Azure/azure-sdk-for-java/blob/master/sdk/keyvault/tests.yml#L7) that means only 2 legs can run at a time which will extend the full runtime of the entire pipeline.
  2. Looking at a single test that seems to take at least 5 minutes to complete, listDeletedKeys right from the test method I see a number of calls to Sleep. That is generally a bad idea in tests because you guarantee that the test will take a longer time, instead it is generally recommended that we poll instead of blindly sleep.
  3. As far as I can tell it doesn't appear as though the tests are not running in parallel. We should look into running them in parallel, see https://maven.apache.org/surefire/maven-surefire-plugin/examples/fork-options-and-parallel-execution.html. We should do that for all the tests.

@sima-zhu
Copy link
Contributor Author

As what I learned from discussion, KeyVault has a risk of throttling if running too many in parallels.
Some of the client resources have the concern of running in parallel. The test parameters execute sequentially for the same reason (run netty, v1, then netty v2, and so on). The parallel setting depends on clients and not able to apply to all.

@g2vinay
Copy link
Member

g2vinay commented Apr 14, 2020

@weshaggard
Thanks for your observations.
For #1 and #3, increasing parallelism, causes KV to throttle.
For #2, sleep statements are there to give enough time for the action to complete on service side and to avoid throttling the KV.

a simple for loop executing an API call for 25 - 50 times is enough to throttle the KV.
So, lopping indefinitely without any delay tends to do that too.
I would like to experiment by changing the retry policy for #2 though and see if it reduces the need for the required delay.

@srnagar
Copy link
Member

srnagar commented Apr 14, 2020

The nature of the sparse matrix is to identify the key combinations we care about and add them to the test matrix. Unfortunately it isn't realistic to run all combinations, all the time.

The motivation to propose the rolling strategy was to test all combinations of HttpClient and ServiceVersions every day without spending too much time or consuming too many resources.
If this is not an option, then we should at least look into the key combinations that run every day and augment it with another pipeline that runs all the less-important combinations once a week/month.

We should have the necessary alerts to notify the team of any failures.

@weshaggard
Copy link
Member

@g2vinay have you chatted with the keyvault owners in the other languages? .NET for example tests usually run in about 30 mins and I'm pretty sure they are running them in parallel. I also see the python keyvault tests are running a large matrix in parallel. I do think it is worth further investigation into improving parallelization and will likely go a long way to improving the runtime of the tests.

I believe we can still do a better job in general about enabling our tests to run in parallel. However even with full parallelization we can only get so far so I think we can define the full matrix in the yml file but only run a subset all the time and run the full set on a weekly or monthly schedule in another pipeline. That should enable us to add more parameters now while we still work on improving the tests themselves.

@sima-zhu
Copy link
Contributor Author

sima-zhu commented Apr 15, 2020

Synced with @srnagar

Here is how we want to move forward.
In archtype-sdk-tests.yml, we make current clients(netty, okhttp) evenly distributed on current matrix. We only run on latest service version.

  • Linux + jdk8 : netty + latest version
  • MacOs + jdk8: okhttp + latest version
  • Win + jdk8: netty + latest version
  • Linux+ jdk11: okhttp + latest version
  • MacOs + jdk11: netty + latest version
  • Win+ jdk11: okhttp + latest version

For non-critical path, we will defer the choice to each client to setup their own matrix. E.g. KeyVault can have additional pipeline run on netty + old version or more pipelines if needed.

@weshaggard Let me know if you have any suggestion.

@weshaggard
Copy link
Member

Yes that seems inline with what I was thinking. You could even add more matrix entries and only execute them in another pipeline under some condition. See Azure/azure-sdk-for-python#9031 which is only running a subset of the matrix entries for PRs but more for nightly test runs. You could use a similar strategy to run more matrix entries on some other weekly/monthly schedule if needed.

@sima-zhu sima-zhu changed the title Rolling strategy on live test Sparse live test matrix on client and service version Apr 16, 2020
@JonathanGiles
Copy link
Member

@sima-zhu Does this mean that there will be no testing of earlier service versions and that we only ever test the latest? Will there be some pipeline specified that runs against all service versions?

@sima-zhu
Copy link
Contributor Author

@JonathanGiles
Yes, we only run latest version in nightly live builds.
The non-critical path we currently did not setup yet. It depends on each client library.
I have setup a non-critical pipeline for KeyVault and run once a week currently.
https://dev.azure.com/azure-sdk/internal/_build?definitionId=1673&_a=summary

Need to work with @srnagar @g2vinay whether we want to enable the pipeline.

@weshaggard
Copy link
Member

Fixed with #10312

@github-actions github-actions bot locked and limited conversation to collaborators Apr 12, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Client This issue points to a problem in the data-plane of the library. EngSys This issue is impacting the engineering system. test-enhancement
Projects
None yet
Development

No branches or pull requests

7 participants