Optimise CI - reduce flakyness, unify workflows #4516

0x009922 · 2024-04-26T03:02:37Z

I encounter this time and time again: you open a PR, wait for checks for 10/20/30 minutes, and then you see that same workflows fail on the same tests. During that time I get distracted on something else and forget to re-start checks again just in time. Sometimes it takes 5-6 restarts to get them work. This way, the moment of all green lights might delay for days. And it delays development in my case.

From my observation, the following workflows are flaky:

I2::Dev::Tests > with_coverage, integration, unstable

And these are particular flaky tests:

integration::extra_functional::offline_peers::genesis_block_is_committed_with_some_offline_peers
integration::extra_functional::unstable_network::soft_fork
And maybe some others, haven't collected much data

Are these tests worth it?

My another concern is that I don't see the rationale behind having so many workflows:

I2::Dev::Static
1. smart contracts
2. workspace
I2::Tests::UI
1. test with all features
2. test with no default features
I2::Dev::Tests
1. consistency
2. with_coverage
3. integration
4. unstable
5. client-cli-tests

(there are some others too)

These workflows all run Cargo and compile more or less the same stuff. Yes, there are variations in features presets, but Cargo handles it for us. It can granularly reuse compilation artifacts depending on the context (apart from cases with different RUSTC flags, I suppose).

So, I guess that it is worth trying to combine all these workflows into a single one, and build it in a way so that it can report as many useful information as possible in a single run. I wonder how much more/less it would be efficient.

Another useful implication of this would be a shorter feedback on some early errors. For example, a certain change in PR introduces something and Iroha cannot even compile. Currently, all 8+ workflows will run and fail on the same error. In the case of a unified CI, there will be less work repetition.

Proposals

Prioritise zero-tolerance to flaky tests from development side
If flaky tests couldn't be easily fixed, possibly move them away from PR checks to after-merge checks.
Create a single unified workflow, and research the performance impact of it.
Explore ways to use a sane scripting language for CI, not Shell. That's for a separate issue, maybe.

The text was updated successfully, but these errors were encountered:

DCNick3 · 2024-04-26T07:28:00Z

And these are particular flaky tests:
integration::extra_functional::offline_peers::genesis_block_is_committed_with_some_offline_peers
integration::extra_functional::unstable_network::soft_fork

Agree, I often find those particular tests failing. Though, a lot of client-cli-tests are also very flaky =(. I noticed, in particular, test_burn_asset_for_account_in_same_domain and test_register_account_with_invalid_domain, but there are probably more

The nice things about the multiple workflows is that they are ran in parallel and fail independently (that, in particular, is probably why unstable workflow is a thing).

AlexStroke · 2024-05-02T16:04:42Z

Prioritise zero-tolerance to flaky tests from development side
If flaky tests couldn't be easily fixed, possibly move them away from PR checks to after-merge checks.

We need to think and implement a definitive solution that ensures tests are not flaky by fundamentally resolving synchronization issues. For example, in the Python SDK, @SamHSmith and I are working on a 'submit and block' function in the client. This function will only return once a transaction has been fully processed and confirmed on all peers, regardless of the machine and resources the tests are running on. Maybe it should be some more centralized solution so that you don't have to implement it in every client and SDK.

AlexStroke · 2024-05-23T08:08:38Z

Prioritise zero-tolerance to flaky tests from development side
If flaky tests couldn't be easily fixed, possibly move them away from PR checks to after-merge checks.

We need to think and implement a definitive solution that ensures tests are not flaky by fundamentally resolving synchronization issues. For example, in the Python SDK, @SamHSmith and I are working on a 'submit and block' function in the client. This function will only return once a transaction has been fully processed and confirmed on all peers, regardless of the machine and resources the tests are running on. Maybe it should be some more centralized solution so that you don't have to implement it in every client and SDK.

https://github.com/hyperledger/iroha-python/pull/196/files

0x009922 · 2024-05-31T04:19:36Z

@nxsaken do you think your work on #4500 will address flakyness issue?

nxsaken · 2024-05-31T05:24:56Z

@0x009922 I keep it in mind, but not sure yet because I haven't looked into why the tests are flaky.

0x009922 added question Further information is requested iroha2-dev The re-implementation of a BFT hyperledger in RUST CI labels Apr 26, 2024

0x009922 assigned BAStos525 Apr 26, 2024

0x009922 self-assigned this May 30, 2024

This was referenced May 31, 2024

Fix Flaky tests #2136

Closed

refactor: fix application of the core chain-wide parameters; chores #4697

Merged

0x009922 mentioned this issue Jun 11, 2024

refactor: supervise spawned tasks #4716

Merged

0x009922 removed their assignment Jul 4, 2024

nxsaken added zh-migration and removed zh-migration labels Sep 4, 2024

This was referenced Oct 8, 2024

refactor!: black-box integration tests #5124

Merged

ci: reorganise workflows [EXPERIMENTAL] #5125

Draft

0x009922 linked a pull request Oct 8, 2024 that will close this issue

refactor!: black-box integration tests #5124

Merged

8 tasks

0x009922 self-assigned this Oct 8, 2024

mversic closed this as completed in #5124 Oct 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimise CI - reduce flakyness, unify workflows #4516

Optimise CI - reduce flakyness, unify workflows #4516

0x009922 commented Apr 26, 2024

DCNick3 commented Apr 26, 2024

AlexStroke commented May 2, 2024

AlexStroke commented May 23, 2024

0x009922 commented May 31, 2024 •

edited

Loading

nxsaken commented May 31, 2024

Optimise CI - reduce flakyness, unify workflows #4516

Optimise CI - reduce flakyness, unify workflows #4516

Comments

0x009922 commented Apr 26, 2024

Proposals

DCNick3 commented Apr 26, 2024

AlexStroke commented May 2, 2024

AlexStroke commented May 23, 2024

0x009922 commented May 31, 2024 • edited Loading

nxsaken commented May 31, 2024

0x009922 commented May 31, 2024 •

edited

Loading