[Flaky Test] Modifies block rate in VN test to address sealing lagging finalization #4975

jordanschalm · 2023-11-09T02:12:35Z

This PR modifies the block rate in the Verification Node test to address sealing lagging finalization. In the VN epoch integration test only, we enable require-seals-for-approvals, which requires VN participation in sealing. In practice, this results in sealing lagging finalization (see here for some tests) by up to 100 blocks.

This over-ran the default snapshot history limit, preventing the test from validating state
This also causes epoch transition failures, because DKG requires somewhat up-to-date sealing w.r.t. the DKG phase length, which is small in these tests.

To fix the problem, I increased the block rate delay for consensus nodes, from 100ms to 250 ms. This PR also updates several test utilities for observing what is going on while the test runs.

codecov-commenter · 2023-11-09T02:42:26Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Comparison is base (1990e2f) 55.75% compared to head (54bdfc2) 55.75%.

Additional details and impacted files

@@           Coverage Diff           @@
##           master    #4975   +/-   ##
=======================================
  Coverage   55.75%   55.75%           
=======================================
  Files         959      959           
  Lines       89270    89270           
=======================================
+ Hits        49773    49774    +1     
+ Misses      35752    35750    -2     
- Partials     3745     3746    +1

Flag	Coverage Δ
unittests	`55.75% <ø> (+<0.01%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files	Coverage Δ
engine/access/rpc/backend/backend.go	`75.54% <ø> (ø)`

... and 8 files with indirect coverage changes

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

gomisha

Thanks for the fix! Ran it locally multiple times and it seems to be very stable, taking about 175 seconds to run.

#4975

* unit tests, integration tests using buildjet-4vcpu runner * listTargetPackages() returns map of CI runners * test for custom runners * generateTestMatrix() stores CI runner * more testing with custom CI runners * CI test - 2 unit tests running buildjet runners * increase engine unit tests to 8 vCPUs * resource manager full load test (8 vCPUs) * network/test tests unquarantined, increased runners * increased to 16 vCPUs * increased engine tests to 16 vCPUs * insecure module uses buildjet 4 vcpu runner * bft framework integration test using buildjet * engine tests split up to 6 jobs with default runners * added back all integration tests * upgraded runners for some integration tests, engine/execution * dummy commit to kick off CI * split up TestScriptExecutionAndGetAccounts into 3 tests * updated runners for Epoch Cohort2, Access Integration Tests * AN integration tests split up into 3 cohorts (skip unit, other integration tests) * docker build using GitHub cache * remove saving Docker image locally * cache v3, saving local tar before caching, 8vcpu * add 11 docker images to docker-images.tar, 16 vcpu * docker save multiline fix; skip localnet-test * docker build increase to 32 vcpu * buildjet cache instead of GitHub cache * lint fix * activate more integration tests with default runners, using cached Docker * docker-build downsized to 16 vcpu * integration test without relic build * put back relic build for integration tests * put back all remaining integration tests using BuildJet runners * integration tests reverted to ubuntu-latest runners * re-activated unit-test, unit-test-modules jobs * added (failing) test for sub-sub packages * sub-sub packages support * sub-sub package test enhanced * updated other tests to support new sub-sub packages * split up engine/execution unit tests into 3 jobs * engine/execution/ingestion upgraded to 4 vcpu * engine/execution/ingestion:buildjet-8vcpu * network/test, network/p2p increased to 8 vcpu; added remaining network tests * BFT Framework, Epoch Cohort 2 - increased to 4 vcpu * upgraded runners for Access Cohort 1, Epoch Cohort 1, 2 * epoch cohort2 upgraded to 16 vcpu * BFT Framework upgrade to 8 vcpu, epoch cohort2 removed flaky test, downgrade to 4 vcpu * quarantined flaky test - TestEpochJoinAndLeaveVN * quarantined flaky test - TestSealingAndVerificationPassThrough * lint fix * network/test split up into network/test/cohort1, network/test/cohort2 * network/p2p split up into 4 subpackages network/p2p/connection network/p2p/scoring network/p2p/p2pnode network/p2p * extracted network/alsp into separate job * extracted module/dkg into separate job * engine/execution/ingestion dowgraded to stock runner * added engine parent package as separate job * added storage parent package as separate job * added state package as separate job * removed localnet-test job * engine/execution/ingestion upgraded to 2 vcpu * network/p2p/p2pnode upgraded to 2 vcpu * module, engine upgraded to 2 vcpu * network/p2p/p2pnode upgraded to 4 vcpu * network/test/cohort2 upgraded to 4 vcpu * engine upgraded to 4 vcpu * engine/execution/ingestion upgraded to 4 vcpu, network/test/cohort1 upgraded to 2 vcpu * network/test/cohort1 upgraded to 4 vcpu * engine/execution/ingestion upgraded to 8 vcpu * network/test/cohort1 upgraded to 8 vcpu * BFT (Protocol) upgraded to 4 vcpu * network/test/cohort1 upgraded to 16 vcpu * BFT (Protocol), Epoch Cohort1 upgraded to 8 vcpu * noop push to kick off CI * Module (integration) upgraded to 4 vcpu * TestUnicastRateLimit_Messages flaky test quarantined * noop push to kick off CI * buildjet/cache@v3 => actions/cache@v3 * test retries increased to 5, timeout increased to 35 mins * module job increased to 4 vcpu * unquarantined epoch flaky test after it was fixed #4975 * unlink flaky test monitor workflow from running when ci changes * clean up * lint fix * add TargetDuration to models, conversion * bump core-contracts version got this error: ``` go: module github.com/onflow/flow-core-contracts@9e8417b found (v0.0.0-20231120143830-9e8417b56122), but does not contain package github.com/onflow/flow-core-contracts/lib/go/templates ``` Maybe because templates did not change since previous version? * update state commitment constants in tests * tidy * update mocks * Update state/protocol/inmem/convert.go Co-authored-by: Alexander Hentschel <alex.hentschel@axiomzen.co> --------- Co-authored-by: Misha <15269764+gomisha@users.noreply.github.com> Co-authored-by: Alexander Hentschel <alex.hentschel@axiomzen.co>

…e`, `TargetDuration` (#5023) * unit tests, integration tests using buildjet-4vcpu runner * listTargetPackages() returns map of CI runners * test for custom runners * generateTestMatrix() stores CI runner * more testing with custom CI runners * CI test - 2 unit tests running buildjet runners * increase engine unit tests to 8 vCPUs * resource manager full load test (8 vCPUs) * network/test tests unquarantined, increased runners * increased to 16 vCPUs * increased engine tests to 16 vCPUs * insecure module uses buildjet 4 vcpu runner * bft framework integration test using buildjet * engine tests split up to 6 jobs with default runners * added back all integration tests * upgraded runners for some integration tests, engine/execution * dummy commit to kick off CI * split up TestScriptExecutionAndGetAccounts into 3 tests * updated runners for Epoch Cohort2, Access Integration Tests * AN integration tests split up into 3 cohorts (skip unit, other integration tests) * docker build using GitHub cache * remove saving Docker image locally * cache v3, saving local tar before caching, 8vcpu * add 11 docker images to docker-images.tar, 16 vcpu * docker save multiline fix; skip localnet-test * docker build increase to 32 vcpu * buildjet cache instead of GitHub cache * lint fix * activate more integration tests with default runners, using cached Docker * docker-build downsized to 16 vcpu * integration test without relic build * put back relic build for integration tests * put back all remaining integration tests using BuildJet runners * integration tests reverted to ubuntu-latest runners * re-activated unit-test, unit-test-modules jobs * added (failing) test for sub-sub packages * sub-sub packages support * sub-sub package test enhanced * updated other tests to support new sub-sub packages * split up engine/execution unit tests into 3 jobs * engine/execution/ingestion upgraded to 4 vcpu * engine/execution/ingestion:buildjet-8vcpu * network/test, network/p2p increased to 8 vcpu; added remaining network tests * BFT Framework, Epoch Cohort 2 - increased to 4 vcpu * upgraded runners for Access Cohort 1, Epoch Cohort 1, 2 * epoch cohort2 upgraded to 16 vcpu * BFT Framework upgrade to 8 vcpu, epoch cohort2 removed flaky test, downgrade to 4 vcpu * quarantined flaky test - TestEpochJoinAndLeaveVN * quarantined flaky test - TestSealingAndVerificationPassThrough * lint fix * network/test split up into network/test/cohort1, network/test/cohort2 * network/p2p split up into 4 subpackages network/p2p/connection network/p2p/scoring network/p2p/p2pnode network/p2p * extracted network/alsp into separate job * extracted module/dkg into separate job * engine/execution/ingestion dowgraded to stock runner * added engine parent package as separate job * added storage parent package as separate job * added state package as separate job * removed localnet-test job * engine/execution/ingestion upgraded to 2 vcpu * network/p2p/p2pnode upgraded to 2 vcpu * module, engine upgraded to 2 vcpu * network/p2p/p2pnode upgraded to 4 vcpu * network/test/cohort2 upgraded to 4 vcpu * engine upgraded to 4 vcpu * engine/execution/ingestion upgraded to 4 vcpu, network/test/cohort1 upgraded to 2 vcpu * network/test/cohort1 upgraded to 4 vcpu * engine/execution/ingestion upgraded to 8 vcpu * network/test/cohort1 upgraded to 8 vcpu * BFT (Protocol) upgraded to 4 vcpu * network/test/cohort1 upgraded to 16 vcpu * BFT (Protocol), Epoch Cohort1 upgraded to 8 vcpu * noop push to kick off CI * Module (integration) upgraded to 4 vcpu * TestUnicastRateLimit_Messages flaky test quarantined * noop push to kick off CI * buildjet/cache@v3 => actions/cache@v3 * test retries increased to 5, timeout increased to 35 mins * module job increased to 4 vcpu * unquarantined epoch flaky test after it was fixed #4975 * unlink flaky test monitor workflow from running when ci changes * clean up * lint fix * add TargetEndTime to epoch models * update service event parsing * fix some typos and godocs * update mocks * lint * lint * pin new version of core-contracts - adds TargetEndTime field - pinned to branch - needs to be updated when release is created * update expected state commitment * update another state commitment constant * update another state commit constant * use TargetEndTime in cruisectl * represent target end time as uint64 * lint / mocks * use unix time in block time controller * add todod for epoch fallback flag * wip * wip - begin adding duration * add TargetDuration to models, conversion * bump core-contracts version got this error: ``` go: module github.com/onflow/flow-core-contracts@9e8417b found (v0.0.0-20231120143830-9e8417b56122), but does not contain package github.com/onflow/flow-core-contracts/lib/go/templates ``` Maybe because templates did not change since previous version? * update state commitment constants in tests * tidy * update mocks * track target duration in block time ctl * remove transition_time * update tests * fix last 2 tests - add helper functions for concisely converting between various time/duration types. - use ns-level precision in view measurement, primarily so that we can verify this implementation is consistent with the Python simulation. For real-world, second-level precision is likely fine. * Update consensus/hotstuff/cruisectl/block_time_controller.go Co-authored-by: Alexander Hentschel <alex.hentschel@axiomzen.co> * Update consensus/hotstuff/cruisectl/block_time_controller_test.go * Update consensus/hotstuff/cruisectl/block_time_controller_test.go * lint * Apply suggestions from code review Co-authored-by: Alexander Hentschel <alex.hentschel@axiomzen.co> * add docs * add docs for config setter/getters * gofmt --------- Co-authored-by: Misha <15269764+gomisha@users.noreply.github.com> Co-authored-by: Alexander Hentschel <alex.hentschel@axiomzen.co>

Backporting VN epoch test flakiness fix from master

jordanschalm added 3 commits November 8, 2023 17:08

test2.3

ac62aa6

test 2.3

b22d993

test 2.4 - slow block rate

54bdfc2

jordanschalm requested a review from peterargue as a code owner November 9, 2023 02:12

gomisha approved these changes Nov 9, 2023

View reviewed changes

gomisha mentioned this pull request Nov 9, 2023

[CI][Testing] Add BuildJet CI runners for unit tests, integration tests; cache Docker images #4906

Merged

jordanschalm requested a review from kc1116 November 9, 2023 16:15

kc1116 approved these changes Nov 9, 2023

View reviewed changes

jordanschalm added this pull request to the merge queue Nov 9, 2023

Merged via the queue into master with commit 9f00fa6 Nov 9, 2023
36 checks passed

jordanschalm deleted the jordan/vn-test-block-rate-fix branch November 9, 2023 17:49

gomisha added a commit that referenced this pull request Nov 9, 2023

unquarantined epoch flaky test after it was fixed

b60901b

#4975

jordanschalm mentioned this pull request Nov 29, 2023

[CI][Testing] Add BuildJet CI runners for unit tests, integration tests; cache Docker images - backport #5015

Merged

jordanschalm added a commit that referenced this pull request Nov 29, 2023

Cherry-pick changes from #4975

9c4f2c3

Backporting VN epoch test flakiness fix from master

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Flaky Test] Modifies block rate in VN test to address sealing lagging finalization #4975

[Flaky Test] Modifies block rate in VN test to address sealing lagging finalization #4975

jordanschalm commented Nov 9, 2023

codecov-commenter commented Nov 9, 2023 •

edited

Loading

gomisha left a comment

[Flaky Test] Modifies block rate in VN test to address sealing lagging finalization #4975

[Flaky Test] Modifies block rate in VN test to address sealing lagging finalization #4975

Conversation

jordanschalm commented Nov 9, 2023

codecov-commenter commented Nov 9, 2023 • edited Loading

Codecov Report

gomisha left a comment

Choose a reason for hiding this comment

codecov-commenter commented Nov 9, 2023 •

edited

Loading