-
Notifications
You must be signed in to change notification settings - Fork 179
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Flaky Test] Modifies block rate in VN test to address sealing lagging finalization #4975
Conversation
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## master #4975 +/- ##
=======================================
Coverage 55.75% 55.75%
=======================================
Files 959 959
Lines 89270 89270
=======================================
+ Hits 49773 49774 +1
+ Misses 35752 35750 -2
- Partials 3745 3746 +1
Flags with carried forward coverage won't be shown. Click here to find out more.
☔ View full report in Codecov by Sentry. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the fix! Ran it locally multiple times and it seems to be very stable, taking about 175 seconds to run.
* unit tests, integration tests using buildjet-4vcpu runner * listTargetPackages() returns map of CI runners * test for custom runners * generateTestMatrix() stores CI runner * more testing with custom CI runners * CI test - 2 unit tests running buildjet runners * increase engine unit tests to 8 vCPUs * resource manager full load test (8 vCPUs) * network/test tests unquarantined, increased runners * increased to 16 vCPUs * increased engine tests to 16 vCPUs * insecure module uses buildjet 4 vcpu runner * bft framework integration test using buildjet * engine tests split up to 6 jobs with default runners * added back all integration tests * upgraded runners for some integration tests, engine/execution * dummy commit to kick off CI * split up TestScriptExecutionAndGetAccounts into 3 tests * updated runners for Epoch Cohort2, Access Integration Tests * AN integration tests split up into 3 cohorts (skip unit, other integration tests) * docker build using GitHub cache * remove saving Docker image locally * cache v3, saving local tar before caching, 8vcpu * add 11 docker images to docker-images.tar, 16 vcpu * docker save multiline fix; skip localnet-test * docker build increase to 32 vcpu * buildjet cache instead of GitHub cache * lint fix * activate more integration tests with default runners, using cached Docker * docker-build downsized to 16 vcpu * integration test without relic build * put back relic build for integration tests * put back all remaining integration tests using BuildJet runners * integration tests reverted to ubuntu-latest runners * re-activated unit-test, unit-test-modules jobs * added (failing) test for sub-sub packages * sub-sub packages support * sub-sub package test enhanced * updated other tests to support new sub-sub packages * split up engine/execution unit tests into 3 jobs * engine/execution/ingestion upgraded to 4 vcpu * engine/execution/ingestion:buildjet-8vcpu * network/test, network/p2p increased to 8 vcpu; added remaining network tests * BFT Framework, Epoch Cohort 2 - increased to 4 vcpu * upgraded runners for Access Cohort 1, Epoch Cohort 1, 2 * epoch cohort2 upgraded to 16 vcpu * BFT Framework upgrade to 8 vcpu, epoch cohort2 removed flaky test, downgrade to 4 vcpu * quarantined flaky test - TestEpochJoinAndLeaveVN * quarantined flaky test - TestSealingAndVerificationPassThrough * lint fix * network/test split up into network/test/cohort1, network/test/cohort2 * network/p2p split up into 4 subpackages network/p2p/connection network/p2p/scoring network/p2p/p2pnode network/p2p * extracted network/alsp into separate job * extracted module/dkg into separate job * engine/execution/ingestion dowgraded to stock runner * added engine parent package as separate job * added storage parent package as separate job * added state package as separate job * removed localnet-test job * engine/execution/ingestion upgraded to 2 vcpu * network/p2p/p2pnode upgraded to 2 vcpu * module, engine upgraded to 2 vcpu * network/p2p/p2pnode upgraded to 4 vcpu * network/test/cohort2 upgraded to 4 vcpu * engine upgraded to 4 vcpu * engine/execution/ingestion upgraded to 4 vcpu, network/test/cohort1 upgraded to 2 vcpu * network/test/cohort1 upgraded to 4 vcpu * engine/execution/ingestion upgraded to 8 vcpu * network/test/cohort1 upgraded to 8 vcpu * BFT (Protocol) upgraded to 4 vcpu * network/test/cohort1 upgraded to 16 vcpu * BFT (Protocol), Epoch Cohort1 upgraded to 8 vcpu * noop push to kick off CI * Module (integration) upgraded to 4 vcpu * TestUnicastRateLimit_Messages flaky test quarantined * noop push to kick off CI * buildjet/cache@v3 => actions/cache@v3 * test retries increased to 5, timeout increased to 35 mins * module job increased to 4 vcpu * unquarantined epoch flaky test after it was fixed #4975 * unlink flaky test monitor workflow from running when ci changes * clean up * lint fix * add TargetDuration to models, conversion * bump core-contracts version got this error: ``` go: module github.com/onflow/flow-core-contracts@9e8417b found (v0.0.0-20231120143830-9e8417b56122), but does not contain package github.com/onflow/flow-core-contracts/lib/go/templates ``` Maybe because templates did not change since previous version? * update state commitment constants in tests * tidy * update mocks * Update state/protocol/inmem/convert.go Co-authored-by: Alexander Hentschel <alex.hentschel@axiomzen.co> --------- Co-authored-by: Misha <15269764+gomisha@users.noreply.github.com> Co-authored-by: Alexander Hentschel <alex.hentschel@axiomzen.co>
…e`, `TargetDuration` (#5023) * unit tests, integration tests using buildjet-4vcpu runner * listTargetPackages() returns map of CI runners * test for custom runners * generateTestMatrix() stores CI runner * more testing with custom CI runners * CI test - 2 unit tests running buildjet runners * increase engine unit tests to 8 vCPUs * resource manager full load test (8 vCPUs) * network/test tests unquarantined, increased runners * increased to 16 vCPUs * increased engine tests to 16 vCPUs * insecure module uses buildjet 4 vcpu runner * bft framework integration test using buildjet * engine tests split up to 6 jobs with default runners * added back all integration tests * upgraded runners for some integration tests, engine/execution * dummy commit to kick off CI * split up TestScriptExecutionAndGetAccounts into 3 tests * updated runners for Epoch Cohort2, Access Integration Tests * AN integration tests split up into 3 cohorts (skip unit, other integration tests) * docker build using GitHub cache * remove saving Docker image locally * cache v3, saving local tar before caching, 8vcpu * add 11 docker images to docker-images.tar, 16 vcpu * docker save multiline fix; skip localnet-test * docker build increase to 32 vcpu * buildjet cache instead of GitHub cache * lint fix * activate more integration tests with default runners, using cached Docker * docker-build downsized to 16 vcpu * integration test without relic build * put back relic build for integration tests * put back all remaining integration tests using BuildJet runners * integration tests reverted to ubuntu-latest runners * re-activated unit-test, unit-test-modules jobs * added (failing) test for sub-sub packages * sub-sub packages support * sub-sub package test enhanced * updated other tests to support new sub-sub packages * split up engine/execution unit tests into 3 jobs * engine/execution/ingestion upgraded to 4 vcpu * engine/execution/ingestion:buildjet-8vcpu * network/test, network/p2p increased to 8 vcpu; added remaining network tests * BFT Framework, Epoch Cohort 2 - increased to 4 vcpu * upgraded runners for Access Cohort 1, Epoch Cohort 1, 2 * epoch cohort2 upgraded to 16 vcpu * BFT Framework upgrade to 8 vcpu, epoch cohort2 removed flaky test, downgrade to 4 vcpu * quarantined flaky test - TestEpochJoinAndLeaveVN * quarantined flaky test - TestSealingAndVerificationPassThrough * lint fix * network/test split up into network/test/cohort1, network/test/cohort2 * network/p2p split up into 4 subpackages network/p2p/connection network/p2p/scoring network/p2p/p2pnode network/p2p * extracted network/alsp into separate job * extracted module/dkg into separate job * engine/execution/ingestion dowgraded to stock runner * added engine parent package as separate job * added storage parent package as separate job * added state package as separate job * removed localnet-test job * engine/execution/ingestion upgraded to 2 vcpu * network/p2p/p2pnode upgraded to 2 vcpu * module, engine upgraded to 2 vcpu * network/p2p/p2pnode upgraded to 4 vcpu * network/test/cohort2 upgraded to 4 vcpu * engine upgraded to 4 vcpu * engine/execution/ingestion upgraded to 4 vcpu, network/test/cohort1 upgraded to 2 vcpu * network/test/cohort1 upgraded to 4 vcpu * engine/execution/ingestion upgraded to 8 vcpu * network/test/cohort1 upgraded to 8 vcpu * BFT (Protocol) upgraded to 4 vcpu * network/test/cohort1 upgraded to 16 vcpu * BFT (Protocol), Epoch Cohort1 upgraded to 8 vcpu * noop push to kick off CI * Module (integration) upgraded to 4 vcpu * TestUnicastRateLimit_Messages flaky test quarantined * noop push to kick off CI * buildjet/cache@v3 => actions/cache@v3 * test retries increased to 5, timeout increased to 35 mins * module job increased to 4 vcpu * unquarantined epoch flaky test after it was fixed #4975 * unlink flaky test monitor workflow from running when ci changes * clean up * lint fix * add TargetEndTime to epoch models * update service event parsing * fix some typos and godocs * update mocks * lint * lint * pin new version of core-contracts - adds TargetEndTime field - pinned to branch - needs to be updated when release is created * update expected state commitment * update another state commitment constant * update another state commit constant * use TargetEndTime in cruisectl * represent target end time as uint64 * lint / mocks * use unix time in block time controller * add todod for epoch fallback flag * wip * wip - begin adding duration * add TargetDuration to models, conversion * bump core-contracts version got this error: ``` go: module github.com/onflow/flow-core-contracts@9e8417b found (v0.0.0-20231120143830-9e8417b56122), but does not contain package github.com/onflow/flow-core-contracts/lib/go/templates ``` Maybe because templates did not change since previous version? * update state commitment constants in tests * tidy * update mocks * track target duration in block time ctl * remove transition_time * update tests * fix last 2 tests - add helper functions for concisely converting between various time/duration types. - use ns-level precision in view measurement, primarily so that we can verify this implementation is consistent with the Python simulation. For real-world, second-level precision is likely fine. * Update consensus/hotstuff/cruisectl/block_time_controller.go Co-authored-by: Alexander Hentschel <alex.hentschel@axiomzen.co> * Update consensus/hotstuff/cruisectl/block_time_controller_test.go * Update consensus/hotstuff/cruisectl/block_time_controller_test.go * lint * Apply suggestions from code review Co-authored-by: Alexander Hentschel <alex.hentschel@axiomzen.co> * add docs * add docs for config setter/getters * gofmt --------- Co-authored-by: Misha <15269764+gomisha@users.noreply.github.com> Co-authored-by: Alexander Hentschel <alex.hentschel@axiomzen.co>
Backporting VN epoch test flakiness fix from master
This PR modifies the block rate in the Verification Node test to address sealing lagging finalization. In the VN epoch integration test only, we enable
require-seals-for-approvals
, which requires VN participation in sealing. In practice, this results in sealing lagging finalization (see here for some tests) by up to 100 blocks.To fix the problem, I increased the block rate delay for consensus nodes, from 100ms to 250 ms. This PR also updates several test utilities for observing what is going on while the test runs.