Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Flaky Test] Modifies block rate in VN test to address sealing lagging finalization #4975

Merged
merged 3 commits into from
Nov 9, 2023

Conversation

jordanschalm
Copy link
Member

This PR modifies the block rate in the Verification Node test to address sealing lagging finalization. In the VN epoch integration test only, we enable require-seals-for-approvals, which requires VN participation in sealing. In practice, this results in sealing lagging finalization (see here for some tests) by up to 100 blocks.

  • This over-ran the default snapshot history limit, preventing the test from validating state
  • This also causes epoch transition failures, because DKG requires somewhat up-to-date sealing w.r.t. the DKG phase length, which is small in these tests.

To fix the problem, I increased the block rate delay for consensus nodes, from 100ms to 250 ms. This PR also updates several test utilities for observing what is going on while the test runs.

@codecov-commenter
Copy link

codecov-commenter commented Nov 9, 2023

Codecov Report

All modified and coverable lines are covered by tests ✅

Comparison is base (1990e2f) 55.75% compared to head (54bdfc2) 55.75%.

Additional details and impacted files
@@           Coverage Diff           @@
##           master    #4975   +/-   ##
=======================================
  Coverage   55.75%   55.75%           
=======================================
  Files         959      959           
  Lines       89270    89270           
=======================================
+ Hits        49773    49774    +1     
+ Misses      35752    35750    -2     
- Partials     3745     3746    +1     
Flag Coverage Δ
unittests 55.75% <ø> (+<0.01%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files Coverage Δ
engine/access/rpc/backend/backend.go 75.54% <ø> (ø)

... and 8 files with indirect coverage changes

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link
Contributor

@gomisha gomisha left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the fix! Ran it locally multiple times and it seems to be very stable, taking about 175 seconds to run.

@jordanschalm jordanschalm added this pull request to the merge queue Nov 9, 2023
Merged via the queue into master with commit 9f00fa6 Nov 9, 2023
36 checks passed
@jordanschalm jordanschalm deleted the jordan/vn-test-block-rate-fix branch November 9, 2023 17:49
gomisha added a commit that referenced this pull request Nov 9, 2023
jordanschalm added a commit that referenced this pull request Nov 23, 2023
* unit tests, integration tests using buildjet-4vcpu runner

* listTargetPackages() returns map of CI runners

* test for custom runners

* generateTestMatrix() stores CI runner

* more testing with custom CI runners

* CI test - 2 unit tests running buildjet runners

* increase engine unit tests to 8 vCPUs

* resource manager full load test (8 vCPUs)

* network/test tests unquarantined, increased runners

* increased to 16 vCPUs

* increased engine tests to 16 vCPUs

* insecure module uses buildjet 4 vcpu runner

* bft framework integration test using buildjet

* engine tests split up to 6 jobs with default runners

* added back all integration tests

* upgraded runners for some integration tests, engine/execution

* dummy commit to kick off CI

* split up TestScriptExecutionAndGetAccounts into 3 tests

* updated runners for Epoch Cohort2, Access Integration Tests

* AN integration tests split up into 3 cohorts (skip unit, other integration tests)

* docker build using GitHub cache

* remove saving Docker image locally

* cache v3, saving local tar before caching, 8vcpu

* add 11 docker images to docker-images.tar, 16 vcpu

* docker save multiline fix; skip localnet-test

* docker build increase to 32 vcpu

* buildjet cache instead of GitHub cache

* lint fix

* activate more integration tests with default runners, using cached Docker

* docker-build downsized to 16 vcpu

* integration test without relic build

* put back relic build for integration tests

* put back all remaining integration tests using BuildJet runners

* integration tests reverted to ubuntu-latest runners

* re-activated unit-test, unit-test-modules jobs

* added (failing) test for sub-sub packages

* sub-sub packages support

* sub-sub package test enhanced

* updated other tests to support new sub-sub packages

* split up engine/execution unit tests into 3 jobs

* engine/execution/ingestion upgraded to 4 vcpu

* engine/execution/ingestion:buildjet-8vcpu

* network/test, network/p2p increased to 8 vcpu; added remaining network tests

* BFT Framework, Epoch Cohort 2 - increased to 4 vcpu

* upgraded runners for Access Cohort 1, Epoch Cohort 1, 2

* epoch cohort2 upgraded to 16 vcpu

* BFT Framework upgrade to 8 vcpu, epoch cohort2 removed flaky test, downgrade to 4 vcpu

* quarantined flaky test - TestEpochJoinAndLeaveVN

* quarantined flaky test - TestSealingAndVerificationPassThrough

* lint fix

* network/test split up into network/test/cohort1, network/test/cohort2

* network/p2p split up into 4 subpackages

network/p2p/connection network/p2p/scoring network/p2p/p2pnode network/p2p

* extracted network/alsp into separate job

* extracted module/dkg into separate job

* engine/execution/ingestion dowgraded to stock runner

* added engine parent package as separate job

* added storage parent package as separate job

* added state package as separate job

* removed localnet-test job

* engine/execution/ingestion upgraded to 2 vcpu

* network/p2p/p2pnode upgraded to 2 vcpu

* module, engine upgraded to 2 vcpu

* network/p2p/p2pnode upgraded to 4 vcpu

* network/test/cohort2 upgraded to 4 vcpu

* engine upgraded to 4 vcpu

* engine/execution/ingestion upgraded to 4 vcpu, network/test/cohort1 upgraded to 2 vcpu

* network/test/cohort1 upgraded to 4 vcpu

* engine/execution/ingestion upgraded to 8 vcpu

* network/test/cohort1 upgraded to 8 vcpu

* BFT (Protocol) upgraded to 4 vcpu

* network/test/cohort1 upgraded to 16 vcpu

* BFT (Protocol), Epoch Cohort1 upgraded to 8 vcpu

* noop push to kick off CI

* Module (integration) upgraded to 4 vcpu

* TestUnicastRateLimit_Messages flaky test quarantined

* noop push to kick off CI

* buildjet/cache@v3 => actions/cache@v3

* test retries increased to 5, timeout increased to 35 mins

* module job increased to 4 vcpu

* unquarantined epoch flaky test after it was fixed

#4975

* unlink flaky test monitor workflow from running when ci changes

* clean up

* lint fix

* add TargetDuration to models, conversion

* bump core-contracts version

got this error:
```
go: module github.com/onflow/flow-core-contracts@9e8417b found (v0.0.0-20231120143830-9e8417b56122), but does not contain package github.com/onflow/flow-core-contracts/lib/go/templates
```

Maybe because templates did not change since previous version?

* update state commitment constants in tests

* tidy

* update mocks

* Update state/protocol/inmem/convert.go

Co-authored-by: Alexander Hentschel <alex.hentschel@axiomzen.co>

---------

Co-authored-by: Misha <15269764+gomisha@users.noreply.github.com>
Co-authored-by: Alexander Hentschel <alex.hentschel@axiomzen.co>
jordanschalm added a commit that referenced this pull request Nov 24, 2023
…e`, `TargetDuration` (#5023)

* unit tests, integration tests using buildjet-4vcpu runner

* listTargetPackages() returns map of CI runners

* test for custom runners

* generateTestMatrix() stores CI runner

* more testing with custom CI runners

* CI test - 2 unit tests running buildjet runners

* increase engine unit tests to 8 vCPUs

* resource manager full load test (8 vCPUs)

* network/test tests unquarantined, increased runners

* increased to 16 vCPUs

* increased engine tests to 16 vCPUs

* insecure module uses buildjet 4 vcpu runner

* bft framework integration test using buildjet

* engine tests split up to 6 jobs with default runners

* added back all integration tests

* upgraded runners for some integration tests, engine/execution

* dummy commit to kick off CI

* split up TestScriptExecutionAndGetAccounts into 3 tests

* updated runners for Epoch Cohort2, Access Integration Tests

* AN integration tests split up into 3 cohorts (skip unit, other integration tests)

* docker build using GitHub cache

* remove saving Docker image locally

* cache v3, saving local tar before caching, 8vcpu

* add 11 docker images to docker-images.tar, 16 vcpu

* docker save multiline fix; skip localnet-test

* docker build increase to 32 vcpu

* buildjet cache instead of GitHub cache

* lint fix

* activate more integration tests with default runners, using cached Docker

* docker-build downsized to 16 vcpu

* integration test without relic build

* put back relic build for integration tests

* put back all remaining integration tests using BuildJet runners

* integration tests reverted to ubuntu-latest runners

* re-activated unit-test, unit-test-modules jobs

* added (failing) test for sub-sub packages

* sub-sub packages support

* sub-sub package test enhanced

* updated other tests to support new sub-sub packages

* split up engine/execution unit tests into 3 jobs

* engine/execution/ingestion upgraded to 4 vcpu

* engine/execution/ingestion:buildjet-8vcpu

* network/test, network/p2p increased to 8 vcpu; added remaining network tests

* BFT Framework, Epoch Cohort 2 - increased to 4 vcpu

* upgraded runners for Access Cohort 1, Epoch Cohort 1, 2

* epoch cohort2 upgraded to 16 vcpu

* BFT Framework upgrade to 8 vcpu, epoch cohort2 removed flaky test, downgrade to 4 vcpu

* quarantined flaky test - TestEpochJoinAndLeaveVN

* quarantined flaky test - TestSealingAndVerificationPassThrough

* lint fix

* network/test split up into network/test/cohort1, network/test/cohort2

* network/p2p split up into 4 subpackages

network/p2p/connection network/p2p/scoring network/p2p/p2pnode network/p2p

* extracted network/alsp into separate job

* extracted module/dkg into separate job

* engine/execution/ingestion dowgraded to stock runner

* added engine parent package as separate job

* added storage parent package as separate job

* added state package as separate job

* removed localnet-test job

* engine/execution/ingestion upgraded to 2 vcpu

* network/p2p/p2pnode upgraded to 2 vcpu

* module, engine upgraded to 2 vcpu

* network/p2p/p2pnode upgraded to 4 vcpu

* network/test/cohort2 upgraded to 4 vcpu

* engine upgraded to 4 vcpu

* engine/execution/ingestion upgraded to 4 vcpu, network/test/cohort1 upgraded to 2 vcpu

* network/test/cohort1 upgraded to 4 vcpu

* engine/execution/ingestion upgraded to 8 vcpu

* network/test/cohort1 upgraded to 8 vcpu

* BFT (Protocol) upgraded to 4 vcpu

* network/test/cohort1 upgraded to 16 vcpu

* BFT (Protocol), Epoch Cohort1 upgraded to 8 vcpu

* noop push to kick off CI

* Module (integration) upgraded to 4 vcpu

* TestUnicastRateLimit_Messages flaky test quarantined

* noop push to kick off CI

* buildjet/cache@v3 => actions/cache@v3

* test retries increased to 5, timeout increased to 35 mins

* module job increased to 4 vcpu

* unquarantined epoch flaky test after it was fixed

#4975

* unlink flaky test monitor workflow from running when ci changes

* clean up

* lint fix

* add TargetEndTime to epoch models

* update service event parsing

* fix some typos and godocs

* update mocks

* lint

* lint

* pin new version of core-contracts

- adds TargetEndTime field
- pinned to branch - needs to be updated when release is created

* update expected state commitment

* update another state commitment constant

* update another state commit constant

* use TargetEndTime in cruisectl

* represent target end time as uint64

* lint / mocks

* use unix time in block time controller

* add todod for epoch fallback flag

* wip

* wip - begin adding duration

* add TargetDuration to models, conversion

* bump core-contracts version

got this error:
```
go: module github.com/onflow/flow-core-contracts@9e8417b found (v0.0.0-20231120143830-9e8417b56122), but does not contain package github.com/onflow/flow-core-contracts/lib/go/templates
```

Maybe because templates did not change since previous version?

* update state commitment constants in tests

* tidy

* update mocks

* track target duration in block time ctl

* remove transition_time

* update tests

* fix last 2 tests

- add helper functions for concisely converting between various
  time/duration types.
- use ns-level precision in view measurement, primarily so that we can
  verify this implementation is consistent with the Python simulation.
  For real-world, second-level precision is likely fine.

* Update consensus/hotstuff/cruisectl/block_time_controller.go

Co-authored-by: Alexander Hentschel <alex.hentschel@axiomzen.co>

* Update consensus/hotstuff/cruisectl/block_time_controller_test.go

* Update consensus/hotstuff/cruisectl/block_time_controller_test.go

* lint

* Apply suggestions from code review

Co-authored-by: Alexander Hentschel <alex.hentschel@axiomzen.co>

* add docs

* add docs for config setter/getters

* gofmt

---------

Co-authored-by: Misha <15269764+gomisha@users.noreply.github.com>
Co-authored-by: Alexander Hentschel <alex.hentschel@axiomzen.co>
jordanschalm added a commit that referenced this pull request Nov 29, 2023
Backporting VN epoch test flakiness fix from master
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants