Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

/SlowEst/SlowTests #1884

Merged
merged 2 commits into from
Aug 21, 2024
Merged

/SlowEst/SlowTests #1884

merged 2 commits into from
Aug 21, 2024

Conversation

tbro
Copy link
Contributor

@tbro tbro commented Aug 21, 2024

Fixes an error of trying to be clever at the expense of clarity. With this Slow Tests workflow is now called Slow Tests.

Doesn't change any functionality, so if Pipelines pass it should be good.

@tbro tbro merged commit 232529d into main Aug 21, 2024
15 checks passed
@tbro tbro deleted the tb/drop-silly-name branch August 21, 2024 13:18
lukaszrzasik added a commit that referenced this pull request Sep 9, 2024
jbearer added a commit that referenced this pull request Sep 24, 2024
* Test framework for restartability

Set up some Rust automation for tests that spin up a sequencer
network and restart various combinations of nodes, checking that we
recover liveness. Instantiate the framework with several combinations
of nodes as outlined in https://www.notion.so/espressosys/Persistence-catchup-and-restartability-cf4ddb79df2e41a993e60e3beaa28992.

As expected, the tests where we restart >f nodes do not pass yet,
and are ignored. The others pass locally.

There are many things left to test here, including:
* Testing with a valid libp2p setup
* Testing with _only_ libp2p and no CDN
* Checking integrity of the DA/query service during and after restart
But this is a pretty good starting point.

I considered doing this with something more dynamic like Bash or
Python scripting, leaning on our existing docker-compose or process-compose
infrastructure to spin up a network. I avoided this for a few reasons:
* process-compose is annoying to script and in particular has limited
  capabilities for shutting down and starting up processes
* both docker-compose and process-compose make it hard to dynamically
  choose the network topology
* once the basic test infrastructure is out of the way, Rust is far
  easier to work with for writing new checks and assertions. For
  example, checking for progress is way easier when we can plug
  directly into the HotShot event stream, vs subscribing to some
  stream via HTTP and parsing responses with jq

* Add hotshot-query-service/testing feature, remove `testing` from default features

* Configure libp2p in restart tests

* Return an error instead of panicking if node initialization fails

This is needed for the restart tests, where initialization can sometimes
fail after a restart due to the libp2p port not being deallocated by the
OS quickly enough. This necessitates a retry loop, which means all error
cases need to return an error rather than panicking.

* Adjust timeouts and thresholds so tests consistently pass

* Deterministically avoid port collisions

* Improve debug logging for event handling tasks

* Update API database in sync with consensus storage

Previously, the database used by the query API was populated from a
completely separate event handling task than the consensus storage.
This could lead to a situation where consensus storage has already
been updated with a newly decided leaf, but API storage has not, and
then the node restarts, so that consensus things it is on a later leaf,
but the query API has never and will never see this leaf, and thus cannot
make it available: a DA failure.

With this change, the query database is now populated from the consensus
storage, so that consensus storage is authoritative, and the query datbase
is guaranteed to always eventually reflect the status of consensus storage.
The movement of data from consensus storage to query storage is tied in with
consensus garbage collection, so that we do not delete any data until we are
sure it has been recorded in the DA database, if appropriate.

This also obsoletes the in-memory payload storage in HotShot, since we are
now able to load full payloads from storage on each decide, if available.

* Bug fixes

* Don't panic in SQL persistence `collect_garbage` when no new leaves
  are decided
* Don't fail fs persistence `load_quorum_proposals` when the proposals
  directory does not exist
* Better logging for libp2p startup

* Improved logging around decide events

* Use a more robust method for deciding which leaves have already been processed

Store last processed leaf view in Postgres rather than trying to
dead reckon.

* Disable "restart all" tests

These tests require non-DA nodes to store merklized state. Disabling
until we have that functionality.

* update the cdn

* Move event back-processing to async task so it doesn't block startup

* Document restart function

* Avoid blocking drop with task cancellation if shut_down is explicitly called

* Avoid blocking drop at end of restart tests

* Mark restart tests slow

* Mark restart tests as heavy

* Increase slow test timeout

* Don't capture test output so I can debug timing out job

* Revert "/SlowEst/SlowTests (#1884)"

This reverts commit 232529d.

* Orchestrator does not check authorization

* Avoid shutting down in Context::drop if shut_down was already called

* Pull in patched query service

* Update query service

* Update query service

* Update query service

* Fix names in slow test job

---------

Co-authored-by: Jeb Bearer <jeb@espressosys.com>
Co-authored-by: Rob <rob@espressosys.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants