/SlowEst/SlowTests #1884

tbro · 2024-08-21T12:13:52Z

Fixes an error of trying to be clever at the expense of clarity. With this Slow Tests workflow is now called Slow Tests.

Doesn't change any functionality, so if Pipelines pass it should be good.

This reverts commit 232529d.

* Test framework for restartability Set up some Rust automation for tests that spin up a sequencer network and restart various combinations of nodes, checking that we recover liveness. Instantiate the framework with several combinations of nodes as outlined in https://www.notion.so/espressosys/Persistence-catchup-and-restartability-cf4ddb79df2e41a993e60e3beaa28992. As expected, the tests where we restart >f nodes do not pass yet, and are ignored. The others pass locally. There are many things left to test here, including: * Testing with a valid libp2p setup * Testing with _only_ libp2p and no CDN * Checking integrity of the DA/query service during and after restart But this is a pretty good starting point. I considered doing this with something more dynamic like Bash or Python scripting, leaning on our existing docker-compose or process-compose infrastructure to spin up a network. I avoided this for a few reasons: * process-compose is annoying to script and in particular has limited capabilities for shutting down and starting up processes * both docker-compose and process-compose make it hard to dynamically choose the network topology * once the basic test infrastructure is out of the way, Rust is far easier to work with for writing new checks and assertions. For example, checking for progress is way easier when we can plug directly into the HotShot event stream, vs subscribing to some stream via HTTP and parsing responses with jq * Add hotshot-query-service/testing feature, remove `testing` from default features * Configure libp2p in restart tests * Return an error instead of panicking if node initialization fails This is needed for the restart tests, where initialization can sometimes fail after a restart due to the libp2p port not being deallocated by the OS quickly enough. This necessitates a retry loop, which means all error cases need to return an error rather than panicking. * Adjust timeouts and thresholds so tests consistently pass * Deterministically avoid port collisions * Improve debug logging for event handling tasks * Update API database in sync with consensus storage Previously, the database used by the query API was populated from a completely separate event handling task than the consensus storage. This could lead to a situation where consensus storage has already been updated with a newly decided leaf, but API storage has not, and then the node restarts, so that consensus things it is on a later leaf, but the query API has never and will never see this leaf, and thus cannot make it available: a DA failure. With this change, the query database is now populated from the consensus storage, so that consensus storage is authoritative, and the query datbase is guaranteed to always eventually reflect the status of consensus storage. The movement of data from consensus storage to query storage is tied in with consensus garbage collection, so that we do not delete any data until we are sure it has been recorded in the DA database, if appropriate. This also obsoletes the in-memory payload storage in HotShot, since we are now able to load full payloads from storage on each decide, if available. * Bug fixes * Don't panic in SQL persistence `collect_garbage` when no new leaves are decided * Don't fail fs persistence `load_quorum_proposals` when the proposals directory does not exist * Better logging for libp2p startup * Improved logging around decide events * Use a more robust method for deciding which leaves have already been processed Store last processed leaf view in Postgres rather than trying to dead reckon. * Disable "restart all" tests These tests require non-DA nodes to store merklized state. Disabling until we have that functionality. * update the cdn * Move event back-processing to async task so it doesn't block startup * Document restart function * Avoid blocking drop with task cancellation if shut_down is explicitly called * Avoid blocking drop at end of restart tests * Mark restart tests slow * Mark restart tests as heavy * Increase slow test timeout * Don't capture test output so I can debug timing out job * Revert "/SlowEst/SlowTests (#1884)" This reverts commit 232529d. * Orchestrator does not check authorization * Avoid shutting down in Context::drop if shut_down was already called * Pull in patched query service * Update query service * Update query service * Update query service * Fix names in slow test job --------- Co-authored-by: Jeb Bearer <jeb@espressosys.com> Co-authored-by: Rob <rob@espressosys.com>

/SlowEst/SlowTests

3fdde0c

tbro requested review from nomaxg, philippecamacho, ImJeremyHe, sveitser, jbearer and imabdulbasit as code owners August 21, 2024 12:13

hypenate

7304cd3

tbro assigned sveitser Aug 21, 2024

rob-maron approved these changes Aug 21, 2024

View reviewed changes

tbro merged commit 232529d into main Aug 21, 2024
15 checks passed

tbro deleted the tb/drop-silly-name branch August 21, 2024 13:18

lukaszrzasik added a commit that referenced this pull request Sep 9, 2024

Revert "/SlowEst/SlowTests (#1884)"

36df3c8

This reverts commit 232529d.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

/SlowEst/SlowTests #1884

/SlowEst/SlowTests #1884

tbro commented Aug 21, 2024 •

edited

Loading

/SlowEst/SlowTests #1884

/SlowEst/SlowTests #1884

Conversation

tbro commented Aug 21, 2024 • edited Loading

tbro commented Aug 21, 2024 •

edited

Loading