-
Notifications
You must be signed in to change notification settings - Fork 69
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
/SlowEst/SlowTests #1884
Merged
Merged
/SlowEst/SlowTests #1884
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
tbro
requested review from
nomaxg,
philippecamacho,
ImJeremyHe,
sveitser,
jbearer and
imabdulbasit
as code owners
August 21, 2024 12:13
rob-maron
approved these changes
Aug 21, 2024
lukaszrzasik
added a commit
that referenced
this pull request
Sep 9, 2024
This reverts commit 232529d.
jbearer
added a commit
that referenced
this pull request
Sep 24, 2024
* Test framework for restartability Set up some Rust automation for tests that spin up a sequencer network and restart various combinations of nodes, checking that we recover liveness. Instantiate the framework with several combinations of nodes as outlined in https://www.notion.so/espressosys/Persistence-catchup-and-restartability-cf4ddb79df2e41a993e60e3beaa28992. As expected, the tests where we restart >f nodes do not pass yet, and are ignored. The others pass locally. There are many things left to test here, including: * Testing with a valid libp2p setup * Testing with _only_ libp2p and no CDN * Checking integrity of the DA/query service during and after restart But this is a pretty good starting point. I considered doing this with something more dynamic like Bash or Python scripting, leaning on our existing docker-compose or process-compose infrastructure to spin up a network. I avoided this for a few reasons: * process-compose is annoying to script and in particular has limited capabilities for shutting down and starting up processes * both docker-compose and process-compose make it hard to dynamically choose the network topology * once the basic test infrastructure is out of the way, Rust is far easier to work with for writing new checks and assertions. For example, checking for progress is way easier when we can plug directly into the HotShot event stream, vs subscribing to some stream via HTTP and parsing responses with jq * Add hotshot-query-service/testing feature, remove `testing` from default features * Configure libp2p in restart tests * Return an error instead of panicking if node initialization fails This is needed for the restart tests, where initialization can sometimes fail after a restart due to the libp2p port not being deallocated by the OS quickly enough. This necessitates a retry loop, which means all error cases need to return an error rather than panicking. * Adjust timeouts and thresholds so tests consistently pass * Deterministically avoid port collisions * Improve debug logging for event handling tasks * Update API database in sync with consensus storage Previously, the database used by the query API was populated from a completely separate event handling task than the consensus storage. This could lead to a situation where consensus storage has already been updated with a newly decided leaf, but API storage has not, and then the node restarts, so that consensus things it is on a later leaf, but the query API has never and will never see this leaf, and thus cannot make it available: a DA failure. With this change, the query database is now populated from the consensus storage, so that consensus storage is authoritative, and the query datbase is guaranteed to always eventually reflect the status of consensus storage. The movement of data from consensus storage to query storage is tied in with consensus garbage collection, so that we do not delete any data until we are sure it has been recorded in the DA database, if appropriate. This also obsoletes the in-memory payload storage in HotShot, since we are now able to load full payloads from storage on each decide, if available. * Bug fixes * Don't panic in SQL persistence `collect_garbage` when no new leaves are decided * Don't fail fs persistence `load_quorum_proposals` when the proposals directory does not exist * Better logging for libp2p startup * Improved logging around decide events * Use a more robust method for deciding which leaves have already been processed Store last processed leaf view in Postgres rather than trying to dead reckon. * Disable "restart all" tests These tests require non-DA nodes to store merklized state. Disabling until we have that functionality. * update the cdn * Move event back-processing to async task so it doesn't block startup * Document restart function * Avoid blocking drop with task cancellation if shut_down is explicitly called * Avoid blocking drop at end of restart tests * Mark restart tests slow * Mark restart tests as heavy * Increase slow test timeout * Don't capture test output so I can debug timing out job * Revert "/SlowEst/SlowTests (#1884)" This reverts commit 232529d. * Orchestrator does not check authorization * Avoid shutting down in Context::drop if shut_down was already called * Pull in patched query service * Update query service * Update query service * Update query service * Fix names in slow test job --------- Co-authored-by: Jeb Bearer <jeb@espressosys.com> Co-authored-by: Rob <rob@espressosys.com>
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Fixes an error of trying to be clever at the expense of clarity. With this
Slow Tests
workflow is now called Slow Tests.Doesn't change any functionality, so if Pipelines pass it should be good.