-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ledger-tool: verify: add --record-slot-hashes and --check-slot-hashes #29189
ledger-tool: verify: add --record-slot-hashes and --check-slot-hashes #29189
Conversation
Btw while doing this I thought about potentially reusing the existing |
I think I would prefer it if there was only one CLI option instead of two. That way one does not need to change / retype the command but can simply reuse the command from the history, which makes for a slightly better UX. This would be implemented by checking if the file at the given path currently exists, then act as |
Needs to be rebased to solve the downstream CI failure. |
Dang it, my bad on letting this fall off my radar. Let me take a look in the next couple days |
I was chatting with jwash about something else and this PR came up (as we've both had a need to do this kind of thing in the past). jwash was pondering if routing logging to files would accomplish the same thing, and I think I agree. ledger-tool will run with info by default, which will emit the following log (from bank.rs): info!(
"bank frozen: {} hash: {} accounts_delta: {} signature_count: {} last_blockhash: {} capitalization: {}",
self.slot(),
hash,
accounts_delta_hash.hash,
self.signature_count(),
self.last_blockhash(),
self.capitalization(),
); So, it seems like the following sequence could accomplish the same (supposing we're going to run twice)
Granted this isn't json format, but since we only care to compare results between runs (ie don't need to load them up to do additional analysis), json v. csv. plain text doesn't really matter so much. Additionally, storing logs from verify might be desired/beneficial anyways for additional logs, such as the following (also from bank.rs) that give some metadata about the accounts updated in the given slot. info!(
"accounts hash slot: {} stats: {:?}",
self.slot(),
accounts_delta_hash.stats,
); Given above, I'm inclined to think that logging is sufficient for the case you described; what do you think? |
Yes, but doesn't this patch provide much better DX? The way I use this now is not only to debug known consensus issues, is to detect issues early as I work every day. I have been working on a long lived branch that changes the program runtime significantly. I used to work for a week or two, then replay/run validators, find issues, fix. Now I have generated a "good" set of hashes for a couple of ledgers. Every day I do my changes, commit, do a quick smoke check replay and carry on hacking. Only if I get a bad hash then I turn on logs and start debugging. I have already caught at least two issues this way, and fixing them was much easier than the other times I accidentally broke consensus since I found the bug right after doing a change, so all the required context was still loaded in my brain. Sure I could do the same after every commit with grep and diff and shell scripts. But I don't, and I suspect that nobody does because it's extra work. 😋 So again, this reduces friction. And finally, if you've been debugging consensus issues for a long time I get why maybe this feature doesn't make a big difference to you. But just to give you an idea, before I worked on this patch I had no clue what a frozen bank even was. I just wanted something to tell me whether my rbpf/runtime change broke something or not. Either way, I don't really feel strongly about this change and I'm happy to keep carrying it in my own branch. But I don't see a strong reason not to have it. The patch is trivial and the change to blockstore proc is only 2 lines of code. |
I agree with @alessandrod, while this technically does not provide any new functionality which was previously unachievable, that does not mean it doesn't provide a lot of value. In this case you can automatically have it stop at any corrupted hash, no need to diff and search the diff for the mismatch. Also, 5 steps become two. Only thing I would change is to unite both CLI args into one, see #29189 (comment) |
Re-thinking this and reading both of your replies, I think I'm onboard with the change. While the log / grep method I mentioned is workable, it is definitely clunky. For me where I may only be hunting a consensus breakage once-a-quarter when it hits one of the clusters, that's not too bad and I can go refollow the step-by-step wiki I wrote for myself 😄 . But, for the two of you (and potentially others on the VM team), this seems to be a valuable part of your daily workflow that will help reduce downstream bugs. So ideologically I'm onboard, but let me circle back later today (promise it won't be weeks this time) after mulling the 1 v 2 arg item. In the meantime, please rebase on tip-of-master; you'll see that there was a vulnerability in one of our deps, and the tip of master has been updated to use a patched version of that dep. |
688c048
to
5fd7381
Compare
This adds: --record-slot-hashes <FILENAME> Record slot hashes to file so they can be used as reference in future runs. See --check-slot-hashes. --check-slot-hashes <FILENAME> Check that slot hashes match expected reference values. See --record-slot-hashes. The former can be used to dump a list of (slot, hash) to a json file during a replay. The latter can be used to check slot hashes against previously recorded values. This is useful for debugging consensus failures, eg: # on good commit/branch ledger-tool verify --record-slot-hashes good.json ... # on bad commit or potentially consensus breaking branch ledger-tool verify --check-slot-hashes good.json ... On a hash mismatch an error will be logged with the expected hash vs the computed hash.
@alessandrod @steviez I won't let you off the hook so easily ;) |
5fd7381
to
14fe9e6
Compare
This adds:
The former can be used to dump a list of (slot, hash) to a json file during a replay. The latter can be used to check slot hashes against previously recorded values.
This is useful for debugging consensus failures, eg:
On a hash mismatch an error will be logged with the expected hash vs the computed hash.
This is my first foray outside of the program runtime, I have no clue what I'm doing ✌️