ledger-tool: verify: add --record-slot-hashes and --check-slot-hashes #29189

alessandrod · 2022-12-09T23:25:15Z

This adds:

--record-slot-hashes <FILENAME>
    Record slot hashes to file so they can be used as reference in future runs. See --check-slot-hashes.

--check-slot-hashes <FILENAME>
    Check that slot hashes match expected reference values. See --record-slot-hashes.

The former can be used to dump a list of (slot, hash) to a json file during a replay. The latter can be used to check slot hashes against previously recorded values.

This is useful for debugging consensus failures, eg:

# on good commit/branch
ledger-tool verify --record-slot-hashes good.json ...

# on bad commit or potentially consensus breaking branch
ledger-tool verify --check-slot-hashes good.json ...

On a hash mismatch an error will be logged with the expected hash vs the computed hash.

This is my first foray outside of the program runtime, I have no clue what I'm doing ✌️

alessandrod · 2022-12-09T23:27:48Z

Btw while doing this I thought about potentially reusing the existing entry_callback but then noticed that it seems to be unused? Maybe it should be removed?

Lichtso · 2022-12-10T15:38:16Z

I think I would prefer it if there was only one CLI option instead of two. That way one does not need to change / retype the command but can simply reuse the command from the history, which makes for a slightly better UX.

This would be implemented by checking if the file at the given path currently exists, then act as --check-slot-hashes otherwise as --record-slot-hashes.

Lichtso · 2022-12-13T19:23:14Z

Needs to be rebased to solve the downstream CI failure.

steviez · 2023-01-06T08:18:08Z

Dang it, my bad on letting this fall off my radar. Let me take a look in the next couple days

steviez · 2023-01-09T21:02:29Z

I was chatting with jwash about something else and this PR came up (as we've both had a need to do this kind of thing in the past). jwash was pondering if routing logging to files would accomplish the same thing, and I think I agree.

ledger-tool will run with info by default, which will emit the following log (from bank.rs):

        info!(
            "bank frozen: {} hash: {} accounts_delta: {} signature_count: {} last_blockhash: {} capitalization: {}",
            self.slot(),
            hash,
            accounts_delta_hash.hash,
            self.signature_count(),
            self.last_blockhash(),
            self.capitalization(),
        );

So, it seems like the following sequence could accomplish the same (supposing we're going to run twice)

solana-ledger-tool verify <some args> &> verify.log
grep -o "bank frozen: .*" verify.log > verify_bank_hashes.txt
// Repeat above steps with different version and diff .txt files

Granted this isn't json format, but since we only care to compare results between runs (ie don't need to load them up to do additional analysis), json v. csv. plain text doesn't really matter so much.

Additionally, storing logs from verify might be desired/beneficial anyways for additional logs, such as the following (also from bank.rs) that give some metadata about the accounts updated in the given slot.

        info!(
            "accounts hash slot: {} stats: {:?}",
            self.slot(),
            accounts_delta_hash.stats,
        );

Given above, I'm inclined to think that logging is sufficient for the case you described; what do you think?

alessandrod · 2023-01-09T22:49:56Z

solana-ledger-tool verify &> verify.log
grep -o "bank frozen: .*" verify.log > verify_bank_hashes.txt
// Repeat above steps with different version and diff .txt files

Yes, but doesn't this patch provide much better DX? The way I use this now is not only to debug known consensus issues, is to detect issues early as I work every day.

I have been working on a long lived branch that changes the program runtime significantly. I used to work for a week or two, then replay/run validators, find issues, fix. Now I have generated a "good" set of hashes for a couple of ledgers. Every day I do my changes, commit, do a quick smoke check replay and carry on hacking. Only if I get a bad hash then I turn on logs and start debugging. I have already caught at least two issues this way, and fixing them was much easier than the other times I accidentally broke consensus since I found the bug right after doing a change, so all the required context was still loaded in my brain.

Sure I could do the same after every commit with grep and diff and shell scripts. But I don't, and I suspect that nobody does because it's extra work. 😋 So again, this reduces friction. And finally, if you've been debugging consensus issues for a long time I get why maybe this feature doesn't make a big difference to you. But just to give you an idea, before I worked on this patch I had no clue what a frozen bank even was. I just wanted something to tell me whether my rbpf/runtime change broke something or not.

Either way, I don't really feel strongly about this change and I'm happy to keep carrying it in my own branch. But I don't see a strong reason not to have it. The patch is trivial and the change to blockstore proc is only 2 lines of code.

Lichtso · 2023-01-10T00:21:06Z

I agree with @alessandrod, while this technically does not provide any new functionality which was previously unachievable, that does not mean it doesn't provide a lot of value. In this case you can automatically have it stop at any corrupted hash, no need to diff and search the diff for the mismatch. Also, 5 steps become two.

Only thing I would change is to unite both CLI args into one, see #29189 (comment)

steviez · 2023-01-10T18:28:20Z

Re-thinking this and reading both of your replies, I think I'm onboard with the change. While the log / grep method I mentioned is workable, it is definitely clunky. For me where I may only be hunting a consensus breakage once-a-quarter when it hits one of the clusters, that's not too bad and I can go refollow the step-by-step wiki I wrote for myself 😄 .

But, for the two of you (and potentially others on the VM team), this seems to be a valuable part of your daily workflow that will help reduce downstream bugs.

So ideologically I'm onboard, but let me circle back later today (promise it won't be weeks this time) after mulling the 1 v 2 arg item. In the meantime, please rebase on tip-of-master; you'll see that there was a vulnerability in one of our deps, and the tip of master has been updated to use a patched version of that dep.

…llback

This adds: --record-slot-hashes <FILENAME> Record slot hashes to file so they can be used as reference in future runs. See --check-slot-hashes. --check-slot-hashes <FILENAME> Check that slot hashes match expected reference values. See --record-slot-hashes. The former can be used to dump a list of (slot, hash) to a json file during a replay. The latter can be used to check slot hashes against previously recorded values. This is useful for debugging consensus failures, eg: # on good commit/branch ledger-tool verify --record-slot-hashes good.json ... # on bad commit or potentially consensus breaking branch ledger-tool verify --check-slot-hashes good.json ... On a hash mismatch an error will be logged with the expected hash vs the computed hash.

Lichtso · 2023-02-08T12:42:55Z

@alessandrod @steviez I won't let you off the hook so easily ;)

Lichtso requested a review from steviez December 10, 2022 15:35

github-actions bot added the stale [bot only] Added to stale content; results in auto-close after a week. label Dec 29, 2022

github-actions bot closed this Jan 6, 2023

Lichtso reopened this Jan 9, 2023

Lichtso removed the stale [bot only] Added to stale content; results in auto-close after a week. label Jan 9, 2023

alessandrod force-pushed the ledger-tool-check-slot-hashes branch from 688c048 to 5fd7381 Compare January 10, 2023 21:54

github-actions bot added the stale [bot only] Added to stale content; results in auto-close after a week. label Jan 26, 2023

github-actions bot closed this Feb 3, 2023

Lichtso reopened this Feb 3, 2023

github-actions bot removed the stale [bot only] Added to stale content; results in auto-close after a week. label Feb 6, 2023

alessandrod added 2 commits February 7, 2023 05:44

ledger/blockstore_processor: rename ProcessCallback to ProcessEntryCa…

2c3e853

…llback

alessandrod force-pushed the ledger-tool-check-slot-hashes branch from 5fd7381 to 14fe9e6 Compare February 8, 2023 22:14

github-actions bot added the stale [bot only] Added to stale content; results in auto-close after a week. label Feb 23, 2023

github-actions bot closed this Mar 2, 2023

alessandrod reopened this Mar 2, 2023

github-actions bot removed the stale [bot only] Added to stale content; results in auto-close after a week. label Mar 3, 2023

github-actions bot added the stale [bot only] Added to stale content; results in auto-close after a week. label Mar 20, 2023

steviez removed the stale [bot only] Added to stale content; results in auto-close after a week. label Mar 20, 2023

github-actions bot added the stale [bot only] Added to stale content; results in auto-close after a week. label Apr 4, 2023

github-actions bot closed this Apr 12, 2023

seanyoung mentioned this pull request Nov 28, 2023

ledger-tool: verify: add --record-slots and --verify-slots #34246

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ledger-tool: verify: add --record-slot-hashes and --check-slot-hashes #29189

ledger-tool: verify: add --record-slot-hashes and --check-slot-hashes #29189

alessandrod commented Dec 9, 2022

alessandrod commented Dec 9, 2022

Lichtso commented Dec 10, 2022

Lichtso commented Dec 13, 2022

steviez commented Jan 6, 2023

steviez commented Jan 9, 2023

alessandrod commented Jan 9, 2023

Lichtso commented Jan 10, 2023 •

edited

Loading

steviez commented Jan 10, 2023

Lichtso commented Feb 8, 2023

ledger-tool: verify: add --record-slot-hashes and --check-slot-hashes #29189

ledger-tool: verify: add --record-slot-hashes and --check-slot-hashes #29189

Conversation

alessandrod commented Dec 9, 2022

alessandrod commented Dec 9, 2022

Lichtso commented Dec 10, 2022

Lichtso commented Dec 13, 2022

steviez commented Jan 6, 2023

steviez commented Jan 9, 2023

alessandrod commented Jan 9, 2023

Lichtso commented Jan 10, 2023 • edited Loading

steviez commented Jan 10, 2023

Lichtso commented Feb 8, 2023

Lichtso commented Jan 10, 2023 •

edited

Loading