Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ledger-tool: verify: add --record-slot-hashes and --check-slot-hashes #29189

Conversation

alessandrod
Copy link
Contributor

This adds:

--record-slot-hashes <FILENAME>
    Record slot hashes to file so they can be used as reference in future runs. See --check-slot-hashes.

--check-slot-hashes <FILENAME>
    Check that slot hashes match expected reference values. See --record-slot-hashes.

The former can be used to dump a list of (slot, hash) to a json file during a replay. The latter can be used to check slot hashes against previously recorded values.

This is useful for debugging consensus failures, eg:

# on good commit/branch
ledger-tool verify --record-slot-hashes good.json ...

# on bad commit or potentially consensus breaking branch
ledger-tool verify --check-slot-hashes good.json ...

On a hash mismatch an error will be logged with the expected hash vs the computed hash.

This is my first foray outside of the program runtime, I have no clue what I'm doing ✌️

@alessandrod
Copy link
Contributor Author

Btw while doing this I thought about potentially reusing the existing entry_callback but then noticed that it seems to be unused? Maybe it should be removed?

@Lichtso Lichtso requested a review from steviez December 10, 2022 15:35
@Lichtso
Copy link
Contributor

Lichtso commented Dec 10, 2022

I think I would prefer it if there was only one CLI option instead of two. That way one does not need to change / retype the command but can simply reuse the command from the history, which makes for a slightly better UX.

This would be implemented by checking if the file at the given path currently exists, then act as --check-slot-hashes otherwise as --record-slot-hashes.

@Lichtso
Copy link
Contributor

Lichtso commented Dec 13, 2022

Needs to be rebased to solve the downstream CI failure.

@github-actions github-actions bot added the stale [bot only] Added to stale content; results in auto-close after a week. label Dec 29, 2022
@github-actions github-actions bot closed this Jan 6, 2023
@steviez
Copy link
Contributor

steviez commented Jan 6, 2023

Dang it, my bad on letting this fall off my radar. Let me take a look in the next couple days

@Lichtso Lichtso reopened this Jan 9, 2023
@Lichtso Lichtso removed the stale [bot only] Added to stale content; results in auto-close after a week. label Jan 9, 2023
@steviez
Copy link
Contributor

steviez commented Jan 9, 2023

I was chatting with jwash about something else and this PR came up (as we've both had a need to do this kind of thing in the past). jwash was pondering if routing logging to files would accomplish the same thing, and I think I agree.

ledger-tool will run with info by default, which will emit the following log (from bank.rs):

        info!(
            "bank frozen: {} hash: {} accounts_delta: {} signature_count: {} last_blockhash: {} capitalization: {}",
            self.slot(),
            hash,
            accounts_delta_hash.hash,
            self.signature_count(),
            self.last_blockhash(),
            self.capitalization(),
        );

So, it seems like the following sequence could accomplish the same (supposing we're going to run twice)

solana-ledger-tool verify <some args> &> verify.log
grep -o "bank frozen: .*" verify.log > verify_bank_hashes.txt
// Repeat above steps with different version and diff .txt files

Granted this isn't json format, but since we only care to compare results between runs (ie don't need to load them up to do additional analysis), json v. csv. plain text doesn't really matter so much.

Additionally, storing logs from verify might be desired/beneficial anyways for additional logs, such as the following (also from bank.rs) that give some metadata about the accounts updated in the given slot.

        info!(
            "accounts hash slot: {} stats: {:?}",
            self.slot(),
            accounts_delta_hash.stats,
        );

Given above, I'm inclined to think that logging is sufficient for the case you described; what do you think?

@alessandrod
Copy link
Contributor Author

solana-ledger-tool verify &> verify.log
grep -o "bank frozen: .*" verify.log > verify_bank_hashes.txt
// Repeat above steps with different version and diff .txt files

Yes, but doesn't this patch provide much better DX? The way I use this now is not only to debug known consensus issues, is to detect issues early as I work every day.

I have been working on a long lived branch that changes the program runtime significantly. I used to work for a week or two, then replay/run validators, find issues, fix. Now I have generated a "good" set of hashes for a couple of ledgers. Every day I do my changes, commit, do a quick smoke check replay and carry on hacking. Only if I get a bad hash then I turn on logs and start debugging. I have already caught at least two issues this way, and fixing them was much easier than the other times I accidentally broke consensus since I found the bug right after doing a change, so all the required context was still loaded in my brain.

Sure I could do the same after every commit with grep and diff and shell scripts. But I don't, and I suspect that nobody does because it's extra work. 😋 So again, this reduces friction. And finally, if you've been debugging consensus issues for a long time I get why maybe this feature doesn't make a big difference to you. But just to give you an idea, before I worked on this patch I had no clue what a frozen bank even was. I just wanted something to tell me whether my rbpf/runtime change broke something or not.

Either way, I don't really feel strongly about this change and I'm happy to keep carrying it in my own branch. But I don't see a strong reason not to have it. The patch is trivial and the change to blockstore proc is only 2 lines of code.

@Lichtso
Copy link
Contributor

Lichtso commented Jan 10, 2023

I agree with @alessandrod, while this technically does not provide any new functionality which was previously unachievable, that does not mean it doesn't provide a lot of value. In this case you can automatically have it stop at any corrupted hash, no need to diff and search the diff for the mismatch. Also, 5 steps become two.

Only thing I would change is to unite both CLI args into one, see #29189 (comment)

@steviez
Copy link
Contributor

steviez commented Jan 10, 2023

Re-thinking this and reading both of your replies, I think I'm onboard with the change. While the log / grep method I mentioned is workable, it is definitely clunky. For me where I may only be hunting a consensus breakage once-a-quarter when it hits one of the clusters, that's not too bad and I can go refollow the step-by-step wiki I wrote for myself 😄 .

But, for the two of you (and potentially others on the VM team), this seems to be a valuable part of your daily workflow that will help reduce downstream bugs.

So ideologically I'm onboard, but let me circle back later today (promise it won't be weeks this time) after mulling the 1 v 2 arg item. In the meantime, please rebase on tip-of-master; you'll see that there was a vulnerability in one of our deps, and the tip of master has been updated to use a patched version of that dep.

@alessandrod alessandrod force-pushed the ledger-tool-check-slot-hashes branch from 688c048 to 5fd7381 Compare January 10, 2023 21:54
@github-actions github-actions bot added the stale [bot only] Added to stale content; results in auto-close after a week. label Jan 26, 2023
@github-actions github-actions bot closed this Feb 3, 2023
@Lichtso Lichtso reopened this Feb 3, 2023
@github-actions github-actions bot removed the stale [bot only] Added to stale content; results in auto-close after a week. label Feb 6, 2023
This adds:

    --record-slot-hashes <FILENAME>
        Record slot hashes to file so they can be used as reference in future runs. See --check-slot-hashes.

    --check-slot-hashes <FILENAME>
        Check that slot hashes match expected reference values. See --record-slot-hashes.

The former can be used to dump a list of (slot, hash) to a json file
during a replay. The latter can be used to check slot hashes against
previously recorded values.

This is useful for debugging consensus failures, eg:

    # on good commit/branch
    ledger-tool verify --record-slot-hashes good.json ...

    # on bad commit or potentially consensus breaking branch
    ledger-tool verify --check-slot-hashes good.json ...

On a hash mismatch an error will be logged with the expected hash vs the
computed hash.
@Lichtso
Copy link
Contributor

Lichtso commented Feb 8, 2023

@alessandrod @steviez I won't let you off the hook so easily ;)

@alessandrod alessandrod force-pushed the ledger-tool-check-slot-hashes branch from 5fd7381 to 14fe9e6 Compare February 8, 2023 22:14
@github-actions github-actions bot added the stale [bot only] Added to stale content; results in auto-close after a week. label Feb 23, 2023
@github-actions github-actions bot closed this Mar 2, 2023
@alessandrod alessandrod reopened this Mar 2, 2023
@github-actions github-actions bot removed the stale [bot only] Added to stale content; results in auto-close after a week. label Mar 3, 2023
@github-actions github-actions bot added the stale [bot only] Added to stale content; results in auto-close after a week. label Mar 20, 2023
@steviez steviez removed the stale [bot only] Added to stale content; results in auto-close after a week. label Mar 20, 2023
@github-actions github-actions bot added the stale [bot only] Added to stale content; results in auto-close after a week. label Apr 4, 2023
@github-actions github-actions bot closed this Apr 12, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
stale [bot only] Added to stale content; results in auto-close after a week.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants