ledger-tool: verify: add --record-slots and --verify-slots #34246

seanyoung · 2023-11-28T15:37:46Z

This adds:

    --record-slots <FILENAME>
        Write the slot hashes to this file, optionally with all the acounts

    --record-slots-config hash-only|bank
        Store the bank (=accounts) json file, or not.

    --verify-slots <FILENAME>
        Verify slot hashes against this file.

The first case can be used to dump a list of (slot, hash) to a json file
during a replay. The second case can be used to check slot hashes against
previously recorded values.

This is useful for debugging consensus failures, eg:

    # on good commit/branch
    ledger-tool verify --record-slots good.json --record-slots-config=bank

    # on bad commit or potentially consensus breaking branch
    ledger-tool verify --verify-slots good.json

On a hash mismatch an error will be logged with the expected hash vs the computed hash.

Problem

Summary of Changes

Fixes #

seanyoung · 2023-11-28T15:39:06Z

This is #29189 with the following changes:

rebased
single command line option
tested

codecov · 2023-11-28T20:03:39Z

Codecov Report

Attention: Patch coverage is 66.66667% with 6 lines in your changes are missing coverage. Please review.

Project coverage is 81.7%. Comparing base (6ee3bb9) to head (e334f34).
Report is 18 commits behind head on master.

Additional details and impacted files

@@           Coverage Diff            @@
##           master   #34246    +/-   ##
========================================
  Coverage    81.7%    81.7%            
========================================
  Files         834      834            
  Lines      224299   224830   +531     
========================================
+ Hits       183361   183835   +474     
- Misses      40938    40995    +57

alessandrod · 2023-11-28T20:52:54Z

Thanks for picking this up! Since I did the original patch, this was added to the validator and ledger tool #32632. So it seems that this change is now kinda redundant? One can achieve the same by doing a first run with --write-bank-file, diff the bank files, then use --halt-at-slot to stop at the offending slot. Even if it's a couple extra steps, the bank file includes fields that likely make debugging the divergence faster.

steviez · 2023-11-29T07:01:42Z

One can achieve the same by doing a first run with --write-bank-file, diff the bank files, then use --halt-at-slot to stop at the offending slot. Even if it's a couple extra steps, the bank file includes fields that likely make debugging the divergence faster.

A few notes about the --write-bank-file flag:

It outputs the account of every touched account for a given slot to a human readable json file. As such, it has non-trivial file size (one from mnb from several months ago was 20 MB for one slot)
- I only mention the file size as I'm not sure how many slots your typical "test" ledger is. Assuming 20 MB is representative (not sure that it is), that's 1 GB for every 50 slots. If you're running on the beefy dev server, then space probably not a concern. A laptop might be another
Currently, the flag only supports creating a file for the working bank (highest slot at end of replay). It could be an enhancement to allow creating these at every slot along the way
To your last point, absolutely, these files shave a bunch of time off the divergence debugging process. You'll be able to quickly tell which account(s) differed, which is often enough to isolate down to a single transaction that differed.

I uploaded one of these files here for the sake of seeing what exactly it contains:
212112050-GoZ4MYBB21YiiVCHWZHzKaMebtvk1sFWBqpzT3nZwk6s.json

alessandrod · 2023-11-29T07:27:11Z

One can achieve the same by doing a first run with --write-bank-file, diff the bank files, then use --halt-at-slot to stop at the offending slot. Even if it's a couple extra steps, the bank file includes fields that likely make debugging the divergence faster.

A few notes about the --write-bank-file flag:
* It outputs the account of every touched account for a given slot to a human readable json file. As such, it has non-trivial file size (one from mnb from several months ago was 20 MB for one slot)

Oh I see, that's a lot larger than I thought. Maybe we can add an option to opt out of including the actual account data?

  * I only mention the file size as I'm not sure how many slots your typical "test" ledger is. Assuming 20 MB is representative (not sure that it is), that's 1 GB for every 50 slots. If you're running on the beefy dev server, then space probably not a concern. A laptop might be another
* Currently, the flag only supports creating a file for the working bank (highest slot at end of replay). It could be an enhancement to allow creating these at every slot along the way

Yeah for the purpose of smoke testing the runtime I usually replay very large ledgers, the more slots the better really. And data for each slot would have to be generated, preferably by changing the format to include multiple slots, dumping hundreds of files doesn't seem ideal.

* To your last point, absolutely, these files shave a bunch of time off the divergence debugging process. You'll be able to quickly tell which account(s) differed, which is often enough to isolate down to a single transaction that differed.

But so at the moment you must still separately figure out which slot diverges, then generate a bank file for it and diff right? If the format was changed to include multiple slots in order, then from the diff itself it would be possible to tell when the divergence happened I think right?

What I'm thinking is this: I've used the code in this PR many times and it helped me find many runtime bugs early, so obvs I would like to have the functionality in some form. The way I've used it was to find the offending broken slot, then add logs to strategic places to figure out diffs hashes etc... which --write-bank-file conveniently outputs to a file. So somehow merging these two features makes a lot of sense I think?

steviez · 2023-11-30T08:09:10Z

Oh I see, that's a lot larger than I thought. Maybe we can add an option to opt out of including the actual account data?

I'm onboard with this, and this idea would be inline with another idea I had to optionally include transaction information. The transaction information would further float size so not great for a total test bench, but potentially useful in automating some of the tedious work of honing in on the bug.

Yeah for the purpose of smoke testing the runtime I usually replay very large ledgers, the more slots the better really.

Yep, understood

And data for each slot would have to be generated, preferably by changing the format to include multiple slots, dumping hundreds of files doesn't seem ideal.

Yeah, I think that is reasonable as well. As you can see, there is some metadata that won't change (namely the version string but could add more down the road). But, we could add everything else into a slots array, and for the sake of solana-validator dumping a single slot when it has diverged, we can just output the array of size 1.

What I'm thinking is this: I've used the code in this PR many times and it helped me find many runtime bugs early, so obvs I would like to have the functionality in some form. The way I've used it was to find the offending broken slot, then add logs to strategic places to figure out diffs hashes etc... which --write-bank-file conveniently outputs to a file. So somehow merging these two features makes a lot of sense I think?

Got it. At a high level, I see two current cases for this feature:

When a validator diverges, generate this file. The file can then also be generated with a known, good version and diffed. This is useful for us comparing version-to-version, as well as for multiple clients (the Firedancer team expressing desire for this feature was the final nudge to push this in)
Kind of similar to above, but in your case, you're testing just runtime changes via solana-ledger-tool. For this case, the variable level of verbosity seems to fit:
- You generate a file with just the components that go into bank hash as your "first level" for comparison to use as your known good set
- You then develop some feature and want to test it; so you run and check against the known set
  - If the bank hashes match throughout replay, you're good to go
  - If not, then you should have the diverging slot. Then, you potentially go back and do --write-bank-file with the extra details (ie accounts) with the known-good-version AND your under-development-version, and diff those files to hone in.

From above, 1. is more me telling you how I use it; do you think 2. accurately captures how you would want to use this feature if we incorporate the "optional" details?

seanyoung · 2023-11-30T15:43:28Z

I'm making changes to the runtime and testing with this feature. Hopefully by the time I'm done I'll have some bright ideas, or maybe not. We'll see

steviez · 2023-12-01T05:10:03Z

I'm making changes to the runtime and testing with this feature. Hopefully by the time I'm done I'll have some bright ideas, or maybe not. We'll see

Sounds good! I'll be out next week (Dec 4-8), but back the week after (Dec 11). Can review then, or if you still have other things you're working on (or other things that you would rather work on), I could also take this task off of your hands as well.

alessandrod · 2023-12-01T07:59:25Z

Yeah, I think that is reasonable as well. As you can see, there is some metadata that won't change (namely the version string but could add more down the road). But, we could add everything else into a slots array, and for the sake of solana-validator dumping a single slot when it has diverged, we can just output the array of size 1.

Yep makes sense

What I'm thinking is this: I've used the code in this PR many times and it helped me find many runtime bugs early, so obvs I would like to have the functionality in some form. The way I've used it was to find the offending broken slot, then add logs to strategic places to figure out diffs hashes etc... which --write-bank-file conveniently outputs to a file. So somehow merging these two features makes a lot of sense I think?

Got it. At a high level, I see two current cases for this feature:
1. When a validator diverges, generate this file. The file can then also be generated with a known, good version and diffed. This is useful for us comparing version-to-version, as well as for multiple clients (the Firedancer team expressing desire for this feature was the final nudge to push this in)

2. Kind of similar to above, but in your case, you're testing just runtime changes via `solana-ledger-tool`. For this case, the variable level of verbosity seems to fit:
   
   * You generate a file with just the components that go into bank hash as your "first level" for comparison to use as your known good set
   * You then develop some feature and want to test it; so you run and check against the known set
     
     * If the bank hashes match throughout replay, you're good to go
     * If not, then you should have the diverging slot. Then, you potentially go back and do `--write-bank-file` with the extra details (ie accounts) with the known-good-version AND your under-development-version, and diff those files to hone in.
From above, 1. is more me telling you how I use it; do you think 2. accurately captures how you would want to use this feature if we incorporate the "optional" details?

Yep 2 is exactly what I'd want! :)

seanyoung · 2023-12-01T12:09:37Z

Looks like we've settled on a design, it does look good. I'll implement it

seanyoung · 2023-12-19T10:44:31Z

Here is the re-worked version, that I've been using for debugging #34057. There is a new option --verify-slots-details which takes either none, tx, bank, or bank-tx. With bank-tx, for each slot the bank is included and also the transactions. The transactions are in batches and included the instructions/accounts/data and results, and also crucially a per-tx-batch bank hash, so you know exactly at which tx batch things go wrong.

Please let me know what you think.

steviez · 2023-12-19T17:21:35Z

Here is the re-worked version, that I've been using for debugging #34057. There is a new option --verify-slots-details which takes either none, tx, bank, or bank-tx. With bank-tx, for each slot the bank is included and also the transactions. The transactions are in batches and included the instructions/accounts/data and results, and also crucially a per-tx-batch bank hash, so you know exactly at which tx batch things go wrong.

Please let me know what you think.

Hi Sean, I'll take a look today. One request for the future - could you please include new changes in separate commits as opposed to combining with the existing commit ? Even if they end up getting combined on merge, it is nice to be able to flip through the commit-by-commit diffs (ie if I want to see what has changed with your latest update only)

steviez

Thanks for looking/thinking about this! I'm glad you are also interested / see value in adding information for mid-slot (ie tx-by-tx); I've done something similar and have heard others express value in getting that. The general direction of this PR is good, maybe just some tweaks to how we specifically do several things.

That being said, I think we should take an incremental approach to keep PR's smaller

I have a PR that extends the existing BankHashDetails to allow multiple slots: Allow bank hash file to contain details for multiple slots #34516
With that, we could update the file to optionally include / exclude the accounts contents
- The current BankHashDetails file includes them by default, but we probably want to exclude those details if we want to generate a file for 1000's of slots
Rebase this PR on top of the above two and enable ledger-tool to create / read those files
In a subsequent PR, add support for storing "intermediate" (ie tx-by-tx) details or other improvements

What do you think ?

ledger/src/blockstore_processor.rs

ledger-tool/src/main.rs

seanyoung · 2024-01-06T16:37:59Z

Thanks for looking/thinking about this! I'm glad you are also interested / see value in adding information for mid-slot (ie tx-by-tx); I've done something similar and have heard others express value in getting that. The general direction of this PR is good, maybe just some tweaks to how we specifically do several things.

That being said, I think we should take an incremental approach to keep PR's smaller
* I have a PR that extends the existing `BankHashDetails` to allow multiple slots: [Allow bank hash file to contain details for multiple slots #34516](https://github.com/solana-labs/solana/pull/34516)

* With that, we could update the file to optionally include / exclude the accounts contents
  
  * The current `BankHashDetails` file includes them by default, but we probably want to exclude those details if we want to generate a file for 1000's of slots

* Rebase this PR on top of the above two and enable ledger-tool to create / read those files

* In a subsequent PR, add support for storing "intermediate" (ie tx-by-tx) details or other improvements
What do you think ?

Makes a lot of sense. Now this PR does not include transactions but does optionally include the bank, and writes the new BankHashDetails format. @steviez please let me know what you think.

ledger-tool/src/main.rs

steviez · 2024-02-27T07:11:21Z

ledger-tool/src/main.rs

+                            // writing the json file ends up with a syscall for each number, comma, indentation etc.
+                            // use BufWriter to speed things up
+
+                            let writer = std::io::BufWriter::new(recorded_slots_file);
+
+                            serde_json::to_writer_pretty(writer, &bank_hashes).unwrap();


This is duplicate code with stuff we have in bank_hash_details.rs, would be nice to single source that

fn write_bank_hash_details_file() write a single bank, and here we are writing a Vec<BankHashDetails>. Also fn write_bank_hash_details_file() is used from fn dump_then_repair_correct_slots() as well.

I am not sure how I can consolidate the two.

Hmm yeah, maybe not as easy as I thought. I'd say we can leave it for now

The long term vision is implementing something like BankHashDetailsStreamer. This object could hold the serializer, and then dump out details bank-by-bank. Will definitely help out if someone runs the command with "accounts" mode on a large number of slots.

You could specify the config at the start, the object could retain the config. Then, when you call stream_bank() or whatever we decide to name it, the streamer extracts appropriate details (just the bank hash OR all the hashes + accounts).

To the end user, this is mostly invisible tho. So, like I said, I'm happy to push the PR as-is and then potentially go refactor after the fact

Sure, that makes sense. For virtual machine debugging it would be super-useful to have the transactions in the file too. Let start with this and go from there

ledger-tool/src/main.rs

steviez · 2024-02-29T09:11:51Z

ledger-tool/src/main.rs

+                        // .default_value() does not work with .conflicts_with() in clap 2.33
+                        // .conflicts_with("verify_slots")
+                        // https://github.com/clap-rs/clap/issues/1605#issuecomment-722326915


As you mentioned, would be nice if we were on recent CLAP so this bug didn't hinder us.

Instead, can we do the check ourselves instead. Ie, something like:

if arg_matches.is_present("record_slots") && arg_matches.is_present("verify_slots") { eprintln!("Some text, maybe mirror what the CLAP text would be if there weren't that bug"); }

What do you think ? Also, if we do this, we should do it fairly early in the "verify" match case, certainly before calling load_and_process_ledger_or_exit!()

Yes, good point. I've added it.

This adds: --record-slots <FILENAME> Write the slot hashes to this file. --record-slots-config hash-only|bank Store the bank (=accounts) json file, or not. --verify-slots <FILENAME> Verify slot hashes against this file. The first case can be used to dump a list of (slot, hash) to a json file during a replay. The second case can be used to check slot hashes against previously recorded values. This is useful for debugging consensus failures, eg: # on good commit/branch ledger-tool verify --record-slots good.json --record-slots-config=bank # on bad commit or potentially consensus breaking branch ledger-tool verify --verify-slots good.json On a hash mismatch an error will be logged with the expected hash vs the computed hash.

steviez

Cool, let's push this (especially with repo changeover) and we can tweak things from there

seanyoung · 2024-03-01T08:39:53Z

Thanks @steviez

…bs#34246) ledger-tool: verify: add --verify-slots and --verify-slots-details This adds: --record-slots <FILENAME> Write the slot hashes to this file. --record-slots-config hash-only|accounts Store the bank (=accounts) json file, or not. --verify-slots <FILENAME> Verify slot hashes against this file. The first case can be used to dump a list of (slot, hash) to a json file during a replay. The second case can be used to check slot hashes against previously recorded values. This is useful for debugging consensus failures, eg: # on good commit/branch ledger-tool verify --record-slots good.json --record-slots-config=accounts # on bad commit or potentially consensus breaking branch ledger-tool verify --verify-slots good.json On a hash mismatch an error will be logged with the expected hash vs the computed hash.

seanyoung requested review from alessandrod, Lichtso and steviez November 28, 2023 19:54

seanyoung marked this pull request as ready for review November 28, 2023 19:54

seanyoung force-pushed the ledger-tool-check-slot-hashes branch 2 times, most recently from dd47f23 to efd77f3 Compare December 5, 2023 15:26

steviez mentioned this pull request Dec 19, 2023

Allow bank hash file to contain details for multiple slots #34516

Merged

seanyoung force-pushed the ledger-tool-check-slot-hashes branch from efd77f3 to 53a74c4 Compare December 19, 2023 10:15

seanyoung changed the title ~~ledger-tool: verify: add --verify-slot-hashes~~ ledger-tool: verify: add --verify-slots and --verify-slots-details Dec 19, 2023

seanyoung force-pushed the ledger-tool-check-slot-hashes branch 2 times, most recently from 4e0b427 to 57ed4a9 Compare December 19, 2023 10:40

seanyoung force-pushed the ledger-tool-check-slot-hashes branch 3 times, most recently from 41f85e8 to 678940b Compare December 19, 2023 11:49

steviez reviewed Dec 19, 2023

View reviewed changes

ledger/src/blockstore_processor.rs Outdated Show resolved Hide resolved

ledger-tool/src/main.rs Outdated Show resolved Hide resolved

ledger-tool/src/main.rs Outdated Show resolved Hide resolved

github-actions bot added stale [bot only] Added to stale content; results in auto-close after a week. and removed stale [bot only] Added to stale content; results in auto-close after a week. labels Jan 3, 2024

seanyoung force-pushed the ledger-tool-check-slot-hashes branch from 678940b to 8013ba6 Compare January 5, 2024 18:01

seanyoung force-pushed the ledger-tool-check-slot-hashes branch from 8013ba6 to edcbb88 Compare January 6, 2024 15:21

github-actions bot added the stale [bot only] Added to stale content; results in auto-close after a week. label Jan 22, 2024

steviez removed the stale [bot only] Added to stale content; results in auto-close after a week. label Jan 23, 2024

github-actions bot added stale [bot only] Added to stale content; results in auto-close after a week. and removed stale [bot only] Added to stale content; results in auto-close after a week. labels Feb 6, 2024

seanyoung force-pushed the ledger-tool-check-slot-hashes branch 2 times, most recently from 258d0eb to 4aa9daa Compare February 13, 2024 00:13

steviez reviewed Feb 23, 2024

View reviewed changes

ledger-tool/src/main.rs Outdated Show resolved Hide resolved

seanyoung force-pushed the ledger-tool-check-slot-hashes branch from 4aa9daa to 4e7f09d Compare February 24, 2024 10:13

steviez reviewed Feb 27, 2024

View reviewed changes

seanyoung force-pushed the ledger-tool-check-slot-hashes branch 7 times, most recently from 88a01d0 to 0f9eec7 Compare February 28, 2024 12:02

steviez reviewed Feb 29, 2024

View reviewed changes

seanyoung force-pushed the ledger-tool-check-slot-hashes branch 2 times, most recently from e707b53 to 20a8add Compare February 29, 2024 09:38

seanyoung force-pushed the ledger-tool-check-slot-hashes branch from 20a8add to e334f34 Compare February 29, 2024 14:46

steviez approved these changes Mar 1, 2024

View reviewed changes

seanyoung changed the title ~~ledger-tool: verify: add --verify-slots and --verify-slots-details~~ ledger-tool: verify: add --record-slots and --verify-slots Mar 1, 2024

seanyoung merged commit 9bb59aa into solana-labs:master Mar 1, 2024
35 checks passed

seanyoung deleted the ledger-tool-check-slot-hashes branch March 1, 2024 08:39

steviez mentioned this pull request Mar 14, 2024

Add option to record transactions to ledger-tool anza-xyz/agave#181

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ledger-tool: verify: add --record-slots and --verify-slots #34246

ledger-tool: verify: add --record-slots and --verify-slots #34246

seanyoung commented Nov 28, 2023 •

edited

Loading

seanyoung commented Nov 28, 2023

codecov bot commented Nov 28, 2023 •

edited

Loading

alessandrod commented Nov 28, 2023

steviez commented Nov 29, 2023

alessandrod commented Nov 29, 2023

steviez commented Nov 30, 2023

seanyoung commented Nov 30, 2023

steviez commented Dec 1, 2023

alessandrod commented Dec 1, 2023

seanyoung commented Dec 1, 2023

seanyoung commented Dec 19, 2023

steviez commented Dec 19, 2023

steviez left a comment •

edited

Loading

seanyoung commented Jan 6, 2024

steviez Feb 27, 2024

seanyoung Feb 27, 2024

steviez Feb 29, 2024

seanyoung Feb 29, 2024

steviez Feb 29, 2024

seanyoung Feb 29, 2024

steviez left a comment

seanyoung commented Mar 1, 2024

ledger-tool: verify: add --record-slots and --verify-slots #34246

ledger-tool: verify: add --record-slots and --verify-slots #34246

Conversation

seanyoung commented Nov 28, 2023 • edited Loading

Problem

Summary of Changes

seanyoung commented Nov 28, 2023

codecov bot commented Nov 28, 2023 • edited Loading

Codecov Report

alessandrod commented Nov 28, 2023

steviez commented Nov 29, 2023

alessandrod commented Nov 29, 2023

steviez commented Nov 30, 2023

seanyoung commented Nov 30, 2023

steviez commented Dec 1, 2023

alessandrod commented Dec 1, 2023

seanyoung commented Dec 1, 2023

seanyoung commented Dec 19, 2023

steviez commented Dec 19, 2023

steviez left a comment • edited Loading

Choose a reason for hiding this comment

seanyoung commented Jan 6, 2024

steviez Feb 27, 2024

Choose a reason for hiding this comment

seanyoung Feb 27, 2024

Choose a reason for hiding this comment

steviez Feb 29, 2024

Choose a reason for hiding this comment

seanyoung Feb 29, 2024

Choose a reason for hiding this comment

steviez Feb 29, 2024

Choose a reason for hiding this comment

seanyoung Feb 29, 2024

Choose a reason for hiding this comment

steviez left a comment

Choose a reason for hiding this comment

seanyoung commented Mar 1, 2024

seanyoung commented Nov 28, 2023 •

edited

Loading

codecov bot commented Nov 28, 2023 •

edited

Loading

steviez left a comment •

edited

Loading