Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Integrate ledger snapshot checksum #6047

Draft
wants to merge 3 commits into
base: release/10.1.3
Choose a base branch
from

Conversation

geo2a
Copy link

@geo2a geo2a commented Dec 4, 2024

Description

IntersectMBO/ouroboros-consensus#1319, when merged, will checksum the ledger state snapshots to detect data corruption. This PR integrates this change into cardano-node.

In addition to the snapshot file itself, a .checksum file is written to dist, which contains the CRC32 checksum of the snapshot data:

> ls mainnet/db/ledger
2223  2223.checksum

> cat mainnet/db/ledger/2223.checksum
8b578fb4

To preserve backwards compatibility, the absence of the checksum file does not block the node from starting up. Instead, a warning is issued when restoring from a snapshot that is missing the checksum file:

...
Listening on http://127.0.0.1:12798
[geo2a-la:cardano.node.ChainDB:Warning:5] [2024-12-04 13:09:30.72 UTC] Checksum file is missing for snapshot DiskSnapshot {dsNumber = 2223, dsSuffix = Nothing}
...

If the checksum differs from the file, there's a number of scenarios that all ultimately lead to the deletion of the snapshot and an attempt to restore from an older one:

  • Data corruption of the snapshot file:
    [geo2a-la:cardano.node.ChainDB:Error:5] [2024-12-04 13:18:42.93 UTC] Invalid snapshot DiskSnapshot {dsNumber = 2223, dsSuffix = Nothing}InitFailureRead ReadSnapshotDataCorruption This is most likely an expected change in the serialization format, which currently requires a chain replay
    
    To model data corruption for testing purposes, one could use dd, i.e.:
    printf '\x0\x0' | dd of=mainnet/db/ledger/2223 bs=1 seek=100 count=3 conv=notrunc
  • The .checksum file is invalid:
    [geo2a-la:cardano.node.ChainDB:Error:5] [2024-12-04 13:12:29.41 UTC] Invalid snapshot DiskSnapshot {dsNumber = 2223, dsSuffix = Nothing}InitFailureRead (ReadSnapshotInvalidChecksumFile 21127.checksum) This is most likely an expected change in the serialization format, which currently requires a chain replay
    
    that means violated the following condition: length str == 8 && all isHexDigit str

The feature is on by default and can be disabled by the following configuration option:

DoDiskSnapshotChecksum: false

@geo2a geo2a force-pushed the geo2a/10.1.3-ledger-snapshot-checksum branch 3 times, most recently from 76334d7 to a208e5a Compare December 9, 2024 14:50
- Categorise `LedgerDB.SnapshotMissingChecksum` trace as `Warning`
- Expose snapshot checksum switch in config file
@geo2a geo2a force-pushed the geo2a/10.1.3-ledger-snapshot-checksum branch 2 times, most recently from bb20145 to 00efdba Compare December 9, 2024 15:35
github-merge-queue bot pushed a commit to IntersectMBO/ouroboros-consensus that referenced this pull request Dec 10, 2024
…1319)

Fixes #892

Integration into `cardano-node`:
IntersectMBO/cardano-node#6047. This uses the
branch
[geo2a/issue-892-checksum-snaphot-file-release-ouroboros-consensus-0.21.0.0-backport](https://github.com/IntersectMBO/ouroboros-consensus/tree/geo2a/issue-892-checksum-snaphot-file-release-ouroboros-consensus-0.21.0.0-backport)
which is the backport of this PR onto the most resent release of the
`ouroboros-consensus` package.

In this PR, we change the reading and writing disk snapshots of ledger
state. When a snapshot is taken and written to disk, an additional file
with the `.checksum` extension is written alongside it. The checksum
file contains a string that represent the CRC32 checksum of the
snapshot.

The checksum is calculated incrementally, alongside writing the snapshot
to disk. When a snapshot is read from dist, the checksum is again
calculated and compared to the tracked one. If the checksum is
different, `readSnaphot` returns the `ReadSnapshotDataCorruption` error,
indicating data corruption.

The checksum is calculated incrementally, alongside reading a writing
the data. On write, we use the
[`hPutAllCRC`](https://input-output-hk.github.io/fs-sim/fs-api/src/System.FS.CRC.html#hPutAllCRC)
function from `fs-sim`, and on read we modify the
[readIncremental](https://github.com/IntersectMBO/ouroboros-consensus/blob/892-checksum-snaphot-file/ouroboros-consensus/src/ouroboros-consensus/Ouroboros/Consensus/Util/CBOR.hs#L191)
function to compute the checksum as data is read.

To enable seamless integration into `cardano-node`, we make the check
optional. When initialising the ledger state from a snapshot in
`initLedgerDB`, we issue a warning in case the checksum file is missing
for a snapshot, but do not fail as in case of invalid snapshots.

The `db-analyser` tool ignores the checksum files by default when
reading the snapshots. We add `--disk-snapshot-checksum` flag to enabled
the check. When writing a snapshot to disk, e.g. as a result of the
`--store-ledger` analysis, `db-analyser` will always write
calculate the checksum and write it into the snapshot's `.checksum`
file.

**Tests**

There state machine test in `Test.Ouroboros.Storage.LedgerDB.OnDisk` is
relevant to this feature, and has caught a number of silly mistakes in
the process of its implementation, for example forgetting to delete a
checksum file when the snapshot is deleted.

The model in the test does not track checksums, and I do not think it
can (or should) be augmented to do that. Howerver, the `Snap` and
`Restore` events are now parameterised by the checksum flag, and the
values for the flag are randomised when generating these events. This
leads to testing the following properties:
- this feature is backwards-compatible, i.e. the `Restore` events will
always lead to restoring from a snapshot, even if `Snap` events do not
write checksum files (i.e. their flag is `NoDoDiskSnapshotChecksum` ~=
`False`).
- If the interpretation of the `DoDiskSnapshotChecksum` flag changes in
the code base and becomes strict, i.e. hard fail if the checksum file is
missing, this test will discover that.

**Effects on Performance**:

Running `db-analyser` to read a ledger snapshot and store the snapshot
of the state at the next slot shows a difference of 2 seconds on my
machine. See a comment below for the logs.

To precisely evaluate the effects, we need a micro-benchmark of the
reading and writing of snapshots with and without the checksum
calculation.
@geo2a geo2a force-pushed the geo2a/10.1.3-ledger-snapshot-checksum branch from 00efdba to 58c5f1f Compare December 12, 2024 12:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants