Data corruption #3509

andresilva · 2018-02-15T01:07:30Z

Expected behavior

Don't cause data corruption.

Actual behavior

Corruption: block checksum mismatch: expected 3482696050, got 3888739091 in parity/chains/ethereum/db/906a34e69aec8c0d/archive/db/011705.sst offset 35210665 size 16261

Steps to reproduce the behavior

Non-deterministic.

I’m one of the contributors of parity, one of the Ethereum clients. We use RocksDB to store all the client data and this is currently running on 5000+ nodes ranging from server-grade hardware to commodity hardware, but mostly using SSDs. Parity is implemented in Rust and so we use our own library to wrap RocksDB’s C API (https://github.com/paritytech/rust-rocksdb/). A couple of users have been having issues with data corruption, we’ve tried to the best of our extents to understand if this was any hardware error, some users are seeing the issue with different hard drives. We’ve recently fixed an issue where on shutdown we wouldn’t properly close RocksDB, but we’re still seeing users report corruption issues after a fix was rolled out. So far we’ve been unable to reproduce any of these issues ourselves so we’re relying on the data and feedback that our users can provide us.

Attached you can find a RocksDB log provided by one of our users that shows the corruption issue.

Is there anything we should be watching for that could be triggering this corruption, or should we just try to rule out faulty hardware?

rocksdb005.log

The text was updated successfully, but these errors were encountered:

siying · 2018-02-23T20:39:14Z

Based on the log file, the corrupted file has been successfully compacted without a checksum error:

2018/01/31-21:16:29.024324 7f514a9ff700 [rocksdb/db/compaction_job.cc:1453] [col1] Compaction start summary: Base version 5663 Base level 0, inputs: [11737(251KB) 11720(205KB) 11705(262KB) 11690(326KB) 11670(368KB)], [11654(64MB) 11655(15MB)]
2018/01/31-21:16:29.024336 7f514a9ff700 EVENT_LOG_v1 {"time_micros": 1517462189024328, "job": 5645, "event": "compaction_started", "files_L0": [11737, 11720, 11705, 11690, 11670], "files_L5": [11654, 11655], "score": 2.5, "input_data_size": 84790680}
...
2018/01/31-21:16:30.144267 7f514a9ff700 [rocksdb/db/compaction_job.cc:1211] [col1] [JOB 5645] Compacted 5@0 + 2@5 files to L5 => 84810297 bytes
2018/01/31-21:16:30.154410 7f514a9ff700 (Original Log Time 2018/01/31-21:16:30.154339) [rocksdb/db/compaction_job.cc:629] [col1] compacted to: base level 5 max bytes base 93847552 files[1 0 0 0 0 2 5] max score 0.90, MB/sec: 75.7 rd, 75.7 wr, level 5, files in(5, 2) out(2) MB in(1.4, 79.5) out(80.9), read-write-amplify(117.1) write-amplify(58.6) OK, records in: 319359, records dropped: 0
2018/01/31-21:16:30.154415 7f514a9ff700 (Original Log Time 2018/01/31-21:16:30.154368) EVENT_LOG_v1 {"time_micros": 1517462190154356, "job": 5645, "event": "compaction_finished", "compaction_time_micros": 1119893, "output_level": 5, "num_output_files": 2, "total_output_size": 84810297, "num_input_records": 319359, "num_output_records": 319359, "num_subcompactions": 1, "num_single_delete_mismatches": 0, "num_single_delete_fallthrough": 0, "lsm_state": [1, 0, 0, 0, 0, 2, 5]}
...
2018/01/31-21:16:30.154762 7f514a9ff700 EVENT_LOG_v1 {"time_micros": 1517462190154759, "job": 5645, "event": "table_file_deletion", "file_number": 11705}

It means that the file was OK at least when generated.

It it theoretically possible that somehow RocksDB forget to call fsync() or failed to see fsync() failure message, but to me it is unlikely. We just fixed a bug related to that in 5.6 and added unit tests. The version you run (5.8.8) is relatively well covered in this aspect.

How did you copy the data to parity/chains/ethereum/db/906a34e69aec8c0d/archive/db/ directory? Is there possible that the copying have some problem?

andresilva · 2018-03-01T11:04:34Z

Thank you for looking into this. The database has been populated by syncing the Ethereum blockchain, so the data wasn't really copied as a whole. I've looked at the previous corruption issues (based on DeleteRange) but we don't use that feature. Actually we don't really use any "advanced" RocksDB features.

siying · 2018-03-14T18:31:33Z

@andresilva what does "populated by syncing the Ethereum blockchain" mean? Do you mean data is all generated in place in parity/chains/ethereum/db/906a34e69aec8c0d/archive/db/ and is never moved?

andresilva · 2018-03-14T18:40:23Z

Sorry for being vague in my description. Yes, I meant that all the data is generated in place. It is possible that some users may move their data folder (and yes this could corrupt the data), but under normal operation data is never moved and is only interacted with through RocksDB API (no files are touched), in this particular case for which I provided logs data wasn't moved.

siying · 2018-03-14T19:27:43Z

@andresilva I don't understand then. Like what I pasted above, the file 011705.sst was supposed to deleted at 1/31 21:16:

2018/01/31-21:16:30.154762 7f514a9ff700 EVENT_LOG_v1 {"time_micros": 1517462190154759, "job": 5645, "event": "table_file_deletion", "file_number": 11705}

When did you see the corruption reported?

But anyway, I saw some corruption reported in the end of the log file for another file. I will dig that one further.

siying · 2018-03-14T19:34:56Z

There isn't much information to dig further. If you run sst_dump against the file parity/chains/ethereum/db/906a34e69aec8c0d/archive/db/004774.sst, what will be the output?

fanatid · 2018-05-28T14:24:18Z

I keep getting receiving this error, tried 3 different SSD disks (NVMe & SATA), one of them was completely new. Memtest86 says that memory is ok.
Right now there ~50 issues with same error in Parity repository: https://github.com/paritytech/parity/issues?utf8=%E2%9C%93&q=checksum+mismatch
@andresilva if I can help with testing, let me know.

damageco · 2020-05-18T18:12:04Z

An old bug but it still actual for me. I'm testing IOTA project. The RocksDb version used is quit old since it's version 6.3.6.

Here is the stacktrace

org.rocksdb.RocksDBException: block checksum mismatch: expected 335318565, got 604031437  in /iri/data//002913.sst offset 11521057 size 2477
	at org.rocksdb.RocksDB.write0(Native Method) ~[iri-1.9.0.jar:1.9.0]
	at org.rocksdb.RocksDB.write(RocksDB.java:1421) ~[iri-1.9.0.jar:1.9.0]
	at com.iota.iri.storage.rocksDB.RocksDBPersistenceProvider.saveBatch(RocksDBPersistenceProvider.java:358) ~[iri-1.9.0.jar:1.9.0]
	at com.iota.iri.storage.Tangle.saveBatch(Tangle.java:148) ~[iri-1.9.0.jar:1.9.0]
	at com.iota.iri.controllers.TransactionViewModel.store(TransactionViewModel.java:450) ~[iri-1.9.0.jar:1.9.0]
	at com.iota.iri.network.pipeline.ReceivedStage.process(ReceivedStage.java:55) ~[iri-1.9.0.jar:1.9.0]
	at com.iota.iri.network.pipeline.TransactionProcessingPipelineImpl.lambda$addStage$0(TransactionProcessingPipelineImpl.java:125) [iri-1.9.0.jar:1.9.0]
	at java.lang.Thread.run(Thread.java:748) ~[na:1.8.0_181]
	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) ~[na:1.8.0_181]
	at java.util.concurrent.FutureTask.run(FutureTask.java:266) ~[na:1.8.0_181]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) ~[na:1.8.0_181]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) ~[na:1.8.0_181]
	at java.lang.Thread.run(Thread.java:748) ~[na:1.8.0_181]

Any clue ?

fanatid · 2020-05-18T18:16:12Z

@damageco in my case problem was in RAM, but memtest did not recognize it. I found it when manually removed each module while Parity wasn't not ok.

adamretter · 2020-05-18T19:30:33Z

It is hard to say what would cause this. It could possibly even be a storage issue with your hardware

ramvadiv · 2021-09-10T00:49:30Z

Since we have not had any followup data - closing this issue.

andresilva mentioned this issue Feb 15, 2018

Corruption: block checksum mismatch / during sync. openethereum/parity-ethereum#7766

Closed

grbIzl mentioned this issue Aug 26, 2019

Archive node loses peers (stop syncing blocks) openethereum/parity-ethereum#10724

Closed

ramvadiv closed this as completed Sep 10, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Data corruption #3509

Data corruption #3509

andresilva commented Feb 15, 2018 •

edited

Loading

siying commented Feb 23, 2018

andresilva commented Mar 1, 2018

siying commented Mar 14, 2018

andresilva commented Mar 14, 2018

siying commented Mar 14, 2018

siying commented Mar 14, 2018

fanatid commented May 28, 2018

damageco commented May 18, 2020 •

edited

Loading

fanatid commented May 18, 2020

adamretter commented May 18, 2020

ramvadiv commented Sep 10, 2021

Data corruption #3509

Data corruption #3509

Comments

andresilva commented Feb 15, 2018 • edited Loading

Expected behavior

Actual behavior

Steps to reproduce the behavior

siying commented Feb 23, 2018

andresilva commented Mar 1, 2018

siying commented Mar 14, 2018

andresilva commented Mar 14, 2018

siying commented Mar 14, 2018

siying commented Mar 14, 2018

fanatid commented May 28, 2018

damageco commented May 18, 2020 • edited Loading

fanatid commented May 18, 2020

adamretter commented May 18, 2020

ramvadiv commented Sep 10, 2021

andresilva commented Feb 15, 2018 •

edited

Loading

damageco commented May 18, 2020 •

edited

Loading