Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Data corruption #3509

Closed
andresilva opened this issue Feb 15, 2018 · 11 comments
Closed

Data corruption #3509

andresilva opened this issue Feb 15, 2018 · 11 comments

Comments

@andresilva
Copy link

andresilva commented Feb 15, 2018

Expected behavior

Don't cause data corruption.

Actual behavior

Corruption: block checksum mismatch: expected 3482696050, got 3888739091 in parity/chains/ethereum/db/906a34e69aec8c0d/archive/db/011705.sst offset 35210665 size 16261

Steps to reproduce the behavior

Non-deterministic.


I’m one of the contributors of parity, one of the Ethereum clients. We use RocksDB to store all the client data and this is currently running on 5000+ nodes ranging from server-grade hardware to commodity hardware, but mostly using SSDs. Parity is implemented in Rust and so we use our own library to wrap RocksDB’s C API (https://github.com/paritytech/rust-rocksdb/). A couple of users have been having issues with data corruption, we’ve tried to the best of our extents to understand if this was any hardware error, some users are seeing the issue with different hard drives. We’ve recently fixed an issue where on shutdown we wouldn’t properly close RocksDB, but we’re still seeing users report corruption issues after a fix was rolled out. So far we’ve been unable to reproduce any of these issues ourselves so we’re relying on the data and feedback that our users can provide us.

Attached you can find a RocksDB log provided by one of our users that shows the corruption issue.

Is there anything we should be watching for that could be triggering this corruption, or should we just try to rule out faulty hardware?

rocksdb005.log

@siying
Copy link
Contributor

siying commented Feb 23, 2018

Based on the log file, the corrupted file has been successfully compacted without a checksum error:

2018/01/31-21:16:29.024324 7f514a9ff700 [rocksdb/db/compaction_job.cc:1453] [col1] Compaction start summary: Base version 5663 Base level 0, inputs: [11737(251KB) 11720(205KB) 11705(262KB) 11690(326KB) 11670(368KB)], [11654(64MB) 11655(15MB)]
2018/01/31-21:16:29.024336 7f514a9ff700 EVENT_LOG_v1 {"time_micros": 1517462189024328, "job": 5645, "event": "compaction_started", "files_L0": [11737, 11720, 11705, 11690, 11670], "files_L5": [11654, 11655], "score": 2.5, "input_data_size": 84790680}
...
2018/01/31-21:16:30.144267 7f514a9ff700 [rocksdb/db/compaction_job.cc:1211] [col1] [JOB 5645] Compacted 5@0 + 2@5 files to L5 => 84810297 bytes
2018/01/31-21:16:30.154410 7f514a9ff700 (Original Log Time 2018/01/31-21:16:30.154339) [rocksdb/db/compaction_job.cc:629] [col1] compacted to: base level 5 max bytes base 93847552 files[1 0 0 0 0 2 5] max score 0.90, MB/sec: 75.7 rd, 75.7 wr, level 5, files in(5, 2) out(2) MB in(1.4, 79.5) out(80.9), read-write-amplify(117.1) write-amplify(58.6) OK, records in: 319359, records dropped: 0
2018/01/31-21:16:30.154415 7f514a9ff700 (Original Log Time 2018/01/31-21:16:30.154368) EVENT_LOG_v1 {"time_micros": 1517462190154356, "job": 5645, "event": "compaction_finished", "compaction_time_micros": 1119893, "output_level": 5, "num_output_files": 2, "total_output_size": 84810297, "num_input_records": 319359, "num_output_records": 319359, "num_subcompactions": 1, "num_single_delete_mismatches": 0, "num_single_delete_fallthrough": 0, "lsm_state": [1, 0, 0, 0, 0, 2, 5]}
...
2018/01/31-21:16:30.154762 7f514a9ff700 EVENT_LOG_v1 {"time_micros": 1517462190154759, "job": 5645, "event": "table_file_deletion", "file_number": 11705}

It means that the file was OK at least when generated.

It it theoretically possible that somehow RocksDB forget to call fsync() or failed to see fsync() failure message, but to me it is unlikely. We just fixed a bug related to that in 5.6 and added unit tests. The version you run (5.8.8) is relatively well covered in this aspect.

How did you copy the data to parity/chains/ethereum/db/906a34e69aec8c0d/archive/db/ directory? Is there possible that the copying have some problem?

@andresilva
Copy link
Author

Thank you for looking into this. The database has been populated by syncing the Ethereum blockchain, so the data wasn't really copied as a whole. I've looked at the previous corruption issues (based on DeleteRange) but we don't use that feature. Actually we don't really use any "advanced" RocksDB features.

@siying
Copy link
Contributor

siying commented Mar 14, 2018

@andresilva what does "populated by syncing the Ethereum blockchain" mean? Do you mean data is all generated in place in parity/chains/ethereum/db/906a34e69aec8c0d/archive/db/ and is never moved?

@andresilva
Copy link
Author

Sorry for being vague in my description. Yes, I meant that all the data is generated in place. It is possible that some users may move their data folder (and yes this could corrupt the data), but under normal operation data is never moved and is only interacted with through RocksDB API (no files are touched), in this particular case for which I provided logs data wasn't moved.

@siying
Copy link
Contributor

siying commented Mar 14, 2018

@andresilva I don't understand then. Like what I pasted above, the file 011705.sst was supposed to deleted at 1/31 21:16:

2018/01/31-21:16:30.154762 7f514a9ff700 EVENT_LOG_v1 {"time_micros": 1517462190154759, "job": 5645, "event": "table_file_deletion", "file_number": 11705}

When did you see the corruption reported?

But anyway, I saw some corruption reported in the end of the log file for another file. I will dig that one further.

@siying
Copy link
Contributor

siying commented Mar 14, 2018

There isn't much information to dig further. If you run sst_dump against the file parity/chains/ethereum/db/906a34e69aec8c0d/archive/db/004774.sst, what will be the output?

@fanatid
Copy link

fanatid commented May 28, 2018

I keep getting receiving this error, tried 3 different SSD disks (NVMe & SATA), one of them was completely new. Memtest86 says that memory is ok.
Right now there ~50 issues with same error in Parity repository: https://github.com/paritytech/parity/issues?utf8=%E2%9C%93&q=checksum+mismatch
@andresilva if I can help with testing, let me know.

@damageco
Copy link

damageco commented May 18, 2020

An old bug but it still actual for me. I'm testing IOTA project. The RocksDb version used is quit old since it's version 6.3.6.

Here is the stacktrace

org.rocksdb.RocksDBException: block checksum mismatch: expected 335318565, got 604031437  in /iri/data//002913.sst offset 11521057 size 2477
	at org.rocksdb.RocksDB.write0(Native Method) ~[iri-1.9.0.jar:1.9.0]
	at org.rocksdb.RocksDB.write(RocksDB.java:1421) ~[iri-1.9.0.jar:1.9.0]
	at com.iota.iri.storage.rocksDB.RocksDBPersistenceProvider.saveBatch(RocksDBPersistenceProvider.java:358) ~[iri-1.9.0.jar:1.9.0]
	at com.iota.iri.storage.Tangle.saveBatch(Tangle.java:148) ~[iri-1.9.0.jar:1.9.0]
	at com.iota.iri.controllers.TransactionViewModel.store(TransactionViewModel.java:450) ~[iri-1.9.0.jar:1.9.0]
	at com.iota.iri.network.pipeline.ReceivedStage.process(ReceivedStage.java:55) ~[iri-1.9.0.jar:1.9.0]
	at com.iota.iri.network.pipeline.TransactionProcessingPipelineImpl.lambda$addStage$0(TransactionProcessingPipelineImpl.java:125) [iri-1.9.0.jar:1.9.0]
	at java.lang.Thread.run(Thread.java:748) ~[na:1.8.0_181]
	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) ~[na:1.8.0_181]
	at java.util.concurrent.FutureTask.run(FutureTask.java:266) ~[na:1.8.0_181]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) ~[na:1.8.0_181]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) ~[na:1.8.0_181]
	at java.lang.Thread.run(Thread.java:748) ~[na:1.8.0_181]

Any clue ?

@fanatid
Copy link

fanatid commented May 18, 2020

@damageco in my case problem was in RAM, but memtest did not recognize it. I found it when manually removed each module while Parity wasn't not ok.

@adamretter
Copy link
Collaborator

It is hard to say what would cause this. It could possibly even be a storage issue with your hardware

@ramvadiv
Copy link
Contributor

Since we have not had any followup data - closing this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants