tip: 461
title: Optimize data consistency for system abnormals
author: ray.wu <ray.wu@tron.network>
discussions to: https://github.com/tronprotocol/tips/issues/461
status: Final
type: Standards Track
category: Core
created: 2022-09-02
This TIP proposes a solution to TRON database inconsistent when the system is shut down ungracefully.
At present, most public chains in the industry use embedded databases as their underlying storage, and TRON is no exception. Currently, TRON supports LevelDB and RocksDB two kinds of KV storage, and LevelDB is the default configuration, which is also the storage solution adopted by most nodes. However, there is a problem when using LevelDB as storage at present: when the machine goes down abnormally, there is a certain chance that the database will be inconsistent. This article will describe the causes of database corruption and the corresponding solutions.
TRON adopts DPoS as the consensus mechanism. Unlike PoW, which is a probabilistic block finality consensus, DPoS has a clear block finality logic. At present, TRON has 27 SR nodes. Only when a block is approved by 2/3+1 SRs can it be considered that the block can be changed to the finality state, and at the database level, it is expressed as: block storage without finality In the in-memory database, blocks marked as Finality need to be written to the underlying database (LevelDB) in time.
At present, TRON adopts the memory + LevelDB multi-layer storage architecture. Different types of data correspond to different LevelDB instances, and the block data without Finality is stored in the in-memory database. When the block is Finality, the data in the in-memory database is flushed to the corresponding LevelDB database, but LevelDB does not support batch atomic write operations of multiple instances, so when the system exits abnormally during the writing process, it is likely to cause data inconsistency.
In order to solve this problem, TRON introduced the V2 version of the database, which added the Checkpoint mechanism in V2. Before flushing the in-memory database to LevelDB, a Checkpoint is created. The Checkpoint contains all the data of the finality of this block, and the writeBatch
function of LevelDB is used to ensure that The data is written atomically. After the checkpoint is successfully written, the flush operation of the databases is performed.
When the database is written, the system may be abnormally down. When the node restarts, it will automatically check whether the Checkpoint exists. If it exists, the data in the Checkpoint will be written back to the database again. This way, the problem of data inconsistency can be solved.
But there is currently a problem: in order to ensure the write performance, all LevelDB libraries currently use the asynchronous write mode. In this mode, even if the writeBatch
function returns normally, it cannot guarantee that the Checkpoint data must be successfully placed on the disk, because at this time The Checkpoint may still be cached in the OS page cache. If the program exits normally, the OS will persist the Checkpoint after a certain time interval to ensure data consistency. However, if the program exits abnormally (such as machine downtime), it cannot be guaranteed that the Checkpoint data must be flushed to the disk. At this time, there is a certain probability that the data in the business databases may be inconsistent.
Therefore, the current problems can be summarized as follows:
- Checkpoint cannot cover missing data in all databases
- Some databases are slow to be placed in the disk and are more than two checkpoints data behind
- Checkpoint flush is slow, and it lags behind some business databases
- The checkpoint is empty, the data cannot be recovered
Goal: Solve the problem of inconsistency of databases data caused by OS downtime
- LevelDB: Embedded high-performance kv database implemented by google
- RocksDB: A high-performance kv database, based on the LevelDB model and many optimizations have been made on this basis
- Checkpoint: Saves all changed data of block data to ensure data consistency
- writeBatch: LevelDB atomic writebatch interface
To ensure that the checkpoint set can cover all missing data in the database by retaining multiple checkpoints, the following two conditions must be met:
- The block height of the earliest checkpoint data <= the lowest block height of the data in the business database
- The block height of the latest checkpoint data >= the highest block height of the data in the business database
Condition 1: At present, under the default configuration of mainstream OS, the data in the page cache will be synced to the disk after a maximum of 35s. Therefore, checkpoint needs to retain data changes generated within at least 35s.
Condition 2: A potential requirement of Condition 2 is that for the data of the same block height, the checkpoint must be flushed before the databases. Usually, the file is written to the OS page cache first and then flushed to the disk according to the policy. However, to meet the above condition, the checkpoint needs to force writing to the disk, but this method will cause performance loss.
However, considering that the data size of a single checkpoint is controllable, and the LevelDB is tested with writeSync
as true
and false
respectively, the performance gap is about 3~4ms, the impact on the overall performance is controllable, so the configuration parameter checkpoint.sync
is added:
sync
= true: Forcibly call the underlying DB writeSync(true) every time when a checkpoint is created to ensure that the checkpoint is successfully flushed.sync
= false: It is consistent with the original logic and has better performance, but it may lose the latest checkpoint data, which may cause recovery failure when the machine is down.
Users can choose one of them according to their own needs.
The old version of the checkpoint is stored in the tmp
database uniformly, and the corresponding database prefix is added to the key of each data, so that the data can be easily recovered to the corresponding database. The new version of checkpoint keeps multiple copies of the old checkpoint, which are distinguished by subdirectories. Subdirectory names are named with timestamp
+ block number
.
For compatibility, the original checkpoint logic is retained, and the checkpoint v2 logic is added, and the checkpoint version is selected through the configuration file. The recovery logic of a single checkpoint in the v2 version is consistent with the v1 version. On this basis, multiple copies are created, and different database directories are used to distinguish them. The main logic pseudocode:
private void createCheckpointV2() {
CheckPointV2Store checkPointV2Store = null;
try {
Map<WrappedByteArray, WrappedByteArray> batch = new HashMap<>();
for (Chainbase db : dbs) {
Snapshot head = db.getHead();
if (Snapshot.isRoot(head)) {
return;
}
String dbName = db.getDbName();
Snapshot next = head.getRoot();
for (int i = 0; i < flushCount; ++i) {
next = next.getNext();
SnapshotImpl snapshot = (SnapshotImpl) next;
DB<Key, Value> keyValueDB = snapshot.getDb();
for (Map.Entry<Key, Value> e : keyValueDB) {
Key k = e.getKey();
Value v = e.getValue();
batch.put(WrappedByteArray.of(Bytes.concat(simpleEncode(dbName), k.getBytes())),
WrappedByteArray.of(v.encode()));
if ("block".equals(db.getDbName())) {
currentBlockNum = new BlockCapsule(v.getBytes()).getNum();
}
}
}
}
if (currentBlockNum == -1) {
logger.error("create checkpoint failed, currentBlockNum: {}", currentBlockNum);
System.exit(-1);
}
String dbName = System.currentTimeMillis() + "_" + currentBlockNum;
checkPointV2Store = getCheckpointDB(dbName);
checkPointV2Store.getDbSource().updateByBatch(batch.entrySet().stream()
.map(e -> Maps.immutableEntry(e.getKey().getBytes(), e.getValue().getBytes()))
.collect(HashMap::new, (m, k) -> m.put(k.getKey(), k.getValue()), HashMap::putAll),
WriteOptionsWrapper.getInstance().sync(CommonParameter
.getInstance().getStorage().isCheckpointSync()));
} catch ( Exception e) {
throw new TronDBException(e);
} finally {
if (checkPointV2Store != null) {
checkPointV2Store.close();
}
}
}
The checkpoints will continue to increase over time, and a strategy needs to be adopted to clear the expired checkpoints. The number of checkpoints reserved can cover the maximum interval of OS page flush. The current checkpoint expiration time is set to 2min.
For the sake of robustness and startup version checking, the checkpoint directory needs to retain at least one checkpoint. In order to prevent the deletion mechanism from clearing all checkpoint directories, the minimum number of checkpoints needs to be set, which is currently tentatively set to 3.
The cleaning strategy adopts a scheduling task, traverses the checkpoint directory and judges according to the above strategy. If the threshold is exceeded, data cleaning is performed.
Part of the pseudocode:
public void init() {
// prune checkpoint
pruneCheckpointThread.scheduleWithFixedDelay(() -> {
try {
if (isV2Open() && !unChecked) {
pruneCheckpoint();
}
} catch (Throwable t) {
logger.error("Exception in prune checkpoint", t);
}
}, 10000, 3600, TimeUnit.MILLISECONDS);
}
private void pruneCheckpoint() {
if (unChecked) {
return;
}
List<String> cpList = getCheckpointList();
if (cpList == null) {
return;
}
if (cpList.size() < 3) {
return;
}
for (String cp: cpList.subList(0,3)) {
long timestamp = Long.parseLong(cp.split("_")[0]);
long blockNumber = Long.parseLong(cp.split("_")[1]);
if (System.currentTimeMillis() - timestamp < ONE_MINUTE_MILLS*2) {
break;
}
String checkpointPath = Paths.get(StorageUtils.getOutputDirectoryByDbName(CHECKPOINT_V2_DIR),
CommonParameter.getInstance().getStorage().getDbDirectory(), CHECKPOINT_V2_DIR).toString();
if (!FileUtil.recursiveDelete(Paths.get(checkpointPath, cp).toString())) {
logger.error("checkpoint prune failed, number: {}", blockNumber);
return;
}
logger.debug("checkpoint prune success, number: {}", blockNumber);
}
}
The user can choose whether to use the v1 or v2 checkpoint mode, and at the same time, user can choose whether to force flush the checkpoint.
// config.conf
storage {
....
checkpoint.version = 2
checkpoint.sync = true
}
// code
public void check() {
if (!isV2Open()) {
if (checkPointV2Store.getDbSource().allKeys().size() > 0) {
logger.error("db check failed, can't convert checkpoint from v2 to v1");
System.exit(-1);
}
checkV1();
} else {
checkV2();
}
}
public void flush() {
...
if (shouldBeRefreshed()) {
try {
long start = System.currentTimeMillis();
if (!isV2Open()) {
deleteCheckpoint();
createCheckpoint();
} else {
createCheckpointV2();
}
...
}
}
}
In order to ensure the smooth upgrade of the program, it is necessary to meet the compatibility of the old and new versions, and checkpoint version switching through configuration.
key point:
- The function of v1 is retained, and the code is not deleted
- Added checkpoint database as data storage for v2 version
- A new checkpoint configuration item is added to the configuration file, allowing users to specify the checkpoint version
- When starting the check, check the current checkpoint version according to the configuration file, and select the corresponding version logic
- When creating a checkpoint, select the corresponding
createCheckpoint
logic according to the configuration version - When starting with v2, it is not allowed to switch back to v1
Pseudo code of v2 check logic part:
private void checkV2() {
logger.info("checkpoint version: {}", CommonParameter.getInstance().getStorage().getCheckpointVersion());
logger.info("checkpoint sync: {}", CommonParameter.getInstance().getStorage().isCheckpointSync());
List<String> cpList = getCheckpointList();
if (cpList == null || cpList.size() == 0) {
logger.info("checkpoint size is 0, using v1 recover");
checkV1();
deleteCheckpoint();
return;
}
...... // recover logic
}
When the v2 version is used for the first time to start, the following situations exist:
- Node with block height = 0:
- both the
tmp
directory and thecheckpoint
directory are empty
- both the
- Nodes with block height > 0:
tmp
is not empty,checkpoint
is emptytmp
is empty because of delete operation andcheckpoint
is empty
In order to facilitate unified processing, whether it is a node synchronized from 0 or a node that is synchronizing or synchronized, when the checkpoint is empty, use the tmp
database to restore data, and return directly after this operation is completed.
In addition: Because checkV1()
and deleteCheckpoint()
in the above logic are not an atomic operation, there is a possible probability that after the execution of the two, the machine may be down, the data in the checkV1()
logic is not completely flushed, but deleteCheckpoint()
has deleted the data on the disk, so the database damage cannot be repaired after restarting. When the first start or a crash when executing checkV1()
, there may be data corruption. This situation is not considered.
In the future, we will consider adding functions such as data consistency check.