Checkpoint an entire table at a consistent point in time #1032

bmatican · 2019-03-19T17:27:34Z

We need the functionality of asking for a checkpoint of an entire table and having a consistent time cut across it. This will be useful for a number of features:

table backups but with a consistent cut
consistent cross-table backups (ie: backing up a table and its indexes)
as a building block for index backfill Support backfilling of secondary index #448

cc @mbautin @amitanandaiyer @robertpang

Summary: Performed code move and cleanup for snapshot related classes. It is preparation diff for transaction aware backup. Test Plan: Jenkins Reviewers: bogdan, oleg Reviewed By: oleg Subscribers: nicolas, ybase Differential Revision: https://phabricator.dev.yugabyte.com/D7797

…iles Summary: transaction_participant.cc became too big, this diff splits it into separate files. Test Plan: Jenkins Reviewers: timur, mikhail, bogdan Reviewed By: bogdan Subscribers: ybase Differential Revision: https://phabricator.dev.yugabyte.com/D7827

Summary: For transaction aware snapshot we need to patch existing rocksdb to ignore records newer than specified hybrid time. This diff adds support for it. Test Plan: ybd --gtest_filter DocDBTest.SetHybridTimeFilter Reviewers: mikhail, bogdan, timur Reviewed By: timur Subscribers: kannan, ybase Differential Revision: https://phabricator.dev.yugabyte.com/D7891

…pant Summary: For transaction aware snapshot we should apply intents for all transactions that were committed before snapshot hybrid time. This diff adds such functionality to transaction participant. Test Plan: ybd --gtest_filter SnapshotTxnTest.ResolveIntents Reviewers: bogdan, mikhail, timur Reviewed By: mikhail, timur Subscribers: ybase Differential Revision: https://phabricator.dev.yugabyte.com/D7861

… participant Summary: For transaction aware snapshot we should apply intents for all transactions that were committed before snapshot hybrid time. This diff adds such functionality to transaction participant. Test Plan: ybd --gtest_filter SnapshotTxnTest.ResolveIntents Reviewers: bogdan, mikhail, timur Reviewed By: mikhail, timur Subscribers: ybase Differential Revision: https://phabricator.dev.yugabyte.com/D7861

Summary: This diff implements transaction-aware snapshot creation for a given table or set of tables. The workflow is as follows: 1. The master leader replicates a CREATE_ON_MASTER snapshot operation in master Raft group. This operation contains tablet ids for the snapshot and the snapshot hybrid time. The snapshot hybrid time is computed as the master's current time + max clock skew. This method of choosing the snapshot hybrid time will guarantee that all data present in the relevant table(s) at the moment we start creating the snapshot is included in the snapshot. 2. The master leader sends snapshot requests to leaders of all involved tablets (i.e. all tablets of tables we are creating a snapshot for). Each of these requests includes the snapshot id and hybrid time. 3. Each leader of an involved tablet receives a snapshot request, waits until the hybrid time on the tablet server passes the snapshot time, and applies intents for all transactions that were committed before the snapshot hybrid time. This results in replicating APPLYING records for the respective transactions in the Raft groups of these tablets. 4. The tablet leader replicates a CREATE_ON_TABLET snapshot operation using Raft. 5. When the CREATE_ON_TABLET Raft operation is applied (after Raft replication), each Raft group member of the tablet does the following: - Flush regular RocksDB. - Create a checkpoint for the regular RocksDBin a temporary directory. - Update the temporary checkpoint, patching necessary SST file metadata with a special mark to ignore records past the snapshot hybrid time. - Move the temporary checkpoint to the snapshot directory. 6. After replicating and applying the CREATE_ON_TABLET Raft operation the tablet leader reports to the master that the snapshot has been successfully created on that tablet. 7. After the master leader receives such responses from all tablets, it marks the snapshot as complete. The following functionality is currently missing and will be added by upcoming diffs: 1. Request snapshot status 2. Long-running request retry from master to tablet leaders 3. Correct retry handling 4. Retain necessary logs in master 5. Release snapshot state from memory 6. Appropriate tests 7. Retain history cutoff during snapshot 8. After creating a snapshot, check the history cutoff in each file and make sure it is not greater than the snapshot hybrid time. Test Plan: ybd --cxx-test backup-txn-test Reviewers: oleg, mikhail Reviewed By: mikhail Subscribers: ybase, bogdan Differential Revision: https://phabricator.dev.yugabyte.com/D8100

Summary: This diff implements the following transaction aware snapshot functionality. 1) List transaction aware snapshots. 2) Restore transaction aware snapshot. 3) List transaction aware snapshot restorations. Test Plan: ybd --cxx-test backup-txn-test Reviewers: mikhail Reviewed By: mikhail Subscribers: ybase, bogdan Differential Revision: https://phabricator.dev.yugabyte.com/D8143

Summary: Compactions perform garbage collection of deleted or overwritten records in DocDB using a parameter called "history cutoff". When a compaction starts, the history cutoff timestamp for that compaction is calculated as the minimum of current time minus some configurable retention interval, and the oldest active read point in this tablet replica. Prior to this diff, this decision was being made independently on each tablet replica. When creating a transaction-aware consistent snapshot, we first select a snapshot timestamp, then apply the provisional records of all transactions that could have committed before that timestamp, and then replicate a Raft record that triggers the creation of a RocksDB checkpoint of the regular RocksDB on each replica of each tablet of the table. (The provisional records RocksDB does not need to participate in the snapshot.) Each of these snapshot Raft records contains the snapshot timestamp. We need a way to make sure that by the time this record gets applied (in the Raft sense of the word "applied"), the view of the tablet's data at the snapshot timestamp is still available on all replicas of the tablet, i.e. the relevant history has not been garbage-collected. To that end, in this diff we move the control over the history cutoff timestamp to the tablet leader. We introduce a new Raft operation type, HISTORY_CUTOFF_OP. The leader adds this operation to the Raft queue automatically when the hybrid time of the last operation tracked by MvccManager (such as write and transaction update operations) exceeds the current history cutoff, but not more than once per the time interval specified by the new history_cutoff_propagation_interval_ms flag. Another new flag, enable_history_cutoff_propagation, determines whether we use this new approach (when the flag is set to true), or the old approach when each replica selects the history cutoff independently at the time of starting a compaction (when the flag is false, which is the default). We can set enable_history_cutoff_propagation to true after finishing a rolling upgrade of a YugabyteDB cluster to a version containing this commit. If we simply enabled the new behavior by default, upgraded leaders would start sending the new Raft operation type to followers that would not understand it. Also in this diff we are moving the tracking of the multi-set of hybrid times of active read operations from Tablet to TabletRetentionPolicy. TabletRetentionPolicy becomes a entity that keeps track of the committed history cutoff, the oldest read point, and computes history cutoff values to be used in a local compaction and propagated to followers. Other improvements: - Deduplicating code for creating a log prefix for a tablet/peer combination. - Deduplicating code for creating responses for denied votes. - Creating a utility data structure FirstEntryMetadata for data that is stored with the first entry of a log batch, i.e. restart-safe coarse time and committed OpId. - Renaming committed_index, which is actually an OpId, to committed_op_id. Test Plan: Jenkins Reviewers: amitanand, mikhail, timur Reviewed By: mikhail, timur Subscribers: ybase, bogdan Differential Revision: https://phabricator.dev.yugabyte.com/D7975

Summary: This diff implement persistence for transaction aware snapshots. After initiation, a snapshot is stored in system catalog's RocksDB and loaded into memory after master restart. Also snapshot state is updated after the snapshot is complete. Other changes: - Creating a RAII class for keeping track of the number of "prepare" operations, instead of using manual and error-prone increments and decrements. - Moving SysCatalogWriter from sys_catalog-internal.h to sys_catalog_writer.{h,cc}. - Deduplicating the logic for applying snapshot Raft operations between SnapshotOperation::DoReplicated and TabletSnapshots::Apply. Test Plan: ybd --gtest_filter BackupTxnTest.Persistence Reviewers: mikhail Reviewed By: mikhail Subscribers: ybase, bogdan Differential Revision: https://phabricator.dev.yugabyte.com/D8166

Summary: Implemented the ability to delete a transaction aware snapshot. Initially, DELETE_ON_MASTER is replicated in the master tablet. After that the master sends DELETE_ON_TABLET request to all the tablets. That should delete local snapshot copy and respond to the master. Also moved SubmitWrite and part of transaction aware snapshot creation to MasterSnapshotCoordinator. Test Plan: ybd --gtest_filter BackupTxnTest.Delete Reviewers: mikhail Reviewed By: mikhail Subscribers: oleg, ybase, bogdan Differential Revision: https://phabricator.dev.yugabyte.com/D8220

Summary: This diff add test that creates transaction aware snapshot, then imports its metadata. Test Plan: ybd --gtest_filter BackupTxnTest.ImportMeta Reviewers: oleg Reviewed By: oleg Subscribers: ybase Differential Revision: https://phabricator.dev.yugabyte.com/D8232

Summary: This diff fixes several issues related to replying snapshot related operations during master bootstrap: 1) On boostrap failure we are trying to dump current replay state, but it could cause crash, since replayed entry could already be moved from pending_replicates and other data structures. Fixed by handling nullptr there. 2) schema_with_ids_ in SysCatalog is initialized after bootstrap, that prevents snapshot related operations from being replayed. Fixed by initializing schema_with_ids_ in ctor. 3) SnapshotOperationState does not have hybrid time, while being replayed. Fixed by setting hybrid time from replicate message. 4) tablet_ in TabletPeer is not yet initialized while applying snapshot operation state during bootstrap. Fixed by using tablet from operation state. 5) Snapshot could be inserted twice during bootstrap. First because of logs replay and second during sys catalog load. Also added restart w/o flush before BackupTxnTest tear down to test replaying snapshot operations during bootstrap. Test Plan: ybd --gtest_filter BackupTxnTest.* Reviewers: mikhail, oleg, bogdan Reviewed By: oleg Subscribers: ybase Differential Revision: https://phabricator.dev.yugabyte.com/D8277

Summary: After snapshot creation or deletion is initiated, we should retry unless an unrecoverable error happens. Currently only history cutoff could cause snapshot failure. We should use the same hybrid time for the snapshot at all tablets, to keep transactional consistency. If the tablet already moved history cutoff past this point, then it could miss the required values. Also, this diff contains the following improvements: 1) Extracted polling logic from TransactionCoordinator to a separate utility class. 2) Fixed a crash when a retryable master task times out. 3) Replaced LOG with LOG_WITH_PREFIX in retryable master tasks. Test Plan: ybd --gtest_filter BackupTxnTest.Retry Reviewers: oleg, mikhail Reviewed By: mikhail Subscribers: ybase, bogdan Differential Revision: https://phabricator.dev.yugabyte.com/D8256

Summary: This diff adds test that verify transaction aware backup consistency. It checks that transaction is fully backed up or not backup at all. Test Plan: ybd --gtest_filter BackupTxnTest.Consistency Reviewers: mikhail, oleg, bogdan Reviewed By: bogdan Subscribers: ybase Differential Revision: https://phabricator.dev.yugabyte.com/D8377

bmatican added the kind/new-feature This is a request for a completely new feature label Mar 19, 2019

bmatican assigned amitanandaiyer Mar 19, 2019

bmatican mentioned this issue Mar 19, 2019

Ability to backup a table and its indexes in one shot #1033

Closed

bmatican added the area/docdb YugabyteDB core features label Oct 29, 2019

bmatican unassigned amitanandaiyer Oct 31, 2019

bmatican assigned spolitov Mar 6, 2020

bmatican added the priority/high High Priority label Mar 6, 2020

mbautin mentioned this issue Mar 31, 2020

Transactional, distributed backups #2620

Open

bmatican closed this as completed May 6, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Checkpoint an entire table at a consistent point in time #1032

Checkpoint an entire table at a consistent point in time #1032

bmatican commented Mar 19, 2019

Checkpoint an entire table at a consistent point in time #1032

Checkpoint an entire table at a consistent point in time #1032

Comments

bmatican commented Mar 19, 2019