Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Checkpoint an entire table at a consistent point in time #1032

Closed
bmatican opened this issue Mar 19, 2019 · 0 comments
Closed

Checkpoint an entire table at a consistent point in time #1032

bmatican opened this issue Mar 19, 2019 · 0 comments
Assignees
Labels
area/docdb YugabyteDB core features kind/new-feature This is a request for a completely new feature priority/high High Priority

Comments

@bmatican
Copy link
Contributor

We need the functionality of asking for a checkpoint of an entire table and having a consistent time cut across it. This will be useful for a number of features:

cc @mbautin @amitanandaiyer @robertpang

@bmatican bmatican added the kind/new-feature This is a request for a completely new feature label Mar 19, 2019
@bmatican bmatican added the area/docdb YugabyteDB core features label Oct 29, 2019
spolitov added a commit that referenced this issue Jan 20, 2020
Summary:
Performed code move and cleanup for snapshot related classes.

It is preparation diff for transaction aware backup.

Test Plan: Jenkins

Reviewers: bogdan, oleg

Reviewed By: oleg

Subscribers: nicolas, ybase

Differential Revision: https://phabricator.dev.yugabyte.com/D7797
spolitov added a commit that referenced this issue Jan 25, 2020
…iles

Summary: transaction_participant.cc became too big, this diff splits it into separate files.

Test Plan: Jenkins

Reviewers: timur, mikhail, bogdan

Reviewed By: bogdan

Subscribers: ybase

Differential Revision: https://phabricator.dev.yugabyte.com/D7827
spolitov added a commit that referenced this issue Feb 6, 2020
Summary:
For transaction aware snapshot we need to patch existing rocksdb to ignore records
newer than specified hybrid time.
This diff adds support for it.

Test Plan: ybd --gtest_filter DocDBTest.SetHybridTimeFilter

Reviewers: mikhail, bogdan, timur

Reviewed By: timur

Subscribers: kannan, ybase

Differential Revision: https://phabricator.dev.yugabyte.com/D7891
spolitov added a commit that referenced this issue Mar 1, 2020
…pant

Summary:
For transaction aware snapshot we should apply intents for all transactions
that were committed before snapshot hybrid time.

This diff adds such functionality to transaction participant.

Test Plan: ybd --gtest_filter SnapshotTxnTest.ResolveIntents

Reviewers: bogdan, mikhail, timur

Reviewed By: mikhail, timur

Subscribers: ybase

Differential Revision: https://phabricator.dev.yugabyte.com/D7861
@bmatican bmatican added the priority/high High Priority label Mar 6, 2020
carlos-username pushed a commit to carlos-username/yugabyte-db that referenced this issue Mar 11, 2020
… participant

Summary:
For transaction aware snapshot we should apply intents for all transactions
that were committed before snapshot hybrid time.

This diff adds such functionality to transaction participant.

Test Plan: ybd --gtest_filter SnapshotTxnTest.ResolveIntents

Reviewers: bogdan, mikhail, timur

Reviewed By: mikhail, timur

Subscribers: ybase

Differential Revision: https://phabricator.dev.yugabyte.com/D7861
spolitov added a commit that referenced this issue Mar 15, 2020
Summary:
This diff implements transaction-aware snapshot creation for a given
table or set of tables. The workflow is as follows:

1. The master leader replicates a CREATE_ON_MASTER snapshot operation
   in master Raft group. This operation contains tablet ids for the
   snapshot and the snapshot hybrid time. The snapshot hybrid time is
   computed as the master's current time + max clock skew. This method
   of choosing the snapshot hybrid time will guarantee that all data
   present in the relevant table(s) at the moment we start creating the
   snapshot is included in the snapshot.

2. The master leader sends snapshot requests to leaders of all
   involved tablets (i.e. all tablets of tables we are creating a
   snapshot for).  Each of these requests includes the snapshot id and
   hybrid time.

3. Each leader of an involved tablet receives a snapshot request,
   waits until the hybrid time on the tablet server passes the snapshot
   time, and applies intents for all transactions that were committed
   before the snapshot hybrid time.  This results in replicating
   APPLYING records for the respective transactions in the Raft groups
   of these tablets.

4. The tablet leader replicates a CREATE_ON_TABLET snapshot operation
   using Raft.

5. When the CREATE_ON_TABLET Raft operation is applied (after Raft
   replication), each Raft group member of the tablet does the
   following:
   - Flush regular RocksDB.
   - Create a checkpoint for the regular RocksDBin a temporary
     directory.
   - Update the temporary checkpoint, patching necessary SST file
     metadata with a special mark to ignore records past the snapshot
     hybrid time.
   - Move the temporary checkpoint to the snapshot directory.

6. After replicating and applying the CREATE_ON_TABLET Raft operation
   the tablet leader reports to the master that the snapshot has been
   successfully created on that tablet.

7. After the master leader receives such responses from all tablets,
   it marks the snapshot as complete.

The following functionality is currently missing and will be added by
upcoming diffs:

1. Request snapshot status
2. Long-running request retry from master to tablet leaders
3. Correct retry handling
4. Retain necessary logs in master
5. Release snapshot state from memory
6. Appropriate tests
7. Retain history cutoff during snapshot
8. After creating a snapshot, check the history cutoff in each file
   and make sure it is not greater than the snapshot hybrid time.

Test Plan: ybd --cxx-test backup-txn-test

Reviewers: oleg, mikhail

Reviewed By: mikhail

Subscribers: ybase, bogdan

Differential Revision: https://phabricator.dev.yugabyte.com/D8100
spolitov added a commit that referenced this issue Mar 20, 2020
Summary:
This diff implements the following transaction aware snapshot functionality.
1) List transaction aware snapshots.
2) Restore transaction aware snapshot.
3) List transaction aware snapshot restorations.

Test Plan: ybd --cxx-test backup-txn-test

Reviewers: mikhail

Reviewed By: mikhail

Subscribers: ybase, bogdan

Differential Revision: https://phabricator.dev.yugabyte.com/D8143
spolitov added a commit that referenced this issue Mar 27, 2020
Summary:
Compactions perform garbage collection of deleted or overwritten records
in DocDB using a parameter called "history cutoff". When a compaction
starts, the history cutoff timestamp for that compaction is calculated
as the minimum of current time minus some configurable retention
interval, and the oldest active read point in this tablet replica.
Prior to this diff, this decision was being made independently on each
tablet replica.

When creating a transaction-aware consistent snapshot, we first select
a snapshot timestamp, then apply the provisional records of all
transactions that could have committed before that timestamp, and then
replicate a Raft record that triggers the creation of a RocksDB
checkpoint of the regular RocksDB on each replica of each tablet of the
table. (The provisional records RocksDB does not need to participate in
the snapshot.) Each of these snapshot Raft records contains the
snapshot timestamp. We need a way to make sure that by the time this
record gets applied (in the Raft sense of the word "applied"), the view
of the tablet's data at the snapshot timestamp is still available on
all replicas of the tablet, i.e. the relevant history has not been
garbage-collected.

To that end, in this diff we move the control over the history cutoff
timestamp to the tablet leader. We introduce a new Raft operation type,
HISTORY_CUTOFF_OP. The leader adds this operation to the Raft queue
automatically when the hybrid time of the last operation tracked by
MvccManager (such as write and transaction update operations) exceeds
the current history cutoff, but not more than once per the time
interval specified by the new history_cutoff_propagation_interval_ms
flag. Another new flag, enable_history_cutoff_propagation, determines
whether we use this new approach (when the flag is set to true), or the
old approach when each replica selects the history cutoff independently
at the time of starting a compaction (when the flag is false, which is
the default). We can set enable_history_cutoff_propagation to true
after finishing a rolling upgrade of a YugabyteDB cluster to a version
containing this commit. If we simply enabled the new behavior by
default, upgraded leaders would start sending the new Raft operation
type to followers that would not understand it.

Also in this diff we are moving the tracking of the multi-set of hybrid
times of active read operations from Tablet to TabletRetentionPolicy.
TabletRetentionPolicy becomes a entity that keeps track of the
committed history cutoff, the oldest read point, and computes history
cutoff values to be used in a local compaction and propagated to
followers.

Other improvements:

- Deduplicating code for creating a log prefix for a tablet/peer
  combination.
- Deduplicating code for creating responses for denied votes.
- Creating a utility data structure FirstEntryMetadata for data that is
  stored with the first entry of a log batch, i.e. restart-safe coarse
  time and committed OpId.
- Renaming committed_index, which is actually an OpId, to
  committed_op_id.

Test Plan: Jenkins

Reviewers: amitanand, mikhail, timur

Reviewed By: mikhail, timur

Subscribers: ybase, bogdan

Differential Revision: https://phabricator.dev.yugabyte.com/D7975
spolitov added a commit that referenced this issue Mar 31, 2020
Summary:
This diff implement persistence for transaction aware snapshots. After
initiation, a snapshot is stored in system catalog's RocksDB and loaded
into memory after master restart. Also snapshot state is updated after
the snapshot is complete.

Other changes:

- Creating a RAII class for keeping track of the number of "prepare"
  operations, instead of using manual and error-prone increments and
  decrements.

- Moving SysCatalogWriter from sys_catalog-internal.h to
  sys_catalog_writer.{h,cc}.

- Deduplicating the logic for applying snapshot Raft operations between
  SnapshotOperation::DoReplicated and TabletSnapshots::Apply.

Test Plan: ybd --gtest_filter BackupTxnTest.Persistence

Reviewers: mikhail

Reviewed By: mikhail

Subscribers: ybase, bogdan

Differential Revision: https://phabricator.dev.yugabyte.com/D8166
spolitov added a commit that referenced this issue Apr 4, 2020
Summary:
Implemented the ability to delete a transaction aware snapshot.

Initially, DELETE_ON_MASTER is replicated in the master tablet. After
that the master sends DELETE_ON_TABLET request to all the tablets. That
should delete local snapshot copy and respond to the master.

Also moved SubmitWrite and part of transaction aware snapshot creation
to MasterSnapshotCoordinator.

Test Plan: ybd --gtest_filter BackupTxnTest.Delete

Reviewers: mikhail

Reviewed By: mikhail

Subscribers: oleg, ybase, bogdan

Differential Revision: https://phabricator.dev.yugabyte.com/D8220
spolitov added a commit that referenced this issue Apr 5, 2020
Summary: This diff add test that creates transaction aware snapshot, then imports its metadata.

Test Plan: ybd --gtest_filter BackupTxnTest.ImportMeta

Reviewers: oleg

Reviewed By: oleg

Subscribers: ybase

Differential Revision: https://phabricator.dev.yugabyte.com/D8232
spolitov added a commit that referenced this issue Apr 13, 2020
Summary:
This diff fixes several issues related to replying snapshot related operations during master bootstrap:
1) On boostrap failure we are trying to dump current replay state, but it could cause crash, since replayed entry could already be moved from pending_replicates and other data structures. Fixed by handling nullptr there.
2) schema_with_ids_ in SysCatalog is initialized after bootstrap, that prevents snapshot related operations from being replayed. Fixed by initializing schema_with_ids_ in ctor.
3) SnapshotOperationState does not have hybrid time, while being replayed. Fixed by setting hybrid time from replicate message.
4) tablet_ in TabletPeer is not yet initialized while applying snapshot operation state during bootstrap. Fixed by using tablet from operation state.
5) Snapshot could be inserted twice during bootstrap. First because of logs replay and second during sys catalog load.

Also added restart w/o flush before BackupTxnTest tear down to test replaying snapshot operations during bootstrap.

Test Plan: ybd --gtest_filter BackupTxnTest.*

Reviewers: mikhail, oleg, bogdan

Reviewed By: oleg

Subscribers: ybase

Differential Revision: https://phabricator.dev.yugabyte.com/D8277
spolitov added a commit that referenced this issue Apr 15, 2020
Summary:
After snapshot creation or deletion is initiated, we should retry unless an unrecoverable error happens.
Currently only history cutoff could cause snapshot failure.
We should use the same hybrid time for the snapshot at all tablets, to keep transactional consistency.
If the tablet already moved history cutoff past this point, then it could miss the required values.

Also, this diff contains the following improvements:
1) Extracted polling logic from TransactionCoordinator to a separate utility class.
2) Fixed a crash when a retryable master task times out.
3) Replaced LOG with LOG_WITH_PREFIX in retryable master tasks.

Test Plan: ybd --gtest_filter BackupTxnTest.Retry

Reviewers: oleg, mikhail

Reviewed By: mikhail

Subscribers: ybase, bogdan

Differential Revision: https://phabricator.dev.yugabyte.com/D8256
spolitov added a commit that referenced this issue Apr 30, 2020
Summary:
This diff adds test that verify transaction aware backup consistency.
It checks that transaction is fully backed up or not backup at all.

Test Plan: ybd --gtest_filter BackupTxnTest.Consistency

Reviewers: mikhail, oleg, bogdan

Reviewed By: bogdan

Subscribers: ybase

Differential Revision: https://phabricator.dev.yugabyte.com/D8377
@bmatican bmatican closed this as completed May 6, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/docdb YugabyteDB core features kind/new-feature This is a request for a completely new feature priority/high High Priority
Projects
None yet
Development

No branches or pull requests

3 participants