[Design Proposal] Add Remote Translog for Improved Durability #5476

gbbafna · 2022-12-07T07:13:13Z

Overview

OpenSearch is the search and analytics suite powering popular use cases such as application search, log analytics, and more. While OpenSearch is part of critical business and application workflows, it is seldom used as a primary data store because there are no strong guarantees on data durability as the cluster is susceptible to data loss in case of hardware failure. In this document, we propose providing strong durability guarantees in OpenSearch using remote translog store in conjunction with remote segment store.

Background

Lucene commits are expensive and cannot be performed on every write operation. OpenSearch uses translog aka transaction logs for recording uncommitted operations from the last Lucene commit. However there is a possibility that the shard isn’t replicated and the data on local storage gets lost on power trips or a faulty node, which causes data loss and requires the user to restore the data from the last known backup. This not only causes disruption but also amounts to intervals of availability loss, which might be undesirable for certain log/security analytics workloads.
Also certain users don’t require search availability and would try to reduce cost by decoupling availability and durability.

Challenges

Today, data indexed in OpenSearch clusters is stored on a local disk. To achieve durability, that is, to ensure that committed operations are not lost on hardware failure, users either back segments up using snapshots or add more replica copies. However, snapshots do not fully guarantee durability: data indexed since the last snapshot can be lost in case of failure. On the other hand, adding replicas is cost prohibitive.

Remote Segment Store does solve the problem to some extent, but it will be able to provide refresh level durability guarantees. To provide request level durability, we need to persist translog in remote store as well. Remote Segment Store will provide refresh level durability . Remote Translog Store will increase these guarantees to durably store the requests post refresh.

Proposed Solution

As a part of improving the overall durability for OpenSearch, we intend to back up segments and transaction logs. Backing up segments is taken up as part of Remote Segment Store Feature. We will focus on durably storing the translogs in remote translog store and its interaction with remote segment store.

Tenets

Consistency over Availability
Correctness
Data Integrity & Prevention of silent data errors
Bounded recovery and failover time
Minimal storage cost overhead
Cloud-vendor neutral

Indexing Flow

The incoming bulk request is split into individual shard bulk requests per shard which gets indexed into the underlying engine(each shard is a Lucene engine). The IndexShard is responsible for indexing operations to Lucene and once complete writes those operations to translog per request for durability. Periodically one of the writer threads acquire the semaphore and move all operations from the in-memory FileChannel and flushes it to disk.
With remote translog we don’t need to write operations on the transaction logs for the replica since remote store already does the job of making the operation durable which so far the replica translog was responsible for.

Primary Term Validation
This is detailed here. To summarize, in case of remote translog store, we don't need to fan out most of requests to the replicas and would only need to do primary term validation on writes.

Remote Transfer

Transfer Flow

Once the translog has been synced to disk, the transfer manager tries to upload translog and their checkpoint files to the remote store.

Remote Translog Files and Location

.tlg file : base64_uuid/index_uuid/shard_id/primary_term/translog
.ckp file: base64_uuid/index_uuid/shard_id/primary_term/translog
metadata file : base64_uuid/index_uuid/shard_id/metadata

Contents of Remote Translog Files

.tlg_N file: Same as .tlg file created on local disk per generation
.ckp_N file: Same as .ckp file created on local disk per generation
metadata file: metadata__primary_term__generation__timestamp
    primary term
    generation
    timestamp
    mapping between generation → primary_term(representative of the path in remote store, since we can derive the complete path from this)

Remote Translog Download Interaction

Download the latest metadata — This would be the highest primary term highest generation with the latest timestamp
Download the latest checkpoint — From the metadata get the primary term and download the checkpoint file from the remote store
Download the translog files — For each translog generation needed download all translog and checkpoint files till min translog generation from checkpoint

Translog Durability Modes

All operations are written to translog and synced on disk as well as get uploaded in the sync path, however the remote store can add extra latency in the sync path which might impact performance, which might not be acceptable to users who want to leverage same performance characteristics as that of segment replication. Some of these customers might be log analytics who might be fine losing a bounded amount of data at worst.
Smaller object store writes are usually costly and hence batching them for a window or minimum size can provide balance between cost and durability guarantees.

Sync durability

With remote store, writing operations to the remote store becomes the long poll and governs the overall latencies which would be most noticeable with smaller bulk payloads containing fewer docs or single doc insert or updates. Having said that the throughput would remain almost constant even as latencies plateau as only a single writer thread is responsible for doing the heavy lifting of performing remote I/O. However this level of durability offers the guarantee that all acknowledged operations have been persisted with a high level of durability.

Async durability

OpenSearch natively supports async durability, which allows operations to be fsynced at a specific interval rather than per request. This helps boost performance as it allows operations to be buffered up to a fixed interval and fsynced only at those specified intervals, which reduces the frequency of the costly fsyncs without blocking on acknowledgements. This however comes with its own caveat that all operations that have been acknowledged from the last fsync can get lost in the event of a crash.
The expectation with users opting for async durability is either they are fine losing lets say Xs(10s/1 min, upload interval) worth of data as defined by the RPO SLA or have queues at their end that can use used to re-drive last X s worth of data. Re-driving operations might not be authoritative in nature since we don’t expose a sequence number of the last fsynced operation per shard.

Recovery

Primary to Replica : Replica doesn’t upload and download to/from remote
Primary to Primary : During Primary to Primary recovery, we need to make sure that only writer uploads and other is cleanly shutoff . Otherwise this can lead to loss of data. During primary to primary relocation , the path to upload translog is same for both the primaries. So we don’t have a way to differentiate b/w two uploads. Hence we need atomic switch from one primary to another and a queue to hold writes during this switch . For sync translog durability , this is already handled by virtue of draining of operations on exiting primary and then completing the handoff . We will need to take care of this scenario in async durability , where we will have to wait for operations to be drained as well as durably stored before completing the handof.

Failover

When primary goes down and a replica is promoted , we will need to integrate with remote translog store.

Download from remote segment store : Download the latest segment files .
Download from remote translog store : Remote Translog would provide guarantee to have all the data from last durably stored refresh time till last persisted request .

Translog Deletion Policy Garbage Collection

Each operation gets logged into transaction logs and are uploaded to remote store before an acknowledgement is sent back to the client with sync durability. Also at every refresh and commit, the uncommitted data in the form of segments gets uploaded to the remote segment store. When we need to recover data at any point in time we piggy back on segment store for the vast majority of data set and only try to recover uncommitted operations using remote translog store, to ensure we optimise recovery time and leverage physical segments that are pre-created. Now this means that the amount of data on remote translog store ever keeps on increasing and we need to ensure we purge this data.
OpenSearch uses a flush threshold interval to minimise translog operations on the disk at any point, post which the engine fires up a flush to commit segments. The flush triggers a rollover for translog generation and then trims older translog generational files(readers) that do not reference any sequence numbers below the local checkpoint of the new safe commit. The unreferenced translog generations are purged as per a deletion policy . We can purge older generation files in remote translogs in this flow before the sequence number durably persisted using a remote translog deletion policy. This can run in background , not affecting the translog operations latency.

Data Integrity

Ensuring zero data corruption and silent data errors at all times is the key for maintaining a durable data store. The way we need to ensure high data integrity is by leveraging Lucene checksums and ensuring the checksums are validated at all interaction points. Every operation in the translog generational files have a checksum associated with them and persisted on the file. The checkpoint and the metadata files have a codec header and footer. The codec footer is a Lucene checksum of the contents of the file, which is verified each time the file contents are read back. Anytime during recovering operations from the store if we detect that the respective checksums of operations or their corresponding checkpoints don’t match, the system would fail to recovery from the corrupted files.
In order to ensure that the system is protected against silent data errors, which might creep in due to a buggy CPU core, hardware malfunctioning(albeit very rare) its imperative to perform periodic recoveries to catch any corruption that might have crept in proactively. This would ensure that whenever we ought to recover, we don’t have surprises and we can almost guarantee that all data that is backed up are actually restorable.

Correctness

The replication model which uses the logical replication model and its interaction with the remote store needs a TLA+ model for formal verification of the design and additionally system testing frameworks like Jepsen (https://jepsen.io/) for validating the implementation of the system. The system needs to be battle tested against various failure modes, network partitions, failovers, GC pauses, slow disk or network I/O.

FAQs

1. On which indices can remote translog work ?

Remote Translog will work only on indices with Remote Segment enabled , in turn requiring Segment Replication as well.

2. Is only object store supported for storing Translog ?

Yes, for initial implementation , we are relying only on object store. However, the design can be extended to support other stores as well .

The text was updated successfully, but these errors were encountered:

sachinpkale · 2022-12-08T10:13:22Z

Thanks @gbbafna for the proposal.

Translog Deletion Policy and Garbage Collection section does not provide any details on how remote translog will be purged.

gbbafna · 2022-12-08T12:28:03Z

Translog Deletion Policy and Garbage Collection section does not provide any details on how remote translog will be purged.

Thanks Sachin . Updated it now.

To summarize : The flush triggers a rollover for translog generation and then trims older translog generational files(readers) that do not reference any sequence numbers below the local checkpoint of the new safe commit. The unreferenced translog generations are purged as per a deletion policy . We can purge older generation files in remote translogs in this flow before the sequence number durably persisted using a remote translog deletion policy. This can run in background , not affecting the translog operations latency.

gbbafna added enhancement Enhancement or improvement to existing feature or request untriaged labels Dec 7, 2022

gbbafna mentioned this issue Dec 7, 2022

[Remote Translog] Introduce Remote Translog with upload functionality #5477

Closed

mch2 added discuss Issues intended to help drive brainstorming and decision making feedback needed Issue or PR needs feedback labels Dec 7, 2022

gbbafna mentioned this issue Dec 14, 2022

[Remote Translog] Remote Translog Truncation and Deletions #5567

Closed

Bukhtawar mentioned this issue Dec 31, 2022

Adding @gbbafna to OpenSearch maintainers #5668

Merged

2 tasks

ashking94 mentioned this issue Jan 2, 2023

[Remote Store][META] Request Level Durability for for remote-backed indexes (Experimental Release Tracking for 2.5) #5671

Closed

23 tasks

gbbafna removed the untriaged label Jan 5, 2023

ashking94 closed this as completed Jan 6, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Design Proposal] Add Remote Translog for Improved Durability #5476

[Design Proposal] Add Remote Translog for Improved Durability #5476

gbbafna commented Dec 7, 2022 •

edited

Loading

sachinpkale commented Dec 8, 2022

gbbafna commented Dec 8, 2022

[Design Proposal] Add Remote Translog for Improved Durability #5476

[Design Proposal] Add Remote Translog for Improved Durability #5476

Comments

gbbafna commented Dec 7, 2022 • edited Loading

Overview

Background

Challenges

Proposed Solution

FAQs

1. On which indices can remote translog work ?

2. Is only object store supported for storing Translog ?

sachinpkale commented Dec 8, 2022

gbbafna commented Dec 8, 2022

gbbafna commented Dec 7, 2022 •

edited

Loading