[Design Proposal] Add Remote Translog for Improved Durability #5476
Labels
discuss
Issues intended to help drive brainstorming and decision making
enhancement
Enhancement or improvement to existing feature or request
feedback needed
Issue or PR needs feedback
Overview
OpenSearch is the search and analytics suite powering popular use cases such as application search, log analytics, and more. While OpenSearch is part of critical business and application workflows, it is seldom used as a primary data store because there are no strong guarantees on data durability as the cluster is susceptible to data loss in case of hardware failure. In this document, we propose providing strong durability guarantees in OpenSearch using remote translog store in conjunction with remote segment store.
Background
Lucene commits are expensive and cannot be performed on every write operation. OpenSearch uses translog aka transaction logs for recording uncommitted operations from the last Lucene commit. However there is a possibility that the shard isn’t replicated and the data on local storage gets lost on power trips or a faulty node, which causes data loss and requires the user to restore the data from the last known backup. This not only causes disruption but also amounts to intervals of availability loss, which might be undesirable for certain log/security analytics workloads.
Also certain users don’t require search availability and would try to reduce cost by decoupling availability and durability.
Challenges
Today, data indexed in OpenSearch clusters is stored on a local disk. To achieve durability, that is, to ensure that committed operations are not lost on hardware failure, users either back segments up using snapshots or add more replica copies. However, snapshots do not fully guarantee durability: data indexed since the last snapshot can be lost in case of failure. On the other hand, adding replicas is cost prohibitive.
Remote Segment Store does solve the problem to some extent, but it will be able to provide refresh level durability guarantees. To provide request level durability, we need to persist translog in remote store as well. Remote Segment Store will provide refresh level durability . Remote Translog Store will increase these guarantees to durably store the requests post refresh.
Proposed Solution
As a part of improving the overall durability for OpenSearch, we intend to back up segments and transaction logs. Backing up segments is taken up as part of Remote Segment Store Feature. We will focus on durably storing the translogs in remote translog store and its interaction with remote segment store.
Tenets
Indexing Flow
The incoming bulk request is split into individual shard bulk requests per shard which gets indexed into the underlying engine(each shard is a Lucene engine). The IndexShard is responsible for indexing operations to Lucene and once complete writes those operations to translog per request for durability. Periodically one of the writer threads acquire the semaphore and move all operations from the in-memory FileChannel and flushes it to disk.
With remote translog we don’t need to write operations on the transaction logs for the replica since remote store already does the job of making the operation durable which so far the replica translog was responsible for.
Primary Term Validation
This is detailed here. To summarize, in case of remote translog store, we don't need to fan out most of requests to the replicas and would only need to do primary term validation on writes.
Remote Transfer
Transfer Flow
Once the translog has been synced to disk, the transfer manager tries to upload translog and their checkpoint files to the remote store.
Remote Translog Files and Location
Contents of Remote Translog Files
Remote Translog Download Interaction
Translog Durability Modes
All operations are written to translog and synced on disk as well as get uploaded in the sync path, however the remote store can add extra latency in the sync path which might impact performance, which might not be acceptable to users who want to leverage same performance characteristics as that of segment replication. Some of these customers might be log analytics who might be fine losing a bounded amount of data at worst.
Smaller object store writes are usually costly and hence batching them for a window or minimum size can provide balance between cost and durability guarantees.
Sync durability
With remote store, writing operations to the remote store becomes the long poll and governs the overall latencies which would be most noticeable with smaller bulk payloads containing fewer docs or single doc insert or updates. Having said that the throughput would remain almost constant even as latencies plateau as only a single writer thread is responsible for doing the heavy lifting of performing remote I/O. However this level of durability offers the guarantee that all acknowledged operations have been persisted with a high level of durability.
Async durability
OpenSearch natively supports async durability, which allows operations to be fsynced at a specific interval rather than per request. This helps boost performance as it allows operations to be buffered up to a fixed interval and fsynced only at those specified intervals, which reduces the frequency of the costly fsyncs without blocking on acknowledgements. This however comes with its own caveat that all operations that have been acknowledged from the last fsync can get lost in the event of a crash.
The expectation with users opting for async durability is either they are fine losing lets say Xs(10s/1 min, upload interval) worth of data as defined by the RPO SLA or have queues at their end that can use used to re-drive last X s worth of data. Re-driving operations might not be authoritative in nature since we don’t expose a sequence number of the last fsynced operation per shard.
Recovery
Failover
When primary goes down and a replica is promoted , we will need to integrate with remote translog store.
Translog Deletion Policy Garbage Collection
Each operation gets logged into transaction logs and are uploaded to remote store before an acknowledgement is sent back to the client with sync durability. Also at every refresh and commit, the uncommitted data in the form of segments gets uploaded to the remote segment store. When we need to recover data at any point in time we piggy back on segment store for the vast majority of data set and only try to recover uncommitted operations using remote translog store, to ensure we optimise recovery time and leverage physical segments that are pre-created. Now this means that the amount of data on remote translog store ever keeps on increasing and we need to ensure we purge this data.
OpenSearch uses a flush threshold interval to minimise translog operations on the disk at any point, post which the engine fires up a flush to commit segments. The flush triggers a rollover for translog generation and then trims older translog generational files(readers) that do not reference any sequence numbers below the local checkpoint of the new safe commit. The unreferenced translog generations are purged as per a deletion policy . We can purge older generation files in remote translogs in this flow before the sequence number durably persisted using a remote translog deletion policy. This can run in background , not affecting the translog operations latency.
Data Integrity
Ensuring zero data corruption and silent data errors at all times is the key for maintaining a durable data store. The way we need to ensure high data integrity is by leveraging Lucene checksums and ensuring the checksums are validated at all interaction points. Every operation in the translog generational files have a checksum associated with them and persisted on the file. The checkpoint and the metadata files have a codec header and footer. The codec footer is a Lucene checksum of the contents of the file, which is verified each time the file contents are read back. Anytime during recovering operations from the store if we detect that the respective checksums of operations or their corresponding checkpoints don’t match, the system would fail to recovery from the corrupted files.
In order to ensure that the system is protected against silent data errors, which might creep in due to a buggy CPU core, hardware malfunctioning(albeit very rare) its imperative to perform periodic recoveries to catch any corruption that might have crept in proactively. This would ensure that whenever we ought to recover, we don’t have surprises and we can almost guarantee that all data that is backed up are actually restorable.
Correctness
The replication model which uses the logical replication model and its interaction with the remote store needs a TLA+ model for formal verification of the design and additionally system testing frameworks like Jepsen (https://jepsen.io/) for validating the implementation of the system. The system needs to be battle tested against various failure modes, network partitions, failovers, GC pauses, slow disk or network I/O.
FAQs
1. On which indices can remote translog work ?
Remote Translog will work only on indices with Remote Segment enabled , in turn requiring Segment Replication as well.
2. Is only object store supported for storing Translog ?
Yes, for initial implementation , we are relying only on object store. However, the design can be extended to support other stores as well .
The text was updated successfully, but these errors were encountered: