Replies: 7 comments 26 replies
-
In regards to wrapping the guard and S3 write in a distributed lock, I came across https://crates.io/crates/dynalock recently which may be worth taking a dependency on. |
Beta Was this translation helpful? Give feedback.
-
The problem you mentioned exists in our current filesystem backend as well, for example: a writer could crash right after it created the new log entry, but before fully writes out the log content. When this happens, we would end up with a corrupted log entry. The core of the problem here is transaction commit semantic is modeled by creation of the file without taking into account of the file content. Same analysis can be applied to the failure case you mentioned with dynamodb, where putting/reserving a key in dynamodb only indicates "creation" of the file, not a file with fully flushed content. I think it would be more fitting to change the Back to S3 + Dynamodb, my original suggestion of leveraging S3's new consistent read capability won't fly as you already pointed out. It looks like to be 100% safe, we would still need to use dyanmodb for listing so we can get that Here is one way to do it: In our commit loop:
To make it backwards compatible with readers using vanilla S3 backends, we can replica logs from dynamodb into S3 in a dedicated single threaded replicator/writer:
|
Beta Was this translation helpful? Give feedback.
-
Another way to do it is using the distributed lock design you mentioned to serialize the write:
Notice in this design, we only need to use dynamodb for locking. We don't need to use it to implement This design is a simpler compared to the one that uses dynamodb for listing, at the cost of slower commit time and the need to handle lock expiration edge-case. |
Beta Was this translation helpful? Give feedback.
-
There are a lot of of interesting discussions around this topic in delta-io/delta#41 as well, which covered a lot of the points we discussed here. |
Beta Was this translation helpful? Give feedback.
-
Fantastic suggestions all around @houqp. I am all on board for implementing "atomic rename" instead of "create if not exists". Since the use case I am primarily interested in begins with headless multi-process streaming writers appending to the same table and competing over the next version, I personally think I prefer approach 2 you mention above. I think approach 1 would land me right back into a spot where I would either need a dedicated replication process or another not-yet-mentioned lock to let separate threads in each process compete over which process is doing the replication. Approach 2 does not have this same development overhead as far as i can tell. |
Beta Was this translation helpful? Give feedback.
-
@mosyp @xianwill here is an expanded potential atomic rename design with fencing token. Before the optimistic commit loopWrite log entries to a temp s3 location ( At the beginning of the optimistic commit loop:Compute the next version from cached delta table version in memory. It is required to compute the version outside of the lock for non-append only writes. For append only writes, we could optimize further to get the latest version from dynanmodb lock. Atomic rename
|
Beta Was this translation helpful? Give feedback.
-
Atomic rename for S3 v2.0We introduce new Variables The result of a) The lock is not acquired: try again (still in progress or someone else took it). b) Acquired released or new lock:
c) Current lock is expired: this is dangerous zone, previous owner is in undefined state - e.g. halting, crashed etc. At this point we have to repair a rename commit (if any) of previous owner:
|
Beta Was this translation helpful? Give feedback.
-
S3 does not have an API that supports the "Create if not exists" semantics that the other storage backends have. This discussion is to work out what a wrapper on top of S3 should look like to provide these semantics when writing delta log entries.
Ultimately, we need the
put_obj
implementation within the s3 backend to fail withStorageError:AlreadyExists
as the fs backend does iff a log entry with the same version has already been written.DynamoDb is frequently mentioned as a metastore on top of S3 to provide this type of API. delta-rs should provide functionality to:
This discussion is intended to work out the schema and protocol to include "Create if not exists" semantics in the put_obj implementation of the S3 backend.
To kick the discussion off with a naive approach:
Consider a DynamoDb schema with only a
path
attribute. A put_obj implementation may look like:Already, we have some problems. If there is a system failure between DynamoDb guard write and delta log file write for version n, the delta log file for version n will not be written. The delta protocol requires contiguous, monotonically-increasing integers for version so this must be resolved before other log writers can continue the optimistic concurrency loop with any chance of success.
We could implement a distributed lock w/ timeout around the DynamoDb write and S3 log entry file write, but now we have a distributed lock inside of an optimistic concurrency loop which smells pretty funny to me. Perhaps this is tolerable since the goal is to bring the S3 semantics up to speed with other storage backends.
@houqp + @rtyler - your turn now.
Beta Was this translation helpful? Give feedback.
All reactions