Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Storage System] Support for AWS S3 (multiple clusters/drivers/JVMs) #41

Closed
3 of 9 tasks
tdas opened this issue May 11, 2019 · 35 comments
Closed
3 of 9 tasks

[Storage System] Support for AWS S3 (multiple clusters/drivers/JVMs) #41

tdas opened this issue May 11, 2019 · 35 comments
Labels
enhancement New feature or request

Comments

@tdas
Copy link
Contributor

tdas commented May 11, 2019

This is the official for discussing support for Delta Lake on S3 while writing from multiple clusters. The challenges of S3 support have been explained in #39 . While #39 tracks the work for a simpler solution that works only with all write operations going through the same cluster/driver/JVM, this issues tracks the larger problem of making it work with multiple clusters.

Please use this thread to discuss and vote on ideas.

Update 2022-01-13
We have begun working with an open-source contributor on the design + implementation of this feature using DynamoDB to provide the mutual-exclusion that S3 is lacking.

Here's the public design doc.

The current status is:

  • PR feedback document + not-yet-public (still WIP) design doc
  • implement PR feedback
  • refactor base PR to use new storage-dynamodb SBT project
  • refactor base PR's python integration tests cc @allisonport-db
  • refactor to Java
  • refactor-out project to separate module (to isolate AWS dependencies)
  • ease-of-use improvements (e.g. default tables, capacity modes, etc.)
  • potential performance improvements
  • 0th commit DynamoDB-empty-table check
@turtlemonvh
Copy link

The Terraform project has a nice configurable state locking system:
https://www.terraform.io/docs/state/locking.html

The default locking mechanism for S3 storage of their state files is based on DynamoDB:
https://www.terraform.io/docs/backends/types/s3.html

A similar mechanism here would be a good first cut, but it would be great to make this something users can customize. I am esp. thinking of my own use cases where we may run against an S3-like system (e.g. minio) in some environments where DynamoDB is not available, but I may have access to a different state locking mechanism (e.g. etcd, esp. if I'm running Spark on kubernetes).

@tdas
Copy link
Contributor Author

tdas commented May 15, 2019

Yeah, it will be good to design a generic database-backed (or kvstore-backed) log store that can be plugged in with any database or key-value store implementation.

@uctopcu
Copy link

uctopcu commented May 23, 2019

An IPC mechanism between drivers, such as a message queue, can also be utilized for coordination of the transactional changes. This could remove external storage dependencies.

@gourav-sg
Copy link

I think that it might be great also consider costs and performance based on the architecture we choose and its performance implications (https://aws.amazon.com/dynamodb/pricing/on-demand/). I remember about TD's use of RocksDB, if I remember correctly, for storing the state of Structured Streaming. We could also use ElasticCache (https://aws.amazon.com/elasticache/pricing/). Let me know in case I can be of any help :)

@zsxwing zsxwing changed the title [Storage System] Support for AWS S3 (multiple clusters) [Storage System] Support for AWS S3 (multiple clusters/drivers/JVMs) Jun 14, 2019
@ryanworl
Copy link

ryanworl commented Jul 14, 2019

FoundationDB would be a great choice to solve the two problems listed in #39. It is the only open-source, highly-available, and scalable database that provides strictly serializable, arbitrary read-write ACID transactions. It is backed by a team of extremely talented developers at Apple and a few other places, some of which have been there since nearly the beginning when it was a commercial product and before it was acquired. FoundationDB has Java bindings, and the Java bindings are the most mature of all the bindings because they are used by Apple and others in production at petabyte scale.

Snowflake uses FoundationDB for the very similar use case of storing the metadata for their cloud data warehouse which, like Delta, stores data in an object storage system such as S3. FoundationDB could also be used for storing other metadata besides the transaction log, such as zone maps for pruning of partition files or directories while executing queries.

FoundationDB provides strict serializability, which is the strongest isolation level possible. It is broadly defined as serializability from database theory combined with linearizability from concurrency theory. This is the same isolation level offered by Spanner from Google and FaunaDB, both of which are closed-source products. Spanner is also not even available outside of Google Cloud Platform. In practice, strict serializability means transactions are atomically executed at a single point in time, and those transactions are visible in a real-time order. FoundationDB is also causally consistent, which is important for use cases such as what I'm proposing below where invariants need to be maintained across transactions.

I think using a database with such strong guarantees is important because metadata corruption from using a weaker isolation level or asynchronous durability in the metadata store would be an absolute disaster. Metadata corruption is essentially impossible to recover from because, unlike a back-up of a database which may be able to provide a consistent snapshot of itself, a system which combines S3 PLUS a metadata store cannot be backed up in a consistent fashion. Even if you restore your metadata store from a back-up that is consistent unto itself (and not all provide that option), you would have to manually determine the state of the tables and rectify that yourself between the metadata store and the objects in S3.

Systems which will almost certainly be suggested, but which should be rejected because they suffer from weak isolation or asynchronous durability include:

  1. MySQL. The default isolation level is not serializable. Even if you turn it up manually or use SELECT ... FOR UPDATE, MySQL replication is asynchronous. This could lead to lost commits if a failover happens at an inopportune moment in the Delta commit protocol. Snowflake originally tried to use MySQL and switched to FoundationDB.
  2. Postgres. See MySQL, the answer is the same.
  3. Redis. Redis by default does offer the necessary isolation, but still suffers from the same async replication problems as MySQL and Postgres, but throws async durability into the mix on top, making it even harder to predict the behavior during failures.
  4. Literally any message queue or other caching product. Just don't even try. See all answers above and more.

Systems which do fulfill the requirements, but which suffer other drawbacks that can be worked around in various ways:

  1. ZooKeeper/etcd/consul. These would all satisfy mutual exclusion without issue because they provide linearizability, replicate synchronously, and sync to disk before acknowledging. Where they may fall over in general is storing large volumes of metadata in high-volume tables. These systems cannot outgrow the capacity of a single machine, and their performance suffers heavily if they ever need to go to disk. This limits the data size in practice to the capacity of the main memory of a single machine. If Databricks were to offer this as a service to customers using S3, making the metadata service available in a multi-tenant fashion would be costly because of these limitations. If the only problem were lack of mutual exclusion in S3, these systems would all be a good fit. But because of the lack of consistent directory listing, you need to store the metadata too. Some scheme involving storing only pointers to metadata files which are themselves stored in S3 in random keys (not in the transaction log ordering) could be devised to lessen this limitation.
  2. DynamoDB. DynamoDB does satisfy the property of mutual exclusion, and it also recently added support for transactions. This means it can commit atomically to increment the table version number and add an entry into the commit log. What DynamoDB does not provide, however, is an ordered key-value store. To ensure you can read data back out in order, you have a few options, none of which are good. The first option is to use the same hash key for every commit log entry and put the version in the sort key, which limits your total data size to the capacity of a single DynamoDB shard, which is 10GB. A second option is a scheme using pointers and random hash keys, such that data can be spread throughout the entire cluster, but reading cannot be done concurrently and you have to read each item individually to read the whole commit log. A third option and tempting one is to rely on global secondary indexes, but those do not offer consistent range reads, which leaves you back where you started. If your metadata will never exceed 10GB and the capacity of a single shard (1000QPS), DynamoDB would potentially work. I don't think that is realistic as a hard limit to impose on every user of a big data system though.

The two problems could be solved as such:

  1. Atomic "Put if not Present" When the user begins a transaction in Delta, Delta begins a FoundationDB transaction and reads the latest version of the table. The user then performs whatever read and write operations to S3 to stage their changes, recording the read and write sets as specified in the Delta protocol. Once the user is ready to commit, Delta starts a new FoundationDB transaction and reads the latest version of the table. If the version has not changed, it is always safe to proceed since there were no concurrent writes. If the version has changed, the client performs conflict detection with the read and write sets already recorded by the Delta protocol. If validation passes, Delta commits the new transaction log entry at whatever the next version is according to what was just read from the version key in FoundationDB. FoundationDB will ensure mutual exclusion on the second transaction because all concurrent writers will perform a read on the version key, which will ensure a conflict is detected if one occurs. The client can retry as many times as necessary at the FoundationDB level without exposing the conflicts to the user, unless the conflict is a true conflict and not a conflict from mutual exclusion on the version counter. Another wrinkle, which I hope is already covered by the Delta protocol, is that a client must have some way of knowing which commits were made by it so that in the case of retries it can achieve idempotent commits. This is the case regardless of which locking mechanism is used (i.e. not related to FoundationDB), but should be mentioned.

  2. Consistent directory listing FoundationDB provides a a range read API and is an ordered key-value store. This allows you to e.g. mimic the S3 pattern of hierarchical keys for partitioning files. For the transaction log, all of the data would be stored in FoundationDB, and all of FoundationDB's guarantees would apply.

I would be happy to talk to anyone on the Delta team if they were interested in pursuing this.

@marmbrus
Copy link
Contributor

Thanks for taking the time to write such a complete and thoughtful post @ryanworl!

In general, I agree with your analysis though there is one point I would like to clarify: the current implementation of Delta uses Spark to process the transaction log, which is stored along-side the actual data. As such, I think system like Dynamo or Zookeeper could be used to determine the "winner" of any given table version (problem #1) and also as the source for what the newest version of the table is (problem #2) without needing to store large quantities of data themselves.

Now, I think you could also move the entire transaction log itself into something like FoundationDB, and build a version of the delta log protocol that operates as SQL queries against this system. This could possible improve the latency of querying the latest version of the table. That said, that is a much larger effort than what we anticipating when we created this ticket :)

@ryanworl
Copy link

ryanworl commented Jul 15, 2019

@marmbrus Thank you for the clarification. I wasn't sure if I was correct about the relationship between the consistent directory listing limitation and the table version.

How is problem 2 solved if there is no consistent directory listing in S3? By issuing a single read for each version number?

EDIT:

I see from this comment in the S3 driver a proposed solution:

 * Regarding directory listing, this implementation:
 * - returns a list by merging the files listed from S3 and recently-written 
     files from the cache.

Is this adequate to ensure all metadata files are visible to all clients before they attempt to commit? I am not sure. The scenario I envision is if the metadata file one version previous to the current version is not yet visible to a client, and it attempts to read the current version of the table. If it misses the second-to-last metadata file, it would see a corrupt version of the table, but think it has successfully read the table and potentially committed a new write because the current version did happen to be visible. It could successfully CAS the current version to the new version it committed without reading the second-to-last metadata file.

I'm sure you've all been thinking about this more than I have since learning about Delta, so I am probably not correct here.

@marmbrus
Copy link
Contributor

The number of the latest version is all we need to correctly reconstruct the current state of a table. We will list to find the latest checkpoint, but it is not incorrect to start with an earlier one, in the case where a later checkpoint is omitted by eventually consistent listing (you'll just do a bit more work to reconstruct the snapshot).

Other side notes:
S3 does guarantee read-after-consistency for new keys. Delta ensures that both data files (named with GUIDs) and log files (named with monotonically increasing integers) are always "new".

@marmbrus
Copy link
Contributor

If it misses the second-to-last metadata file, it would see a corrupt version of the table

We'll never "miss" a metadata files as they are named with contiguous, monotonically increasing integers. So as long as we know the latest we can figure out the others.

However, if that cache is missing the latest value (because a write came from another cluster) you might see a stale (but not corrupted) version of the table.

@ryanworl
Copy link

ryanworl commented Jul 15, 2019

Let me lay out the scenario I'm talking about more explicitly.

Three competing clients A, B, and C, are all modifying the table concurrently. They proceed along a timeline represented by T0, T1, etc. Table version is represented as V0, etc.

At T0, A increments to V1 from V0, issues a Put of metadata file for V1.
At T1, B increments to V2 from V1, issues a Put of metadata file for V2.
At T2, A increments to V3 from V2, issues a Put of metadata file for V3.

If client C has nothing in cache and is starting work after T2, it could go to the version mutex server and see the version is V3 now because the version server is linearizable.

If client C then proceeds to do a directory listing at T2, is it guaranteed to see ALL the files for V1, V2 and V3 in the directory listing? I don't think that is the case.

If the client does individual reads for each version key between what it has in cache and the latest version, it would see everything because S3 has read-after-write consistency for newly written keys. But if it solely relies on the content of the directory listing and doesn't detect the gap from e.g. receiving a directory listing with V0,V1, and V3 metadata files, it would miss a file and therefore read a corrupt snapshot of the table.

I don't think S3 guarantees that objects are visible in any specific order relative to when they were written. Specifically the documentation says:

A process writes a new object to Amazon S3 and immediately lists keys within its bucket. Until the change is fully propagated, the object might not appear in the list.

An additional wrinkle is that if a client successfully increments the version number and crashes before committing the metadata file to S3, it would leave a hole. This hole would be difficult to differentiate from an eventual consistency artifact in the directory listing, without attempting to read the key directly.

@ryanworl
Copy link

ryanworl commented Jul 15, 2019

Another issue worth considering is the scenario where a client A increments the version, but before it has written the metadata file (GC pause, network partition, etc), client B reads the version from the version server and issues a GET for the key where the newly written version metadata would be before it is written.

This sequence would invalidate read-after-write consistency from S3.

From the documentation:

Amazon S3 provides read-after-write consistency for PUTS of new objects in your S3 bucket in all regions with one caveat. The caveat is that if you make a HEAD or GET request to the key name (to find if the object exists) before creating the object, Amazon S3 provides eventual consistency for read-after-write.

@marmbrus
Copy link
Contributor

Thanks for the clarification! I agree, that there are a bunch of edge-cases to consider here, but I don't think they are insurmountable. Specific thoughts inline below.

But if it solely relies on the content of the directory listing and doesn't detect the gap from e.g. receiving a directory listing with V0,V1, and V3 metadata files, it would miss a file and therefore read a corrupt snapshot of the table.

You are right, you must always load a contiguous set of versions, based some available snapshot (latest is most efficient, but not required for correctness) and the latest version available (from the transactional store).

The current implementation will refuse to load a table if there are any missing version files.

An additional wrinkle is that if a client successfully increments the version number and crashes before committing the metadata file to S3, it would leave a hole. This hole would be difficult to differentiate from an eventual consistency artifact in the directory listing, without attempting to read the key directly.

Agreed, we would not want to sacrifice liveness of the protocol here by allowing a client that "locks" a version to die without completing the operation.

I think we can avoid an issue here by including information about the contents of the commit atomically with the operation that "claims" that version. For example, you could write the contents of Vn to a temporary file and include a pointer to that file in the lock table of whatever transactional storage system we are using.

In this case, any reader who comes along and sees that the transaction log is behind the lock table (either due to a crash or due to eventual consistency) could finish that operation. If multiple readers do this concurrently, there are no issues with eventual consistency as they are all writing the same data.

As an optimization, if the version is small (which is the common case), you could avoid the indirection and just include the data in the lock request. (This is how the commit service in Databrick's for Managed Delta Lake works).

Another issue worth considering is the scenario where a client A increments the version, but before it has written the metadata file (GC pause, network partition, etc), client B reads the version from the version server and issues a GET for the key where the newly written version metadata would be before it is written.
This sequence would invalidate read-after-write consistency from S3.

You can avoid this by confirming the existence of a file using list, before every doing a GET or HEAD. However, as you point out, if we are trusting a different transactional store, we might prime S3's "negative-cache" and thus invalidate the read-after-write guarantee. The result would be incorrect "FileNotFound" errors.

I think a solution to the negative cache is to be robust to these errors and retry automatically, or as suggested above, "complete" the missing operation yourself.

@ryanworl
Copy link

I think we can avoid an issue here by including information about the contents of the commit atomically with the operation that "claims" that version. For example, you could write the contents of Vn to a temporary file and include a pointer to that file in the lock table of whatever transactional storage system we are using.

In this case, any reader who comes along and sees that the transaction log is behind the lock table (either due to a crash or due to eventual consistency) could finish that operation. If multiple readers do this concurrently, there are no issues with eventual consistency as they are all writing the same data.

As an optimization, if the version is small (which is the common case), you could avoid the indirection and just include the data in the lock request. (This is how the commit service in Databrick's for Managed Delta Lake works).

This was my general thinking as well. By committing to a temporary file in S3 first, then writing the temporary file's location simultaneously with the version CAS operation, you avoid ever doing speculative GETs and retain RaW consistency for all operations. The small file optimization is a nice touch too which would reduce commit latency by one extra S3 write in the common case.

Thanks for walking me through this. I am confident this will work with the only requirement being a key value store that can do CAS on two keys at the same time as well as two simultaneous linearizable reads on those same two keys, one being the version counter and the other being the last committed metadata file location/inline value. Storing the entire history of metadata file locations (or a snapshot plus recent history) as I suggested in my original post in the KV store would just be a performance optimization to reduce polling of the S3 directory listing API waiting for the writes to either achieve consistency, or be retried by a non-failing client.

If Databricks is interested in exploring FoundationDB for metadata storage (especially for Managed Delta Lake), I have some availability soon. You can reach me at ryantworl at gmail dot com.

My episode on the Data Engineering Podcast explaining FoundationDB was about a month before your's explaining Delta Lake @marmbrus 😄

@rogue-one
Copy link

rogue-one commented Aug 26, 2019

EMR clusters provide EMRFS filesystem as an option which provides consistent viewing of S3 powered by DynamoDB tables. EMRFS filesystem provides both

  1. atomic put if not present feature to avoid overwriting transaction logs
  2. and consistent listing of directories .

Most of the production grade EMR cluster is expected to use EMRFS filesystem and enabling concurrent writes to delta table from different cluster should incur very minimal code change too...

@marmbrus
Copy link
Contributor

@rogue-one Thanks for the pointer. I don't see anything that guarantees mutual exclusion there, but perhaps I'm missing it. Do you have a pointer to the specific API that provides this for EMRFS?

Note that this guarantee is stronger than "consistent listing" as you need to avoid the race condition between two writer who both check and find the file to be missing. The storage system needs to atomically check-and-create the file in question.

@iley
Copy link

iley commented Oct 16, 2019

I wonder if the approach used in S3SingleDriverLogStore can be easily generalised to address this issue. One could use an external system (e.g. DynamoDB) for distributed locking and caching. We could introduce a trait for locking and caching to allow for multiple implementations.

@iley
Copy link

iley commented Nov 8, 2019

@marmbrus do you see any potential issues with the approach I described? I'm considering implementing it

@tdas
Copy link
Contributor Author

tdas commented Nov 8, 2019

I agree with your idea that we should introduce that can make it pluggable. However, distributed locking is probably not the best way forward. See the argument @marmbrus made above regarding the liveness risks of using distributed locks. It's better to invest time and effort in a design that does not suffer from obvious pitfalls.

@gourav-sg
Copy link

gourav-sg commented Nov 10, 2019 via email

@fabboe
Copy link

fabboe commented Nov 14, 2019

@tdas could you elaborate the liveness risks with distributed locks again? Respectively, given current design, can they be taken care of at all in a LogStore?

@mmazek
Copy link

mmazek commented Apr 14, 2020

For everyone interested we have a solution for this - using AWS DynamoDB table: #339

LantaoJin added a commit to LantaoJin/delta that referenced this issue May 27, 2020
@zpencerq
Copy link

zpencerq commented Dec 3, 2020

With the release of strong read-after-write consistency in S3, is this still an issue?

@ryanworl
Copy link

ryanworl commented Dec 3, 2020

Yes. The basis of this issue is S3 doesn't have a compare-and-swap operation. Consistent listing makes some operations easier, though.

The accepted solution using DynamoDB works because DynamoDB has a compare-and-swap operation.

@SreeramGarlapati
Copy link

Hi folks, this PR looks like a viable approach for this problem? #339
are there any blockers or additional work/contribution needed to get this merged. happy to contribute if that helps.
View in Slack

@rtyler
Copy link
Member

rtyler commented Mar 16, 2021

@SreeramGarlapati the DynamoDBLogStore approach defined in #339 requires that all parties (readers/writers) know about the log store being in DynamoDB. That would mean basically everything has to be configured to use the "non-standard" log store.

The approach we're exploring in delta-rs for this problem is to instead use a DynamoDB-based locking mechanism on the writer-side only, which would allow the readers to continue to look at the JSON file in S3. Meanwhile, writers would need to coordinate around this DynamoDB-based lock.

@mmgaggle
Copy link

mmgaggle commented Mar 31, 2021

Amazon S3 consistency semantics have improved over the years, as @zpencerq pointed out. Also, there exist S3 dialects that have offered stronger consistency semantics for a long time - for example the S3 API provided by Ceph.

Note that even when directory listings do offer strongly consistency, they are still paginated in all S3 SDKs. The S3 endpoint sees a series of HTTP get requests corresponding to the lexicographic ranges of interest. I would strongly advise against approaches which use the S3 key name to embed associations between databases, tables, and partitions. Apache Iceberg avoids this by individually tracking data files in a table instead of directories. The data file to table tracking information is maintained in a avro serialized metadata file/object.

Now a query planner can understand which data files correspond to a particular table (and even at a particular point in time) by issuing a single request, instead of O(pages) requests.

If data files are always written to new keys, and tracked in a per table metadata file/object, then the remaining challenge is to provide a mechanism for per table / catalog locking. With above, the reliance and volume of data stored in ZK/etcd/FoundationDB is considerably less, which is good because capacity in those often command a higher premium than an object store.

@findepi
Copy link

findepi commented Apr 1, 2021

Has an implementation of LogStore for S3 that would require no external system been considered?

Given S3 strong consistency guarantees, the read-related guarantees are trivially satisfied now.
The outstanding issue is the exclusive put (write if not exists) that's necessary for new objects,
but it seems this can be implemented with optimistic locking (a separate lock object) and backoff.
Of course, it wouldn't work great for high concurrency writes, but these can be additionally
coordinated within a cluster, so that the storage-level concurrency is low (like a "synchronous commit"
in traditional databases).

The potential advantage of not having to manage separate storage system for write coordination would
probably be interesting.

@mmgaggle
Copy link

mmgaggle commented Apr 1, 2021

Are the new committers available in hadoop-aws helpful in achieving the desired semantics?

@emanuelh-cloud
Copy link

Does the aws s3 Object Lock https://docs.aws.amazon.com/AmazonS3/latest/userguide/object-lock.html can somehow help us with "write if not exists" aws s3 consistency issue?

@zsxwing
Copy link
Member

zsxwing commented Apr 25, 2021

Are the new committers available in hadoop-aws helpful in achieving the desired semantics?

No. The new S3 committer is to solve another problem. It doesn't handle concurrent write.

Does the aws s3 Object Lock https://docs.aws.amazon.com/AmazonS3/latest/userguide/object-lock.html can somehow help us with "write if not exists" aws s3 consistency issue?

No, it doesn't. S3 Object Lock is to allow users to pin an object to a specific version so that readers alway see the pinned version even if the object is overwritten. It's not a lock providing mutual exclusion.

@boonware
Copy link

Hi all. I have not been following this thread but I've come across it while trying to find a solution to the problem "Concurrent writes to the same Delta table from multiple Spark drivers can lead to data loss." Does the solution in #339 solve this?

Additionally, are there any constraints on the solution using S3 specifically? Alternative yet similar storage platforms exist such as IBM Cloud's Cloud Object Storage, which actually provides partial support for the S3 API.

@dennyglee
Copy link
Contributor

As @zsxwing noted, the key issue of concern is the lack of putifAbsent in S3 at this time. Yes, #339 should help with this problem and as noted in #339 we're actively working on updates to the LogStore API to make it easier to merge in. I'll let others chime in on IBM Cloud object store, but this issue does not arise with ADLS gen2 nor Google Cloud storage.

@emanuelh-cloud
Copy link

The question: The "spark.delta.logStore.class=org.apache.spark.sql.delta.storage.S3SingleDriverLogStore" definition is necessary for example if one spark driver host many jobs that writing concurrently to the same AWS S3 delta table. But when only one job is running and writing to the AWS S3 delta table (mean NO concurrent writes to the same AWS S3 delta table) the "spark.delta.logStore.class=org.apache.spark.sql.delta.storage.S3SingleDriverLogStore" configuration is NOT necessary because the https://aws.amazon.com/blogs/aws/amazon-s3-update-strong-read-after-write-consistency.
Am I correct?

@gourav-sg
Copy link

gourav-sg commented May 29, 2021 via email

LantaoJin added a commit to LantaoJin/delta that referenced this issue Jun 15, 2021
jbguerraz pushed a commit to jbguerraz/delta that referenced this issue Jul 6, 2022
Resolves delta-io#41

This PR addresses issue delta-io#41 - Support for AWS S3 (multiple clusters/drivers/JVMs).

It implements few ideas from delta-io#41 discussion:

- provides generic base class BaseExternalLogStore for storing listing of commit files
in external DB. This class may be easily extended for specific DB backend
- stores contents of commit in temporary file and links to it in DB's row
to be able to finish uncompleted write operation while reading
- provides concrete DynamoDBLogStore implementation extending BaseExternalLogStore
- implementations for other DB backends should be simple to implement
(ZooKeeper implementation is almost ready, I can create separate PR if anyone is interested)

- unit tests in `ExternalLogStoreSuite` which uses `InMemoryLogStore` to mock `DynamoDBLogStore`
- python integration test inside of `storage-dynamodb/integration_test/dynamodb_logstore.py` which tests concurrent readers and writers
- that integration test can also run using `FailingDynamoDBLogStore` which injects errors into the runtime execution to test error edge cases
- This solution has been also stress-tested (by SambaTV) on Amazon's EMR cluster
(multiple test jobs writing thousands of parallel transactions to single delta table)
and no data loss has beed observed so far

To enable DynamoDBLogStore set following spark property:
`spark.delta.logStore.class=io.delta.storage.DynamoDBLogStore`

Following configuration properties are recognized:

io.delta.storage.DynamoDBLogStore.tableName - table name (defaults to 'delta_log')
io.delta.storage.DynamoDBLogStore.region - AWS region (defaults to 'us-east-1')

Closes delta-io#1044

Co-authored-by: Scott Sandre <scott.sandre@databricks.com>
Co-authored-by: Allison Portis <allison.portis@databricks.com>

Signed-off-by: Scott Sandre <scott.sandre@databricks.com>
GitOrigin-RevId: 7c276f95be92a0ebf1eaa9038d118112d25ebc21
jbguerraz pushed a commit to jbguerraz/delta that referenced this issue Jul 6, 2022
Resolves delta-io#41

This PR addresses issue delta-io#41 - Support for AWS S3 (multiple clusters/drivers/JVMs).

It implements few ideas from delta-io#41 discussion:

- provides generic base class BaseExternalLogStore for storing listing of commit files
in external DB. This class may be easily extended for specific DB backend
- stores contents of commit in temporary file and links to it in DB's row
to be able to finish uncompleted write operation while reading
- provides concrete DynamoDBLogStore implementation extending BaseExternalLogStore
- implementations for other DB backends should be simple to implement
(ZooKeeper implementation is almost ready, I can create separate PR if anyone is interested)

- unit tests in `ExternalLogStoreSuite` which uses `InMemoryLogStore` to mock `DynamoDBLogStore`
- python integration test inside of `storage-dynamodb/integration_test/dynamodb_logstore.py` which tests concurrent readers and writers
- that integration test can also run using `FailingDynamoDBLogStore` which injects errors into the runtime execution to test error edge cases
- This solution has been also stress-tested (by SambaTV) on Amazon's EMR cluster
(multiple test jobs writing thousands of parallel transactions to single delta table)
and no data loss has beed observed so far

To enable DynamoDBLogStore set following spark property:
`spark.delta.logStore.class=io.delta.storage.DynamoDBLogStore`

Following configuration properties are recognized:

io.delta.storage.DynamoDBLogStore.tableName - table name (defaults to 'delta_log')
io.delta.storage.DynamoDBLogStore.region - AWS region (defaults to 'us-east-1')

Closes delta-io#1044

Co-authored-by: Scott Sandre <scott.sandre@databricks.com>
Co-authored-by: Allison Portis <allison.portis@databricks.com>

Signed-off-by: Scott Sandre <scott.sandre@databricks.com>
GitOrigin-RevId: 7c276f95be92a0ebf1eaa9038d118112d25ebc21
@soumilshah1995
Copy link

Any updates here #1498

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.