-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
S3 Support #907
Comments
My personal feeling is that we probably don't want to have every storage system inside the DataFusion crate by default but either:
|
A separate crate is a good idea. How would you think of keeping the code as part of the workspace? |
Another question I wanted to throw out too, on implementation details (probably putting the cart before the horse, but wanted to ask anyway), was should we use Rusoto (in maintenance mode) or the new AWS rust sdk (in Alpha, not on crates.io until GA). My current thinking is probably to use Rusoto for now, as it will probably be a while until the AWS SDK gets into GA, but want to hear everyone thoughts on this. |
Yes I think separate crates is a very good idea to avoid datafusion requiring a massive dependency stack and to keep compile time reasonable.
I think this is a reasonable for "core" integrations, though going this path also subjects those crates to the apache arrow governance model (among other things slower release cycle) which may not be needed here Maybe we could start it as its own repo / separate crate with whoever does it and then we can figure out if we want to bring it back into apache arrow if any.
FWIW we took this approach in IOx (starting with Rusto) and it has been working well for us: https://github.com/influxdata/influxdb_iox/blob/main/object_store/Cargo.toml#L21-L23 (we have been using it in AWS for several months now and we have not hit any issues with it) |
I also think separate crate is a good idea. The design that @yjshen came up with made it very easy to implement different remote storage backends as self-contained plugins. Maybe it's finally the time to create that general Rust database Github Org that @andygrove mentioned at apache/datafusion-sqlparser-rs#308 (comment)? We could use it as the incubator for various experiments that need collaborations instead of using personal namespace.
I think the upside of this is we won't need to deal with IP clearance if we ever need to donate it back to datafusion. The downside is what Andrew said earlier, slower release cycle due to apache governance overhead, which is not too bad for a stable project. For newer and fast evolving projects, it might slow things down unnecessarily. |
Wow, that's great! |
If someone may be interested, I will also work on HDFS support right after this. |
If we are developing crates that are extensions to DataFusion then I would be nervous about creating a separate org & community with its own governance for these crates. I think it would be fine for generic database-related crates that don't have "arrow" or "datafusion" in the name, such as the We already have the ability to create |
Wanted to reopen the discussion here...do we have a consensus on which approach we want to go with? Now that the first set of remote storage PRs are merged, this can now be implemented. |
I think there is a consensus on the architecture within datafusion, now that #950 has been merged I think we now would need to rework the various table providers ( |
@alamb That makes sense, my thought though is that s3 support can be worked on in parallel (although not able to utilized until the table providers are updated) as the scaffolding is there with the Object Reader and Object Store traits. In order to do that, however, either a repo would need to be created (or one of the other approaches given previously adopted). |
Taking @andygrove 's comment into account, unless anyone volunteers to help create the |
Starting to work on an experimental crate within I haven't tracked / studied the requirements for releasing alpha/betas in fine detail, but I think as long as they are clearly marked as not official apache releases, it would be fine to put them on crates.io |
@alamb @houqp I'm really interested in getting s3 support added. If no one else is working on this I should be able to pick it up soon (of course I'll likely have questions along the way). Based on the above it seems the approach would be to add it as crate in arrow-datafusion. Assuming this is still the case is there anything else I should know before starting? |
@matthewmturner we created https://github.com/datafusion-contrib to host these community maintained extensions so it's easier for us to evolve the core datafusion code base. |
I also recommend taking a look at the new official AWS rust SDK, it seems a lot more mature since we started this discussion. |
Ok. Will check it out. |
@houqp Separate but related to s3 I'm interested in adding the ability to register a AWS glue catalog / database. Is that type of functionality something I could expect datafusion to have? If so, do you think it makes sense to bundle with s3 functionality? I would of course start with s3. But just for planning purposes (I.e looking at rusoto vs AWS sdk) |
I think so, but i think the GLUE catalog integration should be implemented as a table provider plugin in a separate crate, not as part of the s3 object store plugin. |
@houqp for these cloud based services what do you think about a naming convention like the following: datafusion-[cloud provider]-[service] i.e. |
My suggestion would be naming the repos by extension types, for example: |
makes sense. do you think for glue it might be better to use catalog instead of table? |
Yeah, i think that's a better name 👍 |
FWIW one example of using the aws s3 object store interface is in IOx: https://github.com/influxdata/influxdb_iox/blob/main/object_store/src/aws.rs in case that is helpful |
thank you |
@houqp @alamb im very interested in getting s3 / glue support added to datafusion cli. do you know of a way to feature gate other libraries without adding their code to datafusion cli codebase? im just concerned that datafusion cli codebase could become messy if we end up adding the code and feature gate for each functionality like this (i.e. aws s3, aws glue, azure blob, custom sql, etc...). or maybe thats okay? what do you think? As an aside, and for your information, someone emailed me and wanted to add azure blob storage support to datafusion but said they could work on s3 first. ive created a repo in the |
FYI there is some great work from @yahoNanJing in this area on #1062 |
Hi guys, If it is useful I'm happy to donate it. |
@seddonm1 I invited you to the datafusion contrib org as well, could you work with @matthewmturner to figure out how you want to collaborate on this? |
@houqp yup sounds good. @seddonm1 that would be fantastic if you could donate it. do you have a preference for how you would like to do that? if you want to contribute directly i could work on setting up CI in the repo and testing what you've done on my side. let me know how you would like to proceed and thanks so much for the donation! |
@houqp thanks have joined that org. I can do a PR tomorrow AU time. |
Thanks @seddonm1. minio provides s3 api compatibility for azure blob storage. I'll try running the tests today with azure blob storage. |
@seddonm1 ok i will look into that. also once im comfortable with the usage i can work on the documentation. from a code perspective, is there anything in particular that we should be aware of? i.e. any issues with aws rust sdk? |
for those waiting for this functionality a PR has been raised: datafusion-contrib/datafusion-objectstore-s3#2 If you want to help test you can follow the instructions in the PR and it should work. The abstraction is good @rdettai and will be even better once the |
Folks, thank you for great work on this issue. I see that repo https://github.com/datafusion-contrib/datafusion-objectstore-s3 has the relevant code added. I am a little hazy on how do I get these two projects (datafusion & datafusion-objectstore-s3) compiled and work together. Are there any pointers towards this? |
@nitisht the README on https://github.com/datafusion-contrib/datafusion-objectstore-s3 has some examples how to use it and you could also look at our tests to see all functionality that we have tested (https://github.com/datafusion-contrib/datafusion-objectstore-s3/blob/main/src/object_store/aws.rs). If there is something that is not clear - or a functionality you need that we havent yet added could you please create an issue in that repo? |
Thank you for this pointer @matthewmturner . This is precisely what I was looking for, everything working great in my local testing. |
Now that there is a PR in-flight for remote storage systems (#811 ), I wanted to ask what the plan was to implement s3 support. Is this something that will live within the datafusion project or maintained outside? Also, does anyone plan on working on this, or is this something that you are looking for a contributor for?
Also, I'm new to the project, so what is the best way to ask these kind of questions...github issues or on the email list? thanks!
The text was updated successfully, but these errors were encountered: