Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

S3 Support #907

Closed
ehenry2 opened this issue Aug 19, 2021 · 39 comments
Closed

S3 Support #907

ehenry2 opened this issue Aug 19, 2021 · 39 comments
Labels
enhancement New feature or request

Comments

@ehenry2
Copy link

ehenry2 commented Aug 19, 2021

Now that there is a PR in-flight for remote storage systems (#811 ), I wanted to ask what the plan was to implement s3 support. Is this something that will live within the datafusion project or maintained outside? Also, does anyone plan on working on this, or is this something that you are looking for a contributor for?

Also, I'm new to the project, so what is the best way to ask these kind of questions...github issues or on the email list? thanks!

@ehenry2 ehenry2 added the enhancement New feature or request label Aug 19, 2021
@Dandandan
Copy link
Contributor

My personal feeling is that we probably don't want to have every storage system inside the DataFusion crate by default but either:

  • Keep separate crates, like datafusion-s3 which has implementations for some interfaces, and maybe also exposes some helper functions
  • Have some optional features (e.g. s3) living inside the crate.

Wydt @alamb @houqp @andygrove @jorgecarleitao

@jimexist
Copy link
Member

My personal feeling is that we probably don't want to have every storage system inside the DataFusion crate by default but either:

  • Keep separate crates, like datafusion-s3 which has implementations for some interfaces, and maybe also exposes some helper functions

  • Have some optional features (e.g. s3) living inside the crate.

Wydt @alamb @houqp @andygrove @jorgecarleitao

A separate crate is a good idea.

How would you think of keeping the code as part of the workspace?

@ehenry2
Copy link
Author

ehenry2 commented Aug 19, 2021

Another question I wanted to throw out too, on implementation details (probably putting the cart before the horse, but wanted to ask anyway), was should we use Rusoto (in maintenance mode) or the new AWS rust sdk (in Alpha, not on crates.io until GA). My current thinking is probably to use Rusoto for now, as it will probably be a while until the AWS SDK gets into GA, but want to hear everyone thoughts on this.

@alamb
Copy link
Contributor

alamb commented Aug 19, 2021

Keep separate crates, like datafusion-s3 which has implementations for some interfaces, and maybe also exposes some helper functions

Yes I think separate crates is a very good idea to avoid datafusion requiring a massive dependency stack and to keep compile time reasonable.

How would you think of keeping the code as part of the workspace?

I think this is a reasonable for "core" integrations, though going this path also subjects those crates to the apache arrow governance model (among other things slower release cycle) which may not be needed here

Maybe we could start it as its own repo / separate crate with whoever does it and then we can figure out if we want to bring it back into apache arrow if any.

was should we use Rusoto (in maintenance mode) or the new AWS rust sdk (in Alpha, not on crates.io until GA). My current thinking is probably to use Rusoto for now, as it will probably be a while until the AWS SDK gets into GA, but want to hear everyone thoughts on this.

FWIW we took this approach in IOx (starting with Rusto) and it has been working well for us: https://github.com/influxdata/influxdb_iox/blob/main/object_store/Cargo.toml#L21-L23 (we have been using it in AWS for several months now and we have not hit any issues with it)

@houqp
Copy link
Member

houqp commented Aug 20, 2021

I also think separate crate is a good idea. The design that @yjshen came up with made it very easy to implement different remote storage backends as self-contained plugins.

Maybe it's finally the time to create that general Rust database Github Org that @andygrove mentioned at apache/datafusion-sqlparser-rs#308 (comment)? We could use it as the incubator for various experiments that need collaborations instead of using personal namespace.

How would you think of keeping the code as part of the workspace?

I think the upside of this is we won't need to deal with IP clearance if we ever need to donate it back to datafusion. The downside is what Andrew said earlier, slower release cycle due to apache governance overhead, which is not too bad for a stable project. For newer and fast evolving projects, it might slow things down unnecessarily.

@yjshen
Copy link
Member

yjshen commented Aug 20, 2021

Maybe it's finally the time to create that general Rust database Github Org that @andygrove mentioned at apache/datafusion-sqlparser-rs#308 (comment)? We could use it as the incubator for various experiments that need collaborations instead of using a personal namespace.

Wow, that's great!

@yjshen
Copy link
Member

yjshen commented Aug 20, 2021

If someone may be interested, I will also work on HDFS support right after this.

@andygrove
Copy link
Member

Maybe it's finally the time to create that general Rust database Github Org that @andygrove mentioned at sqlparser-rs/sqlparser-rs#308 (comment)? We could use it as the incubator for various experiments that need collaborations instead of using personal namespace.

If we are developing crates that are extensions to DataFusion then I would be nervous about creating a separate org & community with its own governance for these crates. I think it would be fine for generic database-related crates that don't have "arrow" or "datafusion" in the name, such as the sqlparser-rs crate.

We already have the ability to create arrow-experimental-* repos where we do not need so much governance. Perhaps we should try that route first?

@ehenry2
Copy link
Author

ehenry2 commented Sep 14, 2021

Wanted to reopen the discussion here...do we have a consensus on which approach we want to go with? Now that the first set of remote storage PRs are merged, this can now be implemented.

@alamb
Copy link
Contributor

alamb commented Sep 14, 2021

I think there is a consensus on the architecture within datafusion, now that #950 has been merged

I think we now would need to rework the various table providers (CsvExec, ParquetExec, etc) to use that interface rather than File or Read directly, and then we would be ready to create an S3 plugin for DataFusion

@ehenry2
Copy link
Author

ehenry2 commented Sep 14, 2021

@alamb That makes sense, my thought though is that s3 support can be worked on in parallel (although not able to utilized until the table providers are updated) as the scaffolding is there with the Object Reader and Object Store traits. In order to do that, however, either a repo would need to be created (or one of the other approaches given previously adopted).

@houqp
Copy link
Member

houqp commented Sep 15, 2021

Taking @andygrove 's comment into account, unless anyone volunteers to help create the arrow-experimental-datafusion-s3 repo, it might be easier to follow @jimexist 's suggestion to create and manage the s3 plugin crate inside datafusion's workspace. Downstream applications can pull in the plugin dependency by git hash if we are behind on official releases. Also I believe we can release alpha/beta versions to crates.io without requiring an official vote?

@alamb
Copy link
Contributor

alamb commented Sep 15, 2021

Starting to work on an experimental crate within arrow-datafusion seems like a good approach to me.

I haven't tracked / studied the requirements for releasing alpha/betas in fine detail, but I think as long as they are clearly marked as not official apache releases, it would be fine to put them on crates.io

@rdettai
Copy link
Contributor

rdettai commented Sep 24, 2021

#1010 (more exactly rdettai#1) is taking care of re-organizing the TableProviders and ExecutionPlans to integrate the ObjectStore abstraction

@matthewmturner
Copy link
Contributor

@alamb @houqp I'm really interested in getting s3 support added.

If no one else is working on this I should be able to pick it up soon (of course I'll likely have questions along the way). Based on the above it seems the approach would be to add it as crate in arrow-datafusion. Assuming this is still the case is there anything else I should know before starting?

@houqp
Copy link
Member

houqp commented Dec 24, 2021

@matthewmturner we created https://github.com/datafusion-contrib to host these community maintained extensions so it's easier for us to evolve the core datafusion code base.

@houqp
Copy link
Member

houqp commented Dec 24, 2021

I also recommend taking a look at the new official AWS rust SDK, it seems a lot more mature since we started this discussion.

@matthewmturner
Copy link
Contributor

I also recommend taking a look at the new official AWS rust SDK, it seems a lot more mature since we started this discussion.

Ok. Will check it out.

@matthewmturner
Copy link
Contributor

matthewmturner commented Dec 25, 2021

@houqp Separate but related to s3 I'm interested in adding the ability to register a AWS glue catalog / database.

Is that type of functionality something I could expect datafusion to have? If so, do you think it makes sense to bundle with s3 functionality? I would of course start with s3. But just for planning purposes (I.e looking at rusoto vs AWS sdk)

@houqp
Copy link
Member

houqp commented Dec 25, 2021

I think so, but i think the GLUE catalog integration should be implemented as a table provider plugin in a separate crate, not as part of the s3 object store plugin.

@matthewmturner
Copy link
Contributor

@houqp for these cloud based services what do you think about a naming convention like the following:

datafusion-[cloud provider]-[service]

i.e.
datafusion-aws-s3
datafusion-aws-glue

@houqp
Copy link
Member

houqp commented Dec 26, 2021

My suggestion would be naming the repos by extension types, for example: datafusion-objectstore-s3, datafusion-table-glue, etc. But I don't have a strong opinion on this :)

@matthewmturner
Copy link
Contributor

makes sense. do you think for glue it might be better to use catalog instead of table?

@houqp
Copy link
Member

houqp commented Dec 26, 2021

Yeah, i think that's a better name 👍

@alamb
Copy link
Contributor

alamb commented Dec 26, 2021

FWIW one example of using the aws s3 object store interface is in IOx: https://github.com/influxdata/influxdb_iox/blob/main/object_store/src/aws.rs in case that is helpful

@matthewmturner
Copy link
Contributor

FWIW one example of using the aws s3 object store interface is in IOx: https://github.com/influxdata/influxdb_iox/blob/main/object_store/src/aws.rs in case that is helpful

thank you

@matthewmturner
Copy link
Contributor

@houqp @alamb im very interested in getting s3 / glue support added to datafusion cli. do you know of a way to feature gate other libraries without adding their code to datafusion cli codebase? im just concerned that datafusion cli codebase could become messy if we end up adding the code and feature gate for each functionality like this (i.e. aws s3, aws glue, azure blob, custom sql, etc...). or maybe thats okay? what do you think?

As an aside, and for your information, someone emailed me and wanted to add azure blob storage support to datafusion but said they could work on s3 first. ive created a repo in the datafusion-contrib organization (https://github.com/datafusion-contrib/datafusion-objectstore-s3) and will point them to the Influxdb IOx implementation for inspiration.

@alamb
Copy link
Contributor

alamb commented Jan 4, 2022

FYI there is some great work from @yahoNanJing in this area on #1062

@seddonm1
Copy link
Contributor

seddonm1 commented Jan 5, 2022

Hi guys,
I built this which works correctly and uses the offical Amazon SDK: https://gist.github.com/seddonm1/2fb5a6892989fe7bf246022a7bd586ee

If it is useful I'm happy to donate it.

@houqp
Copy link
Member

houqp commented Jan 5, 2022

@seddonm1 I invited you to the datafusion contrib org as well, could you work with @matthewmturner to figure out how you want to collaborate on this?

@matthewmturner
Copy link
Contributor

@houqp yup sounds good.

@seddonm1 that would be fantastic if you could donate it. do you have a preference for how you would like to do that? if you want to contribute directly i could work on setting up CI in the repo and testing what you've done on my side.

let me know how you would like to proceed and thanks so much for the donation!

@seddonm1
Copy link
Contributor

seddonm1 commented Jan 5, 2022

@houqp thanks have joined that org.
@matthewmturner I have updated my example with the tests I ran against the Minio docker container (https://docs.min.io/docs/minio-docker-quickstart-guide) (so ignore the passwords as they are just the suggested values). I suggest something similar for testing in CI.

I can do a PR tomorrow AU time.

@gopik
Copy link

gopik commented Jan 5, 2022

Thanks @seddonm1. minio provides s3 api compatibility for azure blob storage. I'll try running the tests today with azure blob storage.

@matthewmturner
Copy link
Contributor

@seddonm1 ok i will look into that. also once im comfortable with the usage i can work on the documentation.

from a code perspective, is there anything in particular that we should be aware of? i.e. any issues with aws rust sdk?

@seddonm1
Copy link
Contributor

seddonm1 commented Jan 5, 2022

for those waiting for this functionality a PR has been raised: datafusion-contrib/datafusion-objectstore-s3#2

If you want to help test you can follow the instructions in the PR and it should work. The abstraction is good @rdettai and will be even better once the async fn chunk_reader can be used 👍

@nitisht
Copy link
Contributor

nitisht commented Feb 4, 2022

Folks, thank you for great work on this issue. I see that repo https://github.com/datafusion-contrib/datafusion-objectstore-s3 has the relevant code added. I am a little hazy on how do I get these two projects (datafusion & datafusion-objectstore-s3) compiled and work together. Are there any pointers towards this?

@matthewmturner
Copy link
Contributor

@nitisht the README on https://github.com/datafusion-contrib/datafusion-objectstore-s3 has some examples how to use it and you could also look at our tests to see all functionality that we have tested (https://github.com/datafusion-contrib/datafusion-objectstore-s3/blob/main/src/object_store/aws.rs).

If there is something that is not clear - or a functionality you need that we havent yet added could you please create an issue in that repo?

@matthewmturner
Copy link
Contributor

@alamb @houqp im thinking we can close this issue now.

@houqp houqp closed this as completed Feb 5, 2022
@nitisht
Copy link
Contributor

nitisht commented Feb 5, 2022

you could also look at our tests to see all functionality

Thank you for this pointer @matthewmturner . This is precisely what I was looking for, everything working great in my local testing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests