-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ObjectStore API to read from remote storage systems #950
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @yjshen ! Thanks for opening this new PR with the remote storage system only! Sorry I didn't have the chance to submit all the feedback to the design document in time.
I think it brings a lot of value compared to how things work currently. But I still have the feeling that this is not the ideal level of abstraction. In my opinion (but this might be overthinking), the structure that generates the list of files (ObjectStore.list()
) should take:
- an URI (string just like the
prefix
currently), which could be a sort of path (bucket+prefix) for a plain object store like S3, but could also be something a bit more evolved:- an S3 location with hive partitioning (URI=
bucket/prefix?partition=year&partition=month
) - a delta table (URI=
bucket/prefix?versionAsOf=v2
)
- an S3 location with hive partitioning (URI=
- an expression so that we can pushdown the filter to the generation of the file list. This is VERY important for very large datasets with lots of files where listing all files is too long.
@rdettai Thanks for reviewing 👍
I think this could be achieved inside the S3 object store implementation with another PR on the
I think the current, non-filtering version of the listing is made here for simplicity. check more discussions on this in doc here |
Agreed, my point is mostly about the naming/doc, I find
Understood! As the goal of this PR is to set the interface, I thought we might as well include everything that will be required, and I am pretty sure that filter pushdown to the file listing is a must have. Note that it can be included into the interface and not used in the implementations at first, just like the filter pushdown in |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall looks good to me, thanks @yjshen !
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is a good step. thank you @yjshen !
I propose leaving this open for the weekend to gather any additional comments, and if no one objects merge it in Monday
@rdettai on the topics of prefix semantic and filter pushdown, in my mind, these are higher level logic/abstractions that are not natively supported by the object store. For example, I am under the assumption that I am also having a hard time imaging how will filter push down expressions be used to help the object store build a better or more efficient API call. It seems like these kind of higher level logic should be handled at layers above ObjectStore? They seem to be object store agnostics. As a result, it would be better to just implement them once in abstraction layers like Does this make sense to you? Or am I missing specific features from some object stores that can support native filter push down optimizations at the API level? |
@rdettai @houqp After checking DataFusion and delta-rs code, I think it is more natural to have the TableProvider dealing with filters; it's the abstraction over a table, therefore also a suitable entity for table partitions (inferred or user-provided). The laziness of file listing could be achieved by TableProvider and rely on ObjectStore to only listing related files. |
In the case of hive partitioning, you need an extra capability of the object store to achieve that: list the folders (ex for s3 here). Otherwise you cannot now what partitions are there and build the appropriate prefixes. Maybe this example helps getting more context: https://docs.google.com/document/d/1nKXPvolft1_DuVjVK7ceOx37arCdQLKc_Im-TfZh9dk/edit?usp=sharing |
Thanks @rdettai for the detailed google doc write up. I left some comments there :) I think the object store interface proposed in this PR does support listing object keys with partition prefixes, for example: |
@yjshen @houqp you are right, I think that maybe with the delimiter argument added to I am still wondering how we could replicate the hive partitioning support for |
Thanks @rdettai @houqp for the optional delimiter. I was thinking After taking both partitioned and non-partitioned tables in one remote storage into consideration, this original |
For raw file tables like a folder of parquet, csv and ndjson files, I think the partition discovery and pruning logic should be agnostic to both file formats and object storages. We could implement this logic as a shared module, which gets invoked in ParquetTable, CsvFile and NdJsonFile table providers to resolve the filtered down object list. Then these table providers will pass down the object paths to corresponding format specific physical execution plans.
Given that hive catalog manages a collection of databases and tables with their corresponding metadata, I think CatalogProvider is probably the right abstraction for it. The Hive catalog provider could query hive metastore for a given schema and table name. It can then construct a format specific table provider based on hive metastore response. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, I will leave it to @alamb to do the final merge so he can take a final look since the last round of update.
prefix: &str, | ||
_delimiter: Option<String>, | ||
) -> Result<FileMetaStream> { | ||
list_all(prefix.to_owned()).await |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we can maybe mark this as // TODO
as the delimiter is ignored 😉
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's fine to leave out the TODO since the partitioned file read in not supported?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it seems to me that if you don't comply to the API, a TODO is appropriate even if nobody is using the feature yet 😃
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also, if the method can return folders, the name of the structs returned (FileMeta
) is kind of misleading.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PathMeta
? FilePathMeta
? 😂 I'm run out of naming
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
naming is hard indeed :D
how about ListEntry
or ListItem
? we can define it as an enum of FileMeta
and prefix string to distinguish between regular objects and "folder" prefixes.
References:
S3 has prefixes returned in the <CommonPrefixes>
section: https://docs.aws.amazon.com/AmazonS3/latest/API/API_ListObjectsV2.html
GCS has prefixes returned in response object's prefixes property: https://cloud.google.com/storage/docs/json_api/v1/objects/list#response
Azure blob is similar to S3 and has prefix strings returned in its BlobPrefix
section: https://docs.microsoft.com/en-us/rest/api/storageservices/enumerating-blob-resources#DelimitedBlobList
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There might be value in having two list apis as well, one for listing objects without delimiters and returns stream of FileMeta
, the other for listing "directories" with delimiters that could return either FileMeta
or prefix strings.
I am thinking there are use-case where we only need to perform listing without delimiters. For example after we have partition "folders" fully discovered, we will just use list api call to get the final list of objects using partition prefixes. Returning a stream of FileMeta
directly in this case could avoid enum matching overhead on the call site.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FWIW in IOx we went back and forth on this as well. There we have one LIST API https://github.com/influxdata/influxdb_iox/blob/main/object_store/src/lib.rs#L95-L98 that takes anObjectStorePath
and the path itself can distinguish between "is a (logical) directory" or "is a filename" : https://github.com/influxdata/influxdb_iox/blob/main/object_store/src/path.rs#L30-L42
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@alamb that's interesting, do you have any opinion on the design here?
Since list without delimiter always returns object keys, not logical directory keys, I feel like returning FileMeta
in list call provides a better more statically typed interface for callers.
In IOx, do you remember what was the reason for going back to ObjectStorePath
as return type for regular list calls?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Python test failure seems unrelated, could you please retrigger this single GitHub action @houqp ? Thanks! |
looks like python test is broken on master |
@alamb What's your opinion on the current |
@alamb is on vacation this week :P |
Sorry for the delay @yjshen -- I will review this carefully today. |
I think it looks good 👍 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have resolved the python CI failure on master so I am going to merge this PR in as is. Thanks again @yjshen |
Amazing work on laying out a solid foundation for IO abstraction in datafusion @yjshen ! |
Which issue does this PR close?
Closes #616 .
Rationale for this change
Currently, we can only read files from LocalFS since we use std::fs in ParquetExec. It would be nice to add support to read files that reside on storage sources such as HDFS, Amazon S3, etc.
This PR substitute #811 by only adding in the ObjectStore interfaces, leave hooking these APIs up into the rest of the system / rewrite the existing data sources (like Parquet, etc) as separate follow-ups.
What changes are included in this PR?
Introduce
ObjectStore
API as an abstraction of the underlying storage systems, such as local filesystem, HDFS, S3, etc. And make theObjectStore
implementation pluggable through theObjectStoreRegistery
in Datafusion'sExecutionContext
.Are there any user-facing changes?
Users can provide implementations for the
ObjectStore
trait, register it intoExecutionContext
's registry, and run queries against data that resides in remote storage systems as well as local fs (the only default implementation ofObjectStore
).