Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(read_ file funcs): infer from compressed formats #2639
feat(read_ file funcs): infer from compressed formats #2639
Changes from 3 commits
32113ec
71158a5
b9dcdbe
e643e61
37c98e1
fbc737b
fcffba2
f2d6ec7
b1bea52
c7cff1e
9f6d65e
aa070a6
5779392
41b5b15
f333670
ae9401e
058fe38
a38d204
2cb0a68
19495d8
38a1c9b
d6fc4af
15b8e10
0927200
2e2a404
4a0bab6
File filter
Filter by extension
Conversations
Jump to
There are no files selected for viewing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we'd want to match on all compression types here.
We can use datafusion's
FileCompressionType
for this.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll use this to include all compression types.
I overlooked this code logic you suggested while making commits, Will use this as well properly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we use the
read_
functions here?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does it make sense to gzip a parquet file given its internal compression, and the fact that gzip will make it difficult/impossible to partition anything from the file.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i think if arrow's parquet reader supports
gz
compression, then we should too.If it's wise to gzip parquet is another question.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mean, I don't think this is a very strong opinion, but I think this is likely to lead people to having very poor experiences, and given that, maybe it'd be best to disallow it?
There's no reason we need to recapitulate arrow or datafuison's mistakes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
but then if people have gzipped parquet files already created from <insert tool>, it's unreadable. I don't think we should put this limitation on the readers. I could see this being a valid argument for writing though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we have a write path that compresses any objects, anyway.
I'm pretty sure that there aren't tools that support this as much as people using
gzip
from the shell, but that doesn't make it reasonable or correct.Should we support use-cases which are fundamentally misunderstandings or mistakes, especially when likely to produce bad experiences just because someone might stumble into them?
It's much easier to add something like this later if there's a valid, non-pathological use-case, than it is to spend [the lifetime of the product] doing damage control. Sometimes errors in these cases are more kind than supporting something that's broken but possible.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not 100% sure that this will actually work.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should still pass this in to the planner even if it's invalid. I don't think this stage of the planning should be concerned with plan validity. It's only concerned with creating the logical plan from a semantically correct query, even if that plan itself is invalid.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure that makes a lot of sense, or at least that's not consistent with the implementation. If we want to implement what you describe than we split the right side of the string on
.
and match the penultimate phrase against the formats, and the ultimate segment on the compression formats. Really, If it's not this code's responsibility to decide if the reading function can handle the compression formats, then we should actually ignore them here.But I don't think that's what we actually want, and I don't think "pass invalid arguments through and hope that the downstream code handles it correctly," is a reasonable design and also (more importantly) it makes it harder to return a clear error message to users in invalid cases.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not "hoping downstream code handles it correctly". It's a matter of separation of logic and keeping things DRY. The table function already need to handle this scenario. It's redundant to check it here also
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My concern is primarily for centralizing the encoding of the validation logic for all of the formats and table functions (and DDL-created dispatches to these table providers), so that we can have clear error messages, and avoid having duplicate validation logic.
Sort of? The table functions are going to error if they can't read the data or if it's in the wrong format. It's maybe not necessary for them to validate that the path itself is well-formed.
It is redundant, but so is enumerating every possible compression algorithm and extension variant. We shouldn't need to care about
.bson.gz
vs.bson.gzip
(and so forth). I think the reasonable solution to getting something here that's helpful and useful is:.
if the last or second-to-last element in the resulting sequence isbson
,ndjson
,jsonl
, orcsv
then dispatch to the appropriate function..json
out of this as much as possible because of ambiguity between text sequences and line-separated json.