-
Notifications
You must be signed in to change notification settings - Fork 41
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(read_ file funcs): infer from compressed formats #2639
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This needs tests, and there are a few questions.
{ | ||
return Ok(OwnedTableReference::Partial { | ||
schema: "public".into(), | ||
table: "ndjson_scan".into(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we use the read_
functions here?
} else if filename.contains(".bson.gz") { | ||
return Ok(OwnedTableReference::Partial { | ||
schema: "public".into(), | ||
table: "read_bson".into(), | ||
}); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not 100% sure that this will actually work.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should still pass this in to the planner even if it's invalid. I don't think this stage of the planning should be concerned with plan validity. It's only concerned with creating the logical plan from a semantically correct query, even if that plan itself is invalid.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure that makes a lot of sense, or at least that's not consistent with the implementation. If we want to implement what you describe than we split the right side of the string on .
and match the penultimate phrase against the formats, and the ultimate segment on the compression formats. Really, If it's not this code's responsibility to decide if the reading function can handle the compression formats, then we should actually ignore them here.
But I don't think that's what we actually want, and I don't think "pass invalid arguments through and hope that the downstream code handles it correctly," is a reasonable design and also (more importantly) it makes it harder to return a clear error message to users in invalid cases.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not "hoping downstream code handles it correctly". It's a matter of separation of logic and keeping things DRY. The table function already need to handle this scenario. It's redundant to check it here also
select * from read_bson('some_file.bson.gz'); # table function already needs to check it
select * from 'some_file.bson.gz'; # just expands to above.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a matter of separation of logic and keeping things DRY.
My concern is primarily for centralizing the encoding of the validation logic for all of the formats and table functions (and DDL-created dispatches to these table providers), so that we can have clear error messages, and avoid having duplicate validation logic.
The table function already need to handle this scenario.
Sort of? The table functions are going to error if they can't read the data or if it's in the wrong format. It's maybe not necessary for them to validate that the path itself is well-formed.
It's redundant to check it here also.
It is redundant, but so is enumerating every possible compression algorithm and extension variant. We shouldn't need to care about .bson.gz
vs .bson.gzip
(and so forth). I think the reasonable solution to getting something here that's helpful and useful is:
- split the string by
.
if the last or second-to-last element in the resulting sequence isbson
,ndjson
,jsonl
, orcsv
then dispatch to the appropriate function. - let's leave
.json
out of this as much as possible because of ambiguity between text sequences and line-separated json. - write tests to validate that the table functions behave correctly with compression and propagate reasonable error messages otherwise.
} else if filename.contains(".parquet.gz") { | ||
return Ok(OwnedTableReference::Partial { | ||
schema: "public".into(), | ||
table: "parquet_scan".into(), | ||
}); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does it make sense to gzip a parquet file given its internal compression, and the fact that gzip will make it difficult/impossible to partition anything from the file.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i think if arrow's parquet reader supports gz
compression, then we should too.
If it's wise to gzip parquet is another question.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mean, I don't think this is a very strong opinion, but I think this is likely to lead people to having very poor experiences, and given that, maybe it'd be best to disallow it?
There's no reason we need to recapitulate arrow or datafuison's mistakes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
but then if people have gzipped parquet files already created from <insert tool>, it's unreadable. I don't think we should put this limitation on the readers. I could see this being a valid argument for writing though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
but then if people have gzipped parquet files already created from , it's unreadable. I don't think we should put this limitation on the readers.
I don't think we have a write path that compresses any objects, anyway.
I'm pretty sure that there aren't tools that support this as much as people using gzip
from the shell, but that doesn't make it reasonable or correct.
Should we support use-cases which are fundamentally misunderstandings or mistakes, especially when likely to produce bad experiences just because someone might stumble into them?
It's much easier to add something like this later if there's a valid, non-pathological use-case, than it is to spend [the lifetime of the product] doing damage control. Sometimes errors in these cases are more kind than supporting something that's broken but possible.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for picking this one up @hemanth94!
While not mentioned in the original issue, I think it'd be pretty easy to include support for all supported compression types via datafusion's FileCompressionType
.
We should also include a few simple slts. For this, i'd suggest compressing the userdata1
dataset that's used for parquet, ndjson & csv and save it to the same respective directories.
ex: testdata/csv/userdata1.csv
-> testdata/csv/userdata1.csv.gz
testdata/parquet/userdata1.parquet
-> testdata/parquet/userdata1.parquet.gz
} else if filename.contains(".parquet.gz") { | ||
return Ok(OwnedTableReference::Partial { | ||
schema: "public".into(), | ||
table: "parquet_scan".into(), | ||
}); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i think if arrow's parquet reader supports gz
compression, then we should too.
If it's wise to gzip parquet is another question.
@@ -261,10 +268,45 @@ fn infer_func_for_file(path: &str) -> Result<OwnedTableReference> { | |||
schema: "public".into(), | |||
table: "read_bson".into(), | |||
}, | |||
"gz" => { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we'd want to match on all compression types here.
We can use datafusion's FileCompressionType
for this.
if let Ok(compression_type) = ext.parse::<FileCompressionType>() {
let ext = compression_type.get_ext():
let path = path.trim_end_matches(ext);
// then just recursively call the function with the extension removed
infer_func_for_file(path)
} else {
match ext.as_str() { /* ... */ }
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll use this to include all compression types.
I overlooked this code logic you suggested while making commits, Will use this as well properly.
Sure I will try this.
Will do this as well. |
Co-authored-by: universalmind303 <universalmind.candy@gmail.com>
SLTs are failing with this error, Seems parquet."gz" format is not supported.
|
testdata/sqllogictests/infer.slt
Outdated
query | ||
select count(*) from './testdata/parquet/userdata1.parquet.gz' | ||
---- | ||
1000 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we also test other combinations
I think that parquet shouldn't work and we should add a test that ensures we don't change that behavior.
this seems correct and like the desired behavior. |
I just wanted to check in and see where this was and if there was anything you needed from us on this. |
Hello, @tychoish I think this can be merged now. Last changes i made, as you requested.
That's all. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the contribution @hemanth94!
Fixes #2602
Inferring functions from compressed file formats.
@universalmind303 Please suggest changes if needed.
Thanks.