Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(read_ file funcs): infer from compressed formats #2639

Merged
merged 26 commits into from
Feb 21, 2024

Conversation

hemanth94
Copy link
Contributor

Fixes #2602
Inferring functions from compressed file formats.
@universalmind303 Please suggest changes if needed.

Thanks.

Copy link
Contributor

@tychoish tychoish left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This needs tests, and there are a few questions.

crates/datafusion_ext/src/planner/relation/mod.rs Outdated Show resolved Hide resolved
crates/datafusion_ext/src/planner/relation/mod.rs Outdated Show resolved Hide resolved
{
return Ok(OwnedTableReference::Partial {
schema: "public".into(),
table: "ndjson_scan".into(),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we use the read_ functions here?

Comment on lines 302 to 306
} else if filename.contains(".bson.gz") {
return Ok(OwnedTableReference::Partial {
schema: "public".into(),
table: "read_bson".into(),
});
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not 100% sure that this will actually work.

Copy link
Contributor

@universalmind303 universalmind303 Feb 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should still pass this in to the planner even if it's invalid. I don't think this stage of the planning should be concerned with plan validity. It's only concerned with creating the logical plan from a semantically correct query, even if that plan itself is invalid.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure that makes a lot of sense, or at least that's not consistent with the implementation. If we want to implement what you describe than we split the right side of the string on . and match the penultimate phrase against the formats, and the ultimate segment on the compression formats. Really, If it's not this code's responsibility to decide if the reading function can handle the compression formats, then we should actually ignore them here.

But I don't think that's what we actually want, and I don't think "pass invalid arguments through and hope that the downstream code handles it correctly," is a reasonable design and also (more importantly) it makes it harder to return a clear error message to users in invalid cases.

Copy link
Contributor

@universalmind303 universalmind303 Feb 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not "hoping downstream code handles it correctly". It's a matter of separation of logic and keeping things DRY. The table function already need to handle this scenario. It's redundant to check it here also

select * from read_bson('some_file.bson.gz'); # table function already needs to check it
select * from 'some_file.bson.gz'; # just expands to above.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a matter of separation of logic and keeping things DRY.

My concern is primarily for centralizing the encoding of the validation logic for all of the formats and table functions (and DDL-created dispatches to these table providers), so that we can have clear error messages, and avoid having duplicate validation logic.

The table function already need to handle this scenario.

Sort of? The table functions are going to error if they can't read the data or if it's in the wrong format. It's maybe not necessary for them to validate that the path itself is well-formed.

It's redundant to check it here also.

It is redundant, but so is enumerating every possible compression algorithm and extension variant. We shouldn't need to care about .bson.gz vs .bson.gzip (and so forth). I think the reasonable solution to getting something here that's helpful and useful is:

  • split the string by . if the last or second-to-last element in the resulting sequence is bson, ndjson, jsonl, or csv then dispatch to the appropriate function.
  • let's leave .json out of this as much as possible because of ambiguity between text sequences and line-separated json.
  • write tests to validate that the table functions behave correctly with compression and propagate reasonable error messages otherwise.

Comment on lines 292 to 296
} else if filename.contains(".parquet.gz") {
return Ok(OwnedTableReference::Partial {
schema: "public".into(),
table: "parquet_scan".into(),
});
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does it make sense to gzip a parquet file given its internal compression, and the fact that gzip will make it difficult/impossible to partition anything from the file.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think if arrow's parquet reader supports gz compression, then we should too.

If it's wise to gzip parquet is another question.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean, I don't think this is a very strong opinion, but I think this is likely to lead people to having very poor experiences, and given that, maybe it'd be best to disallow it?

There's no reason we need to recapitulate arrow or datafuison's mistakes.

Copy link
Contributor

@universalmind303 universalmind303 Feb 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but then if people have gzipped parquet files already created from <insert tool>, it's unreadable. I don't think we should put this limitation on the readers. I could see this being a valid argument for writing though.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but then if people have gzipped parquet files already created from , it's unreadable. I don't think we should put this limitation on the readers.

I don't think we have a write path that compresses any objects, anyway.

I'm pretty sure that there aren't tools that support this as much as people using gzip from the shell, but that doesn't make it reasonable or correct.

Should we support use-cases which are fundamentally misunderstandings or mistakes, especially when likely to produce bad experiences just because someone might stumble into them?

It's much easier to add something like this later if there's a valid, non-pathological use-case, than it is to spend [the lifetime of the product] doing damage control. Sometimes errors in these cases are more kind than supporting something that's broken but possible.

crates/datafusion_ext/src/planner/relation/mod.rs Outdated Show resolved Hide resolved
Copy link
Contributor

@universalmind303 universalmind303 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for picking this one up @hemanth94!

While not mentioned in the original issue, I think it'd be pretty easy to include support for all supported compression types via datafusion's FileCompressionType.

We should also include a few simple slts. For this, i'd suggest compressing the userdata1 dataset that's used for parquet, ndjson & csv and save it to the same respective directories.

ex: testdata/csv/userdata1.csv -> testdata/csv/userdata1.csv.gz
testdata/parquet/userdata1.parquet -> testdata/parquet/userdata1.parquet.gz

Comment on lines 292 to 296
} else if filename.contains(".parquet.gz") {
return Ok(OwnedTableReference::Partial {
schema: "public".into(),
table: "parquet_scan".into(),
});
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think if arrow's parquet reader supports gz compression, then we should too.

If it's wise to gzip parquet is another question.

@@ -261,10 +268,45 @@ fn infer_func_for_file(path: &str) -> Result<OwnedTableReference> {
schema: "public".into(),
table: "read_bson".into(),
},
"gz" => {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we'd want to match on all compression types here.

We can use datafusion's FileCompressionType for this.

if let Ok(compression_type) = ext.parse::<FileCompressionType>() {
  let ext = compression_type.get_ext():
  let path = path.trim_end_matches(ext);
  // then just recursively call the function with the extension removed
  infer_func_for_file(path)
} else {
  match ext.as_str() { /* ... */ }
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll use this to include all compression types.

I overlooked this code logic you suggested while making commits, Will use this as well properly.

crates/datafusion_ext/src/planner/relation/mod.rs Outdated Show resolved Hide resolved
@hemanth94
Copy link
Contributor Author

While not mentioned in the original issue, I think it'd be pretty easy to include support for all supported compression types via datafusion's FileCompressionType.

Sure I will try this.

We should also include a few simple slts. For this, i'd suggest compressing the userdata1 dataset that's used for parquet, ndjson & csv and save it to the same respective directories.

ex: testdata/csv/userdata1.csv -> testdata/csv/userdata1.csv.gz testdata/parquet/userdata1.parquet -> testdata/parquet/userdata1.parquet.gz

Will do this as well.

@hemanth94
Copy link
Contributor Author

SLTs are failing with this error, Seems parquet."gz" format is not supported.

internal error: cannot execute simple query: db error: ERROR: External error: compression not supported for parquet

@greyscaled greyscaled changed the title Fix: Inferring functions from compressed file formats feat(read_ file funcs): infer from compressed formats Feb 15, 2024
Comment on lines 72 to 75
query
select count(*) from './testdata/parquet/userdata1.parquet.gz'
----
1000
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we also test other combinations

I think that parquet shouldn't work and we should add a test that ensures we don't change that behavior.

@tychoish
Copy link
Contributor

SLTs are failing with this error, Seems parquet."gz" format is not supported.

this seems correct and like the desired behavior.

@universalmind303 universalmind303 self-requested a review February 16, 2024 21:14
@tychoish
Copy link
Contributor

I just wanted to check in and see where this was and if there was anything you needed from us on this.

@hemanth94
Copy link
Contributor Author

Hello, @tychoish

I think this can be merged now.

Last changes i made, as you requested.

  • I added bz2, xz, zst formats of csv, Json, Parquet. Added respective SLTs for them, which passed.
  • Changed SLTs for parquet to catch Error, since Parquet files are supposed to fail with compressed formats.

That's all.
Thanks,

@tychoish tychoish enabled auto-merge (squash) February 21, 2024 13:49
@tychoish tychoish merged commit ef464f1 into GlareDB:main Feb 21, 2024
25 checks passed
@hemanth94 hemanth94 deleted the Hem_Dev2 branch February 21, 2024 15:06
Copy link
Contributor

@universalmind303 universalmind303 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the contribution @hemanth94!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

select * from 'some_file.json.gz
4 participants