feat(read_ file funcs): infer from compressed formats #2639

hemanth94 · 2024-02-12T21:59:59Z

Fixes #2602
Inferring functions from compressed file formats.
@universalmind303 Please suggest changes if needed.

Thanks.

tychoish

This needs tests, and there are a few questions.

crates/datafusion_ext/src/planner/relation/mod.rs

tychoish · 2024-02-12T23:01:24Z

crates/datafusion_ext/src/planner/relation/mod.rs

+    {
+        return Ok(OwnedTableReference::Partial {
+            schema: "public".into(),
+            table: "ndjson_scan".into(),


can we use the read_ functions here?

tychoish · 2024-02-12T23:01:51Z

crates/datafusion_ext/src/planner/relation/mod.rs

+    } else if filename.contains(".bson.gz") {
+        return Ok(OwnedTableReference::Partial {
+            schema: "public".into(),
+            table: "read_bson".into(),
+        });


I'm not 100% sure that this will actually work.

I think we should still pass this in to the planner even if it's invalid. I don't think this stage of the planning should be concerned with plan validity. It's only concerned with creating the logical plan from a semantically correct query, even if that plan itself is invalid.

I'm not sure that makes a lot of sense, or at least that's not consistent with the implementation. If we want to implement what you describe than we split the right side of the string on . and match the penultimate phrase against the formats, and the ultimate segment on the compression formats. Really, If it's not this code's responsibility to decide if the reading function can handle the compression formats, then we should actually ignore them here.

But I don't think that's what we actually want, and I don't think "pass invalid arguments through and hope that the downstream code handles it correctly," is a reasonable design and also (more importantly) it makes it harder to return a clear error message to users in invalid cases.

It's not "hoping downstream code handles it correctly". It's a matter of separation of logic and keeping things DRY. The table function already need to handle this scenario. It's redundant to check it here also

select * from read_bson('some_file.bson.gz'); # table function already needs to check it select * from 'some_file.bson.gz'; # just expands to above.

It's a matter of separation of logic and keeping things DRY.

My concern is primarily for centralizing the encoding of the validation logic for all of the formats and table functions (and DDL-created dispatches to these table providers), so that we can have clear error messages, and avoid having duplicate validation logic.

The table function already need to handle this scenario.

Sort of? The table functions are going to error if they can't read the data or if it's in the wrong format. It's maybe not necessary for them to validate that the path itself is well-formed.

It's redundant to check it here also.

It is redundant, but so is enumerating every possible compression algorithm and extension variant. We shouldn't need to care about .bson.gz vs .bson.gzip (and so forth). I think the reasonable solution to getting something here that's helpful and useful is:

split the string by . if the last or second-to-last element in the resulting sequence is bson, ndjson, jsonl, or csv then dispatch to the appropriate function.

let's leave .json out of this as much as possible because of ambiguity between text sequences and line-separated json.

write tests to validate that the table functions behave correctly with compression and propagate reasonable error messages otherwise.

tychoish · 2024-02-12T23:03:41Z

crates/datafusion_ext/src/planner/relation/mod.rs

+    } else if filename.contains(".parquet.gz") {
+        return Ok(OwnedTableReference::Partial {
+            schema: "public".into(),
+            table: "parquet_scan".into(),
+        });


does it make sense to gzip a parquet file given its internal compression, and the fact that gzip will make it difficult/impossible to partition anything from the file.

i think if arrow's parquet reader supports gz compression, then we should too.

If it's wise to gzip parquet is another question.

I mean, I don't think this is a very strong opinion, but I think this is likely to lead people to having very poor experiences, and given that, maybe it'd be best to disallow it?

There's no reason we need to recapitulate arrow or datafuison's mistakes.

but then if people have gzipped parquet files already created from <insert tool>, it's unreadable. I don't think we should put this limitation on the readers. I could see this being a valid argument for writing though.

but then if people have gzipped parquet files already created from , it's unreadable. I don't think we should put this limitation on the readers.

I don't think we have a write path that compresses any objects, anyway.

I'm pretty sure that there aren't tools that support this as much as people using gzip from the shell, but that doesn't make it reasonable or correct.

Should we support use-cases which are fundamentally misunderstandings or mistakes, especially when likely to produce bad experiences just because someone might stumble into them?

It's much easier to add something like this later if there's a valid, non-pathological use-case, than it is to spend [the lifetime of the product] doing damage control. Sometimes errors in these cases are more kind than supporting something that's broken but possible.

crates/datafusion_ext/src/planner/relation/mod.rs

universalmind303

Thanks for picking this one up @hemanth94!

While not mentioned in the original issue, I think it'd be pretty easy to include support for all supported compression types via datafusion's FileCompressionType.

We should also include a few simple slts. For this, i'd suggest compressing the userdata1 dataset that's used for parquet, ndjson & csv and save it to the same respective directories.

ex: testdata/csv/userdata1.csv -> testdata/csv/userdata1.csv.gz
testdata/parquet/userdata1.parquet -> testdata/parquet/userdata1.parquet.gz

universalmind303 · 2024-02-13T00:37:21Z

crates/datafusion_ext/src/planner/relation/mod.rs

+    } else if filename.contains(".parquet.gz") {
+        return Ok(OwnedTableReference::Partial {
+            schema: "public".into(),
+            table: "parquet_scan".into(),
+        });


i think if arrow's parquet reader supports gz compression, then we should too.

If it's wise to gzip parquet is another question.

universalmind303 · 2024-02-13T00:49:35Z

crates/datafusion_ext/src/planner/relation/mod.rs

@@ -261,10 +268,45 @@ fn infer_func_for_file(path: &str) -> Result<OwnedTableReference> {
            schema: "public".into(),
            table: "read_bson".into(),
        },
+        "gz" => {


I think we'd want to match on all compression types here.

We can use datafusion's FileCompressionType for this.

if let Ok(compression_type) = ext.parse::<FileCompressionType>() { let ext = compression_type.get_ext(): let path = path.trim_end_matches(ext); // then just recursively call the function with the extension removed infer_func_for_file(path) } else { match ext.as_str() { /* ... */ } }

I'll use this to include all compression types.

I overlooked this code logic you suggested while making commits, Will use this as well properly.

crates/datafusion_ext/src/planner/relation/mod.rs

hemanth94 · 2024-02-13T08:09:14Z

While not mentioned in the original issue, I think it'd be pretty easy to include support for all supported compression types via datafusion's FileCompressionType.

Sure I will try this.

We should also include a few simple slts. For this, i'd suggest compressing the userdata1 dataset that's used for parquet, ndjson & csv and save it to the same respective directories.

ex: testdata/csv/userdata1.csv -> testdata/csv/userdata1.csv.gz testdata/parquet/userdata1.parquet -> testdata/parquet/userdata1.parquet.gz

Will do this as well.

Co-authored-by: universalmind303 <universalmind.candy@gmail.com>

…o Hem_Dev2

hemanth94 · 2024-02-14T05:56:10Z

SLTs are failing with this error, Seems parquet."gz" format is not supported.

internal error: cannot execute simple query: db error: ERROR: External error: compression not supported for parquet

tychoish · 2024-02-15T20:20:08Z

testdata/sqllogictests/infer.slt

+query
+select count(*) from './testdata/parquet/userdata1.parquet.gz'
+----
+1000


can we also test other combinations

I think that parquet shouldn't work and we should add a test that ensures we don't change that behavior.

tychoish · 2024-02-15T20:20:33Z

SLTs are failing with this error, Seems parquet."gz" format is not supported.

this seems correct and like the desired behavior.

tychoish · 2024-02-21T12:54:34Z

I just wanted to check in and see where this was and if there was anything you needed from us on this.

hemanth94 · 2024-02-21T13:32:25Z

Hello, @tychoish

I think this can be merged now.

Last changes i made, as you requested.

I added bz2, xz, zst formats of csv, Json, Parquet. Added respective SLTs for them, which passed.
Changed SLTs for parquet to catch Error, since Parquet files are supposed to fail with compressed formats.

That's all.
Thanks,

universalmind303

Thanks for the contribution @hemanth94!

hemanth94 added 3 commits February 13, 2024 03:23

Infering funcs from compressed files

32113ec

minor

71158a5

minor

b9dcdbe

tychoish reviewed Feb 12, 2024

View reviewed changes

universalmind303 suggested changes Feb 13, 2024

View reviewed changes

hemanth94 and others added 15 commits February 13, 2024 13:42

Update crates/datafusion_ext/src/planner/relation/mod.rs

e643e61

Co-authored-by: universalmind303 <universalmind.candy@gmail.com>

Changes as requested

37c98e1

read_ functions

fbc737b

minor

fcffba2

Compressed test data

f2d6ec7

Fixes

b1bea52

Merge branch 'GlareDB:main' into Hem_Dev2

c7cff1e

Merge branch 'Hem_Dev2' of https://github.com/hemanth94/glaredb_H int…

9f6d65e

…o Hem_Dev2

aa070a6

minor

5779392

Revision

41b5b15

Adding SLT for infering from compressed files

f333670

fixes

ae9401e

Minor

058fe38

fix

a38d204

greyscaled changed the title ~~Fix: Inferring functions from compressed file formats~~ feat(read_ file funcs): infer from compressed formats Feb 15, 2024

greyscaled mentioned this pull request Feb 15, 2024

Feature: Inference for Excel file extensions #2651

Closed

Infer read_excel for xlsx

2cb0a68

tychoish reviewed Feb 15, 2024

View reviewed changes

hemanth94 added 2 commits February 16, 2024 16:52

Adding bz2, xz, zst formats of csv,json,parquet

19495d8

SLTs to test .bz2, .xz, .zst

38a1c9b

universalmind303 self-requested a review February 16, 2024 21:14

hemanth94 added 3 commits February 17, 2024 20:32

Merge branch 'GlareDB:main' into Hem_Dev2

d6fc4af

Merge branch 'GlareDB:main' into Hem_Dev2

15b8e10

Merge branch 'GlareDB:main' into Hem_Dev2

0927200

SLTs should error with compressed parquet.

2e2a404

tychoish enabled auto-merge (squash) February 21, 2024 13:49

tychoish approved these changes Feb 21, 2024

View reviewed changes

Merge branch 'main' into Hem_Dev2

4a0bab6

greyscaled approved these changes Feb 21, 2024

View reviewed changes

universalmind303 approved these changes Feb 21, 2024

View reviewed changes

tychoish merged commit ef464f1 into GlareDB:main Feb 21, 2024
25 checks passed

hemanth94 deleted the Hem_Dev2 branch February 21, 2024 15:06

universalmind303 reviewed Feb 21, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(read_ file funcs): infer from compressed formats #2639

feat(read_ file funcs): infer from compressed formats #2639

hemanth94 commented Feb 12, 2024

tychoish left a comment

tychoish Feb 12, 2024

tychoish Feb 12, 2024

universalmind303 Feb 13, 2024 •

edited

Loading

tychoish Feb 13, 2024

universalmind303 Feb 13, 2024 •

edited

Loading

tychoish Feb 13, 2024

tychoish Feb 12, 2024

universalmind303 Feb 13, 2024

tychoish Feb 13, 2024

universalmind303 Feb 13, 2024 •

edited

Loading

tychoish Feb 13, 2024

universalmind303 left a comment •

edited

Loading

universalmind303 Feb 13, 2024

universalmind303 Feb 13, 2024

hemanth94 Feb 13, 2024

hemanth94 commented Feb 13, 2024

hemanth94 commented Feb 14, 2024

tychoish Feb 15, 2024

tychoish commented Feb 15, 2024

tychoish commented Feb 21, 2024

hemanth94 commented Feb 21, 2024

universalmind303 left a comment

feat(read_ file funcs): infer from compressed formats #2639

feat(read_ file funcs): infer from compressed formats #2639

Conversation

hemanth94 commented Feb 12, 2024

tychoish left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

universalmind303 Feb 13, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

universalmind303 Feb 13, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

universalmind303 Feb 13, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

universalmind303 left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

hemanth94 commented Feb 13, 2024

hemanth94 commented Feb 14, 2024

Choose a reason for hiding this comment

tychoish commented Feb 15, 2024

tychoish commented Feb 21, 2024

hemanth94 commented Feb 21, 2024

universalmind303 left a comment

Choose a reason for hiding this comment

universalmind303 Feb 13, 2024 •

edited

Loading

universalmind303 Feb 13, 2024 •

edited

Loading

universalmind303 Feb 13, 2024 •

edited

Loading

universalmind303 left a comment •

edited

Loading