[BUG] Panic when querying table with wrong partition columns order #9785

MohamedAbdeen21 · 2024-03-24T20:41:07Z

Describe the bug

Assume a table test that is partitioned on columns a then b, creating an external table and setting wrong order for the partition columns (b then a) doesn't raise an error, but querying it causes a panic.

To Reproduce

copy these queries into panic.sql

CREATE EXTERNAL TABLE test(name string, year string, month string) PARTITIONED BY (year, month) STORED AS csv LOCATION 'tmp/';
INSERT INTO test VALUES('name', '2024', '03');

-- notice the different ordering in the PARTITIONED BY clause
CREATE EXTERNAL TABLE test2(name string, year string, month string) PARTITIONED BY (month, year) STORED AS csv LOCATION 'tmp/';
SELECT * FROM test2;

And run

$ datafusion-cli -f panic.sql
DataFusion CLI v36.0.0
0 rows in set. Query took 0.002 seconds.

+-------+
| count |
+-------+
| 1     |
+-------+
1 row in set. Query took 0.004 seconds.

0 rows in set. Query took 0.001 seconds.

thread 'main' panicked at /home/user/dev/arrow-datafusion/datafusion/core/src/datasource/physical_plan/file_scan_config.rs:261:54:
index out of bounds: the len is 0 but the index is 0
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

Expected behavior

Should never panic. Instead, return an error on table creation or at least on the queries after.

Additional context

No response

The text was updated successfully, but these errors were encountered:

alamb · 2024-03-24T21:27:51Z

Thanks for the report -- I agree this should never panic but rather erorr

devinjdangelo · 2024-03-24T21:38:52Z

I haven't dug into exactly what is going wrong here, but it does make me think a breaking change to a hiveQL inspired syntax where partition columns are only declared following the partition by keyword may avoid edge cases like this more easily.

#9599 is open right now to support a new syntax in a backwards compatible way (so would not fix this outright).

Thank your for finding this @MohamedAbdeen21 ! Are you able to trigger a similar panic using the new syntax as well?

MohamedAbdeen21 · 2024-03-24T21:45:20Z

Yeah, this is irrelevant of the syntax used.

The new syntax is merely translated to the old syntax as to avoid adding more code to the rest of the system, so as long as the issue is outside the parser, syntax used should be irrelevant.

Just to double-check, here's the same query with the new syntax:

CREATE EXTERNAL TABLE test(name string) PARTITIONED BY (year string, month string) STORED AS csv LOCATION 'tmp/';
INSERT INTO test VALUES('name', '2024', '03');

-- notice the different ordering in the PARTITIONED BY clause
CREATE EXTERNAL TABLE test2(name string) PARTITIONED BY (month string, year string) STORED AS csv LOCATION 'tmp/';
SELECT * FROM test2;

I'd say this is safe to work on regardless of the status of 9599.

MohamedAbdeen21 · 2024-03-24T21:49:55Z

By the way, is the create statement lazy?

alamb · 2024-03-24T21:55:37Z

By the way, is the create statement lazy?

It is lazy in the sense that it doesn't read data until the table appears in a query. CREATE often does do basic operations like reading metadata to determine schema.

It definitely doesn't buffer all the data into memory, the way an "eager" dataframe API like pandas or polars(eager) would

MohamedAbdeen21 · 2024-03-26T09:28:30Z

I'm not sure what to do next regarding this. Do we just prevent the panic and leave it at that? do we perform schema validation if the table exists?

Also keep in mind that if we only prevent the panic, queries such as the following can make a table useless.

CREATE EXTERNAL TABLE test(name string) PARTITIONED BY (year string, month string) STORED AS parquet LOCATION 'tmp/';
-- notice the different ordering in the PARTITIONED BY clause
INSERT INTO test VALUES('name', '2024', '03');

CREATE EXTERNAL TABLE test2(name string) PARTITIONED BY (month string, year string) STORED AS parquet LOCATION 'tmp/';

-- now table have both partitions (year -> month and month -> year)
INSERT INTO test2 VALUES('name', '2024', '03');

-- both fail
SELECT * FROM test;
SELECT * FROM test2;

devinjdangelo · 2024-03-26T14:32:26Z

I'm not sure what to do next regarding this. Do we just prevent the panic and leave it at that? do we perform schema validation if the table exists?

We should prevent the panic, but I think it is reasonable behavior to fail the query and return an error. The examples in this issue are creating two tables with different schemas in the same external storage location and writing to both. This effectively corrupts both tables, since now there are two different table definitions mixed in the same physical location. After the inserts, the directory looks like:

/year=2024/month=03
/month=2024/year=03

When the datafusion read path encounters this, the two options would be:

Silently ignore hive style paths which do not conform to the expected schema (so if the partitions are year/month, only scan within the outer year folder, ignoring the outer month folder)
Return an error that unexpected paths were encountered

Depending on the context, I could see either behavior being desired. We could perhaps provide a configuration option that allows users to control how Datafusion will handle unexpected partition paths.

MohamedAbdeen21 · 2024-03-26T14:49:35Z

I don't like the first option. If one person creates a table partitioned on year/month and another person reads it as month/year, the second person won't see any updates the first person have done, and vice versa.

I think inserts using different partitions should be rejected, but again, that requires changing both insert and select statements.

I'm trying to think about which will be easier for users to debug and for us to maintain, error-ing on creation or on queries after? Or maybe don't show any errors, but re-order the partition columns in the create statement to match the data on disk

devinjdangelo · 2024-03-26T16:11:22Z

I don't like the first option. If one person creates a table partitioned on year/month and another person reads it as month/year, the second person won't see any updates the first person have done, and vice versa.

I see the two tables as distinct tables. They happen to share the same base file path, but reordering the partitions changes the schema and where the files are actually stored within the base path.

You can run into similar issues with non partitioned tables if you write different schemas with different column orderings to the same base path.

We could potentially return an error earlier by checking the schema of any file present during the create external table execution plan. That way we prevent a user from creating a table with an incorrect schema over top of an existing table with a different schema.

MohamedAbdeen21 · 2024-03-30T20:32:50Z

Sorry for the late response.

We could potentially return an error earlier by checking the schema of any file present during the create external table execution plan. That way we prevent a user from creating a table with an incorrect schema over top of an existing table with a different schema.

I think this is the solution that makes the most sense from the user's perspective, my only problem is that I feel like it's easier said than done.

I'll try to take a look around if I manage to find some free time. For now, I'll leave the assignment empty in case anyone else is interested in picking it up.

samuelcolvin · 2024-04-28T22:29:06Z

Would be great to have a new release, I've just run into this too.

alamb · 2024-04-30T13:07:26Z

Would be great to have a new release, I've just run into this too.

Added to #10217 tracker

MohamedAbdeen21 added the bug Something isn't working label Mar 24, 2024

MohamedAbdeen21 mentioned this issue Apr 2, 2024

Validate partitions columns in CREATE EXTERNAL TABLE if table already exists. #9912

Merged

alamb closed this as completed in #9912 Apr 5, 2024

alamb mentioned this issue Apr 30, 2024

DataFusion 38.0.0 Release #10217

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Panic when querying table with wrong partition columns order #9785

[BUG] Panic when querying table with wrong partition columns order #9785

MohamedAbdeen21 commented Mar 24, 2024

alamb commented Mar 24, 2024

devinjdangelo commented Mar 24, 2024 •

edited

Loading

MohamedAbdeen21 commented Mar 24, 2024 •

edited

Loading

MohamedAbdeen21 commented Mar 24, 2024

alamb commented Mar 24, 2024

MohamedAbdeen21 commented Mar 26, 2024

devinjdangelo commented Mar 26, 2024

MohamedAbdeen21 commented Mar 26, 2024

devinjdangelo commented Mar 26, 2024

MohamedAbdeen21 commented Mar 30, 2024

samuelcolvin commented Apr 28, 2024

alamb commented Apr 30, 2024

[BUG] Panic when querying table with wrong partition columns order #9785

[BUG] Panic when querying table with wrong partition columns order #9785

Comments

MohamedAbdeen21 commented Mar 24, 2024

Describe the bug

To Reproduce

Expected behavior

Additional context

alamb commented Mar 24, 2024

devinjdangelo commented Mar 24, 2024 • edited Loading

MohamedAbdeen21 commented Mar 24, 2024 • edited Loading

MohamedAbdeen21 commented Mar 24, 2024

alamb commented Mar 24, 2024

MohamedAbdeen21 commented Mar 26, 2024

devinjdangelo commented Mar 26, 2024

MohamedAbdeen21 commented Mar 26, 2024

devinjdangelo commented Mar 26, 2024

MohamedAbdeen21 commented Mar 30, 2024

samuelcolvin commented Apr 28, 2024

alamb commented Apr 30, 2024

devinjdangelo commented Mar 24, 2024 •

edited

Loading

MohamedAbdeen21 commented Mar 24, 2024 •

edited

Loading