-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] Panic when querying table with wrong partition columns order #9785
Comments
Thanks for the report -- I agree this should never panic but rather erorr |
I haven't dug into exactly what is going wrong here, but it does make me think a breaking change to a hiveQL inspired syntax where partition columns are only declared following the partition by keyword may avoid edge cases like this more easily. #9599 is open right now to support a new syntax in a backwards compatible way (so would not fix this outright). Thank your for finding this @MohamedAbdeen21 ! Are you able to trigger a similar panic using the new syntax as well? |
Yeah, this is irrelevant of the syntax used. The new syntax is merely translated to the old syntax as to avoid adding more code to the rest of the system, so as long as the issue is outside the parser, syntax used should be irrelevant. Just to double-check, here's the same query with the new syntax: CREATE EXTERNAL TABLE test(name string) PARTITIONED BY (year string, month string) STORED AS csv LOCATION 'tmp/';
INSERT INTO test VALUES('name', '2024', '03');
-- notice the different ordering in the PARTITIONED BY clause
CREATE EXTERNAL TABLE test2(name string) PARTITIONED BY (month string, year string) STORED AS csv LOCATION 'tmp/';
SELECT * FROM test2; I'd say this is safe to work on regardless of the status of 9599. |
By the way, is the create statement lazy? |
It is lazy in the sense that it doesn't read data until the table appears in a query. It definitely doesn't buffer all the data into memory, the way an "eager" dataframe API like pandas or polars(eager) would |
I'm not sure what to do next regarding this. Do we just prevent the panic and leave it at that? do we perform schema validation if the table exists? Also keep in mind that if we only prevent the panic, queries such as the following can make a table useless. CREATE EXTERNAL TABLE test(name string) PARTITIONED BY (year string, month string) STORED AS parquet LOCATION 'tmp/';
-- notice the different ordering in the PARTITIONED BY clause
INSERT INTO test VALUES('name', '2024', '03');
CREATE EXTERNAL TABLE test2(name string) PARTITIONED BY (month string, year string) STORED AS parquet LOCATION 'tmp/';
-- now table have both partitions (year -> month and month -> year)
INSERT INTO test2 VALUES('name', '2024', '03');
-- both fail
SELECT * FROM test;
SELECT * FROM test2; |
We should prevent the panic, but I think it is reasonable behavior to fail the query and return an error. The examples in this issue are creating two tables with different schemas in the same external storage location and writing to both. This effectively corrupts both tables, since now there are two different table definitions mixed in the same physical location. After the inserts, the directory looks like:
When the datafusion read path encounters this, the two options would be:
Depending on the context, I could see either behavior being desired. We could perhaps provide a configuration option that allows users to control how Datafusion will handle unexpected partition paths. |
I don't like the first option. If one person creates a table partitioned on year/month and another person reads it as month/year, the second person won't see any updates the first person have done, and vice versa. I think inserts using different partitions should be rejected, but again, that requires changing both insert and select statements. I'm trying to think about which will be easier for users to debug and for us to maintain, error-ing on creation or on queries after? Or maybe don't show any errors, but re-order the partition columns in the create statement to match the data on disk |
I see the two tables as distinct tables. They happen to share the same base file path, but reordering the partitions changes the schema and where the files are actually stored within the base path. You can run into similar issues with non partitioned tables if you write different schemas with different column orderings to the same base path. We could potentially return an error earlier by checking the schema of any file present during the create external table execution plan. That way we prevent a user from creating a table with an incorrect schema over top of an existing table with a different schema. |
Sorry for the late response.
I think this is the solution that makes the most sense from the user's perspective, my only problem is that I feel like it's easier said than done. I'll try to take a look around if I manage to find some free time. For now, I'll leave the assignment empty in case anyone else is interested in picking it up. |
Would be great to have a new release, I've just run into this too. |
Added to #10217 tracker |
Describe the bug
Assume a table
test
that is partitioned on columnsa
thenb
, creating an external table and setting wrong order for the partition columns (b
thena
) doesn't raise an error, but querying it causes a panic.To Reproduce
copy these queries into panic.sql
And run
Expected behavior
Should never panic. Instead, return an error on table creation or at least on the queries after.
Additional context
No response
The text was updated successfully, but these errors were encountered: