-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: preserve more dictionaries when coercing types #10221
fix: preserve more dictionaries when coercing types #10221
Conversation
There is a test failure in sqllogictests
here is the test case: There could be unintended consequences to this change for casting so I will look into it |
Can we please add a test to https://github.com/apache/datafusion/blob/main/datafusion/sqllogictest/test_files/dictionary.slt with the explain plan showing the correct filter being added? |
3a6c04e
to
7fe8eeb
Compare
The test failure is from datafusion/datafusion/expr/src/type_coercion/functions.rs Lines 186 to 201 in 7fe8eeb
|
There are other places array prepend/append:
list type coercions: datafusion/datafusion/expr/src/type_coercion/other.rs Lines 25 to 34 in 7fe8eeb
datafusion/datafusion/expr/src/type_coercion/other.rs Lines 39 to 54 in 7fe8eeb
datafusion/datafusion/optimizer/src/analyzer/type_coercion.rs Lines 151 to 172 in 7fe8eeb
datafusion/datafusion/optimizer/src/analyzer/type_coercion.rs Lines 236 to 270 in 7fe8eeb
|
| (other_type, d @ Dictionary(_, value_type)) | ||
if preserve_dictionaries && value_type.as_ref() == other_type => | ||
(d @ Dictionary(_, _), other_type) | (other_type, d @ Dictionary(_, _)) | ||
if preserve_dictionaries && can_cast_types(other_type, d) => |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I understand correctly, we should check whether value_type
and other_type
are coercible. If yes return the new dict with new value type
Dict(k, i8) and i64 -> Dict(k, i64)
Dict(k, i64) and i8 -> Dict(k, i64)
Instead of Some(d.clone())
we should create a new dict with new value type.
And, I think we should use the coerce logic in comparison_coercion
not can_cast_types
, because the logic is not totally the same.
| (other_type, d @ Dictionary(key_type, value_type))
if preserve_dictionaries && comparison_coercion(value_type.as_ref(), other_type).is_some() =>
{
let new_value = comparison_coercion(&value_type, other_type).unwrap();
Some(DataType::Dictionary(key_type.clone(), Box::new(new_value)))
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jayzhan211 Thanks for the tip. I initially avoiding calling comparison_coercion
because the way I was doing it would cause an infinite recursion, but calling comparison_coercion
on the inner value type and then wrapping in a new Dictionary
as you suggested solves that issue.
I was hoping this would fix the test failure, but unfortunately it does not so I will need to look into reworking the type coercion logic for VariadicEqual
functions
3a53770
to
8901246
Compare
} else { | ||
value_type | ||
} | ||
}; | ||
match (lhs_type, rhs_type) { | ||
( | ||
Dictionary(_lhs_index_type, lhs_value_type), | ||
Dictionary(_rhs_index_type, rhs_value_type), | ||
) => comparison_coercion(lhs_value_type, rhs_value_type), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there any particular reason why this case doesn't also preserve the dictionary type?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not that I know of
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably because value type is enough for comparison
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we should not preserve dict for comparison overall 🤔 ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think what we need to solve the issue is avoiding casting from value to dict for column
, because casting for column is costly compare with casting constant.
Given the example, if we preserve dict, we still ends up casting column (utf8) to Dict (i32,utf8), but in this case, we can cast the const from i64 to utf8 and it is enough.
statement ok
create table test as values
('row1', arrow_cast(1, 'Utf8')),
('row2', arrow_cast(2, 'Utf8')),
('row3', arrow_cast(3, 'Utf8'))
;
# query using an string '1' which must be coerced into a dictionary string
query TT
SELECT * from test where column2 = arrow_cast(2, 'Dictionary(Int32, Int64)');
----
row2 2
query TT
explain SELECT * from test where column2 = arrow_cast(2, 'Dictionary(Int32, Int64)');
----
logical_plan
01)Filter: CAST(test.column2 AS Dictionary(Int32, Utf8)) = Dictionary(Int32, Utf8("2"))
02)--TableScan: test projection=[column1, column2]
physical_plan
01)CoalesceBatchesExec: target_batch_size=8192
02)--FilterExec: CAST(column2@1 AS Dictionary(Int32, Utf8)) = 2
03)----MemoryExec: partitions=1, partition_sizes=[1]
statement ok
drop table test;
expect plan
01)Filter: test.column2 = Utf8("2")
02)--TableScan: test projection=[column1, column2]
physical_plan
01)CoalesceBatchesExec: target_batch_size=8192
02)--FilterExec: column2@1 = 2
03)----MemoryExec: partitions=1, partition_sizes=[1]
I think we should not preserve dict, but need specialization on comparing dict vs non-dict case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
02)--FilterExec: CAST(column2@1 AS Dictionary(Int32, Utf8)) = 2
Yes I agree that looks bad. It should be unwrapped.
Thank you for that example 💯
Maybe we could extend
match (left.as_mut(), right.as_mut()) { |
CAST(col, ..) = const
for other datatypes 🤔
I can try to do so later this weekend. Or would you like to try it @erratic-pattern ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FWIW I tried the example from @jayzhan211 on main and it doesn't put a cast on the filter -- thus I agree this PR would be a regression if merged as is. I will dismiss my review
DataFusion CLI v37.1.0
> create table test as values
('row1', arrow_cast(1, 'Utf8')),
('row2', arrow_cast(2, 'Utf8')),
('row3', arrow_cast(3, 'Utf8'))
;
0 row(s) fetched.
Elapsed 0.050 seconds.
> explain SELECT * from test where column2 = arrow_cast(2, 'Dictionary(Int32, Int64)');
+---------------+---------------------------------------------------+
| plan_type | plan |
+---------------+---------------------------------------------------+
| logical_plan | Filter: test.column2 = Utf8("2") |
| | TableScan: test projection=[column1, column2] |
| physical_plan | CoalesceBatchesExec: target_batch_size=8192 |
| | FilterExec: column2@1 = 2 |
| | MemoryExec: partitions=1, partition_sizes=[1] |
| | |
+---------------+---------------------------------------------------+
2 row(s) fetched.
Elapsed 0.053 seconds.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with @jayzhan211 that we probably don't want to cast everything to dictionaries in the way that we are currently doing it, and what we really want is a way for the optimizer to avoid expensive casts of dictionary columns, and more generally to just avoid column casts in favor of casting constants and scalar expressions.
I think what we have works fine for now and fixes performance issues we're seeing on dictionary columns, but should be improved for the general case in subsequent PRs that redesign the type coercion logic.
8901246
to
f47e74d
Compare
It looks like the failing test case is no longer throwing an error after I added @jayzhan211 's suggestion:
I think it's now complaining that the type is no longer an integer and is instead
|
Marking as draft as I think this PR is still in progress and the CI is not yet passing |
That seems like you could just update the test. Perhaps via https://github.com/apache/datafusion/tree/main/datafusion/sqllogictest#updating-tests-completion-mode |
I may try to approach a solution from a different angle here, to avoid needing to refactor the type coercion logic too much without any clear understanding of what the impacts could be. I think a more specific optimization might look something like |
FWIW I think the coercion is the right approach here. Let me take a look at this PR and see if I can find something |
Or even more generally, you could ignore whether the operands are column references or constants and just have:
|
I think it is the right approach for this one particular case, but the Maybe we need to decouple the non-comparison type coercions from using the |
this comment also seems to hint at a possible refactor: datafusion/datafusion/expr/src/type_coercion/functions.rs Lines 190 to 194 in f47e74d
we could potentially rewrite datafusion/datafusion/expr/src/type_coercion/functions.rs Lines 262 to 271 in f47e74d
|
This diff shows a list of types that coalesce can accept. There might be others but this would serve as a good starting point: |
I updated the tests, but I'm still unsure if it's okay to ignore this change in behavior. It seems like this could be a regression in other cases where we might be converting things to dictionaries for no reason. It would be nice to have some issue tracking this, or maybe there is one already. A possibly easy change to make in the short-term is to refactor away from using I just feels like we're playing Jenga with type coercion logic right now. The coercion logic should prioritize avoiding expensive casts, I think. |
I think the reason to convert to a dictonary is that the other side of the comparsion is already a dictionary, which thus avoids convering both arguments (though I may be missing something) Maybe @viirya has some thoughts given he worked on coercion as part of #9459 recently
Perhaps #5928 ? There appear to be a bunch of open tickets about coercion https://github.com/apache/datafusion/issues?q=is%3Aissue+is%3Aopen+coercion
This seems reasonable to me (aka have special rules for
The key to ensuring we don't break existing code is to rely on the tests. I agree the type coercion logic could be improved -- specifically I think it needs to have some encapsulation rather than being spread out through a bunch of free functions. Is this something you are interested in helping out with? I think the first thing to do would be to try and explain how the current logic works (and in that process we will likely uncover ways to make it better). Then, would improve the code the structure to encapsulate the logic (into structs / enums perhaps -- like I did recently in #10216). Once the logic is more encapsulated, then I think we'll be able to reason about it and feel good about how it fits into the overall picture I think the difference in casting a scalar integer to a scalar dictionary is neglible. The difference casting a column to a different type is likely subtantial (though casting string --> dictionary doesn't require any data copying) |
For this PR I suggest:
Then as a follow on, if you want to take on the refactoring of type coecion we can start doing that. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @erratic-pattern and @jayzhan211
I pushed a few new tests and I think it is ready to go. I think we should wait a day or so to let other reviewers have a change
In terms of "type coercion logic seems quite brittle, I agree. It would be really nice to try and make it easier to understand / change without worrying of unintended consequences
@@ -1768,52 +1768,61 @@ SELECT make_array(1, 2, 3); | |||
[1, 2, 3] | |||
|
|||
# coalesce static empty value | |||
query T | |||
SELECT COALESCE('', 'test') | |||
query TT |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I updated these tests to show what the coerced type was as well
explain SELECT * from test where column2 = 1; | ||
---- | ||
logical_plan | ||
01)Filter: test.column2 = Dictionary(Int32, Utf8("1")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the key is that there is no CAST on column2
here
query I | ||
select coalesce(34, arrow_cast(123, 'Dictionary(Int32, Int8)')); | ||
# test coercion Int and Dictionary | ||
query ?T |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As @erratic-pattern noted the difference here is that now that the output type of cealesce is Dictionary
rather than Int.
I'm wondering why coalesce has the signature VariadicEqual, since it takes a first non-null value, ideally, coercion is not necessary. I think In this PR, coalesce do the coercion internally, I forgot why we do coercion but not returning first non-null value with the type it has Output in this PR:
query IT
select coalesce(arrow_cast(1, 'Int32'), arrow_cast(1, 'Int64')), arrow_typeof(coalesce(arrow_cast(1, 'Int32'), arrow_cast(1, 'Int64')))
----
1 Int64
Expect:
1 Int32 Change to |
I can't remember either but I remember it was tricky -- I suggest we open a new ticket / discussion about it rather than trying to change the behavior in this PR |
7ed5503
to
7f64b4a
Compare
rebased onto main |
I had a chat with @erratic-pattern the plan: We'll put this PR into draft mode
|
I believe this is superceded by #10323 |
Which issue does this PR close?
Closes #10220
Rationale for this change
Loosen the restriction on when type coercion will preserve dictionary types to prevent slow casting of columns with dictionary type.
What changes are included in this PR?
Are these changes tested?
There are already tests that would fail on type coercion. Let me know if there additional tests you'd like to see.
Are there any user-facing changes?
Certain coercion operations will now preserve dictionary encoding where before they would downcast to the value type