Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Slow comparisions to dictionary columns with type coercion #10220

Closed
alamb opened this issue Apr 24, 2024 · 3 comments · Fixed by #10323
Closed

Slow comparisions to dictionary columns with type coercion #10220

alamb opened this issue Apr 24, 2024 · 3 comments · Fixed by #10323
Assignees
Labels
enhancement New feature or request

Comments

@alamb
Copy link
Contributor

alamb commented Apr 24, 2024

Is your feature request related to a problem or challenge?

In InfluxDB we use Dictionary(Int32, Utf8) columns a lot.

Queries like this (with string constants) work great and are very fast

SELECT ... WHERE column = '1'

Queries like this (note 1 is an integer, not a '1') the query goes very slow

SELECT ... WHERE column = 1

@erratic-pattern and I tracked this down to an issue/ limitation in type coercion:

Reproducer

DataFusion CLI v37.1.0
> create table test as values (arrow_cast('1', 'Dictionary(Int32, Utf8)'));
0 row(s) fetched.
Elapsed 0.010 seconds.

> select arrow_typeof(column1) from test;
+----------------------------+
| arrow_typeof(test.column1) |
+----------------------------+
| Dictionary(Int32, Utf8)    |
+----------------------------+
1 row(s) fetched.
Elapsed 0.002 seconds.

> explain SELECT * from test where column1 = 1;
+---------------+---------------------------------------------------+
| plan_type     | plan                                              |
+---------------+---------------------------------------------------+
| logical_plan  | Filter: CAST(test.column1 AS Utf8) = Utf8("1")    |
|               |   TableScan: test projection=[column1]            |
| physical_plan | CoalesceBatchesExec: target_batch_size=8192       |
|               |   FilterExec: CAST(column1@0 AS Utf8) = 1         |
|               |     MemoryExec: partitions=1, partition_sizes=[1] |
|               |                                                   |
+---------------+---------------------------------------------------+
2 row(s) fetched.
Elapsed 0.003 seconds.

I think this shows the core problem:

| logical_plan  | Filter: CAST(test.column1 AS Utf8) = Utf8("1")    |

It basically shows the column is being converted to a string, rather than the constant being converted to th ecorrect type.

Not only does this mean the column is being un-encoded for the comparsion, it also means that PruningPredicate doesn't work either

Describe the solution you'd like

I would like the query to go fast lol

Specifically, I think the filter should look like this (no cast on the column, and instead the constant type matches)

| logical_plan  | Filter: test.column1 = Dictionary(Int32, Utf8("1")) |

Note this is what happens if you compare the dictionary column to a string literal:

> explain SELECT * from test where column1 = '1';
+---------------+-----------------------------------------------------+
| plan_type     | plan                                                |
+---------------+-----------------------------------------------------+
| logical_plan  | Filter: test.column1 = Dictionary(Int32, Utf8("1")) |
|               |   TableScan: test projection=[column1]              |
| physical_plan | CoalesceBatchesExec: target_batch_size=8192         |
|               |   FilterExec: column1@0 = 1                         |
|               |     MemoryExec: partitions=1, partition_sizes=[1]   |
|               |                                                     |
+---------------+-----------------------------------------------------+
2 row(s) fetched.
Elapsed 0.002 seconds.

>

Describe alternatives you've considered

We could potentially update the coercion logic to coerce 1 to Dictionary(.. "1") or maybe update the unwrap_comparsion logic

Additional context

No response

@erratic-pattern
Copy link
Contributor

erratic-pattern commented Apr 24, 2024

I have a PR that fixes this. #10221 Here is the explain after making the change:

> explain SELECT * from test where column1 = 1;
+---------------+-----------------------------------------------------+
| plan_type     | plan                                                |
+---------------+-----------------------------------------------------+
| logical_plan  | Filter: test.column1 = Dictionary(Int32, Utf8("1")) |
|               |   TableScan: test projection=[column1]              |
| physical_plan | CoalesceBatchesExec: target_batch_size=8192         |
|               |   FilterExec: column1@0 = 1                         |
|               |     MemoryExec: partitions=1, partition_sizes=[1]   |
|               |                                                     |
+---------------+-----------------------------------------------------+
2 row(s) fetched.
Elapsed 0.008 seconds.

However it looks like some tests are failing so I am still looking into it.

@erratic-pattern
Copy link
Contributor

#10323 is ready for review and avoids the previously discussed issues with #10221

@alamb
Copy link
Contributor Author

alamb commented Apr 30, 2024

Thanks @erratic-pattern -- I hope to look at this tomorrow morning

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
2 participants