Casting from Binary --> Utf8 to evaluate `LIKE` slows down some ClickBench queries #12509

alamb · 2024-09-17T15:31:30Z

Is your feature request related to a problem or challenge? Please describe what you are trying to do.

While working on enabling StringView by default in #12092 I noticed that some of the clickbench queries got 10% slower and looked into it.

The plan looks like this:

DataFusion CLI v41.0.0
+---------------+----------------------------------------------------------------------------------------------------$
| plan_type     | plan                                                                                               $
+---------------+----------------------------------------------------------------------------------------------------$
| physical_plan | AggregateExec: mode=Final, gby=[], aggr=[count(*)]                                                 $
|               |   CoalescePartitionsExec                                                                           $
|               |     AggregateExec: mode=Partial, gby=[], aggr=[count(*)]                                           $
|               |       ProjectionExec: expr=[]                                                                      $
|               |         CoalesceBatchesExec: target_batch_size=8192                                                $
|               |           FilterExec: CAST(URL@0 AS Utf8View) LIKE %google%                                        $
|               |             ParquetExec: file_groups={16 groups: [[Users/andrewlamb/Software/datafusion/benchmarks/$
+---------------+----------------------------------------------------------------------------------------------------$
2 row(s) fetched.
Elapsed 0.065 seconds.

When looking at the flamegraphs, you can see the CAST spends a huge amount of time validating utf8 (more time than actually evaluating the LIKE predicate actually):

Here are the full flamegraphs for comparison:
q20-flamegraph-main
q20-flamegraph-stringview

I belive the issue is here:

|               |           FilterExec: CAST(URL@0 AS Utf8View) LIKE %google%

This filter first *CASTs the URL column to Utf8View and then evaluates LIKE`

Converting BinaryArray --> StringArrayas is done without StringView is relatively faster because it is done with a single large function call

However, converting BinaryViewArrar --> StringViewArray is not as it makes many small function calls. The parquet reader has a special optimization for this as descsribed in "Section 2.1: From binary to strings" of the Using StringView / German Style Strings to Make Queries Faster: Part 1 - Reading Parquet from @XiangpengHao

Describe the solution you'd like
I would like this query to go as fast / faster with Utf8View / BinaryView enabled.

Bonus points if it went faster even without Utf-8 enabled

Describe alternatives you've considered

Option 1: `LIKE` for binary

One option is to skip validating UTF8 entirely and evaluate LIKE directly on binary. This would mean if the column is read as binary we could cast the argument '%google%' to binary and then evaluate LIKE directly on the binary column. This would skip validaitng utf8 completely

Unfortunately, it appears that the like kernel is only implemented for StringArray and StringViewArray at the moment, not BinaryArray: https://docs.rs/arrow-string/53.0.0/src/arrow_string/like.rs.html#110-149

Another related option would be to potentially special case the LIKE rewite in this case for just prefix / contians / suffix -- in this case rewrite <binary> LIKE <const that starts and ends with '%'> --> <binary> CONTAINS <string>

Option 2: resolve the column as `Utf8` rather than `Binary`

For some reason the schema of hits.parquet (the single file from ClickBench) has the URL column (and others) as Utf8 (strings) but the hits_partitioned file resolves it as Binary. \

We could change the schema resolution logicic to resolve the column as a String instead.

This option is probably slower than option 1 but I think it is more inline with what the intended semantics (these columns contain logical stirngs) and the parquet reader includes the fast read path for such strings and would be more general.

Filed #12510 to track this ideae

Additional context

The text was updated successfully, but these errors were encountered:

alamb · 2024-09-22T12:58:18Z

I have been thinking about this, and I came up with a third option which is to "push the casting into the scan"

Consider this plan for q28:

+---------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------$
| plan_type     | plan                                                                                                                                                                                                                                                                                                             $
+---------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------$
| logical_plan  | Sort: l DESC NULLS FIRST, fetch=25                                                                                                                                                                                                                                                                               $
|               |   Projection: regexp_replace(hits_partitioned.Referer,Utf8("^https?://(?:www\.)?([^/]+)/.*$"),Utf8("\1")) AS k, avg(character_length(hits_partitioned.Referer)) AS l, count(*) AS c, min(hits_partitioned.Referer)                                                                                               $
|               |     Filter: count(*) > Int64(100000)                                                                                                                                                                                                                                                                             $
|               |       Aggregate: groupBy=[[regexp_replace(__common_expr_1 AS hits_partitioned.Referer, Utf8("^https?://(?:www\.)?([^/]+)/.*$"), Utf8("\1"))]], aggr=[[avg(CAST(character_length(__common_expr_1 AS hits_partitioned.Referer) AS Float64)), count(Int64(1)) AS count(*), min(hits_partitioned.Referer)]]          $
|               |         Projection: CAST(hits_partitioned.Referer AS Utf8) AS __common_expr_1, hits_partitioned.Referer                                                                                                                                                                                                          $
|               |           Filter: hits_partitioned.Referer != BinaryView("")                                                                                                                                                                                                                                                     $
|               |             TableScan: hits_partitioned projection=[Referer], partial_filters=[hits_partitioned.Referer != BinaryView("")]

The

 Projection: CAST(hits_partitioned.Referer AS Utf8) AS __common_expr_1, hits_partitioned.Referer

Is what is causing a non trivial slowdown.

The issue is that hits_partitioned.Referer is read as a BinaryView

The problem is that BinaryView --> Utf8View conversion is much slower than reading Utf8View directly out of the parquet file due to the Utf8 optimization described by @XiangpengHao in "Section 2.1: From binary to strings" of the string view blog.

Option 3: Implement push down casting (maybe as an Analyzer rule??)

The theory here is that some readers( such as the parquet reader) can produce the data more effiicently in a particular format than creating it first in one format before datafusion casts it to another.

So the plan above would basically push the cast down so the parquet reader read the hits_partitioned.Referer as a Utf8View to begin with

alamb · 2024-10-05T10:11:30Z

❤️

BTW I think option 3 is the ideal solution (as it would be general purpose and likely benefit all plans, not just ClickBench) -- push the casting into the scan as the TableProvider may be able to create the desired datatype more efficiently than casting after the fact.

Another idea that @tustvold and I talked about yesterday would be to improve the performance of casting BinaryView --> Utf8View, inspired by @XiangpengHao 's trick for the parquet reader:

Attempt to validate the entire buffer (not string by string) is Utf8
If the entire buffer is utf8, then for each large array element validate that it doesn't point into the middle of a Utf8 codepoint
If the entire buffer is not utf8, then fall back to validating each long string one by one
(I'll try and write up a ticket later about this)

alamb · 2024-10-05T13:00:11Z

Thank you for looking at this one @jayzhan211

alamb · 2024-10-05T13:14:28Z

BTW the more I think about this one, the more I think the "right" fix might just be to setup the query with the appropriate schema (aka tell DataFusion to treat the string columns as strings with CREATE EXTERNAL TABLE... rather than trying to do some sort of type casting contortions) 🤔

alamb · 2024-10-05T13:17:26Z

Leaving it as binary messes up the results of the queres too

For example, if you run this query

SELECT "SearchEngineID", "SearchPhrase", COUNT(*) AS c FROM 'hits_partitioned' WHERE "SearchPhrase" <> '' GROUP BY "SearchEngineID", "SearchPhrase" ORDER BY c DESC LIMIT 10;

You get this arguably nonsensical result:

(venv) andrewlamb@Andrews-MacBook-Pro-2:~/Downloads$ datafusion-cli -f q.sql
DataFusion CLI v42.0.0
+----------------+--------------------------------------------------------------------------------------------------+-------+
| SearchEngineID | SearchPhrase                                                                                     | c     |
+----------------+--------------------------------------------------------------------------------------------------+-------+
| 2              | d0bad0b0d180d0b5d0bbd0bad0b8                                                                     | 46258 |
| 2              | d0bcd0b0d0bdd0b3d18320d0b220d0b7d0b0d180d0b0d0b1d0b5d0b920d0b3d180d0b0d0bcd0b0                   | 18871 |
| 2              | d181d0bcd0bed182d180d0b5d182d18c20d0bed0bdd0bbd0b0d0b9d0bd                                       | 16905 |
| 3              | d0b0d0bbd0b1d0b0d182d180d183d182d0b4d0b8d0bd                                                     | 16748 |
| 2              | d181d0bcd0bed182d180d0b5d182d18c20d0bed0bdd0bbd0b0d0b9d0bd20d0b1d0b5d181d0bfd0bbd0b0d182d0bdd0be | 14909 |
| 2              | d0b0d0bbd0b1d0b0d182d180d183d182d0b4d0b8d0bd                                                     | 13716 |
| 2              | d18dd0bad0b7d0bed0b8d0b4d0bdd18bd0b5                                                             | 13414 |
| 2              | d181d0bcd0bed182d180d0b5d182d18c                                                                 | 13108 |
| 3              | d0bad0b0d180d0b5d0bbd0bad0b8                                                                     | 12815 |
| 2              | d0b4d180d183d0b6d0bad0b520d0bfd0bed0bcd0b5d189d0b5d0bdd0b8d0b5                                   | 11946 |
+----------------+--------------------------------------------------------------------------------------------------+-------+
10 row(s) fetched.
Elapsed 0.561 seconds.

(venv) andrewlamb@Andrews-MacBook-Pro-2:~/Downloads$

vs if you run the same query with the proper schema:

(venv) andrewlamb@Andrews-MacBook-Pro-2:~/Downloads$ datafusion-cli -f q.sql
DataFusion CLI v42.0.0
+----------------+---------------------------+-------+
| SearchEngineID | SearchPhrase              | c     |
+----------------+---------------------------+-------+
| 2              | карелки                   | 46258 |
| 2              | мангу в зарабей грама     | 18871 |
| 2              | смотреть онлайн           | 16905 |
| 3              | албатрутдин               | 16748 |
| 2              | смотреть онлайн бесплатно | 14909 |
| 2              | албатрутдин               | 13716 |
| 2              | экзоидные                 | 13414 |
| 2              | смотреть                  | 13108 |
| 3              | карелки                   | 12815 |
| 2              | дружке помещение          | 11946 |
+----------------+---------------------------+-------+
10 row(s) fetched.
Elapsed 0.569 seconds.

I will file a proper ticket for this

alamb · 2024-10-05T13:32:41Z

If you are looking to help StringView along @jayzhan211 I would recommend #12771

I think if we fix the schema of hte table for hits_partitioned to what it is supposed to be, the queries will be quite fast

jayzhan211 · 2024-10-06T01:08:09Z

I think if we fix the schema of hte table for hits_partitioned to what it is supposed to be, the queries will be quite fast

If we always check whether a binary is UTF-8 encoded before converting it to a string, will this add overhead for cases where the binary is not UTF-8 encoded?

jayzhan211 · 2024-10-06T07:48:03Z

After changing the table schema from Binary to Utf8, the query becomes slower.

alamb · 2024-10-06T11:05:55Z

If we always check whether a binary is UTF-8 encoded before converting it to a string, will this add overhead for cases where the binary is not UTF-8 encoded?

I think there are two things going on:

Reading a column as a BinaryViewArray and then casting to Utf8ViewArray is significantly slower than reading the data from parquet as a Utf8ViewArray (due to optimizations in the parquet decoder).
Reading a column as a BinaryArray and then casting to Utf8Array is about the same speed as reading as Utf8Array

So I think the core issue is that for hits_partitioned the "string" columns in the schema are marked as binary (not Utf8) and thus the slower conversion path is used

After changing the table schema from Binary to Utf8, the query becomes slower.

🤔 I don't understand that

jayzhan211 · 2024-10-06T11:36:20Z

So I think the core issue is that for hits_partitioned the "string" columns in the schema are marked as binary (not Utf8) and thus the slower conversion path is used

I'm not quite familiar with Parquet yet, do you mean you can modify the file schema to String? I thought we should keep the file schema as binary but change the mapped schema (table schema) to utf8/utf8view. Therefore, we need to validate the binary to know whether the array is utf8 or not. If it is utf8, we can convert it to string, if not we need to keep it as binary. In this case, we have an additional UTF8 validation check, although not sure if the overhead is negligible or not.

I only tried modifying the table schema to utf8 without any utf8 validation, but the query is slower compared to main.
If we want to try binary to utf8view, I think we need to support arrow-rs casting first.

The one only modify the table schema without any utf8 validation. The idea is to read the binary array column as utf8 directly.

#12777

┏━━━━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━┓
┃ Query        ┃      main ┃ parquet-binary ┃       Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━┩
│ QQuery 0     │ 9844.61ms │     10490.19ms │ 1.07x slower │
└──────────────┴───────────┴────────────────┴──────────────┘

The things I could try is

Support binary to utf8view and read the column as utf8view and see if it improves
Read as binary/binary view and improve the casting to utf8/utf8view.

alamb · 2024-10-07T11:15:17Z

Hi @jayzhan211 -- sorry my confusion -- I have written up what I think we should do here: #12788

This is part of (but not all) of what we need to turn on StringView and have it go faster always. #11682 lists the other things

alamb added the enhancement New feature or request label Sep 17, 2024

This was referenced Sep 17, 2024

Enable reading StringViewArray by default from Parquet #12092

Closed

[Epic] Complete Initial StringView in DataFusion #11752

Closed

alamb transferred this issue from apache/arrow-rs Sep 17, 2024

alamb mentioned this issue Sep 17, 2024

Consider resolving a clickbench files as Utf8 (rather than binary) #12510

Closed

alamb mentioned this issue Sep 22, 2024

Implement fast min/max accumulator for binary / strings (now it uses the slower path) #6906

Closed

jayzhan211 self-assigned this Oct 5, 2024

alamb mentioned this issue Oct 7, 2024

Enable reading StringView by default from Parquet (schema_force_string_view) by default #11682

Closed

alamb closed this as completed Oct 7, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Casting from Binary --> Utf8 to evaluate `LIKE` slows down some ClickBench queries #12509

Casting from Binary --> Utf8 to evaluate `LIKE` slows down some ClickBench queries #12509

alamb commented Sep 17, 2024 •

edited

Loading

alamb commented Sep 22, 2024

alamb commented Oct 5, 2024 •

edited

Loading

alamb commented Oct 5, 2024

alamb commented Oct 5, 2024

alamb commented Oct 5, 2024

alamb commented Oct 5, 2024 •

edited

Loading

jayzhan211 commented Oct 6, 2024 •

edited

Loading

jayzhan211 commented Oct 6, 2024

alamb commented Oct 6, 2024

jayzhan211 commented Oct 6, 2024 •

edited

Loading

alamb commented Oct 7, 2024 •

edited

Loading

Casting from Binary --> Utf8 to evaluate LIKE slows down some ClickBench queries #12509

Casting from Binary --> Utf8 to evaluate LIKE slows down some ClickBench queries #12509

Comments

alamb commented Sep 17, 2024 • edited Loading

Option 1: LIKE for binary

Option 2: resolve the column as Utf8 rather than Binary

alamb commented Sep 22, 2024

Option 3: Implement push down casting (maybe as an Analyzer rule??)

alamb commented Oct 5, 2024 • edited Loading

alamb commented Oct 5, 2024

alamb commented Oct 5, 2024

alamb commented Oct 5, 2024

alamb commented Oct 5, 2024 • edited Loading

jayzhan211 commented Oct 6, 2024 • edited Loading

jayzhan211 commented Oct 6, 2024

alamb commented Oct 6, 2024

jayzhan211 commented Oct 6, 2024 • edited Loading

alamb commented Oct 7, 2024 • edited Loading

Casting from Binary --> Utf8 to evaluate `LIKE` slows down some ClickBench queries #12509

Casting from Binary --> Utf8 to evaluate `LIKE` slows down some ClickBench queries #12509

alamb commented Sep 17, 2024 •

edited

Loading

Option 1: `LIKE` for binary

Option 2: resolve the column as `Utf8` rather than `Binary`

alamb commented Oct 5, 2024 •

edited

Loading

alamb commented Oct 5, 2024 •

edited

Loading

jayzhan211 commented Oct 6, 2024 •

edited

Loading

jayzhan211 commented Oct 6, 2024 •

edited

Loading

alamb commented Oct 7, 2024 •

edited

Loading