Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support IN lists with more than three constants in predicates for bloom filters #8436

Closed
alamb opened this issue Dec 6, 2023 · 15 comments · Fixed by #8654
Closed

Support IN lists with more than three constants in predicates for bloom filters #8436

alamb opened this issue Dec 6, 2023 · 15 comments · Fixed by #8654
Labels
enhancement New feature or request

Comments

@alamb
Copy link
Contributor

alamb commented Dec 6, 2023

Is your feature request related to a problem or challenge?

BloomFilter support was added in #7821 by @hengfeiyang ❤️

There is partial support for optimizing queries that have IN List predicates,. as suggested by @Ted-Jiang : #7821 (comment) and tested via https://github.com/apache/arrow-datafusion/blob/0d7cab055cb39d6df751e070af5a0bf5444e3849/datafusion/core/src/datasource/physical_plan/parquet/row_groups.rs#L1056-L1084

However, this only supports queries where there are three or fewer items in the IN list:

SELECT * 
FROM parquet_file 
WHERE col IN ('foo', 'bar', 'baz')

It only works for small numbers of constants because the current implementation only checks for predicates like col = 'foo' OR col = 'bar'. The reason this works for InLists is that with small numbers of items ( 3) are rewritten to OR chains) by this code in the optimizer:

https://github.com/apache/arrow-datafusion/blob/0d7cab055cb39d6df751e070af5a0bf5444e3849/datafusion/optimizer/src/simplify_expressions/expr_simplifier.rs#L500-L549

Thus, the the current bloom filter code will not work for queries with large numbers (more than the THRESHOLD_INLINE_INLIST) of constants in the IN list, such as

SELECT * 
FROM parquet_file 
WHERE col IN (
  'constant1',
  'constant2',
  ..,
  'constant99',
  'constant100',
)

Describe the solution you'd like

I would like the bloom filter code to directly support InListExpr and thus also support IN / NOT IN queries with large numbers of constants

In terms of implementation, after #8437 is merged and #8376 is closed, this should be a straightforward matter of:

  1. Adding support in LiteralGurantee code (see Add LiteralGuarantee on columns to extract conditions required for PhysicalExpr expressions to evaluate to true #8437 )
  2. Add tests in LiteralGurantee
  3. Add a integration test for Bloom filters in datafusion/core/src/datasource/physical_plan/parquet/row_groups.rs

Describe alternatives you've considered

No response

Additional context

Found while I was working on #8376

@my-vegetable-has-exploded
Copy link
Contributor

I'd like to have a try.

@alamb
Copy link
Contributor Author

alamb commented Dec 6, 2023

@my-vegetable-has-exploded -- thank you -- can you please implement it in terms of LiteralGuarantee #8437 ?

@hengfeiyang
Copy link
Contributor

hengfeiyang commented Dec 7, 2023

@alamb @my-vegetable-has-exploded I found if there are more than three values the bloomfilter also can't work.

This works

SELECT *  FROM tbl where (trace_id='3c7dbf90d1a66e3faffa344519c3bac3' OR trace_id='1' OR trace_id='2') LIMIT 150;

This doesn't works

SELECT *  FROM tbl where (trace_id='3c7dbf90d1a66e3faffa344519c3bac3' OR trace_id='1' OR trace_id='2' OR trace_id='3') LIMIT 150;

This is the same problem, because more than three values will convert OR to InList.

@my-vegetable-has-exploded
Copy link
Contributor

I think we can use Short-circuit evaluation here, if left satisfy some condition, we can skip right.
https://github.com/apache/arrow-datafusion/blob/5e8b0e09228925b01c8bcc7afe448a7487347872/datafusion/core/src/datasource/physical_plan/parquet/row_groups.rs#L217-L232
If needed, I can open pr to do it. Thanks.

@alamb
Copy link
Contributor Author

alamb commented Dec 7, 2023

I think we can use Short-circuit evaluation here, if left satisfy some condition, we can skip right.

I am in the middle of rewriting the bloom filter implementation to be more general (see #8442).

I believe the new (not yet merged) code correctly handles predicates like where trace_id='3c7dbf90d1a66e3faffa344519c3bac3' OR trace_id='1' OR trace_id='2' OR trace_id='3'

However, the new code does not handle explicit IN expressions

This ticket was perhaps a bit over eager -- basically I recommend not changing the existing implementation as I am rewriting it.

However, if you would like to change the existing code, that is also fine, I will manage the conflicts as part of my PRs.

@my-vegetable-has-exploded
Copy link
Contributor

I recommend not changing the existing implementation as I am rewriting it.

Get it.

@alamb
Copy link
Contributor Author

alamb commented Dec 23, 2023

The relevant PRs have been merged now, so I think it would be possible to add support to LiteralGuarantees which would automatically add support for BloomFilters (once #8442 is merged)

@my-vegetable-has-exploded
Copy link
Contributor

thanks @hengfeiyang @alamb

@domyway
Copy link

domyway commented Dec 28, 2023

if items more than 10 , it still can’t use bloom filter

@my-vegetable-has-exploded
Copy link
Contributor

if items more than 10 , it still can’t use bloom filter

Can you provide an example of reproducing the problem? Thanks.

@domyway
Copy link

domyway commented Dec 28, 2023

if items more than 10 , it still can’t use bloom filter

Can you provide an example of reproducing the problem? Thanks.

SELECT * FROM tbl where (trace_id IN('3c7dbf90d1a66e3faffa344519c3bac3', '1', '2', '3', '4', '5', '6', '7', '8', '9', '10', '11', '12', '13', '14', '15', '16', '17', '18', '19', '20', '21', '22', '23', '24', '25', '26', '27', '28', '29')) LIMIT 10;

@alamb
Copy link
Contributor Author

alamb commented Dec 28, 2023

How do you know the bloom filter isn't being used? Is there a reproducer (a parquet file) you can share?

It appears that there is no good way to know if the bloom filter code is working via logging or metrics 🤔
https://github.com/apache/arrow-datafusion/blob/f39c040ace0b34b0775827907aa01d6bb71cbb14/datafusion/core/src/datasource/physical_plan/parquet/row_groups.rs#L111-L168

@alamb
Copy link
Contributor Author

alamb commented Dec 28, 2023

I wonder if #8669 from @yahoNanJing is related (which basically adds IN list pruning based on min/max statistics rather than on the bloom filters)

@domyway
Copy link

domyway commented Dec 29, 2023

How do you know the bloom filter isn't being used? Is there a reproducer (a parquet file) you can share?

It appears that there is no good way to know if the bloom filter code is working via logging or metrics 🤔

https://github.com/apache/arrow-datafusion/blob/f39c040ace0b34b0775827907aa01d6bb71cbb14/datafusion/core/src/datasource/physical_plan/parquet/row_groups.rs#L111-L168

I conducted a test locally by writing 200GB of data. When using a Bloom filter for queries, I observed that the query only takes 0.1 seconds, whereas without using the Bloom filter, the query takes 1 second. If a query takes 1 second, I can infer that it is not using the Bloom filter because using the Bloom filter should yield results within 0.1 seconds.

@alamb
Copy link
Contributor Author

alamb commented Dec 30, 2023

I conducted a test locally by writing 200GB of data

Thank you @domyway -- I filed #8685 with your report to get some more visibility to this issue so others may be able to find it and help. Let's continue the conversation there

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants