Implement adaptive BloomFilter algorithm #251

dai-chen · 2024-02-09T22:53:11Z

Description

This pull request (PR) introduces an enhanced BloomFilter implementation that extends the capabilities of the previously added classic BloomFilter. This new implementation builds the BloomFilter adaptively, determining optimal parameters for construction without relying on prior knowledge of cardinality.

Documentation: user manual will be updated in the next PR of adding SQL support.

Adaptive BloomFilter parameters
- num_candidates: by default adaptive algorithm uses 10 candidates
- fpp: false positive probability (passed to underlying BloomFilter algorithm)
Classic BloomFilter parameters
- num_items: expected maximum number of distinct items (NDV)
- fpp: same as above

PR Planned

Implement BloomFilter skipping index building logic #242
Implement BloomFilter query rewrite (without pushdown optimization) #248
Implement BloomFilter query pushdown optimization #271
Implement adaptive BloomFilter algorithm #251 [Current]
Support bloom filter type in Flint SQL

Detailed Design

New Classes

Added new abstraction BloomFilterFactory that creates or deserialize BloomFilter.
Added new AdaptiveBloomFilter that builds BloomFilter adaptively.

Adaptive Algorithm

The Adaptive BloomFilter algorithm dynamically adjusts to varying cardinalities by initially creating 10 BloomFilters, each with a doubled number of expected number of items (NDV, number of distinct values) as candidates. Upon inserting a unique element, the cardinality counter increments. Because the BloomFilter's put item result can determine if it's the first time the item has been seen, this is achieved using the put item result of the largest BloomFilter candidate, which is the most accurate.

At last, the algorithm selects the candidate with an NDV just greater than the current cardinality as the best candidate. To reduce the overhead introduced by more candidates, candidates with NDV smaller than the best are ignored during the put items or merge operations. In the scenario that the cardinality exceeds the largest candidate's NDV, the algorithm designates the largest candidate as the best, even if its false positive rate (FPP) decreases due to overflow.

In summary, this adaptive approach ensures the use of a BloomFilter with the right size, even in the absence of prior knowledge of cardinality, ensuring optimal performance and accuracy in diverse scenarios.

Distributed BloomFilter Aggregation

For better understanding how this works in Spark, here is the basic workflow when we run query such as SELECT input_file_name(), bloom_filter_agg(clientip) FROM http_logs GROUP BY input_file_name():

BloomFilterAgg creates BloomFilter instance and put items
BloomFilterAgg serializes BloomFilter and merge those within same bucket together after shuffle
BloomFilterAgg serializes BloomFilter again as final output result

Benchmark Test #

Below are the initial benchmark test results. We will conduct a comprehensive test after all PRs have been finalized and merged into the main branch. As of now, the conclusions drawn from the initial results are as follows:

With Prior Knowledge of Uniform Cardinality
- When users has prior knowledge of the cardinality, and the cardinality of each file is similar, it is recommended to employ the non-adaptive BloomFilter algorithm.
Without Prior Knowledge or Large Variations in Cardinality:
- In scenarios where the cardinality is unknown or where cardinality exhibits large variations, it is advisable to utilize the default adaptive BloomFilter algorithm. [Test case T7]
- Otherwise, the non-adaptive algorithm may still consume significant disk space. [Test case T5 and T6]

Test Case	Indexing Latency (sec)	Index Size	File Scanned	Query Latency (sec)	Comment
T1: No Index	0	0	1045	641
T2: ValueSet	610	336k	1033	651	Only 12 files has lower cardinality than 100
T3: ValueSet(50k)	663	193m	535	455
T4: ValueSet(1m)	856	1g	1	13
T5: BloomFilter(200k)	740	543m	1	14	user has prior knowledge of maximum cardinality, but cardinality varies significantly across files
T6: BloomFilter(1m)	784	1.7g	1	15
T7: BloomFilter(adaptive)	783	241m	10	21	9 false positives

Issues Resolved

#206

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

Signed-off-by: Chen Dai <daichen@amazon.com>

penghuo · 2024-03-13T16:03:17Z

The Adaptive BloomFilter algorithm dynamically adjusts to varying cardinalities by initially creating 10 BloomFilters, each with a doubled number of expected number of items (NDV, number of distinct values) as candidates.

By default, Adaptive BloomFilter can handle 1024*1024 NDV at most?. If user already know the range of NDV? User should use ClassicBloomFilters, right?

dai-chen · 2024-03-13T17:20:36Z

The Adaptive BloomFilter algorithm dynamically adjusts to varying cardinalities by initially creating 10 BloomFilters, each with a doubled number of expected number of items (NDV, number of distinct values) as candidates.

By default, Adaptive BloomFilter can handle 1024*1024 NDV at most?. If user already know the range of NDV? User should use ClassicBloomFilters, right?

Yes. For the recommendation, I updated here: #251 (comment). Will redo the benchmark and post test result and conclusion once all PRs merged. Thanks!

penghuo

Thx!

dai-chen added enhancement New feature or request 0.2 labels Feb 9, 2024

dai-chen self-assigned this Feb 9, 2024

dai-chen changed the title ~~Implement adaptive bloom filter algorithm~~ Implement adaptive BloomFilter algorithm Feb 9, 2024

dai-chen added 0.3 and removed 0.2 labels Feb 28, 2024

dai-chen mentioned this pull request Mar 5, 2024

Implement BloomFilter query pushdown optimization #271

Merged

5 tasks

dai-chen added 5 commits March 7, 2024 16:46

Add basic adaptive BF and factory

f59cd97

Signed-off-by: Chen Dai <daichen@amazon.com>

Refactor BloomFilter factory

4170422

Signed-off-by: Chen Dai <daichen@amazon.com>

Add number of candidate param

454d0cd

Signed-off-by: Chen Dai <daichen@amazon.com>

Add BloomFilter candidate class

c697cf3

Signed-off-by: Chen Dai <daichen@amazon.com>

Change default NDV to 1024

11be2f9

Signed-off-by: Chen Dai <daichen@amazon.com>

dai-chen force-pushed the implement-adaptive-bloom-filter branch from 2343d4c to 11be2f9 Compare March 8, 2024 00:47

dai-chen added 5 commits March 8, 2024 14:33

Fix broken IT

003ed7d

Signed-off-by: Chen Dai <daichen@amazon.com>

Add UT for bloom filter factory

21d5434

Signed-off-by: Chen Dai <daichen@amazon.com>

Refactor adaptive BF and UT

b0cbe34

Signed-off-by: Chen Dai <daichen@amazon.com>

Merge branch 'main' into implement-adaptive-bloom-filter

4104789

Add comment and more UT for adaptive BloomFilter

3df4893

Signed-off-by: Chen Dai <daichen@amazon.com>

dai-chen marked this pull request as ready for review March 12, 2024 16:45

dai-chen requested review from rupal-bq, vmmusings, penghuo, anirudha, kaituo and YANG-DB as code owners March 12, 2024 16:45

penghuo approved these changes Mar 13, 2024

View reviewed changes

Merge branch 'main' into implement-adaptive-bloom-filter

e888854

dai-chen merged commit 8cdc171 into opensearch-project:main Mar 13, 2024
4 checks passed

dai-chen deleted the implement-adaptive-bloom-filter branch March 13, 2024 18:50

This was referenced Mar 13, 2024

Add BloomFilter skipping index SQL support #283

Merged

[Feature] OpenSearch and Apache Spark Integration #3

Closed

dai-chen mentioned this pull request Mar 21, 2024

Add skipping index benchmark test #291

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement adaptive BloomFilter algorithm #251

Implement adaptive BloomFilter algorithm #251

dai-chen commented Feb 9, 2024 •

edited

Loading

penghuo commented Mar 13, 2024

dai-chen commented Mar 13, 2024

penghuo left a comment

Implement adaptive BloomFilter algorithm #251

Implement adaptive BloomFilter algorithm #251

Conversation

dai-chen commented Feb 9, 2024 • edited Loading

Description

PR Planned

Detailed Design

New Classes

Adaptive Algorithm

Distributed BloomFilter Aggregation

Benchmark Test #

Issues Resolved

penghuo commented Mar 13, 2024

dai-chen commented Mar 13, 2024

penghuo left a comment

Choose a reason for hiding this comment

dai-chen commented Feb 9, 2024 •

edited

Loading