Implement BloomFilter query rewrite (without pushdown optimization) #248

dai-chen · 2024-02-08T16:45:23Z

Description

Implemented BloomFilter skipping index query rewrite by introducing the new BloomFilterMightContain expression. This internal expression serves to represent BloomFilter queries, aligning with the approach taken in a previous PR with the addition of BloomFilterAgg. In the absence of pushdown optimization in the Flint data source, this PR includes updates to the integration tests to validate both code generation and evaluation execution.

PR Planned

Implement BloomFilter skipping index building logic #242
Implement BloomFilter query rewrite (without pushdown optimization) #248 [Current]
Implement pushdown optimization by OpenSearch painless script
Implement AdaptiveBloomFilter algorithm
Support bloom filter type in Flint SQL

Documentation

https://github.com/dai-chen/opensearch-spark/blob/implement-bloom-filter-query-rewrite-no-pushdown/docs/index.md#feature-highlights

Issues Resolved

#206

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

Signed-off-by: Chen Dai <daichen@amazon.com>

…down Signed-off-by: Chen Dai <daichen@amazon.com>

Signed-off-by: Chen Dai <daichen@amazon.com>

dai-chen · 2024-02-09T20:57:46Z

...re/src/main/java/org/opensearch/flint/core/field/bloomfilter/classic/ClassicBloomFilter.java

@@ -132,16 +132,23 @@ public void writeTo(OutputStream out) throws IOException {
   * @param in input stream
   * @return bloom filter
   */
-  public static BloomFilter readFrom(InputStream in) throws IOException {
-    DataInputStream dis = new DataInputStream(in);
+  public static BloomFilter readFrom(InputStream in) {


need to try-catch as Spark codegen doesn't allow checked exception

…down

penghuo · 2024-03-06T20:03:05Z

...src/main/scala/org/opensearch/flint/spark/skipping/bloomfilter/BloomFilterMightContain.scala

+  override def eval(input: InternalRow): Any = {
+    val value = valueExpression.eval(input)
+    if (value == null) {
+      null


why eval result is null? Should bloomFilter.test(null) return false?

Following Spark SQL NULL semantics, NULL is ignored in BloomFilterAgg. So NULL is returned for bloom_filter_might_contain(clientip, NULL). Reference: https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/BloomFilterMightContain.scala#L100

As I understand, what's discussed here will happen only if WHERE clientip = NULL. We're concerned it's rewritten to bloom_filter_might_contain(clientip, NULL) which skips source file by mistake.

I did some test and found out that col = NULL will be optimized by Spark directly because it always returns empty result:

spark-sql> EXPLAIN SELECT `@timestamp`, request FROM ds_tables.http_logs WHERE clientip = null; == Physical Plan == LocalTableScan <empty>, [@timestamp#5, request#7]

...src/main/scala/org/opensearch/flint/spark/skipping/bloomfilter/BloomFilterMightContain.scala

Signed-off-by: Chen Dai <daichen@amazon.com>

Add bloom filter might contain expression

6f2aceb

Signed-off-by: Chen Dai <daichen@amazon.com>

dai-chen added enhancement New feature or request 0.2 labels Feb 8, 2024

dai-chen self-assigned this Feb 8, 2024

dai-chen added 3 commits February 8, 2024 15:30

Merge branch 'main' into implement-bloom-filter-query-rewrite-no-push…

cf3ff2a

…down Signed-off-by: Chen Dai <daichen@amazon.com>

Fix IT to cover both codegen and eval execution test

d7565f5

Signed-off-by: Chen Dai <daichen@amazon.com>

Add UT

1cc7cf1

Signed-off-by: Chen Dai <daichen@amazon.com>

dai-chen commented Feb 9, 2024

View reviewed changes

dai-chen marked this pull request as ready for review February 9, 2024 20:58

dai-chen requested review from rupal-bq, vmmusings, penghuo, anirudha, kaituo and YANG-DB as code owners February 9, 2024 20:58

dai-chen mentioned this pull request Feb 9, 2024

Implement adaptive BloomFilter algorithm #251

Merged

5 tasks

dai-chen added 0.3 and removed 0.2 labels Feb 28, 2024

Merge branch 'main' into implement-bloom-filter-query-rewrite-no-push…

c22c7d1

…down

dai-chen mentioned this pull request Mar 5, 2024

Implement BloomFilter query pushdown optimization #271

Merged

5 tasks

penghuo reviewed Mar 6, 2024

View reviewed changes

Address PR comment

ce21393

Signed-off-by: Chen Dai <daichen@amazon.com>

dai-chen force-pushed the implement-bloom-filter-query-rewrite-no-pushdown branch from a81ae4d to ce21393 Compare March 6, 2024 21:18

penghuo approved these changes Mar 7, 2024

View reviewed changes

dai-chen merged commit d3cdb0e into opensearch-project:main Mar 7, 2024
4 checks passed

dai-chen deleted the implement-bloom-filter-query-rewrite-no-pushdown branch March 7, 2024 19:16

This was referenced Mar 13, 2024

Add BloomFilter skipping index SQL support #283

Merged

[Feature] OpenSearch and Apache Spark Integration #3

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement BloomFilter query rewrite (without pushdown optimization) #248

Implement BloomFilter query rewrite (without pushdown optimization) #248

dai-chen commented Feb 8, 2024 •

edited

Loading

dai-chen Feb 9, 2024

penghuo Mar 6, 2024

dai-chen Mar 6, 2024

dai-chen Mar 7, 2024

Implement BloomFilter query rewrite (without pushdown optimization) #248

Implement BloomFilter query rewrite (without pushdown optimization) #248

Conversation

dai-chen commented Feb 8, 2024 • edited Loading

Description

PR Planned

Documentation

Issues Resolved

dai-chen Feb 9, 2024

Choose a reason for hiding this comment

penghuo Mar 6, 2024

Choose a reason for hiding this comment

dai-chen Mar 6, 2024

Choose a reason for hiding this comment

dai-chen Mar 7, 2024

Choose a reason for hiding this comment

dai-chen commented Feb 8, 2024 •

edited

Loading