-
Notifications
You must be signed in to change notification settings - Fork 33
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement BloomFilter query rewrite (without pushdown optimization) #248
Implement BloomFilter query rewrite (without pushdown optimization) #248
Conversation
Signed-off-by: Chen Dai <daichen@amazon.com>
…down Signed-off-by: Chen Dai <daichen@amazon.com>
Signed-off-by: Chen Dai <daichen@amazon.com>
@@ -132,16 +132,23 @@ public void writeTo(OutputStream out) throws IOException { | |||
* @param in input stream | |||
* @return bloom filter | |||
*/ | |||
public static BloomFilter readFrom(InputStream in) throws IOException { | |||
DataInputStream dis = new DataInputStream(in); | |||
public static BloomFilter readFrom(InputStream in) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
need to try-catch as Spark codegen doesn't allow checked exception
override def eval(input: InternalRow): Any = { | ||
val value = valueExpression.eval(input) | ||
if (value == null) { | ||
null |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why eval result is null? Should bloomFilter.test(null) return false?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Following Spark SQL NULL semantics, NULL is ignored in BloomFilterAgg. So NULL is returned for bloom_filter_might_contain(clientip, NULL)
. Reference: https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/BloomFilterMightContain.scala#L100
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As I understand, what's discussed here will happen only if WHERE clientip = NULL
. We're concerned it's rewritten to bloom_filter_might_contain(clientip, NULL)
which skips source file by mistake.
I did some test and found out that col = NULL
will be optimized by Spark directly because it always returns empty result:
spark-sql> EXPLAIN SELECT `@timestamp`, request FROM ds_tables.http_logs WHERE clientip = null;
== Physical Plan ==
LocalTableScan <empty>, [@timestamp#5, request#7]
...src/main/scala/org/opensearch/flint/spark/skipping/bloomfilter/BloomFilterMightContain.scala
Outdated
Show resolved
Hide resolved
Signed-off-by: Chen Dai <daichen@amazon.com>
a81ae4d
to
ce21393
Compare
Description
Implemented BloomFilter skipping index query rewrite by introducing the new
BloomFilterMightContain
expression. This internal expression serves to represent BloomFilter queries, aligning with the approach taken in a previous PR with the addition ofBloomFilterAgg
. In the absence of pushdown optimization in the Flint data source, this PR includes updates to the integration tests to validate both code generation and evaluation execution.PR Planned
Documentation
https://github.com/dai-chen/opensearch-spark/blob/implement-bloom-filter-query-rewrite-no-pushdown/docs/index.md#feature-highlights
Issues Resolved
#206
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.