Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Provide hybrid scan setting for consistency requirement #1819

Conversation

dai-chen
Copy link
Collaborator

@dai-chen dai-chen commented Jul 7, 2023

Description

Add hybrid scan mode which covers the latest source files that haven't refreshed to Flint index yet:

  1. For skipping index, this means source file unknown in skipping index will not be skipped (because we're not sure if it has the answer for a query <= this is covered in this PR
  2. For covered index/MV, this means we need to union all the results from unknown files and index data

Issues Resolved

opensearch-project/opensearch-spark#2

Check List

  • New functionality includes testing.
    • All tests pass, including unit test, integration test and doctest
  • New functionality has been documented.
    • New functionality has javadoc added
    • New functionality has user manual doc added
  • Commits are signed per the DCO using --signoff

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

dai-chen added 3 commits July 7, 2023 13:11
Signed-off-by: Chen Dai <daichen@amazon.com>
Signed-off-by: Chen Dai <daichen@amazon.com>
Signed-off-by: Chen Dai <daichen@amazon.com>
@dai-chen dai-chen added enhancement New feature or request Flint labels Jul 7, 2023
@dai-chen dai-chen self-assigned this Jul 7, 2023
@codecov
Copy link

codecov bot commented Jul 7, 2023

Codecov Report

Merging #1819 (136de9f) into feature/flint (91b2a06) will not change coverage.
The diff coverage is n/a.

@@               Coverage Diff                @@
##             feature/flint    opensearch-project/sql#1819   +/-   ##
================================================
  Coverage            97.19%   97.19%           
  Complexity            4107     4107           
================================================
  Files                  371      371           
  Lines                10464    10464           
  Branches               706      706           
================================================
  Hits                 10170    10170           
  Misses                 287      287           
  Partials                 7        7           
Flag Coverage Δ
sql-engine 97.19% <ø> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

Signed-off-by: Chen Dai <daichen@amazon.com>
@dai-chen dai-chen marked this pull request as ready for review July 7, 2023 21:21
partitions
.flatMap(_.files.map(f => f.getPath.toUri.toString))
.toDF(FILE_PATH_COLUMN)
.join(indexScan, Seq(FILE_PATH_COLUMN), "left")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i assume left join is expensive, is there an magic number to avoid left join, which mean scan without index? if it make sense, we can add an issue to track it at perf test stage.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, will cover this in our benchmark after merged. Thanks!

@dai-chen dai-chen merged commit 5da4f0a into opensearch-project:feature/flint Jul 11, 2023
@dai-chen dai-chen deleted the add-hybrid-scan-mode-rebased branch July 11, 2023 20:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request Flint
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants