Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix lucene codec after lucene version bumped to 9.12 #2195

Merged
merged 1 commit into from
Oct 8, 2024

Conversation

navneet1v
Copy link
Collaborator

@navneet1v navneet1v commented Oct 7, 2024

Description

Fix lucene codec after lucene version bumped to 9.12 Currently the version bump has happened only for main branch hence we are not doing any backport here.

2.x port done for core: opensearch-project/OpenSearch#16211

This change includes:

  1. Bumping up the Lucene Codec to 912.
  2. New KNN9120Codec class added.
  3. Changes in the NativeEngineFieldVectorsWriter to pass the FlatFieldVectorsWriter since in Lucene 912, the capability of FlatFieldVectorsWriter adding the vectorValue to a passed VectorFieldWriter, the VectorFieldWriters who are using the FlatFieldVectorsWriter have to call the addValue function of FlatFieldVectorsWriter. Ref: Reduce heap usage for knn index writers apache/lucene#13538
  4. Set Compress flag for Lucene SQ bits > 4 to false. Ref: https://github.com/apache/lucene/blob/branch_9_12/lucene/core/src/java/org/apache/lucene/codecs/lucene99/Lucene99ScalarQuantizedVectorsFormat.java#L113-L116.

Related Issues

Resolves #2193

Check List

  • New functionality has been documented.
  • Commits are signed per the DCO using --signoff.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

@navneet1v navneet1v force-pushed the lucene_codec branch 2 times, most recently from bcdaeec to 614163a Compare October 7, 2024 22:02
@navneet1v navneet1v added v2.18.0 Maintenance Add support for new versions of OpenSearch/Dashboards from upstream and removed skip-changelog v2.18.0 labels Oct 7, 2024
@navneet1v
Copy link
Collaborator Author

navneet1v commented Oct 7, 2024

Seems like the min distribution is still not updated with the lucene version. Hence the builds are failing. Will check on Jenkins.

Ref: https://build.ci.opensearch.org/view/all/job/distribution-build-opensearch/

@navneet1v
Copy link
Collaborator Author

Conversation happening with the build team on this thread: https://opensearch.slack.com/archives/C04UTNM338A/p1728341414662579

@navneet1v
Copy link
Collaborator Author

navneet1v commented Oct 8, 2024

On further checking and talking to build team we found out that lucene upgrade was merged in opensearch in last 12hrs, which updated the maven repo(which happens with every merge in main branch) but the min distribution runs once in 24hrs hence min distribution is not updated.

Thanks @gaiksaya for helping here. A new build is triggered ref: https://build.ci.opensearch.org/blue/organizations/jenkins/publish-opensearch-min-snapshots/detail/publish-opensearch-min-snapshots/1818/pipeline/ Once it is completed I will re-run the GH actions.

@navneet1v
Copy link
Collaborator Author

navneet1v commented Oct 8, 2024

On doing deep-dive on the failed ITs I found out that lucene has changed the way they were using the FlatFieldVectorWriter. This will require more changes in the code to ensure that tests are passing since the changes are in the indexing path. I am working on the fix for this. Will try to raise a PR by today.

Ref: apache/lucene#13538

Earlier the Lucene99FlatVectorsWriter.FieldWriter we calling the KNNFieldWriter as a delegate. Now it is not calling anymore, hence we need to call Lucene99FlatVectorsWriter.FieldWriter.addValue from out NativeEngineFieldsVectorWriter.addValue

ryanbogan
ryanbogan previously approved these changes Oct 8, 2024
ryanbogan
ryanbogan previously approved these changes Oct 8, 2024
Signed-off-by: Navneet Verma <navneev@amazon.com>
Copy link
Collaborator

@shatejas shatejas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall NativeEngineWriter perspective it looks good, for isCompress flag please rely on approval from other maintainers.

  • We can look at if there is a way to leverage FlatVectorFieldsWriter to get the vectors maybe as a follow up for this. I know NativeEngineWriter uses Map and FlatVectorFieldsWriter uses List but might be worth it if we can leverage to reduce ram usage similar to lucene.
  • I see use of any() in unit tests, those matchers don't make for tight tests. Try to verify with exact values if possible

@navneet1v
Copy link
Collaborator Author

Overall NativeEngineWriter perspective it looks good for isCompress flag please rely on approval from other maintainers.

  • We can look at if there is a way to leverage FlatVectorFieldsWriter to get the vectors maybe as a follow up for this. I know NativeEngineWriter uses Map and FlatVectorFieldsWriter uses Map but might be worth it if we can leverage at to reduce ram.
  • I see use of any() in unit tests, those matchers don't make for tight tests. Try to verify with exact values if possible

the use of any is added in where we are completely mocking the NativeEngineFieldVectorsWriter flush and merge tests, for other places I have removed it already.

We can look at if there is a way to leverage FlatVectorFieldsWriter to get the vectors maybe as a follow up for this. I know NativeEngineWriter uses Map and FlatVectorFieldsWriter uses Map but might be worth it if we can leverage at to reduce ram

Yes, this is a good suggestion. I have it my mind but problem is if I do it right now the scope of the PR will be huge. Already it had items which came as a part of interface changes.

@navneet1v navneet1v merged commit 3322912 into opensearch-project:main Oct 8, 2024
30 of 31 checks passed
opensearch-trigger-bot bot pushed a commit that referenced this pull request Oct 9, 2024
Signed-off-by: Navneet Verma <navneev@amazon.com>
(cherry picked from commit 3322912)
@navneet1v
Copy link
Collaborator Author

Adding backport label as core has backported the lucene upgrade PR to 2.x branch: opensearch-project/OpenSearch#16211

naveentatikonda pushed a commit that referenced this pull request Oct 9, 2024
…2196)

* Fix lucene codec after lucene version bumped to 9.12 (#2195)

Signed-off-by: Navneet Verma <navneev@amazon.com>
(cherry picked from commit 3322912)

* Update CHANGELOG.md

Signed-off-by: Navneet Verma <navneev@amazon.com>

---------

Signed-off-by: Navneet Verma <navneev@amazon.com>
Co-authored-by: Navneet Verma <navneev@amazon.com>
@navneet1v
Copy link
Collaborator Author

  • We can look at if there is a way to leverage FlatVectorFieldsWriter to get the vectors maybe as a follow up for this. I know NativeEngineWriter uses Map and FlatVectorFieldsWriter uses List but might be worth it if we can leverage to reduce ram usage similar to lucene.

Created a GH issue for the fix: #2207

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport 2.x Maintenance Add support for new versions of OpenSearch/Dashboards from upstream
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Fix k-NN build due to lucene upgrade
6 participants