Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix: avoid ZSTD codec from overriding service codec factory. #7037

Merged
merged 3 commits into from
Apr 13, 2023

Conversation

mulugetam
Copy link
Contributor

@mulugetam mulugetam commented Apr 6, 2023

The new ZSTD compression codec adds ZSTD to the existing compression codecs (default, best_compression, and lucene_default). This PR allows the compression codec to give a custom-codec service factory only when index.codec is set to ZSTD or ZSTDNODICT.

Description

Fixes issue #7012 by explicitly avoiding the creation of a custom codec service unless the index.codec value is either ZSTD or ZSTDNODICT .

Issues Resolved

Resolves #7012

Check List

  • New functionality includes testing.
    • All tests pass
  • New functionality has been documented.
    • New functionality has javadoc added
  • [x ] Commits are signed per the DCO using --signoff
  • Commit changes are listed out in CHANGELOG.md file (See: Changelog)

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

  - addresses opensearch-project#7012

Signed-off-by: Mulugeta Mammo <mulugeta.mammo@intel.com>
Copy link
Member

@dblock dblock left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This works, but feels awfully specific to the fact that we have these two custom codecs in the project. A few ideas before I hit approve/merge.

  1. Can we check whether the codec is not custom (aka well known codecs) instead of whether it's a known custom codec?
  2. Does it make sense to make names such as "CUSTOM:ZSTD" and then look for "CUSTOM:" instead or is it a silly idea?

@github-actions
Copy link
Contributor

github-actions bot commented Apr 6, 2023

Gradle Check (Jenkins) Run Completed with:

*/
@Override
public Optional<CodecServiceFactory> getCustomCodecServiceFactory(final IndexSettings indexSettings) {
return Optional.of(new CustomCodecServiceFactory());
String codec = indexSettings.getValue(EngineConfig.INDEX_CODEC_SETTING);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please add unit test for this class.

@mulugetam
Copy link
Contributor Author

  1. dec is not custom

@dblock Do you mean requiring the user to specify all custom codecs as, for example, index.codec: CUSTOM:ZSTD instead of just index.codec: ZSTD? And then return a custom CodecServiceFactory only if(codec.startsWith('CUSTOM:'))?

@navneet1v
Copy link
Contributor

One comment, if an index sets this value to use the custom codec and from another plugin the codec is coming lets say a k-NN index, which codec will be picked up? or it will lead to failures?

Example(might result in failure, can we check this case):

PUT my-knn-index-1
{
  "settings": {
    "index": {
      "knn": true,
      "knn.algo_param.ef_search": 100,
      "codec": "ZSTD"
    }
  },
  "mappings": {
    "properties": {
        "my_vector1": {
          "type": "knn_vector",
          "dimension": 2,
          "method": {
            "name": "hnsw",
            "space_type": "l2",
            "engine": "nmslib",
            "parameters": {
              "ef_construction": 128,
              "m": 24
            }
          }
        },
        "my_vector2": {
          "type": "knn_vector",
          "dimension": 4,
          "method": {
            "name": "hnsw",
            "space_type": "innerproduct",
            "engine": "faiss",
            "parameters": {
              "ef_construction": 256,
              "m": 48
            }
          }
        }
    }
  }
}

I have created this issue: #7032 tries to explore the possible solution.

One suggestion I have is can we add java doc on top of this plugins and also on the EnginePlugin class which provides this interface that if an index tries to use 2 codec this can lead to failures in creating the index.

  - Removed custom classes for CodecService and CodecServiceFactory.
  - Also removed PerFieldMappingPostingFormatCodec -- not required.
  - Added documentation.

Signed-off-by: Mulugeta Mammo <mulugeta.mammo@intel.com>
@mulugetam
Copy link
Contributor Author

I reviewed the code again to see if I could do it without overriding the existing code service factory, and I can. So, I have removed the CodecService classes altogether.

Note that the custom compression codecs are registered by calling org.apache.lucene.codecs.Codec ctor here. The index.codec setting will get the custom compression codecs by calling Codec.availableCodecs().contains(s) here in EngineConfig.

@dblock The names for custom compression codecs, ZSTD and ZSTDNODICT, cannot contain non-alphanumeric characters; Lucene's NamedSPILoader forbids it.

@navneet1v @martin-gaievski Can you check if this also addresses #7032?

@github-actions
Copy link
Contributor

github-actions bot commented Apr 7, 2023

Gradle Check (Jenkins) Run Completed with:

@codecov-commenter
Copy link

codecov-commenter commented Apr 7, 2023

Codecov Report

Merging #7037 (0cf7be4) into main (53b128f) will decrease coverage by 0.07%.
The diff coverage is 0.00%.

📣 This organization is not using Codecov’s GitHub App Integration. We recommend you install it so Codecov can continue to function properly for your repositories. Learn more

@@             Coverage Diff              @@
##               main    #7037      +/-   ##
============================================
- Coverage     70.78%   70.72%   -0.07%     
- Complexity    59269    59278       +9     
============================================
  Files          4823     4820       -3     
  Lines        283985   283962      -23     
  Branches      40953    40952       -1     
============================================
- Hits         201026   200820     -206     
- Misses        66403    66693     +290     
+ Partials      16556    16449     -107     
Impacted Files Coverage Δ
...ch/index/codec/customcodecs/CustomCodecPlugin.java 0.00% <ø> (ø)
...customcodecs/Lucene95CustomStoredFieldsFormat.java 25.00% <ø> (ø)
...opensearch/index/codec/customcodecs/ZstdCodec.java 80.00% <ø> (ø)
.../index/codec/customcodecs/ZstdCompressionMode.java 84.09% <ø> (ø)
...arch/index/codec/customcodecs/ZstdNoDictCodec.java 80.00% <ø> (ø)
.../codec/customcodecs/ZstdNoDictCompressionMode.java 76.71% <ø> (ø)
...ter/snapshots/get/TransportGetSnapshotsAction.java 41.58% <0.00%> (ø)

... and 505 files with indirect coverage changes

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

@navneet1v
Copy link
Contributor

navneet1v commented Apr 7, 2023

I reviewed the code again to see if I could do it without overriding the existing code service factory, and I can. So, I have removed the CodecService classes altogether.

@mulugetam

Now you have removed all CodecSeviceFactory , I want to know how the codec is now getting used or getting attached to a particular index?

is it like if someone specify the index.codec: ZSTD the ZstdCodec would be picked up if already present?

@mulugetam
Copy link
Contributor Author

mulugetam commented Apr 7, 2023

is it like if someone specify the index.codec: ZSTD the ZstdCodec would be picked up if already present?

@navneet1v Yes. That happens here, which calls Lucene's registered codecs here.

Lucene's setCodec(new ZstdCodec()) is called, just like setCodec(new Lucene95Codec(Lucene95Codec.Mode.BEST_COMPRESSION)) is called for a index.codec: best_compression.

Flamegraphs for the run below, with default, best_compression, and ZSTD compression codecs, are available here (search for "compress").

opensearch-benchmark execute_test \
--workload=geonames \
--workload-params='{"index_settings":{"index.codec": "compression_codec"}}' \
--target-hosts=127.0.0.1:9200 \
--kill-running-processes \
--pipeline=benchmark-only \
--include-tasks=delete-index,create-index,index-append \
--kill-running-processes

@mulugetam mulugetam changed the title Fix: enable ZSTD codec only if index.codec is set to ZSTD. Fix: avoid ZSTD codec from overriding service codec factory. Apr 7, 2023
@reta
Copy link
Collaborator

reta commented Apr 7, 2023

I reviewed the code again to see if I could do it without overriding the existing code service factory, and I can. So, I have removed the CodecService classes altogether.

This is awesome, @mulugetam , basically the standard service loader mechanism is purely sufficient here, right?

@mulugetam
Copy link
Contributor Author

This is awesome, @mulugetam , basically the standard service loader mechanism is purely sufficient here, right?
@reta yes.

  - Zstandard version 1.5.5 contains a bug fix for a rare corruption error
    described here: https://github.com/facebook/zstd/releases/tag/v1.5.5. The
    zstd-jni version we use here, 1.5.5-1, uses Zstandard v1.5.5.

Signed-off-by: Mulugeta Mammo <mulugeta.mammo@intel.com>
@mulugetam
Copy link
Contributor Author

@reta I have also upgraded the zstd-jni version from 1.5.4-1 to 1.5.5-1. Version 1.5.5-1 is based on ZSTD version 1.5.5 that addresses the rare corruption bug described here: https://github.com/facebook/zstd/releases/tag/v1.5.5

@github-actions
Copy link
Contributor

Gradle Check (Jenkins) Run Completed with:

  • RESULT: UNSTABLE ❕
  • TEST FAILURES:
      1 org.opensearch.indices.replication.SegmentReplicationIT.testScrollCreatedOnReplica
      1 org.opensearch.indices.replication.SegmentReplicationIT.testReplicaHasDiffFilesThanPrimary
      1 org.opensearch.indices.replication.SegmentReplicationIT.testPitCreatedOnReplica

@dblock dblock merged commit 569e90c into opensearch-project:main Apr 13, 2023
@dblock dblock added the backport 2.x Backport to 2.x branch label Apr 13, 2023
opensearch-trigger-bot bot pushed a commit that referenced this pull request Apr 13, 2023
* Fix: enable ZSTD codec only if index.codec is set to ZSTD.

  - addresses #7012

Signed-off-by: Mulugeta Mammo <mulugeta.mammo@intel.com>

* Removed custom CodecService and CodecServiceFactory classes.

  - Removed custom classes for CodecService and CodecServiceFactory.
  - Also removed PerFieldMappingPostingFormatCodec -- not required.
  - Added documentation.

Signed-off-by: Mulugeta Mammo <mulugeta.mammo@intel.com>

* Bump zstd-jni version from 1.5.4-1 to 1.5.5-1.

  - Zstandard version 1.5.5 contains a bug fix for a rare corruption error
    described here: https://github.com/facebook/zstd/releases/tag/v1.5.5. The
    zstd-jni version we use here, 1.5.5-1, uses Zstandard v1.5.5.

Signed-off-by: Mulugeta Mammo <mulugeta.mammo@intel.com>

---------

Signed-off-by: Mulugeta Mammo <mulugeta.mammo@intel.com>
(cherry picked from commit 569e90c)
Signed-off-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
reta pushed a commit that referenced this pull request Apr 13, 2023
…7149)

* Fix: enable ZSTD codec only if index.codec is set to ZSTD.

  - addresses #7012



* Removed custom CodecService and CodecServiceFactory classes.

  - Removed custom classes for CodecService and CodecServiceFactory.
  - Also removed PerFieldMappingPostingFormatCodec -- not required.
  - Added documentation.



* Bump zstd-jni version from 1.5.4-1 to 1.5.5-1.

  - Zstandard version 1.5.5 contains a bug fix for a rare corruption error
    described here: https://github.com/facebook/zstd/releases/tag/v1.5.5. The
    zstd-jni version we use here, 1.5.5-1, uses Zstandard v1.5.5.



---------


(cherry picked from commit 569e90c)

Signed-off-by: Mulugeta Mammo <mulugeta.mammo@intel.com>
Signed-off-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
austintlee pushed a commit to austintlee/OpenSearch that referenced this pull request Apr 28, 2023
…rch-project#7037)

* Fix: enable ZSTD codec only if index.codec is set to ZSTD.

  - addresses opensearch-project#7012

Signed-off-by: Mulugeta Mammo <mulugeta.mammo@intel.com>

* Removed custom CodecService and CodecServiceFactory classes.

  - Removed custom classes for CodecService and CodecServiceFactory.
  - Also removed PerFieldMappingPostingFormatCodec -- not required.
  - Added documentation.

Signed-off-by: Mulugeta Mammo <mulugeta.mammo@intel.com>

* Bump zstd-jni version from 1.5.4-1 to 1.5.5-1.

  - Zstandard version 1.5.5 contains a bug fix for a rare corruption error
    described here: https://github.com/facebook/zstd/releases/tag/v1.5.5. The
    zstd-jni version we use here, 1.5.5-1, uses Zstandard v1.5.5.

Signed-off-by: Mulugeta Mammo <mulugeta.mammo@intel.com>

---------

Signed-off-by: Mulugeta Mammo <mulugeta.mammo@intel.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport 2.x Backport to 2.x branch skip-changelog
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[BUG] Zstd new codec is a breaking change for kNN plugin
5 participants