[ML] Add prefix strings option to trained models #101978

davidkyle · 2023-11-09T17:10:51Z

Certain NLP models such as multilingual-e5-large require a prefix string to be applied to the input text. For asymmetric tasks such as information retrieval the prefix can be different when ingesting the data and when searching it. For example text embedding model can have a one prefix applied when the model is evaluated as part of an knn search and a different prefix when ingesting documents.

An example model configuration with prefix strings is:

{
    "model_type": "pytorch",
    "inference_config": {
        "text_embedding": {
            "tokenization": {
                "xlm_roberta": {
                    "do_lower_case": false,
                    "with_special_tokens": true,
                    "max_sequence_length": 512,
                    "truncate": "first",
                    "span": -1
                }
            },
            "embedding_size": 384
        }
    },
    "prefix_strings": {
        "search": "this is a query",
        "ingest": "this is a passage"
    }
}

Many files have been touched by this change but the bulk of the work is quite simple: define the configuration object an pass the prefix type (context) parameter down the inference calls.

elasticsearchmachine · 2023-11-10T12:49:12Z

Hi @davidkyle, I've created a changelog YAML for you.

elasticsearchmachine · 2023-11-10T13:01:00Z

Pinging @elastic/ml-core (Team:ML)

docs/reference/ml/trained-models/apis/put-trained-models.asciidoc

droberts195

LGTM, but if we're going to have packaged models on GCS that use this feature then the new section will need to also go in this class:

elasticsearch/x-pack/plugin/core/src/main/java/org/elasticsearch/xpack/core/ml/inference/trainedmodel/ModelPackageConfig.java

Line 33 in 8d6ded3

public class ModelPackageConfig implements ToXContentObject, Writeable {

Co-authored-by: István Zoltán Szabó <istvan.szabo@elastic.co>

szabosteve

Docs LGTM, thanks!

davidkyle · 2023-11-13T12:45:04Z

@elasticmachine update branch

elasticmachine · 2023-11-13T12:45:06Z

merge conflict between base and head

The initial state of the desired-balance allocator has `lastConvergedIndex` set to `-1`. This is not important to represent in the stats, so with this commit we map it to zero.

…er (elastic#101912) This commit addresses an issue in the passage formatter of the unified highlighter, where overlapping terms were not correctly expanded to be highlighted as a single object. The fix in this commit involves adjusting the expansion logic to consider the maximum end offset during the process, as matches are initially sorted by ascending start offset and then by ascending end offset.

It's in the title these two tests are failing near 100% of the time. for elastic#102000

Muting this one that keeps failing for elastic#102010

The title says the whole story

…assigning shards (elastic#101723)

…nodes (elastic#100230) (elastic#101599)

…ividual nodes (elastic#100230) (elastic#101599)" (elastic#102042) Reverting because the new action is not properly handled in a mixed cluster.

This fixes running caching issues for StandaloneRestIntegTest tasks as an aftermath of elastic#101923 This should fix elastic#102015

This reverts commit e9e948d.

Just like the other ones.

closes elastic#102016

Part of the broader work covered in elastic#102030 Updates tests in: - HighlighterWithAnalyzersTests.java - TokenCountFieldMapperIntegrationIT - GeoIpDownloaderIT.java - DataStreamIT.java

…rt1) (elastic#102035)

…rt2) (elastic#102036)

…rt3) (elastic#102037)

…rt4) (elastic#102038)

We encountered a bad third-party S3 repository implementation which incorrectly rejects empty multipart uploads. This anomaly is detected by some repository analysis runs, but not by all of them. This commit adds a specific check for this incompatibility so that it can be reported reliably.

Adds the `?register_operation_count` parameter that allows to control the number of register operations separately from the number of regular blob operations.

* Add inference counts by NLP model to the machine learning usage stats. * Update docs/changelog/101915.yaml * Add inference_counts_by_model to yamlRestTest. * Strip leading dot from internal model IDs. * Add last access and task type to the stats by model. * Change stats_by_model for map to list * Simplify code. * Fix style

…2029) The title says the whole story part 2

Tests covered in this PR: * `org.elasticsearch.percolator.PercolatorQuerySearchIT`

Tests covered in this PR: * `org.elasticsearch.xpack.enrich.EnrichPolicyRunnerTests` * `org.elasticsearch.index.engine.frozen.FrozenIndexIT` * `org.elasticsearch.index.engine.frozen.FrozenIndexTests`

davidkyle · 2023-11-13T15:48:27Z

Closing as I've made a mess with git. #102089 is raised as a replacement

elasticsearchmachine added the v8.12.0 label Nov 9, 2023

davidkyle added 3 commits November 10, 2023 09:58

Prefix strings

9ed028f

fix the tests

2d4dc4c

prefix type in the request

c4d3327

davidkyle force-pushed the prefix-strings branch from 4bc50f5 to c4d3327 Compare November 10, 2023 12:16

davidkyle changed the title ~~Add prefix strings option to trained models~~ [ML] Add prefix strings option to trained models Nov 10, 2023

davidkyle added >enhancement :ml Machine learning labels Nov 10, 2023

Update docs/changelog/101978.yaml

f8936fa

davidkyle marked this pull request as ready for review November 10, 2023 13:00

elasticsearchmachine added the Team:ML Meta label for the ML team label Nov 10, 2023

docs

14a58ab

szabosteve reviewed Nov 10, 2023

View reviewed changes

docs/reference/ml/trained-models/apis/put-trained-models.asciidoc Show resolved Hide resolved

docs/reference/ml/trained-models/apis/put-trained-models.asciidoc Outdated Show resolved Hide resolved

droberts195 approved these changes Nov 10, 2023

View reviewed changes

davidkyle and others added 2 commits November 13, 2023 12:42

Update docs/reference/ml/trained-models/apis/put-trained-models.asciidoc

36fa520

Co-authored-by: István Zoltán Szabó <istvan.szabo@elastic.co>

Update docs/reference/ml/trained-models/apis/put-trained-models.asciidoc

7189fc0

Co-authored-by: István Zoltán Szabó <istvan.szabo@elastic.co>

szabosteve approved these changes Nov 13, 2023

View reviewed changes

DaveCTurner and others added 9 commits November 13, 2023 14:20

Avoid negative DesiredBalanceStats#lastConvergedIndex (elastic#101998)

b7f19ad

The initial state of the desired-balance allocator has `lastConvergedIndex` set to `-1`. This is not important to represent in the stats, so with this commit we map it to zero.

[DOCS] DISSECT does not support reference keys (elastic#102002)

df91976

Mute tests for elastic#102000 (elastic#102006)

1d77446

It's in the title these two tests are failing near 100% of the time. for elastic#102000

Mute test for elastic#102010 (elastic#102011)

aaded1b

Muting this one that keeps failing for elastic#102010

Remove some more explicit SearchResponse use in tests (elastic#102003)

1f9b9fb

The title says the whole story

Allowing non-dynamic index settings to be updated by automatically un…

a4de390

…assigning shards (elastic#101723)

ES-6566: Move the calculation of data tier usage stats to individual …

3639787

…nodes (elastic#100230) (elastic#101599)

[ci] Fix build scan annotations on Windows (elastic#101990)

1c09614

gmarouli and others added 26 commits November 13, 2023 14:21

Revert "ES-6566: Move the calculation of data tier usage stats to ind…

cf55c12

…ividual nodes (elastic#100230) (elastic#101599)" (elastic#102042) Reverting because the new action is not properly handled in a mixed cluster.

Fix caching for StandaloneRestIntegTest tasks (elastic#102043)

395dd6b

This fixes running caching issues for StandaloneRestIntegTest tasks as an aftermath of elastic#101923 This should fix elastic#102015

Mute UpgradeClusterClientYamlTestSuiteIT

14e7635

Revert "Mute UpgradeClusterClientYamlTestSuiteIT"

fba946c

This reverts commit e9e948d.

Remove explicit search responses from IndexAliasIT (elastic#101996)

8693c79

Just like the other ones.

Removing the use of Version.CURRENT from watcher (elastic#102045)

f5ac741

null check on searchResponse (elastic#102017)

e18d126

closes elastic#102016

Update tests to decrement ref count (elastic#102044)

0e5d09e

Part of the broader work covered in elastic#102030 Updates tests in: - HighlighterWithAnalyzersTests.java - TokenCountFieldMapperIntegrationIT - GeoIpDownloaderIT.java - DataStreamIT.java

Remove explicit SearchResponse references from server bucket aggs (pa…

62a5a09

…rt1) (elastic#102035)

Remove explicit SearchResponse references from server bucket aggs (pa…

2058d22

…rt2) (elastic#102036)

Remove explicit SearchResponse references from server bucket aggs (pa…

9e9b42e

…rt3) (elastic#102037)

Remove explicit SearchResponse references from server bucket aggs (pa…

f481bad

…rt4) (elastic#102038)

Repo analysis: allow configuration of register ops (elastic#102051)

74d9593

Adds the `?register_operation_count` parameter that allows to control the number of register operations separately from the number of regular blob operations.

Using VersionInformation explicitly (elastic#101962)

9281730

Remove some more explicit SearchResponse use in tests Pt2 (elastic#10…

1280662

…2029) The title says the whole story part 2

[DOCS] Clarify ES|QL grok escaping (elastic#102059)

4f91fb1

[DOCS] Fix typo (elastic#101791)

b9f8b89

Remove obsolete version check for autoid creates (elastic#101623)

35631ae

Update connector templates to use a historical feature (elastic#101531)

27080db

Removing explicit SearchResponse usages in tests - v3 (elastic#102019)

baa3aa9

Tests covered in this PR: * `org.elasticsearch.percolator.PercolatorQuerySearchIT`

AwaitsFix for elastic#102070

86d1a95

Removing explicit SearchResponse usages in tests (elastic#102008)

0da28fd

Tests covered in this PR: * `org.elasticsearch.xpack.enrich.EnrichPolicyRunnerTests` * `org.elasticsearch.index.engine.frozen.FrozenIndexIT` * `org.elasticsearch.index.engine.frozen.FrozenIndexTests`

Prefix strings

97d6325

Add prefix strings to model package config

99fbdbf

davidkyle added the cloud-deploy Publish cloud docker image for Cloud-First-Testing label Nov 13, 2023

davidkyle mentioned this pull request Nov 13, 2023

[ML] Add prefix strings option to trained models #102089

Merged

davidkyle closed this Nov 13, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ML] Add prefix strings option to trained models #101978

[ML] Add prefix strings option to trained models #101978

davidkyle commented Nov 9, 2023 •

edited

Loading

elasticsearchmachine commented Nov 10, 2023

elasticsearchmachine commented Nov 10, 2023

droberts195 left a comment

szabosteve left a comment

davidkyle commented Nov 13, 2023

elasticmachine commented Nov 13, 2023

davidkyle commented Nov 13, 2023

[ML] Add prefix strings option to trained models #101978

[ML] Add prefix strings option to trained models #101978

Conversation

davidkyle commented Nov 9, 2023 • edited Loading

elasticsearchmachine commented Nov 10, 2023

elasticsearchmachine commented Nov 10, 2023

droberts195 left a comment

Choose a reason for hiding this comment

szabosteve left a comment

Choose a reason for hiding this comment

davidkyle commented Nov 13, 2023

elasticmachine commented Nov 13, 2023

davidkyle commented Nov 13, 2023

davidkyle commented Nov 9, 2023 •

edited

Loading