Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ML] Add prefix strings option to trained models #101978

Closed
wants to merge 46 commits into from

Conversation

davidkyle
Copy link
Member

@davidkyle davidkyle commented Nov 9, 2023

Certain NLP models such as multilingual-e5-large require a prefix string to be applied to the input text. For asymmetric tasks such as information retrieval the prefix can be different when ingesting the data and when searching it. For example text embedding model can have a one prefix applied when the model is evaluated as part of an knn search and a different prefix when ingesting documents.

An example model configuration with prefix strings is:

{
    "model_type": "pytorch",
    "inference_config": {
        "text_embedding": {
            "tokenization": {
                "xlm_roberta": {
                    "do_lower_case": false,
                    "with_special_tokens": true,
                    "max_sequence_length": 512,
                    "truncate": "first",
                    "span": -1
                }
            },
            "embedding_size": 384
        }
    },
    "prefix_strings": {
        "search": "this is a query",
        "ingest": "this is a passage"
    }
}

Many files have been touched by this change but the bulk of the work is quite simple: define the configuration object an pass the prefix type (context) parameter down the inference calls.

@davidkyle davidkyle changed the title Add prefix strings option to trained models [ML] Add prefix strings option to trained models Nov 10, 2023
@davidkyle davidkyle added >enhancement :ml Machine learning labels Nov 10, 2023
@elasticsearchmachine
Copy link
Collaborator

Hi @davidkyle, I've created a changelog YAML for you.

@davidkyle davidkyle marked this pull request as ready for review November 10, 2023 13:00
@elasticsearchmachine elasticsearchmachine added the Team:ML Meta label for the ML team label Nov 10, 2023
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/ml-core (Team:ML)

Copy link
Contributor

@droberts195 droberts195 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, but if we're going to have packaged models on GCS that use this feature then the new section will need to also go in this class:

davidkyle and others added 2 commits November 13, 2023 12:42
Co-authored-by: István Zoltán Szabó <istvan.szabo@elastic.co>
Co-authored-by: István Zoltán Szabó <istvan.szabo@elastic.co>
Copy link
Contributor

@szabosteve szabosteve left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Docs LGTM, thanks!

@davidkyle
Copy link
Member Author

@elasticmachine update branch

@elasticmachine
Copy link
Collaborator

merge conflict between base and head

DaveCTurner and others added 9 commits November 13, 2023 14:20
The initial state of the desired-balance allocator has
`lastConvergedIndex` set to `-1`. This is not important to represent in
the stats, so with this commit we map it to zero.
…er (elastic#101912)

This commit addresses an issue in the passage formatter of the unified highlighter, where overlapping terms were not correctly expanded to be highlighted as a single object. The fix in this commit involves adjusting the expansion logic to consider the maximum end offset during the process, as matches are initially sorted by ascending start offset and then by ascending end offset.
It's in the title these two tests are failing near 100% of the time.

for elastic#102000
Muting this one that keeps failing for elastic#102010
gmarouli and others added 26 commits November 13, 2023 14:21
…ividual nodes (elastic#100230) (elastic#101599)" (elastic#102042)

Reverting because the new action is not properly handled in a mixed
cluster.
This fixes running caching issues for StandaloneRestIntegTest tasks as
an aftermath of elastic#101923

This should fix elastic#102015
Part of the broader work covered in
elastic#102030

Updates tests in: - HighlighterWithAnalyzersTests.java -
TokenCountFieldMapperIntegrationIT - GeoIpDownloaderIT.java -
DataStreamIT.java
We encountered a bad third-party S3 repository implementation which
incorrectly rejects empty multipart uploads. This anomaly is detected by
some repository analysis runs, but not by all of them. This commit adds
a specific check for this incompatibility so that it can be reported
reliably.
Adds the `?register_operation_count` parameter that allows to control
the number of register operations separately from the number of regular
blob operations.
* Add inference counts by NLP model to the machine learning usage stats.

* Update docs/changelog/101915.yaml

* Add inference_counts_by_model to yamlRestTest.

* Strip leading dot from internal model IDs.

* Add last access and task type to the stats by model.

* Change stats_by_model for map to list

* Simplify code.

* Fix style
Tests covered in this PR:

* `org.elasticsearch.percolator.PercolatorQuerySearchIT`
Tests covered in this PR: *
`org.elasticsearch.xpack.enrich.EnrichPolicyRunnerTests` *
`org.elasticsearch.index.engine.frozen.FrozenIndexIT` *
`org.elasticsearch.index.engine.frozen.FrozenIndexTests`
@davidkyle davidkyle added the cloud-deploy Publish cloud docker image for Cloud-First-Testing label Nov 13, 2023
@davidkyle
Copy link
Member Author

Closing as I've made a mess with git. #102089 is raised as a replacement

@davidkyle davidkyle closed this Nov 13, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cloud-deploy Publish cloud docker image for Cloud-First-Testing >enhancement :ml Machine learning Team:ML Meta label for the ML team v8.12.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.