Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ML] adds new question_answering NLP task for extracting answers to questions from a document #85958

Merged

Conversation

benwtrent
Copy link
Member

This commit adds a new question_answering task.

The question_answering task allows supplying a question in the inference config update.

When storing the model config for inference:

"inference_config": {
  "question_answering": {
    "tokenization": {...}, // tokenization settings, recommend doing 386 max sequence length with 128 span, and no truncating
    "max_answer_length": 15 // the max answer length to consider
  }
}

Then when calling _infer or running with in a pipeline, add the question you want answered given the context provided by the document text

{
  "docs":[{ "text_field": <some long text field to extract answer}],
  "inference_config": {
    "question_answering": {
      "question": <Question desiring answer>
    }
  }
}

The response then looks like:

{
    "predicted_value": <string subsection of the document that is the answer>
    "start_offset": <Char offset in document to start>,
    "end_offset": <char offset end of the answer,
    "prediction_probability": <prediction score>
}

Some models tested:

@elasticmachine elasticmachine added the Team:ML Meta label for the ML team label Apr 18, 2022
@elasticmachine
Copy link
Collaborator

Pinging @elastic/ml-core (Team:ML)

@elasticsearchmachine
Copy link
Collaborator

Hi @benwtrent, I've created a changelog YAML for you.

…om:benwtrent/elasticsearch into feature/ml-add-question-answering-nlp-task
Copy link
Member

@davidkyle davidkyle left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nits, nothing more.

LGTM I added the cloud-deploy label so I can test it out

@@ -98,7 +98,8 @@ public List<TokenizationResult.Tokens> tokenize(String seq, Tokenization.Truncat
);
// Make sure we do not end on a word
if (splitEndPos != tokenIds.size()) {
while (Objects.equals(tokenPositionMap.get(splitEndPos), tokenPositionMap.get(splitEndPos - 1))) {
while (splitEndPos > splitStartPos + 1
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure what is happening here

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to verify that the end is actually after the start. If we don't we could get into a death spiral where the splitting never moves forward (think adversarial case of very small sequences with overlaps that are almost the same as the sequence size).

for (int j = 0; j < maxLength; j++) {
builder.value(0);
// Just a single sequence within this tokenization
if (inputTokens.seqPairOffset <= 0) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good fix

* @param sequenceId Unique sequence id for this tokenization
* @return tokenization result for the sequence pair
*/
public List<TokenizationResult.Tokens> tokenize(String seq1, String seq2, Tokenization.Truncate truncate, int span, int sequenceId) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks very similar to

tokenize( String seq1, InnerTokenization innerResultSeq1, String seq2, Tokenization.Truncate truncate, int sequenceId )

but with the spanning logic added. Is there an opportunity to refactor here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe, but the previous one shouldn't do spanning at all. Mainly because its an optimization to allow a single tokenization to be used multiple times. We just have no way of representing a spanned batched encoding (4d tensor?).

I think some refactoring could be done as a whole, but it would take more churn on the rest of the tokenization code.

Co-authored-by: David Kyle <david.kyle@elastic.co>
@benwtrent
Copy link
Member Author

@elasticmachine update branch

@benwtrent benwtrent added the cloud-deploy Publish cloud docker image for Cloud-First-Testing label May 3, 2022
@benwtrent
Copy link
Member Author

@elasticmachine update branch

@benwtrent
Copy link
Member Author

@elasticmachine update branch

dhrubo-os pushed a commit to opensearch-project/opensearch-py-ml that referenced this pull request Oct 18, 2022
Adds support for `question_answering` NLP models within the pytorch model uploader.

Related: elastic/elasticsearch#85958
Signed-off-by: Dhrubo Saha <dhrubo@amazon.com>
dhrubo-os pushed a commit to opensearch-project/opensearch-py-ml that referenced this pull request Oct 18, 2022
Adds support for `question_answering` NLP models within the pytorch model uploader.

Related: elastic/elasticsearch#85958
Signed-off-by: Dhrubo Saha <dhrubo@amazon.com>
dhrubo-os added a commit to opensearch-project/opensearch-py-ml that referenced this pull request Oct 19, 2022
* [ML] Improve NLP model import by using nicely defined types (#459)

This adds some more definite types for our NLP tasks and tokenization configurations.

This is the first step in allowing users to more easily import their own transformer models via something other than hugging face.

Signed-off-by: Dhrubo Saha <dhrubo@amazon.com>

* [ML] add support for question_answering NLP tasks (#457)

Adds support for `question_answering` NLP models within the pytorch model uploader.

Related: elastic/elasticsearch#85958
Signed-off-by: Dhrubo Saha <dhrubo@amazon.com>

* [ML] improve general pytorch model import and add tests (#463)

This improves the user consumed functions and classes for PyTorch NLP model upload to Elasticsearch.

Previously it was difficult to wrap your own module for uploading to Elasticsearch.

This commit splits some classes out, adds new ones, and adds tests showing how to wrap some simple modules.

Signed-off-by: Dhrubo Saha <dhrubo@amazon.com>

* Release 8.2.0

Signed-off-by: Dhrubo Saha <dhrubo@amazon.com>

* [ML] fixes decision tree classifier upload to account for probabilities (#465)

This switches our sklearn.DecisionTreeClassifier serialization logic to account for multi-valued leaves in the tree.

The key difference between our inference and DecisionTreeClassifier, is that we run a softMax over the leaf where sklearn simply normalizes the results.

This means that our "probabilities" returned will be different than sklearn.

Signed-off-by: Dhrubo Saha <dhrubo@amazon.com>

* Add authentication methods for import model script (#466)

Signed-off-by: Dhrubo Saha <dhrubo@amazon.com>

* Ignore type checking for `agg_value`

Signed-off-by: Dhrubo Saha <dhrubo@amazon.com>

* [DOCS] Adds question_answering task type for eland_import_hub_model

Signed-off-by: Dhrubo Saha <dhrubo@amazon.com>

* Stop explicitly pulling master

Signed-off-by: Dhrubo Saha <dhrubo@amazon.com>

* Remove 'numpydoc' to stop reformatting

Signed-off-by: Dhrubo Saha <dhrubo@amazon.com>

* Also pin traitlets

Signed-off-by: Dhrubo Saha <dhrubo@amazon.com>

* [DOCS] Include missing attributes (#468)

Co-authored-by: Seth Michael Larson <seth.larson@elastic.co>
Signed-off-by: Dhrubo Saha <dhrubo@amazon.com>

* [ML] ensure quantization is applied (#472)

Signed-off-by: Dhrubo Saha <dhrubo@amazon.com>

* Freeze the traced PyTorch model

Signed-off-by: Dhrubo Saha <dhrubo@amazon.com>

* Bump minimum PyTorch version to 1.11

Signed-off-by: Dhrubo Saha <dhrubo@amazon.com>

* [ML] adds new auto task type that attempts to automatically determine NLP task type from model config (#475)

For many model types, we don't need to require the task requested. We can infer the task type based on the model configuration and architecture.

This commit makes the `task-type` parameter optional for the model up load script and adds logic for auto-detecting the task type based on the 🤗 model.

Signed-off-by: Dhrubo Saha <dhrubo@amazon.com>

* added opensearch as dependency

Signed-off-by: Dhrubo Saha <dhrubo@amazon.com>

* replaced core mentions of elasticsearch client w opensearch

Signed-off-by: Dhrubo Saha <dhrubo@amazon.com>

* changed index names for testing

Signed-off-by: Dhrubo Saha <dhrubo@amazon.com>

* modified test dataframes to accommodate opensearch indexing

Signed-off-by: Dhrubo Saha <dhrubo@amazon.com>

* fixed aggregatable field name tests

Signed-off-by: Dhrubo Saha <dhrubo@amazon.com>

* fixing pytests that mention indices of ed/pd dataframes

Signed-off-by: Dhrubo Saha <dhrubo@amazon.com>

* fixed equality boolean filter to accommodate terminology difference in elastic vs open search

Signed-off-by: Dhrubo Saha <dhrubo@amazon.com>

* fixed pytests with indexing issues, geolocation field renaming issues

Signed-off-by: Dhrubo Saha <dhrubo@amazon.com>

* modified test setup code to work for opensearch

Signed-off-by: Dhrubo Saha <dhrubo@amazon.com>

* reverted many erroneous "fixes" to tests

Signed-off-by: Dhrubo Saha <dhrubo@amazon.com>

* fixed opensearch integration so remaining non-ml tests run

Signed-off-by: Dhrubo Saha <dhrubo@amazon.com>

* added initial connection to predicting with sagemaker

Signed-off-by: Dhrubo Saha <dhrubo@amazon.com>

* added sagemaker predict api

Signed-off-by: Dhrubo Saha <dhrubo@amazon.com>

* added band-aid to fix iterating over rows

Signed-off-by: Dhrubo Saha <dhrubo@amazon.com>

* debugging indexing issue

Signed-off-by: Dhrubo Saha <dhrubo@amazon.com>

* reverted indexing change for sagemaker predict

Signed-off-by: Dhrubo Saha <dhrubo@amazon.com>

* added deprecation warnings to ml module

Signed-off-by: Dhrubo Saha <dhrubo@amazon.com>

* refactoring elasticsearch names to opensearch

Signed-off-by: Dhrubo Saha <dhrubo@amazon.com>

* continued renaming opensearch variables

Signed-off-by: Dhrubo Saha <dhrubo@amazon.com>

* more renaming changes

Signed-off-by: Dhrubo Saha <dhrubo@amazon.com>

* first commit for ml common integration

Signed-off-by: Dhrubo Saha <dhrubo@amazon.com>

* PoC for model upload

Signed-off-by: Dhrubo Saha <dhrubo@amazon.com>

* renamed model chunk uploading path

Signed-off-by: Dhrubo Saha <dhrubo@amazon.com>

* added total chunks to model upload

Signed-off-by: Dhrubo Saha <dhrubo@amazon.com>

* fixed docstring typo

Signed-off-by: Dhrubo Saha <dhrubo@amazon.com>

* added first iteration of custom model load supprot

Signed-off-by: Dhrubo Saha <dhrubo@amazon.com>

* removed unsupported features

Signed-off-by: Dhrubo Saha <dhrubo@amazon.com>

* renaming all instances of elastic in code

Signed-off-by: Dhrubo Saha <dhrubo@amazon.com>

* created new dev requirements file

Signed-off-by: Dhrubo Saha <dhrubo@amazon.com>

* typo fix

Signed-off-by: Dhrubo Saha <dhrubo@amazon.com>

* PR feedback

Signed-off-by: Dhrubo Saha <dhrubo@amazon.com>

* implement PR feedback

Signed-off-by: Dhrubo Saha <dhrubo@amazon.com>

* PR feedback

Signed-off-by: Dhrubo Saha <dhrubo@amazon.com>

* implement pr feedback

Signed-off-by: Dhrubo Saha <dhrubo@amazon.com>

* Update README.md

Signed-off-by: Dhrubo Saha <dhrubo@amazon.com>

* added demo materials

Signed-off-by: Dhrubo Saha <dhrubo@amazon.com>

* refactoring

Signed-off-by: Dhrubo Saha <dhrubo@amazon.com>

* refactoring code and changed code to address some of the deprection warning

Signed-off-by: Dhrubo Saha <dhrubo@amazon.com>

* adding header license info

Signed-off-by: Dhrubo Saha <dhrubo@amazon.com>

* formatted code with black, isort, mypy

Signed-off-by: Dhrubo Saha <dhrubo@amazon.com>

* updating git ci workflow

Signed-off-by: Dhrubo Saha <dhrubo@amazon.com>

* refactoring code + adding pytest in the ci workflow

Signed-off-by: Dhrubo Saha <dhrubo@amazon.com>

* removing test from ci workflow

Signed-off-by: Dhrubo Saha <dhrubo@amazon.com>

* setup CI for integration test

Signed-off-by: Dhrubo Saha <dhrubo@amazon.com>

* adding files required for CI

Signed-off-by: Dhrubo Saha <dhrubo@amazon.com>

* adding files which got deleted during merge

Signed-off-by: Dhrubo Saha <dhrubo@amazon.com>

* adding deleted files by git merge

Signed-off-by: Dhrubo Saha <dhrubo@amazon.com>

Signed-off-by: Dhrubo Saha <dhrubo@amazon.com>
Co-authored-by: Benjamin Trent <ben.w.trent@gmail.com>
Co-authored-by: Seth Michael Larson <seth.larson@elastic.co>
Co-authored-by: Lisa Cawley <lcawley@elastic.co>
Co-authored-by: Nigel Small <nasmall@pm.me>
Co-authored-by: David Kyle <david.kyle@elastic.co>
Co-authored-by: Thomas Ma <thomayinma@gmail.com>
Co-authored-by: Thomas Ma <31194803+LEFTA98@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cloud-deploy Publish cloud docker image for Cloud-First-Testing >enhancement :ml Machine learning release highlight Team:ML Meta label for the ML team v8.3.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants