[ML] adds new question_answering NLP task for extracting answers to questions from a document #85958

benwtrent · 2022-04-18T16:21:01Z

This commit adds a new question_answering task.

The question_answering task allows supplying a question in the inference config update.

When storing the model config for inference:

"inference_config": {
  "question_answering": {
    "tokenization": {...}, // tokenization settings, recommend doing 386 max sequence length with 128 span, and no truncating
    "max_answer_length": 15 // the max answer length to consider
  }
}

Then when calling _infer or running with in a pipeline, add the question you want answered given the context provided by the document text

{
  "docs":[{ "text_field": <some long text field to extract answer}],
  "inference_config": {
    "question_answering": {
      "question": <Question desiring answer>
    }
  }
}

The response then looks like:

{
    "predicted_value": <string subsection of the document that is the answer>
    "start_offset": <Char offset in document to start>,
    "end_offset": <char offset end of the answer,
    "prediction_probability": <prediction score>
}

Some models tested:

…uestions from a document

elasticmachine · 2022-04-18T16:21:05Z

Pinging @elastic/ml-core (Team:ML)

elasticsearchmachine · 2022-04-18T16:21:26Z

Hi @benwtrent, I've created a changelog YAML for you.

…om:benwtrent/elasticsearch into feature/ml-add-question-answering-nlp-task

davidkyle

Nits, nothing more.

LGTM I added the cloud-deploy label so I can test it out

...ava/org/elasticsearch/xpack/core/ml/inference/results/QuestionAnsweringInferenceResults.java

...ain/java/org/elasticsearch/xpack/core/ml/inference/trainedmodel/QuestionAnsweringConfig.java

...in/ml/src/main/java/org/elasticsearch/xpack/ml/inference/nlp/QuestionAnsweringProcessor.java

davidkyle · 2022-05-03T10:29:49Z

...lugin/ml/src/main/java/org/elasticsearch/xpack/ml/inference/nlp/tokenizers/NlpTokenizer.java

@@ -98,7 +98,8 @@ public List<TokenizationResult.Tokens> tokenize(String seq, Tokenization.Truncat
            );
            // Make sure we do not end on a word
            if (splitEndPos != tokenIds.size()) {
-                while (Objects.equals(tokenPositionMap.get(splitEndPos), tokenPositionMap.get(splitEndPos - 1))) {
+                while (splitEndPos > splitStartPos + 1


not sure what is happening here

We need to verify that the end is actually after the start. If we don't we could get into a death spiral where the splitting never moves forward (think adversarial case of very small sequences with overlaps that are almost the same as the sequence size).

davidkyle · 2022-05-03T11:27:09Z

...ml/src/main/java/org/elasticsearch/xpack/ml/inference/nlp/tokenizers/TokenizationResult.java

-            for (int j = 0; j < maxLength; j++) {
-                builder.value(0);
+            // Just a single sequence within this tokenization
+            if (inputTokens.seqPairOffset <= 0) {


...lugin/ml/src/main/java/org/elasticsearch/xpack/ml/inference/nlp/tokenizers/NlpTokenizer.java

davidkyle · 2022-05-03T12:24:05Z

...lugin/ml/src/main/java/org/elasticsearch/xpack/ml/inference/nlp/tokenizers/NlpTokenizer.java

+     * @param sequenceId Unique sequence id for this tokenization
+     * @return tokenization result for the sequence pair
+     */
+    public List<TokenizationResult.Tokens> tokenize(String seq1, String seq2, Tokenization.Truncate truncate, int span, int sequenceId) {


This looks very similar to

tokenize( String seq1, InnerTokenization innerResultSeq1, String seq2, Tokenization.Truncate truncate, int sequenceId )

but with the spanning logic added. Is there an opportunity to refactor here?

Maybe, but the previous one shouldn't do spanning at all. Mainly because its an optimization to allow a single tokenization to be used multiple times. We just have no way of representing a spanned batched encoding (4d tensor?).

I think some refactoring could be done as a whole, but it would take more churn on the rest of the tokenization code.

Co-authored-by: David Kyle <david.kyle@elastic.co>

benwtrent · 2022-05-03T12:40:38Z

@elasticmachine update branch

benwtrent · 2022-05-03T13:28:05Z

@elasticmachine update branch

benwtrent · 2022-05-04T12:52:58Z

@elasticmachine update branch

Adds support for `question_answering` NLP models within the pytorch model uploader. Related: elastic/elasticsearch#85958

Adds support for `question_answering` NLP models within the pytorch model uploader. Related: elastic/elasticsearch#85958 Signed-off-by: Dhrubo Saha <dhrubo@amazon.com>

* [ML] Improve NLP model import by using nicely defined types (#459) This adds some more definite types for our NLP tasks and tokenization configurations. This is the first step in allowing users to more easily import their own transformer models via something other than hugging face. Signed-off-by: Dhrubo Saha <dhrubo@amazon.com> * [ML] add support for question_answering NLP tasks (#457) Adds support for `question_answering` NLP models within the pytorch model uploader. Related: elastic/elasticsearch#85958 Signed-off-by: Dhrubo Saha <dhrubo@amazon.com> * [ML] improve general pytorch model import and add tests (#463) This improves the user consumed functions and classes for PyTorch NLP model upload to Elasticsearch. Previously it was difficult to wrap your own module for uploading to Elasticsearch. This commit splits some classes out, adds new ones, and adds tests showing how to wrap some simple modules. Signed-off-by: Dhrubo Saha <dhrubo@amazon.com> * Release 8.2.0 Signed-off-by: Dhrubo Saha <dhrubo@amazon.com> * [ML] fixes decision tree classifier upload to account for probabilities (#465) This switches our sklearn.DecisionTreeClassifier serialization logic to account for multi-valued leaves in the tree. The key difference between our inference and DecisionTreeClassifier, is that we run a softMax over the leaf where sklearn simply normalizes the results. This means that our "probabilities" returned will be different than sklearn. Signed-off-by: Dhrubo Saha <dhrubo@amazon.com> * Add authentication methods for import model script (#466) Signed-off-by: Dhrubo Saha <dhrubo@amazon.com> * Ignore type checking for `agg_value` Signed-off-by: Dhrubo Saha <dhrubo@amazon.com> * [DOCS] Adds question_answering task type for eland_import_hub_model Signed-off-by: Dhrubo Saha <dhrubo@amazon.com> * Stop explicitly pulling master Signed-off-by: Dhrubo Saha <dhrubo@amazon.com> * Remove 'numpydoc' to stop reformatting Signed-off-by: Dhrubo Saha <dhrubo@amazon.com> * Also pin traitlets Signed-off-by: Dhrubo Saha <dhrubo@amazon.com> * [DOCS] Include missing attributes (#468) Co-authored-by: Seth Michael Larson <seth.larson@elastic.co> Signed-off-by: Dhrubo Saha <dhrubo@amazon.com> * [ML] ensure quantization is applied (#472) Signed-off-by: Dhrubo Saha <dhrubo@amazon.com> * Freeze the traced PyTorch model Signed-off-by: Dhrubo Saha <dhrubo@amazon.com> * Bump minimum PyTorch version to 1.11 Signed-off-by: Dhrubo Saha <dhrubo@amazon.com> * [ML] adds new auto task type that attempts to automatically determine NLP task type from model config (#475) For many model types, we don't need to require the task requested. We can infer the task type based on the model configuration and architecture. This commit makes the `task-type` parameter optional for the model up load script and adds logic for auto-detecting the task type based on the 🤗 model. Signed-off-by: Dhrubo Saha <dhrubo@amazon.com> * added opensearch as dependency Signed-off-by: Dhrubo Saha <dhrubo@amazon.com> * replaced core mentions of elasticsearch client w opensearch Signed-off-by: Dhrubo Saha <dhrubo@amazon.com> * changed index names for testing Signed-off-by: Dhrubo Saha <dhrubo@amazon.com> * modified test dataframes to accommodate opensearch indexing Signed-off-by: Dhrubo Saha <dhrubo@amazon.com> * fixed aggregatable field name tests Signed-off-by: Dhrubo Saha <dhrubo@amazon.com> * fixing pytests that mention indices of ed/pd dataframes Signed-off-by: Dhrubo Saha <dhrubo@amazon.com> * fixed equality boolean filter to accommodate terminology difference in elastic vs open search Signed-off-by: Dhrubo Saha <dhrubo@amazon.com> * fixed pytests with indexing issues, geolocation field renaming issues Signed-off-by: Dhrubo Saha <dhrubo@amazon.com> * modified test setup code to work for opensearch Signed-off-by: Dhrubo Saha <dhrubo@amazon.com> * reverted many erroneous "fixes" to tests Signed-off-by: Dhrubo Saha <dhrubo@amazon.com> * fixed opensearch integration so remaining non-ml tests run Signed-off-by: Dhrubo Saha <dhrubo@amazon.com> * added initial connection to predicting with sagemaker Signed-off-by: Dhrubo Saha <dhrubo@amazon.com> * added sagemaker predict api Signed-off-by: Dhrubo Saha <dhrubo@amazon.com> * added band-aid to fix iterating over rows Signed-off-by: Dhrubo Saha <dhrubo@amazon.com> * debugging indexing issue Signed-off-by: Dhrubo Saha <dhrubo@amazon.com> * reverted indexing change for sagemaker predict Signed-off-by: Dhrubo Saha <dhrubo@amazon.com> * added deprecation warnings to ml module Signed-off-by: Dhrubo Saha <dhrubo@amazon.com> * refactoring elasticsearch names to opensearch Signed-off-by: Dhrubo Saha <dhrubo@amazon.com> * continued renaming opensearch variables Signed-off-by: Dhrubo Saha <dhrubo@amazon.com> * more renaming changes Signed-off-by: Dhrubo Saha <dhrubo@amazon.com> * first commit for ml common integration Signed-off-by: Dhrubo Saha <dhrubo@amazon.com> * PoC for model upload Signed-off-by: Dhrubo Saha <dhrubo@amazon.com> * renamed model chunk uploading path Signed-off-by: Dhrubo Saha <dhrubo@amazon.com> * added total chunks to model upload Signed-off-by: Dhrubo Saha <dhrubo@amazon.com> * fixed docstring typo Signed-off-by: Dhrubo Saha <dhrubo@amazon.com> * added first iteration of custom model load supprot Signed-off-by: Dhrubo Saha <dhrubo@amazon.com> * removed unsupported features Signed-off-by: Dhrubo Saha <dhrubo@amazon.com> * renaming all instances of elastic in code Signed-off-by: Dhrubo Saha <dhrubo@amazon.com> * created new dev requirements file Signed-off-by: Dhrubo Saha <dhrubo@amazon.com> * typo fix Signed-off-by: Dhrubo Saha <dhrubo@amazon.com> * PR feedback Signed-off-by: Dhrubo Saha <dhrubo@amazon.com> * implement PR feedback Signed-off-by: Dhrubo Saha <dhrubo@amazon.com> * PR feedback Signed-off-by: Dhrubo Saha <dhrubo@amazon.com> * implement pr feedback Signed-off-by: Dhrubo Saha <dhrubo@amazon.com> * Update README.md Signed-off-by: Dhrubo Saha <dhrubo@amazon.com> * added demo materials Signed-off-by: Dhrubo Saha <dhrubo@amazon.com> * refactoring Signed-off-by: Dhrubo Saha <dhrubo@amazon.com> * refactoring code and changed code to address some of the deprection warning Signed-off-by: Dhrubo Saha <dhrubo@amazon.com> * adding header license info Signed-off-by: Dhrubo Saha <dhrubo@amazon.com> * formatted code with black, isort, mypy Signed-off-by: Dhrubo Saha <dhrubo@amazon.com> * updating git ci workflow Signed-off-by: Dhrubo Saha <dhrubo@amazon.com> * refactoring code + adding pytest in the ci workflow Signed-off-by: Dhrubo Saha <dhrubo@amazon.com> * removing test from ci workflow Signed-off-by: Dhrubo Saha <dhrubo@amazon.com> * setup CI for integration test Signed-off-by: Dhrubo Saha <dhrubo@amazon.com> * adding files required for CI Signed-off-by: Dhrubo Saha <dhrubo@amazon.com> * adding files which got deleted during merge Signed-off-by: Dhrubo Saha <dhrubo@amazon.com> * adding deleted files by git merge Signed-off-by: Dhrubo Saha <dhrubo@amazon.com> Signed-off-by: Dhrubo Saha <dhrubo@amazon.com> Co-authored-by: Benjamin Trent <ben.w.trent@gmail.com> Co-authored-by: Seth Michael Larson <seth.larson@elastic.co> Co-authored-by: Lisa Cawley <lcawley@elastic.co> Co-authored-by: Nigel Small <nasmall@pm.me> Co-authored-by: David Kyle <david.kyle@elastic.co> Co-authored-by: Thomas Ma <thomayinma@gmail.com> Co-authored-by: Thomas Ma <31194803+LEFTA98@users.noreply.github.com>

[ML] adds new question_answering NLP task for extracting answers to q…

2fd31cd

…uestions from a document

benwtrent added >enhancement :ml Machine learning v8.3.0 labels Apr 18, 2022

elasticmachine added the Team:ML Meta label for the ML team label Apr 18, 2022

Update docs/changelog/85958.yaml

eb88551

benwtrent mentioned this pull request Apr 18, 2022

[ML] add support for question_answering NLP tasks elastic/eland#457

Merged

benwtrent added 2 commits April 18, 2022 16:18

fixing testing

94fc760

Merge branch 'feature/ml-add-question-answering-nlp-task' of github.c…

8bb46da

…om:benwtrent/elasticsearch into feature/ml-add-question-answering-nlp-task

davidkyle approved these changes May 3, 2022

View reviewed changes

Apply suggestions from code review

1ba0d33

Co-authored-by: David Kyle <david.kyle@elastic.co>

Merge branch 'master' into feature/ml-add-question-answering-nlp-task

6e9dba9

benwtrent added the cloud-deploy Publish cloud docker image for Cloud-First-Testing label May 3, 2022

Merge branch 'master' into feature/ml-add-question-answering-nlp-task

18651b1

elasticmachine and others added 3 commits May 4, 2022 22:23

Merge branch 'master' into feature/ml-add-question-answering-nlp-task

15488a4

fixing results serialization

6c09d24

fixing ingest processor

31b693f

benwtrent merged commit b7f24bd into elastic:master May 4, 2022

benwtrent deleted the feature/ml-add-question-answering-nlp-task branch May 4, 2022 17:15

benwtrent added a commit to elastic/eland that referenced this pull request May 4, 2022

[ML] add support for question_answering NLP tasks (#457)

70fadc9

Adds support for `question_answering` NLP models within the pytorch model uploader. Related: elastic/elasticsearch#85958

This was referenced May 4, 2022

[ML] add question_answering nlp configuration types elastic/elasticsearch-specification#1675

Closed

[ML] add question_answering config elements elastic/elasticsearch-specification#1687

Merged

jgowdyelastic mentioned this pull request May 11, 2022

[ML] Adding UI for question_answering model testing elastic/kibana#132033

Merged

1 task

This was referenced May 18, 2022

Adds question answering task to Extract information NLP docs elastic/stack-docs#2138

Merged

[DOCS] Adds settings of question_answering to inference_config of PUT and infer trained model APIs #86895

Merged

lcawl mentioned this pull request May 20, 2022

[DOCS] Add question_answering NLP task type elastic/eland#467

Merged

craigtaverner added the release highlight label Jun 22, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ML] adds new question_answering NLP task for extracting answers to questions from a document #85958

[ML] adds new question_answering NLP task for extracting answers to questions from a document #85958

benwtrent commented Apr 18, 2022

elasticmachine commented Apr 18, 2022

elasticsearchmachine commented Apr 18, 2022

davidkyle left a comment

davidkyle May 3, 2022

benwtrent May 3, 2022

davidkyle May 3, 2022

davidkyle May 3, 2022

benwtrent May 3, 2022

benwtrent commented May 3, 2022

benwtrent commented May 3, 2022

benwtrent commented May 4, 2022

[ML] adds new question_answering NLP task for extracting answers to questions from a document #85958

[ML] adds new question_answering NLP task for extracting answers to questions from a document #85958

Conversation

benwtrent commented Apr 18, 2022

elasticmachine commented Apr 18, 2022

elasticsearchmachine commented Apr 18, 2022

davidkyle left a comment

Choose a reason for hiding this comment

davidkyle May 3, 2022

Choose a reason for hiding this comment

benwtrent May 3, 2022

Choose a reason for hiding this comment

davidkyle May 3, 2022

Choose a reason for hiding this comment

davidkyle May 3, 2022

Choose a reason for hiding this comment

benwtrent May 3, 2022

Choose a reason for hiding this comment

benwtrent commented May 3, 2022

benwtrent commented May 3, 2022

benwtrent commented May 4, 2022