Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Execute Enrich policy task with wait_for_completion=false does not retain task status after completion #70554

Closed
askids opened this issue Mar 18, 2021 · 13 comments
Assignees
Labels
>bug :Data Management/Ingest Node Execution or management of Ingest Pipelines including GeoIP Team:Data Management Meta label for data/management team

Comments

@askids
Copy link

askids commented Mar 18, 2021

Elasticsearch version (bin/elasticsearch --version): 7.8.1
OpenJDK 64-Bit Server VM warning: Ignoring option UseConcMarkSweepGC; support was removed in 14.0
OpenJDK 64-Bit Server VM warning: Ignoring option CMSInitiatingOccupancyFraction; support was removed in 14.0
OpenJDK 64-Bit Server VM warning: Ignoring option UseCMSInitiatingOccupancyOnly; support was removed in 14.0
Version: 7.8.1, Build: unknown/unknown/b5ca9c58fb664ca8bf9e4057fc229b3396bf3a89/2020-07-21T16:40:44.668009Z, JVM: 14.0.1

Plugins installed: [readonlyrest - 1.28.0]

JVM version (java -version):
openjdk 14.0.1 2020-04-14
openJDK Runtime Environment AdoptOpenJDK <build 14.0.1+7>
openJDK 64-Bit Server VM AdoptOpenJDK <build 14.0.1+7, mixed mode, sharing>

OS version (uname -a if on a Unix-like system): Windows 2012 R2

Description of the problem including expected versus actual behavior:
When we execute an enrich policy with parameter wait_for_completion=false, we get the task id back. But we are not able to consistently query the status of the task via GET _tasks/ end point. When we try to get status immediately, it will show completed as false and show the status, but subsequent attempts to get the task status results in different kind of errors depending on how long after was the GET task status was executed.

Expected behavior is that GET _tasks should provide the proper status even after the task is completed. Without getting the task status completion, we wont be able to implement any reliable polling process to verify that the enrichment policy execution was successfully completed. We have a requirement to update the enrichment index on a daily basis to get updated data from source index. So we need to be able to get the task status reliably after executing the policy.

Steps to reproduce:

  1. Create enrich policy
  2. execute enrich policy with parameter wait_for_completion=false
  3. Perform GET _tasks/

Provide logs (if relevant):

POST /_enrich/policy/my_enrich_policy_name/_execute?wait_for_completion=false

GET _tasks/oFKHJq8iSi69dXLxKh7EMA:4907254

{
  "completed" : false,
  "task" : {
    "node" : "oFKHJq8iSi69dXLxKh7EMA",
    "id" : 4907254,
    "type" : "enrich",
    "action" : "policy_execution",
    "status" : {
      "phase" : "RUNNING"
    },
    "description" : "my_enrich_policy_name",
    "start_time_in_millis" : 1616054637089,
    "running_time_in_nanos" : 7782110814,
    "cancellable" : false,
    "parent_task_id" : "oFKHJq8iSi69dXLxKh7EMA:4907195",
    "headers" : { }
  }
}

{
  "error" : {
    "root_cause" : [
      {
        "type" : "transport_serialization_exception",
        "reason" : "Failed to deserialize response from handler [org.elasticsearch.transport.TransportService$ContextRestoreResponseHandler]"
      }
    ],
    "type" : "transport_serialization_exception",
    "reason" : "Failed to deserialize response from handler [org.elasticsearch.transport.TransportService$ContextRestoreResponseHandler]",
    "caused_by" : {
      "type" : "illegal_argument_exception",
      "reason" : "Unknown NamedWriteable [org.elasticsearch.tasks.Task$Status][enrich-policy-execution]"
    }
  },
  "status" : 500
}


{
  "error" : {
    "root_cause" : [
      {
        "type" : "resource_not_found_exception",
        "reason" : "task [oFKHJq8iSi69dXLxKh7EMA:4907254] isn't running and hasn't stored its results"
      }
    ],
    "type" : "resource_not_found_exception",
    "reason" : "task [oFKHJq8iSi69dXLxKh7EMA:4907254] isn't running and hasn't stored its results"
  },
  "status" : 404
}

Thanks!

@askids askids added >bug needs:triage Requires assignment of a team area label labels Mar 18, 2021
@matriv matriv added the :Data Management/Ingest Node Execution or management of Ingest Pipelines including GeoIP label Mar 23, 2021
@elasticmachine elasticmachine added the Team:Data Management Meta label for data/management team label Mar 23, 2021
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-core-features (Team:Core/Features)

@martijnvg
Copy link
Member

There are two parts to this request:

  • There is a serialization error. The ExecuteEnrichPolicyTask isn't registered correctly. (this needs to be fixed)
  • The node task that performs the policy execution needs to be stored in the .tasks index, otherwise the status can't be known after a node task has been executed.

Referencing #51628, since that will redefine how the task APIs should be used. In light of that, perhaps we should have a dedicated api to query the status of async policy executions (instead of the above second bullet point).

@gwbrown gwbrown removed the needs:triage Requires assignment of a team area label label Mar 26, 2021
martijnvg added a commit to martijnvg/elasticsearch that referenced this issue May 4, 2021
When executing the enrich execute policy api and not waiting for completion,
then querying for task via task list api can result into a serialization error.

Relates to elastic#70554
@martijnvg martijnvg self-assigned this May 4, 2021
@martijnvg
Copy link
Member

There is a serialization error. The ExecuteEnrichPolicyTask isn't registered correctly. (this needs to be fixed)

Actually this is already fixed via #62364 and the fix is available from version 7.10. So upgrading should fix that
serialisation error in your case.

@martijnvg
Copy link
Member

Instead of checking the tasks api when executing a policy in the background for the status, I think it is easier to use the enrich stats api: GET /_enrich/_stats. This includes details about policies that are currently executing.

Which returns something like this:

{
    "executing_policies": [
        {
            "name": "my-policy",
            "task": {
                "node": "mYT-5C6tRTm9_v6q5GF22w",
                "id": 5190,
                "type": "enrich",
                "action": "policy_execution",
                "status": {
                    "phase": "RUNNING"
                },
                "description": "my-policy",
                "start_time_in_millis": 1620140016776,
                "running_time_in_nanos": 1266045350,
                "cancellable": false,
                "parent_task_id": "mYT-5C6tRTm9_v6q5GF22w:5189",
                "headers": {}
            }
        }
    ],
    "coordinator_stats": [
       ...
    ]
}

This is also more useful, since it returns the task information in a per policy basis (by name), so it easier to lookup and there is no need to record the task id that the execute policy api returns.

@askids If this api would also return the task information from past executions (the last execution for each policy) then would this allow you the consistently fetch the status of a policy execution?

@askids
Copy link
Author

askids commented May 4, 2021

@askids If this api would also return the task information from past executions (the last execution for each policy) then would this allow you the consistently fetch the status of a policy execution?

Yes @martijnvg , that can work, if it shows the last execution of each policy, along with the status. But currently, if there is no executing policies, it wont show anything. So we wouldn't be able to tell, if that execution was successful or was it empty due to it being cancelled/failed etc.

@askids
Copy link
Author

askids commented May 4, 2021

There is a serialization error. The ExecuteEnrichPolicyTask isn't registered correctly. (this needs to be fixed)

Actually this is already fixed via #62364 and the fix is available from version 7.10. So upgrading should fix that
serialisation error in your case.

We are scheduled to upgrade to 7.10.2 (from 7.8.1) in another 3 weeks. May be, i can verify it then on the newer version.

@martijnvg
Copy link
Member

But currently, if there is no executing policies, it wont show anything. So we wouldn't be able to tell, if that execution was successful or was it empty due to it being cancelled/failed etc.

Yes, this is something that I think can be improved in the current enrich stats api.

We are scheduled to upgrade to 7.10.2 (from 7.8.1) in another 3 weeks. May be, i can verify it then on the newer version.

That would be great!

@askids
Copy link
Author

askids commented May 24, 2021

hi @martijnvg

We completed upgrade to 7.10.2. I checked for serialization issue with GET _tasks api. I no longer get that error. When I continue to run GET _tasks, it directly now moves from RUNNING status to resource_not_found_exception, after task is completed. So atleast one part of the reported issue seems to be fixed. That leaves us with the main issue of trying to find task status of a completed enrich task using task id.

Thanks!

@martijnvg
Copy link
Member

@askids Thanks for letting us know!

That leaves us with the main issue of trying to find task status of a completed enrich task using task id.

I've opened #73353 to track this feature request.

@askids
Copy link
Author

askids commented Jul 18, 2021

@martijnvg we upgraded to 7.10.2. Now I am starting to see the same issue on reindex activityalso. When I run reindex with wait_for_completion=false and use the returned task id to get status using GET _tasks/, on many occassions (even when task is still running), I get same error as originally reported " isn't running and hasn't stored its results". Should I submit a separate issue for it?

@martijnvg
Copy link
Member

@askids I think the get task api should be used in order to retrieve the information about the reindex task. The get task api should check the tasks index in case the task has completed execution. You should use the task id returned from the reindex api as argument to the get task api.

@askids
Copy link
Author

askids commented Jul 20, 2021

Yes. That is what we were always doing. But after recent upgrade to 7.10.2, when we run reindex task with wait_for_completion=false, the task id returned is not queryable using GET _tasks api. It works for some id and not for others. If I run multiple reindex tasks from dev tools in one shot, none of the ids returned are queryable. If I run, reindex one script at a time, the id returned is queryable.

Initially, i thought that reindex script was bad. But I could see the doc count increasing on the index as it was a long running process. But I was getting " isn't running and hasn't stored its results" message. So either the reindex API returned wrong task id or GET _tasks is not able to pull up the status due to other issue.

@martijnvg
Copy link
Member

If the get task api doesn't return a task for a completed async reindex execution then I think that is a bug. As far as I see that should work (whereas for execute policy api this is currently not implemented). Opening a separate issue for this makes sense.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>bug :Data Management/Ingest Node Execution or management of Ingest Pipelines including GeoIP Team:Data Management Meta label for data/management team
Projects
None yet
Development

No branches or pull requests

5 participants