Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Ingest Management] Can't update the system package #82580

Closed
mtojek opened this issue Nov 4, 2020 · 22 comments
Closed

[Ingest Management] Can't update the system package #82580

mtojek opened this issue Nov 4, 2020 · 22 comments
Labels
Team:Fleet Team label for Observability Data Collection Fleet team

Comments

@mtojek
Copy link
Contributor

mtojek commented Nov 4, 2020

Hi,

it looks like the following command doesn't work as intended (or in the specific condition):

$ curl -u elastic:${PASSWORD} -k -X POST https://<host>:443/api/fleet/epm/packages/system-0.9.0 -H 'kibana-xsf: blah' -H 'kbn-xsrf: blah' -H 'Content-Type: application/json' -d '{ "force": true }'

Result:

{"statusCode":502,"error":"Bad Gateway","message":"'404 Not Found' error response from package registry at https://epr-staging.elastic.co/package/system/0.7.0/"}

Expected result - the system package "0.7.0" may not exist, but Kibana should install the selected one, which is available.

@mtojek mtojek added the Team:Fleet Team label for Observability Data Collection Fleet team label Nov 4, 2020
@elasticmachine
Copy link
Contributor

Pinging @elastic/ingest-management (Team:Ingest Management)

@ruflin
Copy link
Contributor

ruflin commented Nov 4, 2020

@neptunian Could you have a look at the above and comment on the expected behaviour?

@skh
Copy link
Contributor

skh commented Nov 4, 2020

My current theory is that the update failed and the call to the registry for the old package happened during rollback.

This is a known issue and will be fixed with #81110

Were there other errors in the Kibana log before the 404?

@mtojek
Copy link
Contributor Author

mtojek commented Nov 4, 2020

Not sure how to check logs this cloud environment, maybe @kuisathaverat can help here. AFAIK nothing was found during the investigation yesterday.

@kuisathaverat
Copy link
Contributor

I'll send you the instructions on slack

@skh
Copy link
Contributor

skh commented Nov 4, 2020

There are errors of this type before the 404s that looks related:

09:44:18.000
kibana.log
{ Error: Saved object [dashboard/system-0d3f2380-fa78-11e6-ae9b-81e5311e8cab] not found
    at Function.createGenericNotFoundError (/usr/share/kibana/src/core/server/saved_objects/service/lib/errors.js:136:37)
    at SavedObjectsRepository.delete (/usr/share/kibana/src/core/server/saved_objects/service/lib/repository.js:574:46)
    at process._tickCallback (internal/process/next_tick.js:68:7)
  data: null,
  isBoom: true,
  isServer: false,
  output:
   { statusCode: 404,
     payload:
      { statusCode: 404,
        error: 'Not Found',
        message:
         'Saved object [dashboard/system-0d3f2380-fa78-11e6-ae9b-81e5311e8cab] not found' },
     headers: {} },
  reformat: [Function],
  typeof: [Function: notFound],
  [Symbol(SavedObjectsClientErrorCode)]: 'SavedObjectsClient/notFound' }

Also for search/system-eb0039f0-fa7f-11e6-a1df-a78bd7504d38 and dashboard/system-277876d0-fa2c-11e6-bbd3-29c986c96e5a

@kuisathaverat
Copy link
Contributor

I see tons of logs like this one

{"type":"log","@timestamp":"2020-11-04T11:55:25+00:00","tags":["info","plugins","ingestManager"],"pid":6,"message":"Custom registry url is an experimental feature and is unsupported."}

and when I access to fleet these two

{"type":"log","@timestamp":"2020-11-04T11:57:36+00:00","tags":["error","plugins","ingestManager"],"pid":6,"message":"[cluster_block_exception] index [.transform-internal-005] blocked by: [FORBIDDEN/8/index write (api)]; response from /_transform/endpoint.metadata_current-default-0.16.1: {\"error\":{\"root_cause\":[{\"type\":\"cluster_block_exception\",\"reason\":\"index [.transform-internal-005] blocked by: [FORBIDDEN/8/index write (api)];\"}],\"type\":\"runtime_exception\",\"reason\":\"runtime_exception: Failed to persist transform configuration\",\"caused_by\":{\"type\":\"cluster_block_exception\",\"reason\":\"index [.transform-internal-005] blocked by: [FORBIDDEN/8/index write (api)];\"}},\"status\":500}"}

{"type":"error","@timestamp":"2020-11-04T11:56:50+00:00","tags":[],"pid":6,"level":"error","error":{"message":"Internal Server Error","name":"Error","stack":"Error: Internal Server Error\n at HapiResponseAdapter.toError (/usr/share/kibana/src/core/server/http/router/response_adapter.js:132:19)\n at HapiResponseAdapter.toHapiResponse (/usr/share/kibana/src/core/server/http/router/response_adapter.js:86:19)\n at HapiResponseAdapter.handle (/usr/share/kibana/src/core/server/http/router/response_adapter.js:81:17)\n at Router.handle (/usr/share/kibana/src/core/server/http/router/router.js:164:34)\n at process._tickCallback (internal/process/next_tick.js:68:7)"},"url":"https://dev-next-oblt.elastic.dev/api/fleet/setup","message":"Internal Server Error"}

@kuisathaverat
Copy link
Contributor

kuisathaverat commented Nov 4, 2020

I have checked the status of the Elasticsearch cluster is red this can explain why can not write that index

{
  "cluster_name" : "XXXXXXXXXX",
  "status" : "red",
  "timed_out" : false,
  "number_of_nodes" : 12,
  "number_of_data_nodes" : 6,
  "active_primary_shards" : 768,
  "active_shards" : 1168,
  "relocating_shards" : 0,
  "initializing_shards" : 11,
  "unassigned_shards" : 238,
  "delayed_unassigned_shards" : 0,
  "number_of_pending_tasks" : 26,
  "number_of_in_flight_fetch" : 0,
  "task_max_waiting_in_queue_millis" : 773,
  "active_shards_percent_as_number" : 82.42766407904023
}

@kuisathaverat
Copy link
Contributor

with the cluster in yellow, we have the same error

07:34:35.000
kibana.log
[cluster_block_exception] index [.transform-internal-005] blocked by: [FORBIDDEN/8/index write (api)]; response from /_transform/endpoint.metadata_current-default-0.16.1: {"error":{"root_cause":[{"type":"cluster_block_exception","reason":"index [.transform-internal-005] blocked by: [FORBIDDEN/8/index write (api)];"}],"type":"runtime_exception","reason":"runtime_exception: Failed to persist transform configuration","caused_by":{"type":"cluster_block_exception","reason":"index [.transform-internal-005] blocked by: [FORBIDDEN/8/index write (api)];"}},"status":500}

@kuisathaverat
Copy link
Contributor

kuisathaverat commented Nov 4, 2020

The index is frozen, I think we reported it before, it does not have alias neither ILM

Screenshot 2020-11-04 at 13 36 57

after unfroze the index the issue is gonne

@skh
Copy link
Contributor

skh commented Nov 4, 2020

As the symptom in the original description will be addressed by #81110 , can we close this one?

@mtojek
Copy link
Contributor Author

mtojek commented Nov 4, 2020

Actually it's up to the team. I didn't dive deeper into the Kibana issue, but I assume that the package won't disappear once installed, right? It isn't a temporary cache?

If you decide to close this issue, I suggest to prioritize the other one, because it's a relatively common use case to replace staged packages with new versions (e.g. accumulate many snapshots until we have a major version. Snapshots will disappear once the package is promoted).

EDIT:

I'm not quite sure about the root cause of #82580 (comment) , it looks like it's a bit different and related with endpoint. /cc @jonathan-buttner

@nnamdifrankie
Copy link
Contributor

@kuisathaverat

{"type":"log","@timestamp":"2020-11-04T11:57:36+00:00","tags":["error","plugins","ingestManager"],"pid":6,"message":"[cluster_block_exception] index [.transform-internal-005] blocked by: [FORBIDDEN/8/index write (api)]; response from /_transform/endpoint.metadata_current-default-0.16.1: {\"error\":{\"root_cause\":[{\"type\":\"cluster_block_exception\",\"reason\":\"index [.transform-internal-005] blocked by: [FORBIDDEN/8/index write (api)];\"}],\"type\":\"runtime_exception\",\"reason\":\"runtime_exception: Failed to persist transform configuration\",\"caused_by\":{\"type\":\"cluster_block_exception\",\"reason\":\"index [.transform-internal-005] blocked by: [FORBIDDEN/8/index write (api)];\"}},\"status\":500}"}

{"type":"error","@timestamp":"2020-11-04T11:56:50+00:00","tags":[],"pid":6,"level":"error","error":{"message":"Internal Server Error","name":"Error","stack":"Error: Internal Server Error\n at HapiResponseAdapter.toError (/usr/share/kibana/src/core/server/http/router/response_adapter.js:132:19)\n at HapiResponseAdapter.toHapiResponse (/usr/share/kibana/src/core/server/http/router/response_adapter.js:86:19)\n at HapiResponseAdapter.handle (/usr/share/kibana/src/core/server/http/router/response_adapter.js:81:17)\n at Router.handle (/usr/share/kibana/src/core/server/http/router/router.js:164:34)\n at process._tickCallback (internal/process/next_tick.js:68:7)"},"url":"https://dev-next-oblt.elastic.dev/api/fleet/setup","message":"Internal Server Error"}

In regards to this error I reached out to the ML team and here is their response of a likely cause

afaik this happens when the disk gets full. Similar issues happen for .kibana. Does the problem persist? I think in former versions of ES you had to manually unblock an index after an out of diskspace, but they introduced a fix for that which automatically makes indexes writable again after disk space is available again

I also recall in the channel that there are a lot of documents on the server, perhaps we should try resize the host and disk or clearing up disk space on the machine.

@ph
Copy link
Contributor

ph commented Nov 4, 2020

This issue starts to be confusing, the initial description of this issue seems to be because the package is gone. as @skh mentioned this will be addressed by #81110.

So I think I would close that. @neptunian Can you double check?

The installation problem as mentioned in @nnamdifrankie should be in another issue.

@nnamdifrankie
Copy link
Contributor

@kuisathaverat

Please can you create a ticket for the frozen index issue. In that ticket we should evaluate the physical health of the server, e.g. disk space e.t.c. I did not setup that server so I will defer to you on proceeding.

@kuisathaverat
Copy link
Contributor

@kuisathaverat

{"type":"log","@timestamp":"2020-11-04T11:57:36+00:00","tags":["error","plugins","ingestManager"],"pid":6,"message":"[cluster_block_exception] index [.transform-internal-005] blocked by: [FORBIDDEN/8/index write (api)]; response from /_transform/endpoint.metadata_current-default-0.16.1: {\"error\":{\"root_cause\":[{\"type\":\"cluster_block_exception\",\"reason\":\"index [.transform-internal-005] blocked by: [FORBIDDEN/8/index write (api)];\"}],\"type\":\"runtime_exception\",\"reason\":\"runtime_exception: Failed to persist transform configuration\",\"caused_by\":{\"type\":\"cluster_block_exception\",\"reason\":\"index [.transform-internal-005] blocked by: [FORBIDDEN/8/index write (api)];\"}},\"status\":500}"}

{"type":"error","@timestamp":"2020-11-04T11:56:50+00:00","tags":[],"pid":6,"level":"error","error":{"message":"Internal Server Error","name":"Error","stack":"Error: Internal Server Error\n at HapiResponseAdapter.toError (/usr/share/kibana/src/core/server/http/router/response_adapter.js:132:19)\n at HapiResponseAdapter.toHapiResponse (/usr/share/kibana/src/core/server/http/router/response_adapter.js:86:19)\n at HapiResponseAdapter.handle (/usr/share/kibana/src/core/server/http/router/response_adapter.js:81:17)\n at Router.handle (/usr/share/kibana/src/core/server/http/router/router.js:164:34)\n at process._tickCallback (internal/process/next_tick.js:68:7)"},"url":"https://dev-next-oblt.elastic.dev/api/fleet/setup","message":"Internal Server Error"}

In regards to this error I reached out to the ML team and here is their response of a likely cause

afaik this happens when the disk gets full. Similar issues happen for .kibana. Does the problem persist? I think in former versions of ES you had to manually unblock an index after an out of diskspace, but they introduced a fix for that which automatically makes indexes writable again after disk space is available again

I also recall in the channel that there are a lot of documents on the server, perhaps we should try resize the host and disk or clearing up disk space on the machine.

This cluster has about 5TB of disk space we are using about 3TB, so I think we are far away to fill the disk, also if we run out of disk space everything blows up (I know from experience), something else has to trigger this index freeze

@kuisathaverat
Copy link
Contributor

kuisathaverat commented Nov 4, 2020

@kuisathaverat

Please can you create a ticket for the frozen index issue. In that ticket we should evaluate the physical health of the server, e.g. disk space e.t.c. I did not setup that server so I will defer to you on proceeding.

@nnamdifrankie
On which repo should I create the issue? Do I have to add any labels to the issue?

@kevinlog
Copy link
Contributor

kevinlog commented Nov 4, 2020

@kuisathaverat

On which repo should I create the issue? Do I have to add any labels to the issue?

You can create it in the Kibana public repo and add label "Team:Onboarding and Lifecycle Mgt"

@neptunian
Copy link
Contributor

neptunian commented Nov 5, 2020

This issue starts to be confusing, the initial description of this issue seems to be because the package is gone. as @skh mentioned this will be addressed by #81110.

So I think I would close that. @neptunian Can you double check?

Agreed that if there is an error installing some version of a package it will try to rollback. There should have been log messages describing the problem, that a rollback was being attempted, and that it failed.

@mtojek
Copy link
Contributor Author

mtojek commented Nov 5, 2020

I understand that there should be an error stored in logs, but on the other hand the one reported in the REST response is really confusing. The user tries to install 0.9.0 and receives a response reporting a problem with non-existing 0.7.0. Do you think we can improve the error message?

@neptunian
Copy link
Contributor

I understand that there should be an error stored in logs, but on the other hand the one reported in the REST response is really confusing. The user tries to install 0.9.0 and receives a response reporting a problem with non-existing 0.7.0. Do you think we can improve the error message?

I tested the scenario out and the response I get is:

{
    "statusCode": 500,
    "error": "Internal Server Error",
    "message": "blee blah blah"
}

The response was the error that caused the rollback to begin with. I'm not sure this is the best message either but I don't get a mention about the other package.

The logs looked like:

server    log   [08:53:11.536] [error][ingestManager][plugins] Error: blee blah blah
    at _installPackage (/Users/sandy/dev/elastic/kibana/x-pack/plugins/ingest_manager/server/services/epm/packages/_install_package.ts:105:37)
    at process._tickCallback (internal/process/next_tick.js:68:7)
server    log   [08:53:11.541] [error][ingestManager][plugins] rolling back to nginx-0.2.3 after error installing nginx-0.2.4
server    log   [09:06:34.146] [error][ingestManager][plugins] failed to uninstall or rollback package after installation error RegistryResponseError: '400 Bad Request' error response from package registry at https://epr-snapshot.elastic.co/package/nginx/0.2.3
server   error  [08:53:10.220]  Error: Internal Server Error
    at HapiResponseAdapter.toError (/Users/sandy/dev/elastic/kibana/src/core/server/http/router/response_adapter.ts:132:19)
    at HapiResponseAdapter.toHapiResponse (/Users/sandy/dev/elastic/kibana/src/core/server/http/router/response_adapter.ts:82:19)
    at HapiResponseAdapter.handle (/Users/sandy/dev/elastic/kibana/src/core/server/http/router/response_adapter.ts:77:17)
    at Router.handle (/Users/sandy/dev/elastic/kibana/src/core/server/http/router/router.ts:273:34)

@mtojek
Copy link
Contributor Author

mtojek commented Nov 6, 2020

I think we're talking about different errors. Please look at the one I posted in the issue description (HTTP 502).

@ph ph closed this as completed Mar 18, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Team:Fleet Team label for Observability Data Collection Fleet team
Projects
None yet
Development

No branches or pull requests

9 participants