Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Migrate status service and status page to New Platform #41983

Closed
11 of 15 tasks
joshdover opened this issue Jul 25, 2019 · 23 comments
Closed
11 of 15 tasks

Migrate status service and status page to New Platform #41983

joshdover opened this issue Jul 25, 2019 · 23 comments
Labels
Feature:Legacy Removal Issues related to removing legacy Kibana Feature:New Platform NeededFor:Monitoring Team:Core Core services & architecture: plugins, logging, config, saved objects, http, ES client, i18n, etc

Comments

@joshdover
Copy link
Contributor

joshdover commented Jul 25, 2019

Subtasks:


Original issue content

Status service

from lifecycle rfc:
Core should expose a global mechanism for core services and plugins to signal their status. This is equivalent to the legacy status API kibana.Plugin.status which allowed plugins to set their status to e.g. 'red' or 'green'. The exact design of this API is outside of the scope of this RFC.

What is important, is that there is a global mechanism to signal status changes which Core then makes visible to system administrators in the Kibana logs and the /status HTTP API. Plugins should be able to inspect and subscribe to status changes from any of their dependencies.

This will provide an obvious mechanism for plugins to signal that the conditions which are required for this plugin to operate are not currently present and manual intervention might be required. Status changes can happen in both setup and start lifecycles e.g.:

[setup] a required remote host is down
[start] a remote host which was up during setup, started returning connection timeout errors.

Status API and page

Kibana currently exposes a api/status endpoint and associated status_page which renders this output:

name: config.get('server.name'),
uuid: config.get('server.uuid'),
version: {
  number: config.get('pkg.version').replace(matchSnapshot, ''),
  build_hash: config.get('pkg.buildSha'),
  build_number: config.get('pkg.buildNum'),
  build_snapshot: matchSnapshot.test(config.get('pkg.version'))
},
status: kbnServer.status.toJSON(), // https://github.com/elastic/kibana/blob/2a290a14066d4da2b626bb0b4e4e9d0193853230/src/legacy/server/status/server_status.js#L111
metrics: kbnServer.metrics // https://github.com/elastic/kibana/issues/46563 https://github.com/elastic/kibana/blob/ec481861799ed8dcced9cafd8112e5b26e641c54/src/legacy/server/status/lib/metrics.js#L57-L68

The status page app is rendered as a hiddenUiApp

and migrating this will be blocked by

The api/stats endpoint is currently created in the same legacy plugin. It won't be migrated to Core, but depends on an equivalent to kbnServer.metrics being exposed from Core.

...kbnServer.metrics // latest metrics captured from the ops event listener in src/legacy/server/status/index

@joshdover joshdover added Team:Core Core services & architecture: plugins, logging, config, saved objects, http, ES client, i18n, etc Feature:New Platform labels Jul 25, 2019
@elasticmachine
Copy link
Contributor

Pinging @elastic/kibana-platform

@joshdover joshdover assigned joshdover and rudolf and unassigned joshdover Nov 12, 2019
@rudolf rudolf changed the title Migrate status page to New Platform Migrate status service and status page to New Platform Nov 19, 2019
@joshdover joshdover assigned eliperelman and unassigned rudolf Jan 14, 2020
@tsullivan
Copy link
Member

tsullivan commented Jan 17, 2020

Hi,

The api/stats endpoint is currently created in the same legacy plugin. It won't be migrated to Core, but depends on an equivalent to kbnServer.metrics being exposed from Core.

I created the /api/stats endpoint with the idea that the stats data isn't suitable for Monitoring since the values come directly from Hapi, and weren't suitable for Monitoring consumption.

What are the thoughts about having /api/stats preserved and be a full replacement for /api/status?

BTW, the stats abbreviation was chosen to be similar to the Elasticsearch endpoints, such as _cluster/stats

@joshdover
Copy link
Contributor Author

I think we should make this APIs separate and just have the status page UI consume both the status API and stats API.

Migrating the stats API is part of #46563

@joshdover
Copy link
Contributor Author

joshdover commented Feb 27, 2020

Here's my initial thoughts on how plugin & core service statuses should work. This is not intended to replace an RFC on the concept, but to organize and document the original thinking before investigating all use cases.

Please poke holes in this 😄

High-level design concepts

More expressive status levels

Right now we have red, yellow, and green. These don't really explain much or have a consistent semantic meaning.

We could benefit from having status levels with explicit meaning and associated behaviors:

enum ServiceStatusLevel {
  available,   // everything is working!
  degraded,    // some features may not be working
  unavailable, // the service/plugin is unavailable, but other functions should work
  fatal        // block all user functions and show status page
               // (reserved for core services?)
}

Statuses reflect dependencies between plugins

In legacy, it's common for plugins to use the "mirror plugin status" concept to inherit their status from another plugin (most commonly, the Elasticsearch plugin).

It seems beneficial for this concept to be baked into the design of the new service:

  • A plugin's status should default to the highest-severity status of its required dependencies
  • A plugin's status should default to degraded if any of it's optional dependencies are >= degraded

Kibana should always try to keep as much functionality working as possible

In the legacy system, if any plugin changes to "red", pretty much all of Kibana's UI becomes blocked by the status page.

This prevents the user from using some of the built-in management and debug tools to diagnose and correct the problem. For instance, if Machine Learning is in the unavailable state, the user should still be able to use the Console app, License management, and Elasticsearch management tools to diagnose & fix the issue.

This is the purpose of the distinction between unavailable and fatal. I suspect that very few plugins should ever need to go into a fatal state. The few exceptions:

  • Security may want to trigger the fatal state if authentication is broken in some way (eg. license expiration)
  • Some key core services may need to block (eg. Saved Object migrations)

Plugins status does not alter which plugins are enabled

Anything that cannot be recovered from without restarting Kibana completely, should be throwing exceptions during setup or start rather than setting a unavailable or fatal status.

Therefore, we should not disable plugins because they are currently unavailable since removing plugins from the dependency tree requires an entire restart of the Core lifecycles (essentially an in-place restart of Kibana).

Core services & plugins should use the same status mechanism

Pretty self explanatory. There should be a single concept that backs status of different components in the system and they should easily interop with one another.

API Design

enum ServiceStatusLevel {
  available,   // everything is working!
  degraded,    // some features may not be working
  unavailable, // the service/plugin is unavailable, but other functions should work
  fatal        // block all user functions and show status page
               // (reserved for core services?)
}

interface ServiceStatus {
  level: ServiceStatusLevel;
  summary?: string;
  detail?: string;
  meta?: object;
}
interface StatusSetup {
  // Allows a plugin to specify a custom status dependent on its own criteria.
  // See calculation section below on how this is combined with dependency statuses
  setStatus(status$: Observable<ServiceStatus>): Observable<ServiceStatus>;

  // Exposes plugin status for dependencies of current plugin.
  // Type could be inferred by the generic type arg provided by plugins.
  getPluginStatuses$(): Observable<Record<string, ServiceStatus>>;

  // Statuses for all of core's services. Can be used with `inheritStatus` utility
  // for expressing dependent statuses on core services.
  core$: Observable<{
    http: ServiceStatus;
    elasticsearch: ServiceStatus;
    savedObjects: ServiceStatus;
    uiSettings: ServiceStatus;
  }>;
}

Status calculation

// Utility for merging several statuses together and producing a single status with the 
// most severe status up to the maxLevel.
const inheritStatus:
  (
    statuses$: Array<Observable<ServiceStatus>>,
    maxLevel?: ServiceStatusLevel
  ) => Observable<ServiceStatus>;
// Pseudo-code calculuation of a plugin's status
function calculatePluginStatus(requiredDeps: string[], optionalDeps: string[], pluginCustomStatus$?: Observable<ServiceStatus>) {
  const requiredDepStatus$ = inheritStatus(
    requiredDeps.map(dep => getStatusForPlugin(dep))
  );
  const optionalDepStatus$ = inheritStatus(
    optionalDeps.map(dep => getStatusForPlugin(dep)),
    ServiceStatusLevel.degraded // Optional dependencies are capped to 'degraded'
  );
  return inheritStatus([
    requiredDepStatus$,
    optionalDepStatus$,
    // `pluginCustomStatus$` is only set if plugin called `status.setStatus()`
    ...(pluginCustomStatus$ ? [pluginCustomStatus$] : [])
  ]);
}

Open questions

  • Do plugins need to be able to override the default inheritance of statuses from their dependencies?
  • Do we need a status service on the frontend as well as the backend?

@rudolf
Copy link
Contributor

rudolf commented Feb 28, 2020

I really like the semantic statuses and the the status inheritance.

Do plugins need to be able to override the default inheritance of statuses from their dependencies?

I think the default of degraded if dependency >= degraded makes sense, but there are probably many plugins which would be unavailable if one of their dependencies were unavailable. E.g. I don't think dashboard could do anything useful without the data plugin, so that might want to set it's status to "unavailable" if the data plugin is unavailable.

Anything that cannot be recovered from without restarting Kibana completely, should be throwing exceptions during setup or start rather than setting a unavailable or fatal status.

I like this and I think it's the "correct" way. Exceptions are reserved for exceptional circumstances that we didn't anticipate happening and have no idea how to handle so the only valid response is to crash and see if it might work the second time Kibana starts up.

The only risk, which I'm not sure how to mitigate, is that crashing kibana becomes the default error handling. Many plugins don't have error handling for network exceptions or ES exceptions (usually rate-limiting or response data too large).

I also think we should flesh out how plugins API's work when a plugin is degraded. One way would be for API methods to throw an exception when that method is degraded. So if APM was unable to create it's agent configuration index, trying to call methods that rely on this index would throw an exception. This requires every consumer to know about, catch and ignore these exceptions, if they don't, calling a degraded API method will crash Kibana.

This isn't any worse than in legacy, but it is a challenge to making Kibana resillient and improving uptime.

@joshdover
Copy link
Contributor Author

The problem described makes sense to me 👍

I'm unclear on what the first suggestion solution means:

There was a suggestion for status service to provide API allowing plugins to enforce status check. Btw, the licensing service does it.

This would be an API to allow plugins to register their own status check that would get executed by Core? Maybe we could leverage the ServiceSetup#set mechanism proposed in the original RFC for this (possibly with some changes)

As an alternative, we can intercept all network request to ES and detect status change (503 status in response) to update status change. Or even allow plugins to claim to perform a request to ES with retry.

This may be difficult. From my understanding it's possible for some APIs in Elasticsearch to return 503 Unavailable but other APIs to be working fine. Maybe we could put the ES status to degraded when Core's check is passing but there are 503s being emitted from some API calls? I just worry that 503s on some APIs may be irrelevant to many plugins. I think we need to analyze the actual failure behavior here in ES and then determine what global behavior here may (or may not) make sense.

@joshdover
Copy link
Contributor Author

@azasypkin do you think we need this solved in Core before #65472 can proceed?

@mshustov
Copy link
Contributor

mshustov commented May 20, 2020

Maybe we could leverage the ServiceSetup#set mechanism proposed in the original RFC for this (possibly with some changes)

As I understand set allows to specify own status. But here a plugins needs to enforce status service to perform status checks immediately not waiting for polling interval to be completed. Licensing plugin provides refresh method triggering license re-fetch https://github.com/elastic/kibana/blob/master/x-pack/plugins/licensing/server/types.ts#L60

@joshdover
Copy link
Contributor Author

Got it, a manual refresh would be easy enough to add. Only complex part would be exposing a new manualTrigger$ or similar argument to core services and as part of the API exposed to plugins (when set is introduced).

@azasypkin
Copy link
Member

@azasypkin do you think we need this solved in Core before #65472 can proceed?

Not necessarily, I'd rather wait a bit more to give you all enough time to come up with a reasonable solution. I can try to replicate something similar to what Larry did here and switch to a proper solution as soon as you provide it.

@pgayvallet
Copy link
Contributor

Just a note of something I discovered while working on #67979:

In the legacy platform, there was a mechanism to display the status page instead of the app when the server was not ready

server.route({
path: '/app/{id}/{any*}',
method: 'GET',
async handler(req, h) {
const id = req.params.id;
const app = server.getUiAppById(id);
try {
if (kbnServer.status.isGreen()) {
return await h.renderApp(app);
} else {
return await h.renderStatusPage();
}
} catch (err) {

Probably important to note that this is currently broken, and have been for a long time:

  • We now are waiting for ES to be ready before actually starting the plugins, so the real app is never displayed until the server is green (only the kibana is not ready yet body content)
  • Now that we are based on a SPA (for non-legacy mode), the client-side router would display the app page instead of the status page after core_system's boot.

This is not blocker for #67979, and we did not get any complains about this broken feature, but we will probably want to be able to display the status page during the server's startup, and I don't see this task in the issue task list.

@lukeelmers
Copy link
Member

Closing as this is essentially complete, and the main outstanding task (#72831) is being tracked separately.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Feature:Legacy Removal Issues related to removing legacy Kibana Feature:New Platform NeededFor:Monitoring Team:Core Core services & architecture: plugins, logging, config, saved objects, http, ES client, i18n, etc
Projects
None yet
Development

No branches or pull requests