Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add core metrics service #58623

Merged
merged 13 commits into from
Mar 3, 2020
Merged

Conversation

pgayvallet
Copy link
Contributor

@pgayvallet pgayvallet commented Feb 26, 2020

Summary

Fix #46563

  • Add a new metrics core service, and expose the associated API in core's public setup API.
  • the core OpsMetrics reproduces the structure generated in src/legacy/server/status/lib/metrics.js
  • core collectors implementation no longer relies on oppsy (but is based on oppsy implem). Once all usages has been adapted to use this new core API, we should be able to remove oppsy from our dependencies.
  • the PR only expose this new API and does not adapt any plugin/legacy code to use it. Using the new service/api is outside of the scope of this PR (in Migrate status service and status page to New Platform #41983 for the status page/plugin for ex.)

Checklist

For maintainers

Dev Docs

A new metrics API is available from core, and allow retrieving various metrics regarding the http server, process and os load/usages

core.metrics.getOpsMetrics$().subscribe(metrics => {
  // do something with the metrics
})

@pgayvallet pgayvallet added Feature:New Platform Team:Core Core services & architecture: plugins, logging, config, saved objects, http, ES client, i18n, etc v7.7.0 v8.0.0 labels Feb 26, 2020
@elasticmachine
Copy link
Contributor

Pinging @elastic/kibana-platform (Team:Platform)

@pgayvallet pgayvallet added the release_note:plugin_api_changes Contains a Plugin API changes section for the breaking plugin API changes section. label Feb 26, 2020
Comment on lines +40 to +44
it('collects event loop delay', async () => {
const metrics = await collector.collect();

expect(metrics.event_loop_delay).toBeGreaterThan(0);
});
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mocking getEventLoopDelay is a pain, so I only tested that we collect an actual > 0 value here.

Comment on lines +29 to +33
describe('ServerMetricsCollector', () => {
let server: HttpService;
let collector: ServerMetricsCollector;
let hapiServer: HapiServer;
let router: IRouter;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The ServerMetricsCollector is collecting data from the HAPI server. Unit testing that was a pain, as a lot of the HAPI server methods and behavior would need to be properly mocked, so I did integration test instead (also I think this is more appropriate than UT for this specific collector)

Comment on lines 108 to 123
it.skip('collect connection count', async () => {
const waitSubject = new Subject();

router.get({ path: '/', validate: false }, async (ctx, req, res) => {
await waitSubject.pipe(take(1)).toPromise();
return res.ok({ body: '' });
});
await server.start();

let metrics = await collector.collect();
expect(metrics.concurrent_connections).toEqual(0);

sendGet('/');
await delay(20);
metrics = await collector.collect();
expect(metrics.concurrent_connections).toEqual(1);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is trying to test this part of the code:

    const connections = await new Promise<number>(resolve => {
      this.server.listener.getConnections((_, count) => {
        resolve(count);
      });
    });

I though this.server.listener.getConnections was returning the number of opened/pending connections, so I tried to test it by keeping waiting handlers, but the test fails ( metrics.concurrent_connections is always equals to 0).

The snippet is directly copied from

server.listener.getConnections((_, count) => {
event.concurrent_connections = count;
// captures (performs transforms on) the latest event data and stashes
// the metrics for status/stats API payload
metrics.capture(event).then(data => {
kbnServer.metrics = data;
});
});

So it's very unlikely is doesn't work (or at least doesn't do what it's supposed to do), however I don't know how to to properly integration test it ( I could unit test it by mocking server.listener.getConnections, but mocking the whole hapi server just for that felt like overkill)

If someone got an idea...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sendGet doesn't send a request, you need to call end with a callback. https://visionmedia.github.io/superagent/

    supertest(hapiServer.listener)
      .get('/')
      .end(() => null);

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It does when you await for it. using both end and awaiting it after throws an error stating that.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh but I'm not awaiting for it here.... Thanks.

Comment on lines +51 to +53
return {
getOpsMetrics$: () => metricsObservable,
};
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I exposed this in the setup API, even if the observable doesn't emit until core's start phase. Tell me if we prefer moving it to the start API instead.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it makes sense if we implement lazy logic

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense to me to maintain the pattern of "register things in setup"

Comment on lines +35 to +40
export class MetricsService
implements CoreService<InternalMetricsServiceSetup, InternalMetricsServiceStart> {
private readonly logger: Logger;
private metricsCollector?: OpsMetricsCollector;
private collectInterval?: NodeJS.Timeout;
private metrics$ = new ReplaySubject<OpsMetrics>(1);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I created a metrics module / service instead of naming that ops. I though that if at some point we want to expose other metrics, it would make more sense to have a global service for that.

Comment on lines +55 to +66
export interface OpsMetrics {
/** Process related metrics */
process: OpsProcessMetrics;
/** OS related metrics */
os: OpsOsMetrics;
/** server response time stats */
response_times: OpsServerMetrics['response_times'];
/** server requests stats */
requests: OpsServerMetrics['requests'];
/** number of current concurrent connections to the server */
concurrent_connections: OpsServerMetrics['concurrent_connections'];
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

waiting for @chrisronline to reply to #46563 (comment) to know if we can regroup the server metrics in a server property instead of exposing them all at the root level as it was done in legacy.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are these in snake_case format just to maintain compatibility with legacy? Seems like we should rename to camelCase

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is, see #46563. If we think this is not acceptable and that consumers should adapt, I could both rename everything to camel and create the server property I spoke about. Not sure how exactly the existing structure is allowed to move, maybe you can answer?

@pgayvallet pgayvallet marked this pull request as ready for review February 27, 2020 09:43
@pgayvallet pgayvallet requested a review from a team as a code owner February 27, 2020 09:43
export const opsConfig = {
path: 'ops',
schema: schema.object({
interval: schema.number({ defaultValue: 5000 }),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: what if we use more descriptive duration type? schema.duration({ defaultValue: '5s' }),


if (platform === 'linux') {
try {
const distro = (await promisify(getos)()) as LinuxOs;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: could be created once instead of recreating on every call

*/
export interface OpsOsMetrics {
/** The os platform */
platform: string;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: platform: NodeJS.Platform

max: 0,
};

constructor(private readonly server: HapiServer) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if we prevent hapi server leakage and invert dependencies? The HTTP server could implement the collect interface instead.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are inside core and no hapi reference is leaking outside of it, so I would say it's alright. But I can move the server collector inside http and expose it from the internal http contract is you think this is better / more futur proof.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are inside core and no hapi reference is leaking outside of it, so I would say it's alright. But I can move the server collector inside http and expose it from the internal http contract is you think this is better / more futur proof.

It's up to you, but I'd prefer to have the one place to update if we decide to get rid of hapi one day.

src/core/server/metrics/metrics_service.ts Show resolved Hide resolved
src/core/server/metrics/metrics_service.ts Show resolved Hide resolved
Comment on lines 108 to 123
it.skip('collect connection count', async () => {
const waitSubject = new Subject();

router.get({ path: '/', validate: false }, async (ctx, req, res) => {
await waitSubject.pipe(take(1)).toPromise();
return res.ok({ body: '' });
});
await server.start();

let metrics = await collector.collect();
expect(metrics.concurrent_connections).toEqual(0);

sendGet('/');
await delay(20);
metrics = await collector.collect();
expect(metrics.concurrent_connections).toEqual(1);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sendGet doesn't send a request, you need to call end with a callback. https://visionmedia.github.io/superagent/

    supertest(hapiServer.listener)
      .get('/')
      .end(() => null);

Comment on lines +51 to +53
return {
getOpsMetrics$: () => metricsObservable,
};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it makes sense if we implement lazy logic

@pgayvallet
Copy link
Contributor Author

retest

Comment on lines +55 to +66
export interface OpsMetrics {
/** Process related metrics */
process: OpsProcessMetrics;
/** OS related metrics */
os: OpsOsMetrics;
/** server response time stats */
response_times: OpsServerMetrics['response_times'];
/** server requests stats */
requests: OpsServerMetrics['requests'];
/** number of current concurrent connections to the server */
concurrent_connections: OpsServerMetrics['concurrent_connections'];
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are these in snake_case format just to maintain compatibility with legacy? Seems like we should rename to camelCase

Comment on lines +51 to +53
return {
getOpsMetrics$: () => metricsObservable,
};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense to me to maintain the pattern of "register things in setup"

free_in_bytes: os.freemem(),
used_in_bytes: os.totalmem() - os.freemem(),
},
uptime_in_millis: os.uptime() * 1000,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as other comment, can we rename to camelCase?

@pgayvallet
Copy link
Contributor Author

Remaining points to decide on:

Don't have a strong opinion on any of them, but we need a consensus.

Copy link
Contributor

@mshustov mshustov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not necessary

optional. #58623 (comment)

can be done as follow-up

@pgayvallet
Copy link
Contributor Author

Created #59113 to track the follow-up improvements

@kibanamachine
Copy link
Contributor

💚 Build Succeeded

History

To update your PR or re-run it, just comment with:
@elasticmachine merge upstream

@pgayvallet pgayvallet merged commit 64ffae3 into elastic:master Mar 3, 2020
pgayvallet added a commit to pgayvallet/kibana that referenced this pull request Mar 3, 2020
* create base service and collectors

* wire the service into server, add mock

* add collector tests

* add main collector test

* export metric types from server

* add service and server tests

* updates generated doc

* improve doc

* nits and comments

* add disconnected requests test
pgayvallet added a commit that referenced this pull request Mar 3, 2020
* create base service and collectors

* wire the service into server, add mock

* add collector tests

* add main collector test

* export metric types from server

* add service and server tests

* updates generated doc

* improve doc

* nits and comments

* add disconnected requests test
@chrisronline
Copy link
Contributor

How do I access this in NP? This doesn't have a metrics key

@pgayvallet
Copy link
Contributor Author

@chrisronline Wow. I guess I just forgot to add it to the public contract. Will address that asap.

@pgayvallet
Copy link
Contributor Author

Created #59294

@chrisronline
Copy link
Contributor

I'm comparing the data returned from this new API to what currently exists in the monitoring code base and I'm seeing an issue (there might be more, but this is the first I dug into)

In the existing system, we request data which then "flushes" the event rolling system which "resets" the existing state. So essentially, all max and average stats are limited to the period of time between when the first event is collected to when the entire buffer is flushed - which is ideal as each .monitoring-kibana-* document is collected every 10s (by default) and we want to ensure each document contains data representative of that time period.

However, this API seems to buffer the data for the entire duration of the process - resulting in different reported values from the existing system.

Perhaps we should add a way to flush the system, or potentially introduce instances of the metrics collector that can support flushing some intermediate state (so to preserve the idea that the main metrics collector never resets its local state)? I don't know if there are other plugins that depend on these data or not.

@pgayvallet
Copy link
Contributor Author

@chrisronline

However, this API seems to buffer the data for the entire duration of the process - resulting in different reported values from the existing system.

Yea, it seems I missed the fact that the HAPI network monitor actually resets it's state / requests after every collection

I don't know if there are other plugins that depend on these data or not.

the monitoring plugin and the oss server monitoring (src/legacy/server/status/collectors/get_ops_stats_collector.js) seems to be the only consumers of this API.

Perhaps we should add a way to flush the system, or potentially introduce instances of the metrics collector that can support flushing some intermediate state

  • exposing an API to flush the global collector/metrics doesn't look acceptable in current implementation, as a single consumer should not impact the other consumers

  • introducing an additional layer in-between doesn't seem doable, as I don't really see how it would be capable to compute the average request value for a given time interval with the data the current collector provides.

Leading to:

I think falling back to oppsy behavior, by reseting the network collector after every collection (every getOpsMetrics observable emission) is the most reasonable thing to do, as this is what both consumers of the API are expecting anyway. Also, this seems reasonable to let the consumers handles potential data aggregation (and none does atm)

WDYT?

@chrisronline
Copy link
Contributor

Sure, that works for me! Thanks!

@pgayvallet
Copy link
Contributor Author

@chrisronline created #59551

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Feature:New Platform release_note:plugin_api_changes Contains a Plugin API changes section for the breaking plugin API changes section. Team:Core Core services & architecture: plugins, logging, config, saved objects, http, ES client, i18n, etc v7.7.0 v8.0.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Migrate ops metrics to New Platform
6 participants