-
Notifications
You must be signed in to change notification settings - Fork 8.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add core metrics service #58623
Add core metrics service #58623
Conversation
Pinging @elastic/kibana-platform (Team:Platform) |
it('collects event loop delay', async () => { | ||
const metrics = await collector.collect(); | ||
|
||
expect(metrics.event_loop_delay).toBeGreaterThan(0); | ||
}); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mocking getEventLoopDelay
is a pain, so I only tested that we collect an actual > 0 value here.
describe('ServerMetricsCollector', () => { | ||
let server: HttpService; | ||
let collector: ServerMetricsCollector; | ||
let hapiServer: HapiServer; | ||
let router: IRouter; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The ServerMetricsCollector
is collecting data from the HAPI server. Unit testing that was a pain, as a lot of the HAPI server methods and behavior would need to be properly mocked, so I did integration test instead (also I think this is more appropriate than UT for this specific collector)
it.skip('collect connection count', async () => { | ||
const waitSubject = new Subject(); | ||
|
||
router.get({ path: '/', validate: false }, async (ctx, req, res) => { | ||
await waitSubject.pipe(take(1)).toPromise(); | ||
return res.ok({ body: '' }); | ||
}); | ||
await server.start(); | ||
|
||
let metrics = await collector.collect(); | ||
expect(metrics.concurrent_connections).toEqual(0); | ||
|
||
sendGet('/'); | ||
await delay(20); | ||
metrics = await collector.collect(); | ||
expect(metrics.concurrent_connections).toEqual(1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is trying to test this part of the code:
const connections = await new Promise<number>(resolve => {
this.server.listener.getConnections((_, count) => {
resolve(count);
});
});
I though this.server.listener.getConnections
was returning the number of opened/pending connections, so I tried to test it by keeping waiting handlers, but the test fails ( metrics.concurrent_connections
is always equals to 0).
The snippet is directly copied from
kibana/src/legacy/server/status/index.js
Lines 41 to 49 in 8e9a8a8
server.listener.getConnections((_, count) => { | |
event.concurrent_connections = count; | |
// captures (performs transforms on) the latest event data and stashes | |
// the metrics for status/stats API payload | |
metrics.capture(event).then(data => { | |
kbnServer.metrics = data; | |
}); | |
}); |
So it's very unlikely is doesn't work (or at least doesn't do what it's supposed to do), however I don't know how to to properly integration test it ( I could unit test it by mocking server.listener.getConnections
, but mocking the whole hapi server just for that felt like overkill)
If someone got an idea...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sendGet
doesn't send a request, you need to call end
with a callback. https://visionmedia.github.io/superagent/
supertest(hapiServer.listener)
.get('/')
.end(() => null);
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It does when you await
for it. using both end
and awaiting it after throws an error stating that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh but I'm not awaiting for it here.... Thanks.
return { | ||
getOpsMetrics$: () => metricsObservable, | ||
}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I exposed this in the setup
API, even if the observable doesn't emit until core's start
phase. Tell me if we prefer moving it to the start
API instead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it makes sense if we implement lazy logic
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense to me to maintain the pattern of "register things in setup"
export class MetricsService | ||
implements CoreService<InternalMetricsServiceSetup, InternalMetricsServiceStart> { | ||
private readonly logger: Logger; | ||
private metricsCollector?: OpsMetricsCollector; | ||
private collectInterval?: NodeJS.Timeout; | ||
private metrics$ = new ReplaySubject<OpsMetrics>(1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I created a metrics
module / service instead of naming that ops
. I though that if at some point we want to expose other metrics, it would make more sense to have a global service for that.
export interface OpsMetrics { | ||
/** Process related metrics */ | ||
process: OpsProcessMetrics; | ||
/** OS related metrics */ | ||
os: OpsOsMetrics; | ||
/** server response time stats */ | ||
response_times: OpsServerMetrics['response_times']; | ||
/** server requests stats */ | ||
requests: OpsServerMetrics['requests']; | ||
/** number of current concurrent connections to the server */ | ||
concurrent_connections: OpsServerMetrics['concurrent_connections']; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
waiting for @chrisronline to reply to #46563 (comment) to know if we can regroup the server metrics in a server
property instead of exposing them all at the root level as it was done in legacy.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are these in snake_case format just to maintain compatibility with legacy? Seems like we should rename to camelCase
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is, see #46563. If we think this is not acceptable and that consumers should adapt, I could both rename everything to camel and create the server
property I spoke about. Not sure how exactly the existing structure is allowed to move, maybe you can answer?
export const opsConfig = { | ||
path: 'ops', | ||
schema: schema.object({ | ||
interval: schema.number({ defaultValue: 5000 }), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: what if we use more descriptive duration type? schema.duration({ defaultValue: '5s' }),
|
||
if (platform === 'linux') { | ||
try { | ||
const distro = (await promisify(getos)()) as LinuxOs; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: could be created once instead of recreating on every call
*/ | ||
export interface OpsOsMetrics { | ||
/** The os platform */ | ||
platform: string; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: platform: NodeJS.Platform
max: 0, | ||
}; | ||
|
||
constructor(private readonly server: HapiServer) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What if we prevent hapi server
leakage and invert dependencies? The HTTP server could implement the collect
interface instead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We are inside core and no hapi
reference is leaking outside of it, so I would say it's alright. But I can move the server
collector inside http
and expose it from the internal http contract is you think this is better / more futur proof.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We are inside core and no hapi reference is leaking outside of it, so I would say it's alright. But I can move the server collector inside http and expose it from the internal http contract is you think this is better / more futur proof.
It's up to you, but I'd prefer to have the one place to update if we decide to get rid of hapi one day.
it.skip('collect connection count', async () => { | ||
const waitSubject = new Subject(); | ||
|
||
router.get({ path: '/', validate: false }, async (ctx, req, res) => { | ||
await waitSubject.pipe(take(1)).toPromise(); | ||
return res.ok({ body: '' }); | ||
}); | ||
await server.start(); | ||
|
||
let metrics = await collector.collect(); | ||
expect(metrics.concurrent_connections).toEqual(0); | ||
|
||
sendGet('/'); | ||
await delay(20); | ||
metrics = await collector.collect(); | ||
expect(metrics.concurrent_connections).toEqual(1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sendGet
doesn't send a request, you need to call end
with a callback. https://visionmedia.github.io/superagent/
supertest(hapiServer.listener)
.get('/')
.end(() => null);
return { | ||
getOpsMetrics$: () => metricsObservable, | ||
}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it makes sense if we implement lazy logic
retest |
export interface OpsMetrics { | ||
/** Process related metrics */ | ||
process: OpsProcessMetrics; | ||
/** OS related metrics */ | ||
os: OpsOsMetrics; | ||
/** server response time stats */ | ||
response_times: OpsServerMetrics['response_times']; | ||
/** server requests stats */ | ||
requests: OpsServerMetrics['requests']; | ||
/** number of current concurrent connections to the server */ | ||
concurrent_connections: OpsServerMetrics['concurrent_connections']; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are these in snake_case format just to maintain compatibility with legacy? Seems like we should rename to camelCase
return { | ||
getOpsMetrics$: () => metricsObservable, | ||
}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense to me to maintain the pattern of "register things in setup"
free_in_bytes: os.freemem(), | ||
used_in_bytes: os.totalmem() - os.freemem(), | ||
}, | ||
uptime_in_millis: os.uptime() * 1000, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same as other comment, can we rename to camelCase?
Remaining points to decide on:
Don't have a strong opinion on any of them, but we need a consensus. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Should we use interval observable instead of
setInterval
(that would force to move thegetOptsMetrics$
API to the start contract) -
Add core metrics service #58623 (comment)
not necessary
- Should we move the
server
collector to thehttp
module to avoid leakage ofhapi
API outside ofhttp
module - Add core metrics service #58623 (comment)
optional. #58623 (comment)
- Can we take this NP migration opportunity to change the ops metrics structure (at least in core) - Add core metrics service #58623 (comment)
can be done as follow-up
Created #59113 to track the follow-up improvements |
💚 Build SucceededHistory
To update your PR or re-run it, just comment with: |
* create base service and collectors * wire the service into server, add mock * add collector tests * add main collector test * export metric types from server * add service and server tests * updates generated doc * improve doc * nits and comments * add disconnected requests test
* create base service and collectors * wire the service into server, add mock * add collector tests * add main collector test * export metric types from server * add service and server tests * updates generated doc * improve doc * nits and comments * add disconnected requests test
How do I access this in NP? This doesn't have a |
@chrisronline Wow. I guess I just forgot to add it to the public contract. Will address that asap. |
Created #59294 |
I'm comparing the data returned from this new API to what currently exists in the monitoring code base and I'm seeing an issue (there might be more, but this is the first I dug into) In the existing system, we request data which then "flushes" the event rolling system which "resets" the existing state. So essentially, all However, this API seems to buffer the data for the entire duration of the process - resulting in different reported values from the existing system. Perhaps we should add a way to flush the system, or potentially introduce instances of the metrics collector that can support flushing some intermediate state (so to preserve the idea that the main metrics collector never resets its local state)? I don't know if there are other plugins that depend on these data or not. |
Yea, it seems I missed the fact that the HAPI network monitor actually resets it's state / requests after every collection
the monitoring plugin and the oss server monitoring (
Leading to: I think falling back to oppsy behavior, by reseting the network collector after every collection (every WDYT? |
Sure, that works for me! Thanks! |
@chrisronline created #59551 |
Summary
Fix #46563
metrics
core service, and expose the associated API in core's publicsetup
API.OpsMetrics
reproduces the structure generated insrc/legacy/server/status/lib/metrics.js
core
collectors implementation no longer relies onoppsy
(but is based on oppsy implem). Once all usages has been adapted to use this new core API, we should be able to removeoppsy
from our dependencies.Checklist
For maintainers
Dev Docs
A new
metrics
API is available from core, and allow retrieving various metrics regarding the http server, process and os load/usages