perf_hooks: add event loop delay sampler #25378

jasnell · 2019-01-07T19:38:45Z

Inspired by (and code borrowed from) https://github.com/mafintosh/event-loop-delay

Adds a simple event loop delay sampler to perf_hooks.

See included test for example use.

Checklist

make -j4 test (UNIX), or vcbuild test (Windows) passes
tests and/or benchmarks are included
documentation is changed or added
commit message follows commit guidelines

nodejs-github-bot · 2019-01-07T19:38:47Z

@jasnell build started: https://ci.nodejs.org/blue/organizations/jenkins/node-test-pull-request-lite-pipeline/detail/node-test-pull-request-lite-pipeline/2167/pipeline

jasnell · 2019-01-07T19:39:35Z

CI: https://ci.nodejs.org/job/node-test-pull-request/19973/

doc/api/perf_hooks.md

lib/perf_hooks.js

src/node_perf.cc

src/node_perf.h

mcollina

LGTM

addaleax

(just requesting changes to make sure this doesn’t land as-is)

doc/api/perf_hooks.md

sam-github · 2019-01-08T20:00:03Z

I could use some convincing that this time-based sampling approach is useful, or that I'm missing the point of the code.

Loop delays tend to be irregular, and to only block a single "turn" of the loop, don't they? Won't this approach miss those delays?

It seems to be much more useful to measure every loop, and track max and min (and possibly average) over some time interval.

src/node_perf.cc

jasnell · 2019-01-08T23:18:25Z

@sam-github ... @mafintosh can likely better describe the rationale behind this particular algorithm (given that he's the one that wrote it :-) ...) ... For the most part, however, this is giving the accumulated delay if the length of time between iterations exceeds the threshold.

jasnell · 2019-01-08T23:36:11Z

I've added some documentation. Essentially, this measures the accumulated delay if the previous turn of the event loop takes longer than a threshold determined by the resolution.

@sam-github ... what you're suggesting is a different way of measuring that we can also do relatively easily.

doc/api/perf_hooks.md

jasnell · 2019-01-08T23:47:45Z

Attempted to clarify a different way

jasnell · 2019-01-09T00:10:22Z

New CI: https://ci.nodejs.org/job/node-test-pull-request/20006/

doc/api/perf_hooks.md

sam-github · 2019-01-09T16:14:14Z

doc/api/perf_hooks.md

+
+The accumulated event loop delay (`delay`) is determined by:
+
+* Calculating a `threshold` in milliseconds as `5e6 + (resolution * 1e6)`.


If my unit conversions are right, isn't this an obscure way of saying the threshold is 1.5 times the resolution?

sam-github · 2019-01-09T16:16:20Z

Essentially, this measures the accumulated delay

Does it, though? The docs say it samples at resolution milliseconds, so it only measures periodically, making it easy to miss, for example, occaisonal calls to sync APIs.

Its not at all clear to me why the threshold is scaled from the sampling period, or why the sampling period is called resolution.

@mafintosh -- comments?

jasnell · 2019-01-09T20:48:08Z

Does it, though? The docs say it samples at resolution milliseconds, so it only measures periodically, making it easy to miss, for example, occaisonal calls to sync APIs.

Well, yes, using a sampling approach means the data will miss some things but it also allows it to work with significantly less overhead. We certainly could also do the min/max/average calculation on every event loop tick but at a definitely higher overhead cost.

If you think that it's worthwhile, I can implement both approaches configurable via the options.

// Sampling-method
performance.startEventLoopTimer({ resolution: 10 })

// Per-event-tick method
performance.startEventLoopTimer({ resolution: 0 })

We can collect the min, max, rolling average in addition to the accumulated delay.

addaleax · 2019-01-28T23:04:43Z

@nodejs/diagnostics Any thoughts on this?

devsnek · 2019-01-28T23:18:53Z

why is this on the performance object instead of the perf_hooks export?

gireeshpunathil · 2019-01-29T03:56:11Z

I guess the most common production workloads will only be interested to see a pattern (average delay over time), its steady state values, and outliers if any; not necessarily a high precision data that is accurate at the loop iteration level. So sampling method that strikes a balance between useful information and performance efficiency looks good to me.

On the contrary, having so many options to configure may lead to:

lot more usability issues, mis-configurations and questions
unwanted comparison at micro level only to spend time on debugging false positives

In short, I am good with the changes as is.

Add a sampling-based event loop delay monitor. ```js const { monitorEventLoopDelay } = require('perf_hooks'); const h = monitorEventLoopDelay(); h.enable(); h.disable(); console.log(h.percentiles); console.log(h.min); console.log(h.max); console.log(h.mean); console.log(h.stddev); console.log(h.percentile(50)); ```

From: mcollina/native-hdr-histogram@c63e971

jasnell · 2019-02-07T02:24:54Z

Turns out there's a patch @mcollina didn't tell me about ;-D and since he's hanging out in Hawaii right now I gotta give him a hard time about it :-)

New CI: https://ci.nodejs.org/job/node-test-pull-request/20631/

Flarna · 2019-02-07T09:29:39Z

@jasnell I wonder a little bit about your test results as it seems that 1000ms tasks in the event loop would be not detected as that. Most likely caused by my test script which didn't allow to set busy time on command line, just "test modes". Updated script and retested also in other environments.

If event loop is busy with tasks longer then the 10ms sample interval it's stable detected in all setups. But measurement of an idle node.js app results in worse results compared to a node.js app busy with 10ms tasks even on the physical linux box.

Which size of "detectable delays" had you in mind here? Maybe the docs should include some hints on the reachable resolution?
I think some testing on other platforms should be also done to check if there are other limitations.
Or is the sample I use to simple and results in misleading interpretations from my side?

The VM on my private notebook is for sure not a relevant production setup (maybe delays are caused by CPU sleep states,...) so I think these results are not that important.

Linux VM on my old private notebook

start NaN
h.min=10.027008ms
h.max=161.349631ms
h.mean=13.187902367634125ms
h.stddev=5.629075695044025ms
start NaN
h.min=10.018816ms
h.max=32.292863ms
h.mean=11.474917361897475ms
h.stddev=2.308021168909449ms
start 0
h.min=9.232384ms
h.max=56.688639ms
h.mean=10.035087593038822ms
h.stddev=0.9816034647940286ms
start 0
h.min=9.33888ms
h.max=15.638527ms
h.mean=10.002511242161441ms
h.stddev=0.1461050645134329ms
start 10
h.min=10.010624ms
h.max=31.686655ms
h.mean=10.068520733131923ms
h.stddev=0.6172102376239719ms
start 10
h.min=10.002432ms
h.max=25.968639ms
h.mean=10.043363193569993ms
h.stddev=0.42167699874933723ms
start 100
h.min=100.007936ms
h.max=111.935487ms
h.mean=100.13551274666666ms
h.stddev=0.9452201842299545ms
start 100
h.min=100.007936ms
h.max=112.394239ms
h.mean=100.11191978666666ms
h.stddev=0.8513607073454165ms

Linux VM on my work PC

start NaN
h.min=10.010624ms
h.max=18.563071ms
h.mean=11.065823704905938ms
h.stddev=0.7107936502571325ms
start NaN
h.min=10.018816ms
h.max=15.474687ms
h.mean=11.278423034223392ms
h.stddev=0.759150300218026ms
start 0
h.min=9.0112ms
h.max=19.365887ms
h.mean=10.017919914495659ms
h.stddev=0.36377668699754595ms
start 0
h.min=9.125888ms
h.max=20.742143ms
h.mean=10.02488607486631ms
h.stddev=0.4253760072540978ms
start 10
h.min=10.010624ms
h.max=18.874367ms
h.mean=10.07886864516129ms
h.stddev=0.4418468141594953ms
start 10
h.min=10.010624ms
h.max=21.381119ms
h.mean=10.092213232839837ms
h.stddev=0.5521641804813799ms
start 100
h.min=100.007936ms
h.max=104.660991ms
h.mean=100.09597269333332ms
h.stddev=0.43897114293681383ms
start 100
h.min=100.007936ms
h.max=111.083519ms
h.mean=100.12764842666667ms
h.stddev=0.7444034454799928ms

Native Linux on a collegues PC

start NaN
h.min=10.027008ms
h.max=13.008895ms
h.mean=10.126542735989196ms
h.stddev=0.07465781045266873ms
start NaN
h.min=10.0352ms
h.max=19.038207ms
h.mean=10.136660093274756ms
h.stddev=0.17323677977705812ms
start 0
h.min=9.289728ms
h.max=10.092543ms
h.mean=10.000237177725909ms
h.stddev=0.015177624171784065ms
start 0
h.min=9.29792ms
h.max=10.166271ms
h.mean=9.999846562187397ms
h.stddev=0.016481384728421315ms
start 10
h.min=10.002432ms
h.max=12.623871ms
h.mean=10.012928427378965ms
h.stddev=0.08814869597128885ms
start 10
h.min=10.002432ms
h.max=12.967935ms
h.mean=10.012034905206942ms
h.stddev=0.08011965038652569ms
start 100
h.min=100.007936ms
h.max=103.350271ms
h.mean=100.05949098666666ms
h.stddev=0.228135157621938ms
start 100
h.min=100.007936ms
h.max=102.760447ms
h.mean=100.05796181333334ms
h.stddev=0.20355572563408403ms

Updated Script

const { monitorEventLoopDelay } = require('perf_hooks');

let delay = +process.argv[2];
const duration = 1000 * 30;
const ns2ms = 1000 * 1000;

function busy() {
  if (delay > 0) {
    const now = process.hrtime();
    let d;
    do {
      d = process.hrtime(now);
    } while (d[1] / ns2ms < delay);
  }
  setImmediate(busy);
}

const h = monitorEventLoopDelay({ resolution: 10 });

function endMeasurement() {
  h.disable();
  console.log(`h.min=${h.min / ns2ms}ms`);
  console.log(`h.max=${h.max / ns2ms}ms`);
  console.log(`h.mean=${h.mean / ns2ms}ms`);
  console.log(`h.stddev=${h.stddev / ns2ms}ms`);
  process.exit();
}

function start() {
  console.log(`start ${delay}`);
  h.enable();
  if (delay >= 0) {
    busy();
  }
}

start();
setTimeout(endMeasurement, duration);

jasnell · 2019-02-07T15:21:25Z

Resume CI due to unrelated failure: https://ci.nodejs.org/job/node-test-pull-request/20637/

jasnell · 2019-02-07T18:12:33Z

@nodejs/build ... any ideas here?

10:26:58 Makefile:449: recipe for target 'clear-stalled' failed
10:26:58 make[1]: *** [clear-stalled] Error 123
10:26:58 Makefile:533: recipe for target 'run-ci' failed
10:26:58 make: *** [run-ci] Error 2
10:26:59 Build step 'Conditional steps (multiple)' marked build as failure
10:26:59 Performing Post build task...
10:26:59 Match found for : : True
10:26:59 Logical operation result is TRUE
10:26:59 Running script  : #/bin/bash

https://ci.nodejs.org/job/node-test-commit-arm/21996/nodes=ubuntu1604-arm64/console

Resolved

sam-github · 2019-02-07T18:37:48Z

Like @mcollina, #25378 (comment), I agree that loop health is critical for production monitoring.

@jasnell There hasn't been any attempt at convincing me that sampling has been seen effective in the wild, see #25378 (comment) (no comment from mafintosh), and this question is open, #25378 (comment)

I've found @bnoordhuis 's approach of measuring every loop, but only recording min/max/avg to be low overhead (its a timer/check, a comparison, and some math per loop, the overhead is dwarfed by the actual code that runs per loop, and most loops are not going to exceed the min/max anyhow), and very useful.

But, if I'm such a big fan of an alternative approach, I should submit my own code to do it :-). Its you who are doing the work here, and any loop health stats are better than none. This approach doesn't prevent someone trying another approach in the future, so I've no objection.

addaleax · 2019-02-08T14:00:47Z

Resume CI again: https://ci.nodejs.org/job/node-test-pull-request/20672/

Add a sampling-based event loop delay monitor. ```js const { monitorEventLoopDelay } = require('perf_hooks'); const h = monitorEventLoopDelay(); h.enable(); h.disable(); console.log(h.percentiles); console.log(h.min); console.log(h.max); console.log(h.mean); console.log(h.stddev); console.log(h.percentile(50)); ``` PR-URL: #25378 Reviewed-By: Matteo Collina <matteo.collina@gmail.com> Reviewed-By: Gireesh Punathil <gpunathi@in.ibm.com> Reviewed-By: Stephen Belanger <admin@stephenbelanger.com> Reviewed-By: Richard Lau <riclau@uk.ibm.com> Reviewed-By: Anna Henningsen <anna@addaleax.net>

From: mcollina/native-hdr-histogram@c63e97151dcff9b9aed1d8ea5e4f5964c69be32fideps: PR-URL: #25378 Reviewed-By: Matteo Collina <matteo.collina@gmail.com> Reviewed-By: Gireesh Punathil <gpunathi@in.ibm.com> Reviewed-By: Stephen Belanger <admin@stephenbelanger.com> Reviewed-By: Richard Lau <riclau@uk.ibm.com> Reviewed-By: Anna Henningsen <anna@addaleax.net>

jasnell · 2019-02-08T19:27:07Z

Landed in bcdd228 and d999b55

Add a sampling-based event loop delay monitor. ```js const { monitorEventLoopDelay } = require('perf_hooks'); const h = monitorEventLoopDelay(); h.enable(); h.disable(); console.log(h.percentiles); console.log(h.min); console.log(h.max); console.log(h.mean); console.log(h.stddev); console.log(h.percentile(50)); ``` PR-URL: #25378 Reviewed-By: Matteo Collina <matteo.collina@gmail.com> Reviewed-By: Gireesh Punathil <gpunathi@in.ibm.com> Reviewed-By: Stephen Belanger <admin@stephenbelanger.com> Reviewed-By: Richard Lau <riclau@uk.ibm.com> Reviewed-By: Anna Henningsen <anna@addaleax.net>

From: mcollina/native-hdr-histogram@c63e97151dcff9b9aed1d8ea5e4f5964c69be32fideps: PR-URL: #25378 Reviewed-By: Matteo Collina <matteo.collina@gmail.com> Reviewed-By: Gireesh Punathil <gpunathi@in.ibm.com> Reviewed-By: Stephen Belanger <admin@stephenbelanger.com> Reviewed-By: Richard Lau <riclau@uk.ibm.com> Reviewed-By: Anna Henningsen <anna@addaleax.net>

BethGriggs · 2019-08-20T15:25:34Z

@jasnell, Should this land in v10.x? Please add the lts-watch label if so

BethGriggs · 2019-09-03T15:54:27Z

Could you please raise a backport PR? This change doesn't land cleanly on v10.x

nodejs-github-bot added c++ Issues and PRs that require attention from people who are familiar with C++. perf_hooks Issues and PRs related to the implementation of the Performance Timing API. labels Jan 7, 2019

Trott reviewed Jan 7, 2019

View reviewed changes