-
Notifications
You must be signed in to change notification settings - Fork 30k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
perf_hooks: add event loop delay sampler #25378
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(just requesting changes to make sure this doesn’t land as-is)
I could use some convincing that this time-based sampling approach is useful, or that I'm missing the point of the code. Loop delays tend to be irregular, and to only block a single "turn" of the loop, don't they? Won't this approach miss those delays? It seems to be much more useful to measure every loop, and track max and min (and possibly average) over some time interval. |
@sam-github ... @mafintosh can likely better describe the rationale behind this particular algorithm (given that he's the one that wrote it :-) ...) ... For the most part, however, this is giving the accumulated delay if the length of time between iterations exceeds the threshold. |
I've added some documentation. Essentially, this measures the accumulated delay if the previous turn of the event loop takes longer than a threshold determined by the resolution. @sam-github ... what you're suggesting is a different way of measuring that we can also do relatively easily. |
Attempted to clarify a different way |
597b295
to
4e8dacd
Compare
doc/api/perf_hooks.md
Outdated
|
||
The accumulated event loop delay (`delay`) is determined by: | ||
|
||
* Calculating a `threshold` in milliseconds as `5e6 + (resolution * 1e6)`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If my unit conversions are right, isn't this an obscure way of saying the threshold is 1.5 times the resolution?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep
Does it, though? The docs say it samples at Its not at all clear to me why the @mafintosh -- comments? |
Well, yes, using a sampling approach means the data will miss some things but it also allows it to work with significantly less overhead. We certainly could also do the min/max/average calculation on every event loop tick but at a definitely higher overhead cost. If you think that it's worthwhile, I can implement both approaches configurable via the // Sampling-method
performance.startEventLoopTimer({ resolution: 10 })
// Per-event-tick method
performance.startEventLoopTimer({ resolution: 0 }) We can collect the min, max, rolling average in addition to the accumulated delay. |
@nodejs/diagnostics Any thoughts on this? |
why is this on the performance object instead of the perf_hooks export? |
I guess the most common production workloads will only be interested to see a pattern (average delay over time), its steady state values, and outliers if any; not necessarily a high precision data that is accurate at the loop iteration level. So sampling method that strikes a balance between useful information and performance efficiency looks good to me. On the contrary, having so many options to configure may lead to:
In short, I am good with the changes as is. |
Add a sampling-based event loop delay monitor. ```js const { monitorEventLoopDelay } = require('perf_hooks'); const h = monitorEventLoopDelay(); h.enable(); h.disable(); console.log(h.percentiles); console.log(h.min); console.log(h.max); console.log(h.mean); console.log(h.stddev); console.log(h.percentile(50)); ```
b281780
to
386f06b
Compare
Turns out there's a patch @mcollina didn't tell me about ;-D and since he's hanging out in Hawaii right now I gotta give him a hard time about it :-) New CI: https://ci.nodejs.org/job/node-test-pull-request/20631/ |
@jasnell I wonder a little bit about your test results as it seems that 1000ms tasks in the event loop would be not detected as that. Most likely caused by my test script which didn't allow to set busy time on command line, just "test modes". Updated script and retested also in other environments. If event loop is busy with tasks longer then the 10ms sample interval it's stable detected in all setups. But measurement of an idle node.js app results in worse results compared to a node.js app busy with 10ms tasks even on the physical linux box. Which size of "detectable delays" had you in mind here? Maybe the docs should include some hints on the reachable resolution? The VM on my private notebook is for sure not a relevant production setup (maybe delays are caused by CPU sleep states,...) so I think these results are not that important. Linux VM on my old private notebook
Linux VM on my work PC
Native Linux on a collegues PC
Updated Script
const { monitorEventLoopDelay } = require('perf_hooks');
let delay = +process.argv[2];
const duration = 1000 * 30;
const ns2ms = 1000 * 1000;
function busy() {
if (delay > 0) {
const now = process.hrtime();
let d;
do {
d = process.hrtime(now);
} while (d[1] / ns2ms < delay);
}
setImmediate(busy);
}
const h = monitorEventLoopDelay({ resolution: 10 });
function endMeasurement() {
h.disable();
console.log(`h.min=${h.min / ns2ms}ms`);
console.log(`h.max=${h.max / ns2ms}ms`);
console.log(`h.mean=${h.mean / ns2ms}ms`);
console.log(`h.stddev=${h.stddev / ns2ms}ms`);
process.exit();
}
function start() {
console.log(`start ${delay}`);
h.enable();
if (delay >= 0) {
busy();
}
}
start();
setTimeout(endMeasurement, duration); |
Resume CI due to unrelated failure: https://ci.nodejs.org/job/node-test-pull-request/20637/ |
@nodejs/build ... any ideas here?
https://ci.nodejs.org/job/node-test-commit-arm/21996/nodes=ubuntu1604-arm64/console |
Like @mcollina, #25378 (comment), I agree that loop health is critical for production monitoring. @jasnell There hasn't been any attempt at convincing me that sampling has been seen effective in the wild, see #25378 (comment) (no comment from mafintosh), and this question is open, #25378 (comment) I've found @bnoordhuis 's approach of measuring every loop, but only recording min/max/avg to be low overhead (its a timer/check, a comparison, and some math per loop, the overhead is dwarfed by the actual code that runs per loop, and most loops are not going to exceed the min/max anyhow), and very useful. But, if I'm such a big fan of an alternative approach, I should submit my own code to do it :-). Its you who are doing the work here, and any loop health stats are better than none. This approach doesn't prevent someone trying another approach in the future, so I've no objection. |
Resume CI again: https://ci.nodejs.org/job/node-test-pull-request/20672/ |
Add a sampling-based event loop delay monitor. ```js const { monitorEventLoopDelay } = require('perf_hooks'); const h = monitorEventLoopDelay(); h.enable(); h.disable(); console.log(h.percentiles); console.log(h.min); console.log(h.max); console.log(h.mean); console.log(h.stddev); console.log(h.percentile(50)); ``` PR-URL: #25378 Reviewed-By: Matteo Collina <matteo.collina@gmail.com> Reviewed-By: Gireesh Punathil <gpunathi@in.ibm.com> Reviewed-By: Stephen Belanger <admin@stephenbelanger.com> Reviewed-By: Richard Lau <riclau@uk.ibm.com> Reviewed-By: Anna Henningsen <anna@addaleax.net>
From: mcollina/native-hdr-histogram@c63e97151dcff9b9aed1d8ea5e4f5964c69be32fideps: PR-URL: #25378 Reviewed-By: Matteo Collina <matteo.collina@gmail.com> Reviewed-By: Gireesh Punathil <gpunathi@in.ibm.com> Reviewed-By: Stephen Belanger <admin@stephenbelanger.com> Reviewed-By: Richard Lau <riclau@uk.ibm.com> Reviewed-By: Anna Henningsen <anna@addaleax.net>
Add a sampling-based event loop delay monitor. ```js const { monitorEventLoopDelay } = require('perf_hooks'); const h = monitorEventLoopDelay(); h.enable(); h.disable(); console.log(h.percentiles); console.log(h.min); console.log(h.max); console.log(h.mean); console.log(h.stddev); console.log(h.percentile(50)); ``` PR-URL: #25378 Reviewed-By: Matteo Collina <matteo.collina@gmail.com> Reviewed-By: Gireesh Punathil <gpunathi@in.ibm.com> Reviewed-By: Stephen Belanger <admin@stephenbelanger.com> Reviewed-By: Richard Lau <riclau@uk.ibm.com> Reviewed-By: Anna Henningsen <anna@addaleax.net>
From: mcollina/native-hdr-histogram@c63e97151dcff9b9aed1d8ea5e4f5964c69be32fideps: PR-URL: #25378 Reviewed-By: Matteo Collina <matteo.collina@gmail.com> Reviewed-By: Gireesh Punathil <gpunathi@in.ibm.com> Reviewed-By: Stephen Belanger <admin@stephenbelanger.com> Reviewed-By: Richard Lau <riclau@uk.ibm.com> Reviewed-By: Anna Henningsen <anna@addaleax.net>
@jasnell, Should this land in v10.x? Please add the |
Could you please raise a backport PR? This change doesn't land cleanly on v10.x |
Inspired by (and code borrowed from) https://github.com/mafintosh/event-loop-delay
/cc @mafintosh @mcollina
Adds a simple event loop delay sampler to perf_hooks.
See included test for example use.
Checklist
make -j4 test
(UNIX), orvcbuild test
(Windows) passes