Fine performance metrics meta-issue #7665

crusaderky · 2023-03-17T16:29:01Z

XREFs

In #7586, we started collecting very granular metrics on how workers are spending their time.
Demo: https://gist.github.com/crusaderky/a97f870c51260e63a1c14c20b762f666

As of that PR, we collect metrics in Worker.digests_total about:

Worker.execute, broken down by task prefix and activity, with special treatment for failed and cancelled tasks
Worker.gather_dep, broken down by activity, with special treatment for failed and cancelled transfers
Worker.get_data, broken down by activity
WorkerMemoryMonitor._spill, broken down by activity

This issue is a meta-tracker of all potential follow-ups, as well as a place to discuss high level design and cost/benefit ratios holistically.

The follow-ups can be broken down into two high level threads:

Improve quality and usability of collected data

What we do with the data

Finishing touches

The text was updated successfully, but these errors were encountered:

Part of dask#7665

crusaderky · 2023-03-31T16:03:11Z

Current data schema

#7701 (currently in main git tip; will be in release 2023.4.1) introduces a new mapping Scheduler.cumulative_worker_metrics.

These are ever-increasing key->float amount pairs.

The keys are as follows. Note that there may be (there are) more keys than the ones listed below, and while all keys listed below are tuples, some keys may be bare strings.

Worker.execute() metrics

Format

("execute", <task prefix>, <activity>, <unit>) -> float amount

All metrics with the same unit are additive.
In a hypothetical, perfect scenario where all workers run tasks back to back non-stop, they would add up to the number of threads on the cluster multiplied by the cluster uptime (although seceded tasks will mess that up; read #7675).
Metrics are captured upon task termination so, in case of long-running tasks, if you scrape them frequently you may observe artificial spikes that exceed your scraping interval (read #7677).

Individual labels

Deserialize run_spec

("execute", <prefix>, "deserialize", "seconds")

Unspill inputs

("execute", <prefix>, "disk-read", "seconds")
("execute", <prefix>, "disk-read", "count")
("execute", <prefix>, "disk-read", "bytes")
("execute", <prefix>, "decompress", "seconds")
("execute", <prefix>, "deserialize", "seconds") (overlaps with run_spec deserialization)

Run task in thread

("execute", <prefix>, "executor", "seconds")
("execute", <prefix>, "thread-cpu", "seconds")
("execute", <prefix>, "thread-noncpu", "seconds")
("execute", <prefix>, <arbitrary user-defined label>, <arbitrary user-defined unit>)

Spill output (will disappear after #4424)

("execute", <prefix>, "serialize", "seconds")
("execute", <prefix>, "compress", "seconds")
("execute", <prefix>, "disk-write", "seconds")
("execute", <prefix>, "disk-write", "count")
("execute", <prefix>, "disk-write", "bytes")

Delta to end-to-end runtime as seen from the worker state machine

("execute", "z", "other", "seconds")

Future additions

("execute", <prefix>, "thread-I/O", "seconds") (Fine performance metrics: decorate dask/dask dask#10084)
("execute", <prefix>, "offload", "seconds") (Fine performance metrics: measure time spent entering/exiting thread pools #7681)
("execute", <prefix>, "zict-offload", "seconds") (Fine performance metrics: measure time spent entering/exiting thread pools #7681 + Asynchronous Disk Access in Workers #4424)
("execute", <prefix>, "re-execute", "seconds") (Fine performance metrics: Meter task re-execution after losing a worker #7676)
("execute", "n/a", "paused", "seconds") (Fine performance metrics: Break down idle time on the Worker #7671)
("execute", "n/a", "constrained", "seconds") (Fine performance metrics: Break down idle time on the Worker #7671)
("execute", "n/a", "gather-dep", "seconds") (Fine performance metrics: Break down idle time on the Worker #7671)
("execute", "n/a", "idle", "seconds") (Fine performance metrics: Break down idle time on the Worker #7671)

Failed tasks

Time wasted on non-successful tasks.
These metrics are instead of the time metrics listed above.

("execute", <prefix>, "failed", "seconds")
("execute", <prefix>, "cancelled", "seconds")

Worker.gather_dep metrics

Format

("gather-dep", <activity>, <unit>) -> float amount

All metrics with the same unit are additive.
A worker may have more than one network comm active at the same time so they will likely add up to more than the uptime of the cluster.
Metrics are captured upon termination of a gather_dep call so, in case of long-running transfers, if you scrape frequently you may observe artificial spikes.

Individual labels

Worker.gather_dep() method

("gather-dep", "network", "seconds")
("gather-dep", "decompress", "seconds")
("gather-dep", "deserialize", "seconds")

Spill output (will disappear after #4424)

("gather-dep", "serialize", "seconds")
("gather-dep", "compress", "seconds")
("gather-dep", "disk-write", "seconds")
("gather-dep", "disk-write", "count")
("gather-dep", "disk-write", "bytes")

Delta to end-to-end runtime as seen from the worker state machine

("gather-dep", "other", "seconds")

Future additions

("gather-dep", "offload", "seconds") (Fine performance metrics: measure time spent entering/exiting thread pools #7681)

Failed transfers

Time wasted on non-successful transfers.
These metrics are instead of the time metrics listed above.

("gather-dep", "busy", "seconds")
("gather-dep", "missing", "seconds")
("gather-dep", "failed", "seconds")
("gather-dep", "cancelled", "seconds")

Worker.get_data() metrics

Format

("get-data", <activity>, <unit>) -> float amount

All metrics with the same unit are additive.
A worker may have more than one network comm active at the same time so they will likely add up to more than the uptime of the cluster.
Metrics are captured upon termination of a get_data call so, in case of long-running transfers, if you scrape frequently you may observe artificial spikes.

Individual labels

Unspill

("get-data", "disk-read", "seconds")
("get-data", "disk-read", "count")
("get-data", "disk-read", "bytes")
("get-data", "decompress", "seconds")
("get-data", "deserialize", "seconds")

Send over the network

("get-data", "serialize", "seconds")
("get-data", "compress", "seconds")
("get-data", "network", "seconds")

crusaderky · 2023-03-31T16:32:47Z

Summary from an offline meeting with @fjetter, @hendrikmakait and @ntabris :

We should give priority about showing the metrics we're already collecting over adding/refining metrics
"End-to-end" metrics are the most useful. Time series can give some additional insights but not that many more.
The inclination to expose everything to Prometheus as a go-to is more about the fact that it's easy to do so an less about the fact that the data is exposed as time series; actually in this instance time metrics could cause confusions due to Fine performance metrics: Meter currently-executing tasks #7677.
We could reuse the scheduler-side construct of "computation" to define what "end-to-end" means. Its data could straightforwardly be extracted with scheduler plugins (e.g. coiled's).
There is concern / confusion as to why it appears that sometimes there are overlapping computations. We need to investigate exactly how this happens and decide what to do when the heartbeat sends new metrics and there's more than one computation running.
From a UX perspective, it would be valuable to expose the metrics as a cluster total and then let the user drill down into individual computations.

crusaderky added the diagnostics label Mar 17, 2023

crusaderky changed the title ~~[PLEASE IGNORE] Fine performance metrics meta-issue~~ Fine performance metrics meta-issue Mar 17, 2023

crusaderky mentioned this issue Mar 17, 2023

Fine performance metrics: measure time spent entering/exiting thread pools #7681

Closed

milesgranger added a commit to milesgranger/distributed that referenced this issue Mar 23, 2023

Store perf metrics on scheduler

db061dd

Part of dask#7665

crusaderky mentioned this issue Apr 14, 2023

Fine performance metrics: apportion to Computations #7776

Closed

This was referenced May 10, 2023

Fine Performance Metrics Dashboard is broken on Bokeh 3 #7831

Closed

Fine Performance Metrics Dashboard nice-to-haves #7832

Open

This was referenced May 22, 2023

Fine Performance Metrics: Dashboard glitch randomly adds/removes tasks #7848

Closed

User-defined spans #7860

Closed

crusaderky mentioned this issue Jun 14, 2023

Add spans to Fine Performance Metrics bokeh dashboard #7911

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fine performance metrics meta-issue #7665

Fine performance metrics meta-issue #7665

crusaderky commented Mar 17, 2023 •

edited

Loading

crusaderky commented Mar 31, 2023 •

edited

Loading

crusaderky commented Mar 31, 2023

Fine performance metrics meta-issue #7665

Fine performance metrics meta-issue #7665

Comments

crusaderky commented Mar 17, 2023 • edited Loading

XREFs

Improve quality and usability of collected data

What we do with the data

Finishing touches

crusaderky commented Mar 31, 2023 • edited Loading

Current data schema

Worker.execute() metrics

Format

Individual labels

Deserialize run_spec

Unspill inputs

Run task in thread

Spill output (will disappear after #4424)

Delta to end-to-end runtime as seen from the worker state machine

Future additions

Failed tasks

Worker.gather_dep metrics

Format

Individual labels

Worker.gather_dep() method

Spill output (will disappear after #4424)

Delta to end-to-end runtime as seen from the worker state machine

Future additions

Failed transfers

Worker.get_data() metrics

Format

Individual labels

Unspill

Send over the network

crusaderky commented Mar 31, 2023

crusaderky commented Mar 17, 2023 •

edited

Loading

crusaderky commented Mar 31, 2023 •

edited

Loading