Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fine performance metrics meta-issue #7665

Open
crusaderky opened this issue Mar 17, 2023 · 2 comments
Open

Fine performance metrics meta-issue #7665

crusaderky opened this issue Mar 17, 2023 · 2 comments

Comments

@crusaderky
Copy link
Collaborator

crusaderky commented Mar 17, 2023

XREFs

In #7586, we started collecting very granular metrics on how workers are spending their time.
Demo: https://gist.github.com/crusaderky/a97f870c51260e63a1c14c20b762f666

As of that PR, we collect metrics in Worker.digests_total about:

  • Worker.execute, broken down by task prefix and activity, with special treatment for failed and cancelled tasks
  • Worker.gather_dep, broken down by activity, with special treatment for failed and cancelled transfers
  • Worker.get_data, broken down by activity
  • WorkerMemoryMonitor._spill, broken down by activity

This issue is a meta-tracker of all potential follow-ups, as well as a place to discuss high level design and cost/benefit ratios holistically.

The follow-ups can be broken down into two high level threads:

Improve quality and usability of collected data

What we do with the data

Finishing touches

@crusaderky crusaderky changed the title [PLEASE IGNORE] Fine performance metrics meta-issue Fine performance metrics meta-issue Mar 17, 2023
milesgranger added a commit to milesgranger/distributed that referenced this issue Mar 23, 2023
@crusaderky
Copy link
Collaborator Author

crusaderky commented Mar 31, 2023

Current data schema

#7701 (currently in main git tip; will be in release 2023.4.1) introduces a new mapping Scheduler.cumulative_worker_metrics.

These are ever-increasing key->float amount pairs.

The keys are as follows. Note that there may be (there are) more keys than the ones listed below, and while all keys listed below are tuples, some keys may be bare strings.

Worker.execute() metrics

Format

("execute", <task prefix>, <activity>, <unit>) -> float amount

All metrics with the same unit are additive.
In a hypothetical, perfect scenario where all workers run tasks back to back non-stop, they would add up to the number of threads on the cluster multiplied by the cluster uptime (although seceded tasks will mess that up; read #7675).
Metrics are captured upon task termination so, in case of long-running tasks, if you scrape them frequently you may observe artificial spikes that exceed your scraping interval (read #7677).

Individual labels

Deserialize run_spec

  • ("execute", <prefix>, "deserialize", "seconds")

Unspill inputs

  • ("execute", <prefix>, "disk-read", "seconds")
  • ("execute", <prefix>, "disk-read", "count")
  • ("execute", <prefix>, "disk-read", "bytes")
  • ("execute", <prefix>, "decompress", "seconds")
  • ("execute", <prefix>, "deserialize", "seconds") (overlaps with run_spec deserialization)

Run task in thread

  • ("execute", <prefix>, "executor", "seconds")
  • ("execute", <prefix>, "thread-cpu", "seconds")
  • ("execute", <prefix>, "thread-noncpu", "seconds")
  • ("execute", <prefix>, <arbitrary user-defined label>, <arbitrary user-defined unit>)

Spill output (will disappear after #4424)

  • ("execute", <prefix>, "serialize", "seconds")
  • ("execute", <prefix>, "compress", "seconds")
  • ("execute", <prefix>, "disk-write", "seconds")
  • ("execute", <prefix>, "disk-write", "count")
  • ("execute", <prefix>, "disk-write", "bytes")

Delta to end-to-end runtime as seen from the worker state machine

  • ("execute", "z", "other", "seconds")

Future additions

Failed tasks

Time wasted on non-successful tasks.
These metrics are instead of the time metrics listed above.

  • ("execute", <prefix>, "failed", "seconds")
  • ("execute", <prefix>, "cancelled", "seconds")

Worker.gather_dep metrics

Format

("gather-dep", <activity>, <unit>) -> float amount

All metrics with the same unit are additive.
A worker may have more than one network comm active at the same time so they will likely add up to more than the uptime of the cluster.
Metrics are captured upon termination of a gather_dep call so, in case of long-running transfers, if you scrape frequently you may observe artificial spikes.

Individual labels

Worker.gather_dep() method

  • ("gather-dep", "network", "seconds")
  • ("gather-dep", "decompress", "seconds")
  • ("gather-dep", "deserialize", "seconds")

Spill output (will disappear after #4424)

  • ("gather-dep", "serialize", "seconds")
  • ("gather-dep", "compress", "seconds")
  • ("gather-dep", "disk-write", "seconds")
  • ("gather-dep", "disk-write", "count")
  • ("gather-dep", "disk-write", "bytes")

Delta to end-to-end runtime as seen from the worker state machine

  • ("gather-dep", "other", "seconds")

Future additions

Failed transfers

Time wasted on non-successful transfers.
These metrics are instead of the time metrics listed above.

  • ("gather-dep", "busy", "seconds")
  • ("gather-dep", "missing", "seconds")
  • ("gather-dep", "failed", "seconds")
  • ("gather-dep", "cancelled", "seconds")

Worker.get_data() metrics

Format

("get-data", <activity>, <unit>) -> float amount

All metrics with the same unit are additive.
A worker may have more than one network comm active at the same time so they will likely add up to more than the uptime of the cluster.
Metrics are captured upon termination of a get_data call so, in case of long-running transfers, if you scrape frequently you may observe artificial spikes.

Individual labels

Unspill

  • ("get-data", "disk-read", "seconds")
  • ("get-data", "disk-read", "count")
  • ("get-data", "disk-read", "bytes")
  • ("get-data", "decompress", "seconds")
  • ("get-data", "deserialize", "seconds")

Send over the network

  • ("get-data", "serialize", "seconds")
  • ("get-data", "compress", "seconds")
  • ("get-data", "network", "seconds")

@crusaderky
Copy link
Collaborator Author

Summary from an offline meeting with @fjetter, @hendrikmakait and @ntabris :

  • We should give priority about showing the metrics we're already collecting over adding/refining metrics
  • "End-to-end" metrics are the most useful. Time series can give some additional insights but not that many more.
  • The inclination to expose everything to Prometheus as a go-to is more about the fact that it's easy to do so an less about the fact that the data is exposed as time series; actually in this instance time metrics could cause confusions due to Fine performance metrics: Meter currently-executing tasks #7677.
  • We could reuse the scheduler-side construct of "computation" to define what "end-to-end" means. Its data could straightforwardly be extracted with scheduler plugins (e.g. coiled's).
  • There is concern / confusion as to why it appears that sometimes there are overlapping computations. We need to investigate exactly how this happens and decide what to do when the heartbeat sends new metrics and there's more than one computation running.
  • From a UX perspective, it would be valuable to expose the metrics as a cluster total and then let the user drill down into individual computations.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant