Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Metric for total compute time by task prefix, including currently-execuing tasks #7601

Open
gjoseph92 opened this issue Feb 28, 2023 · 2 comments

Comments

@gjoseph92
Copy link
Collaborator

In #7406, we added a prometheus metric for total on-thread compute time per task prefix. This basically gives you total CPU-seconds per second. This is quite a useful overview metric to show on grafana, since it can a) show generally which tasks were running when, and b) show how well-utilized the cluster was (if you have 20 threads, 20sec of compute time per second would be 100% utilization).

However, because this metric is only updated when tasks complete, if tasks are longer than the prometheus scrape interval, the metric can look quite misleading:
image

Here we're comparing rate(dask_scheduler_tasks_compute_seconds_total[15s]) and dask_worker_tasks{state="executing"}. This second metric was just updated by @crusaderky in #7506, but it's basically a gauge of the current number of tasks in state executing on each worker.

You'd expect these metrics to be pretty close to the same. If nthreads tasks were in state executing, you'd expect the total compute time to be about nthreads seconds. However, they look wildly different. I suspect part/most of this is because the tasks are (much?) slower than the Prometheus scrape interval.

If you had a 5min task, you'll get a bunch of scrapes with no increase in the tasks_compute_seconds_total metric, then one scrape where it bursts up by 5min. I'm hoping that explains the spikiness here.

Ideally we could add another metric (or update this metric) to include time tasks have spent on the threadpool so far, even when they haven't completed yet.


Though it's also interesting that even if you set the aggregation interval to something very long like 30min, the averaged-out rate is still much lower than the theoretical max:
image

Because dask_worker_tasks is a gauge, it's also possible that this is an overestimate (it's only sampled every 5s, or whatever the scrape interval is; any idle time between the scrapes is not captured).

The high worker CPU in this case makes me think it's probably not that far off, though.

@gjoseph92
Copy link
Collaborator Author

After discussion with @crusaderky, this overlaps a lot (in terms of code changes) with current work on #7565. So the sequence will be:

  1. Some variant of Fine performance metrics for execute, gather_dep, etc. #7586
  2. Linking up ^^ with prometheus
  3. Then, looking into refactoring that tracing to include currently-elapsed time

@fjetter
Copy link
Member

fjetter commented Mar 14, 2023

I do not agree with the notion that these two lines necessarily need to align. This all is about semantics / how the actual counter is defined. These two plots are not showing the same information. The one metric is defined as "total finished compute time" while the other one is "tasks currently in state executing on workers". There is a correlation, of course, but I don't understand why it has to be aligned.


This is mildly off topic and likely not helping for this specific case but please note that prometheus can indeed export metrics with a custom, specific timestamp, i.e. if we know the values of something at a given time we can actually export a value for a specific time. This will likely not help a lot with "smearing out" a metric as you are intending here but this came up frequently enough in the past that I felt the need to share it.

Looking at something like the GaugeMetricFamily we're using we will notice that the API GaugeMetricFamily.add_metric actually exposes a kwarg timestamp which can be used to specify the exact time when an event occured.

To illustrate I set this for the partially_saturated metric below

# /metrics
dask_scheduler_workers{state="idle"} 0.0
# See how the value and the timestamp is exported
dask_scheduler_workers{state="partially_saturated"} 0.0 1678802020560
dask_scheduler_workers{state="saturated"} 0.0
dask_scheduler_workers{state="paused_or_retiring"} 0.0

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants