-
Notifications
You must be signed in to change notification settings - Fork 456
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add tokio metrics collector #5647
base: main
Are you sure you want to change the base?
Conversation
2340 tests run: 2225 passed, 0 failed, 115 skipped (full report)Flaky tests (2)Postgres 16
Postgres 15
Code coverage (full report)
The comment gets automatically updated with the latest test results
ac0b314 at 2023-10-26T22:06:47.381Z :recycle: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is
which, I know, doesn't implement the prometheus::core::Collector.
But, it abstracts the idea of looking at the RuntimeMetrics in fixed intervals. Which I think is a very good idea, because many of the tokio::RuntimeMetrics fields are aggregations already.
For example, I think to detect periods of increased executor stalls due to, say, slow blocking IO, you'd want to sample&reset max_busy_duration
at a given frequency. (I could be wrong about this though, haven't looked deeply at RuntimeMetrics because they seemed out of reach due to the cfg tokio_unstable
)
I did look into this, it seemed like managing a timer tasks would add more overhead and complexity than sampling the runtime on demand.
Are they? I see only 2 metrics that fit this:
I'm not keen on enabling poll_count histograms as it's going to add more overhead:
So really it's only 1 aggregate in this case. The rest of the quantities are counters and gauges.
We can get worker_total_busy_duration as a counter, couldn't you use a sum(rate()) construction to measure this. Unless you think 15s granularity is far too low resolution. |
Regarding gauges and sampling rates, it's been requested to use counter pairs tokio-rs/tokio#4073 (comment) (which I know @koivunej will be happy about) |
Excellent :) Yes, this brings joy for me :) |
Well this is the primary use case that I have for these metrics in pageserver. We're ready to take a constant overhead in exchange for getting that observability. |
Hmm, fair enough. |
I was curious so I ran a quick HTTP benchmark and it only had a 1% degradation in performance |
Problem
It would be interesting to see runtime stats over time
Summary of changes
Adds a prometheus collector for tokio runtime metrics.
TODO:
Checklist before requesting a review
Checklist before merging