-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
storage: granular IO metrics #112898
Comments
Would've helped with cockroachlabs/support#2673. |
As a part of this, we should also do #104114. We should have one goroutine that knows the mapping of path to disk and is responsible for reading the disk stats file in /proc/. The stats it reads can be used to power these granular IO metrics, per-store metrics described in #104114, and the disk stats plumbed into admission control. |
This is scheduled for the 24.1 release cycle. |
In November I started prototyping this and #104114 in jbowens@7103152. The reading of mountpoints may not actually be necessary. |
115375: changefeedccl: reduce rebalancing memory usage from O(ranges) to O(spans) r=jayshrivastava a=jayshrivastava ### sql: count ranges per partition in PartitionSpans This change updates span partitioning to count ranges while making partitions. This allows callers to rebalance partitions based on range counts without having to iterate over the spans to count ranges. Release note: None Epic: None ### changefeedccl: reduce rebalancing memory usage from O(ranges) to O(spans) #115375 Previously, the `rebalanceSpanPartitions` would use O(ranges) memory. This change rewrites it to use range iterators, reducing the memory usage to O(spans). This change also adds a randomized test to assert that all spans are accounted for after rebalancing. It also adds one more unit test. Informs: #113898 Epic: None ### changefeedccl: add rebalancing checks This change adds extra test coverage for partition rebalancing in changefeeds. It adds checks which are performed after rebalancing to assert that the output list of spans covers exactly the same keys as the input list of spans. These checks are expensive so they only run if the environment variable `COCKROACH_CHANGEFEED_TESTING_REBALANCING_CHECKS` is true. This variable is true in cdc roachtests and unit tests. Release note: None Epic: None 119885: storage: support per-store IO metrics with fine granularity r=jbowens,abarganier a=CheranMahalingam Currently, timeseries metrics are collected on a 10s interval which hides momentary spikes in IO. This commit introduces a central disk monitoring system that polls for disk stats at a 100ms interval. Additionally, the current system accumulates disk metrics across all block devices which includes noise from unrelated processes. This commit also adds support for exporting per-store IO metrics (i.e. IO stats for block devices that map to stores used by Cockroach). These changes will be followed up by a PR to remove the need for customers to specify disk names when setting the provisioned bandwidth for each store as described in #109350. Fixes: #104114, #112898. Informs: #89786. Epic: None. Release note: None. 120649: changefeedccl: avoid undefined behavior in distribution test r=wenyihu6 a=jayshrivastava The `rangeDistributionTester` would sometimes calculate log(0) when determining the node to move a range too. Most of the time, this would be some garbage value which gets ignored. Sometimes, this may return a valid node id, causing the range distribution to be wrong and failing the test failures. This change updates the tester to handle this edge case. Closes: #120470 Release note: None Co-authored-by: Jayant Shrivastava <jayants@cockroachlabs.com> Co-authored-by: Cheran Mahalingam <cheran.mahalingam@cockroachlabs.com>
Our existing timeseries metrics are collected on a 10s interval. This coarse granularity makes it impossible to detect events of high variance within the 10s interval. A momentary spike of 16k IO operations in 1 second can be presented as 1.6k IOPS over the 10s interval. A spike like this could force IO operations to queue, inducing latency beyond what our customers consider acceptable, without leaving a trace of the latency's source.
We should collect finer, per-second metrics for select metrics. We shouldn't wait for our timeseries infrastructure to support finer resolution for some metrics. We can surface these metrics through a few strategies:
Jira issue: CRDB-32668
The text was updated successfully, but these errors were encountered: