*: fine-grained cpu attribution #82625

irfansharif · 2022-06-08T19:24:37Z

This is the tracking issue for #82356. We expect the work here to break down into the following deliverables.

Must have:

Write a program to build Go archives for various OS/arch combinations CRDB officially supports/makes use of internally (linux arm64+amd64, darwin arm64+amd64, freebsd amd64, windows amd64). A prototype using hand-rolled scripts (but without CGO support) is here.
- The official steps, in code form, are are here.
- We probably want a docker image with the right cross-compilers to run this program under, these archives need to be built with CGO support.
- Being able to generate these archives in CI as downloadable artifacts would be a nice bonus. Ditto if we're able to run the Go tests alongside.
When bumping Go versions (major or minor), updating our steps to use a manually generated archive from above which also applies the patchset proposed in rfcs: fine-grained cpu attribution #82356 (kept in a cockroachdb/go repo, or checked into cockroachdb/cockroach as a patch file).
Introduce a library in cockroachdb/cockroach to access per-goroutine running time. This needs to be gated behind a build tag to error/zero out when not built using the patched runtime.

Nice to have:

Support a per-store CPU usage breakdown by (i) ranges and (ii) tenants, powered by per-goroutine running time. We'd make this accessible through vtables and include results in debug zips, serving as a CPU-only version of today's hot-ranges report. A prototype of this is available here.
Surfacing per-statement cluster-wide CPU usage as part of EXPLAIN ANALYZE. This would likely require integrating the library above into the Stopper and recording as tracing events total running CPU time for a given request (propagated all the way up to the gateway). Tracked in sql: record CPU seconds per statement #87213.

82013: storage: add MVCC point synthesizing iterator r=jbowens a=erikgrinaker This patch adds `pointSynthesizingIter`, an MVCC iterator which wraps an arbitrary MVCC iterator and synthesizes point keys for range keys at their start key and where they overlap point keys. It can optionally synthesize around the SeekGE seek key too, which is useful for point operations like `MVCCGet` where we may want to return a synthetic tombstone for an MVCC range tombstone if there is no existing point key. This will primarily be used to handle MVCC range tombstones in MVCC scans and gets, as well as during MVCC conflict checks, which allows much of this logic to remain unchanged and simplified (in particular, `pebbleMVCCScanner` will not need any changes). However, this patch does not make use of the iterator yet, since both it and Pebble will need further performance optimizations for use in hot paths. For now, correctness is sufficient, and only basic attempts at performance optimization have been made. Touches #70412. Release note: None 82547: opt: fix inverted key column type in testcat r=mgartner a=mgartner Previously, inverted key columns in the test catalog were incorrectly given the type of their source column. Now they have the correct type, `types.EncodedKey`. Release note: None 82635: bazel: unlist unsupported os+arch combos r=irfansharif a=irfansharif We don't support these for CRDB builds. Even windows is experimental, but lets keep that around anyway since we publish binaries for it. +cc #82625. Release note: None Co-authored-by: Erik Grinaker <grinaker@cockroachlabs.com> Co-authored-by: Marcus Gartner <marcus@cockroachlabs.com> Co-authored-by: irfan sharif <irfanmahmoudsharif@gmail.com>

irfansharif · 2022-07-04T15:14:31Z

Connecting some dots. From an internal doc around admission control, which would benefit from:

CPU timers: A measurement of total cpu time used by a goroutine (analogous to a
thread cpu timer), would allow us to compensate tenants who experience high
contention or high IO time, and so spend more time blocked after admission than
others. This is covered in what we are asking for in
https://github.com/golang/go/issues/41554

I was also curious about how the Go runtime is able to pace its GC to use only 30% CPU, and came across https://github.com/golang/proposal/blob/master/design/44167-gc-pacer-redesign.md#a-note-about-cpu-utilization (this whole document is fantastic). This was in the context of trying to understand how we could possibly do similar per-node CPU control for tenants and/or ranges, or get more accurate resource use signals. Turns out Go internally just reads off of nanotime(), using it to compute running time for individual goroutines doing GC assist work, without trying to be accurate about OS threads being descheduled. +cc golang/go#41554 (comment).

https://github.com/irfansharif/go/blob/2eb8b6eec65d3d214c07067db474bbb93de3443a/src/runtime/mgcmark.go#L592-L601

Package grunning is a library that's able to retrieve on-CPU running time for individual goroutines. It relies on using a patched Go and provides a primitive for fine-grained CPU attribution and control through a single API: package grunning // Time returns the time spent by the current goroutine in the // running state. func Time() time.Duration The motivating RFC is over at cockroachdb#82356. Informs cockroachdb#82625. We build CRDB using use the patched Go runtime for all officially supported platforms when built using Bazel (cockroachdb#84867). Engineers commonly building CRDB also use happen to use two platforms we don't use a patched Go for: - FreeBSD (we don't have cross-compilers setup), and - M1/M2 Macs (we don't have a code-signing pipeline, yet). We use '(darwin && arm64) || freebsd || !bazel' as the build tag to exclude such platforms. See cockroachdb#84867 for more details. This package tests various properties we should expect over the running time value. It does not make assertions given the CI environments we run these under (CPU-starved, lot of OS thread pre-emption, dissimilar to healthy CRDB deployments). This is also why they're skipped under stress. Still, these tests are useful to understand the properties we expect running time to have: === RUN TestEquivalentGoroutines thread=03 expected≈10.00% got= 9.98% of on-cpu time thread=06 expected≈10.00% got=10.00% of on-cpu time thread=02 expected≈10.00% got=10.01% of on-cpu time thread=10 expected≈10.00% got=10.01% of on-cpu time thread=07 expected≈10.00% got= 9.99% of on-cpu time thread=04 expected≈10.00% got= 9.99% of on-cpu time thread=09 expected≈10.00% got=10.00% of on-cpu time thread=01 expected≈10.00% got= 9.99% of on-cpu time thread=08 expected≈10.00% got=10.02% of on-cpu time thread=05 expected≈10.00% got=10.02% of on-cpu time --- PASS: TestEquivalentGoroutines (0.56s) === RUN TestProportionalGoroutines thread=01 got 1.82% of on-cpu time: expected≈ 1.00x got=1.00x thread=02 got 3.64% of on-cpu time: expected≈ 2.00x got=2.00x thread=03 got 5.47% of on-cpu time: expected≈ 3.00x got=3.00x thread=04 got 7.28% of on-cpu time: expected≈ 4.00x got=4.00x thread=05 got 9.09% of on-cpu time: expected≈ 5.00x got=4.99x thread=06 got 10.91% of on-cpu time: expected≈ 6.00x got=5.99x thread=07 got 12.73% of on-cpu time: expected≈ 7.00x got=6.99x thread=08 got 14.54% of on-cpu time: expected≈ 8.00x got=7.99x thread=09 got 16.36% of on-cpu time: expected≈ 9.00x got=8.99x thread=10 got 18.16% of on-cpu time: expected≈10.00x got=9.97x --- PASS: TestProportionalGoroutines (1.72s) === RUN TestPingPongHog pinger/ponger expected≈1.00x got=0.96x --- PASS: TestPingPongHog (0.91s) Release note: None

85850: importer: support {} format for arrays in CSV r=ecwall a=rafiss fixes #84631 Release note (sql change): Arrays can now be imported in a CSV file using the {} format, similar to COPY FROM. Importing array expressions (e.g. ARRAY[1, 2, 3]) is still supported as well. 85854: clusterversion,storage: remove 22.1 PebbleFormat version gates r=jbowens a=celiala This commit removes the following 22.1 version gates: - PebbleFormatBlockPropertyCollector - PebbleFormatSplitUserKeysMarked Cleanup was done following guidance from [21.2 cleanup](#74270 (comment)): > For the most part, if the gates were just simple if !version.IsActive { return x } or something, I just removed the block, and even if it was a little more complicated, like args = [x]; if version { args = append(args, y) }; foo(args) I still tried to mostly inline it such that it looked natural (i.e. remove that append and make it args = [x, y]). > However for just a couple more complicated cases that were referring to <21.2 versions that needed to be replaced when those were deleted, I added a placeholder clusterversion.TODOPre21_2 alias for 21.2. Replacing those calls with this alias shouldn't change their behavior -- it was already always true, since the code today should never run in a <21.2 cluster -- but means we can delete those older versions in the meantime and then the owners of these bits can decide how to update them. Partially addresses #80663 Release note: none 85909: kvserver: instrument RaftTransport workers with pprof labels r=tbg,erikgrinaker a=pavelkalinnikov The unused arguments in the method signature were used to identify goroutines in traces. This no longer works after Go 1.17 started passing arguments via registers. This commit adds pprof labels when starting these goroutines, to have a cleaner code, more readable traces, and to work around the new Go convention. Release note: None 85977: grunning: add library for precise on-CPU time measurement r=irfansharif a=irfansharif Package grunning is a library that's able to retrieve on-CPU running time for individual goroutines. It relies on using a patched Go and provides a primitive for fine-grained CPU attribution and control through a single API: package grunning // Time returns the time spent by the current goroutine in the // running state. func Time() time.Duration The motivating RFC is over at #82356. Informs #82625. We build CRDB using use the patched Go runtime for all officially supported platforms when built using Bazel (#84867). Engineers commonly building CRDB also use happen to use two platforms we don't use a patched Go for: - FreeBSD (we don't have cross-compilers setup), and - M1/M2 Macs (we don't have a code-signing pipeline, yet). We use '(darwin && arm64) || freebsd || !bazel' as the build tag to exclude such platforms. See #84867 for more details. This package tests various properties we should expect over the running time value. It does not make assertions given the CI environments we run these under (CPU-starved, lot of OS thread pre-emption, dissimilar to healthy CRDB deployments). This is also why they're skipped under stress. Still, these tests are useful to understand the properties we expect running time to have: === RUN TestEquivalentGoroutines thread=03 expected≈10.00% got= 9.98% of on-cpu time thread=06 expected≈10.00% got=10.00% of on-cpu time thread=02 expected≈10.00% got=10.01% of on-cpu time thread=10 expected≈10.00% got=10.01% of on-cpu time thread=07 expected≈10.00% got= 9.99% of on-cpu time thread=04 expected≈10.00% got= 9.99% of on-cpu time thread=09 expected≈10.00% got=10.00% of on-cpu time thread=01 expected≈10.00% got= 9.99% of on-cpu time thread=08 expected≈10.00% got=10.02% of on-cpu time thread=05 expected≈10.00% got=10.02% of on-cpu time --- PASS: TestEquivalentGoroutines (0.56s) === RUN TestProportionalGoroutines thread=01 got 1.82% of on-cpu time: expected≈ 1.00x got=1.00x thread=02 got 3.64% of on-cpu time: expected≈ 2.00x got=2.00x thread=03 got 5.47% of on-cpu time: expected≈ 3.00x got=3.00x thread=04 got 7.28% of on-cpu time: expected≈ 4.00x got=4.00x thread=05 got 9.09% of on-cpu time: expected≈ 5.00x got=4.99x thread=06 got 10.91% of on-cpu time: expected≈ 6.00x got=5.99x thread=07 got 12.73% of on-cpu time: expected≈ 7.00x got=6.99x thread=08 got 14.54% of on-cpu time: expected≈ 8.00x got=7.99x thread=09 got 16.36% of on-cpu time: expected≈ 9.00x got=8.99x thread=10 got 18.16% of on-cpu time: expected≈10.00x got=9.97x --- PASS: TestProportionalGoroutines (1.72s) === RUN TestPingPongHog pinger/ponger expected≈1.00x got=0.96x --- PASS: TestPingPongHog (0.91s) Release note: None 85987: sql: fix aggregation of statistics r=maryliag a=maryliag Previously, because we were using a join, we were double counting statistics when we had the same fingerprint in memory and persisted. This commit adds a `DISTINCT` so we only count them once. Fixes #85958 Release note: None 86008: sql: logic test to inform maximum builtin function oid change r=chengxiong-ruan a=chengxiong-ruan This commit adds a logic test to let engineers who added a new builtin function know that the new builtin function is constructed earlier than some existing builtin functions at init time. Release note: None Co-authored-by: Rafi Shamim <rafi@cockroachlabs.com> Co-authored-by: Celia La <celia@cockroachlabs.com> Co-authored-by: Pavel Kalinnikov <pavel@cockroachlabs.com> Co-authored-by: irfan sharif <irfanmahmoudsharif@gmail.com> Co-authored-by: Marylia Gutierrez <marylia@cockroachlabs.com> Co-authored-by: Chengxiong Ruan <chengxiongruan@gmail.com>

irfansharif · 2022-09-12T20:00:18Z

The necessary work is done.

joshimhoff · 2022-10-26T13:24:30Z

This is super cool. I didn't know we actually implemented this until now. @irfansharif, where do we use this capability today?

irfansharif · 2022-10-26T13:36:30Z

Admission control for now, #86638, coming to a 22.2 CC cluster near you. It's pretty sick, check out #89709 (comment) for ex. In 23.1 I hope we start using it in the allocator: #83490.

joshimhoff · 2022-10-26T13:42:41Z

So great!

irfansharif added C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) A-kv-distribution Relating to rebalancing and leasing. A-kv-observability labels Jun 8, 2022

blathers-crl bot added T-dev-inf T-kv KV Team labels Jun 8, 2022

irfansharif changed the title *: fine-grained cpu attribution for v22.2 *: fine-grained cpu attribution Jun 8, 2022

irfansharif mentioned this issue Jun 8, 2022

bazel: unlist unsupported os+arch combos #82635

Merged

mari-crl added the sync-me-7 label Jun 9, 2022

jlinder added the sync-me-8 label Jun 10, 2022

rickystewart mentioned this issue Jul 21, 2022

ci: add scripts to build patched versions of Go runtime #84867

Merged

irfansharif mentioned this issue Aug 11, 2022

grunning: add library for precise on-CPU time measurement #85977

Merged

exalate-issue-sync bot removed the T-dev-inf label Aug 12, 2022

jordanlewis mentioned this issue Aug 31, 2022

sql: record CPU seconds per statement #87213

Open

3 tasks

irfansharif closed this as completed Sep 12, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

*: fine-grained cpu attribution #82625

*: fine-grained cpu attribution #82625

irfansharif commented Jun 8, 2022 •

edited

Loading

irfansharif commented Jul 4, 2022

irfansharif commented Sep 12, 2022

joshimhoff commented Oct 26, 2022

irfansharif commented Oct 26, 2022 •

edited

Loading

joshimhoff commented Oct 26, 2022

*: fine-grained cpu attribution #82625

*: fine-grained cpu attribution #82625

Comments

irfansharif commented Jun 8, 2022 • edited Loading

irfansharif commented Jul 4, 2022

irfansharif commented Sep 12, 2022

joshimhoff commented Oct 26, 2022

irfansharif commented Oct 26, 2022 • edited Loading

joshimhoff commented Oct 26, 2022

irfansharif commented Jun 8, 2022 •

edited

Loading

irfansharif commented Oct 26, 2022 •

edited

Loading