diff --git a/tools/tracing/README.md b/tools/tracing/README.md new file mode 100644 index 0000000000..ab368fd01d --- /dev/null +++ b/tools/tracing/README.md @@ -0,0 +1,241 @@ +# MMTk performace tracing + +## Notes for MMTk developers +Please open pull requests if you develop new tools that others might find useful. +When you add new tools, please update this documentation. +If you change MMTk internals that the tracing tools depend on (such as the +definition of `enum WorkBucketStage`), please update the scripts accordingly. + +## Notes for MMTk users +Since some of the tools depend on the MMTk internals, please use the tools +shipped with the MMTk release you use. + +## Tracepoints +Currently, the core provides the following tracepoints. +- `mmtk:collection_initialized()`: GC is enabled +- `mmtk:harness_begin()`: the timing iteration of a benchmark begins +- `mmtk:harness_end()`: the timing iteration of a benchmark ends +- `mmtk:gccontroller_run()`: the GC controller thread enters its work loop +- `mmtk:gcworker_run()`: a GC worker thread enters its work loop +- `mmtk:gc_start()`: a collection epoch starts +- `mmtk:gc_end()`: a collection epoch ends +- `mmtk:process_edges(num_edges: int, is_roots: bool)`: a invocation of the +`process_edges` method. The first argument is the number of edges to be processed, +and the second argument is whether these edges are root edges. +- `mmtk:bucket_opened(id: int)`: a work bucket opened. The first argument is the +numerical representation of `enum WorkBucketStage`. +- `mmtk:work_poll()`: a work packet is to be polled. +- `mmtk:work(type_name: char *, type_name_len: int)`: a work packet was just +executed. The first argument is points to the string of the Rust type name of +the work packet, and the second argument is the length of the string. +- `mmtk:alloc_slow_once_start()`: the allocation slow path starts. +- `mmtk:alloc_slow_once_end()`: the allocation slow path ends. + +## Running tracing tools +The tracing tools are to be invoked by a wrapper script `run.py`. +``` +usage: run.py [-h] [-b BPFTRACE] -m MMTK [-H] [-p] [-f {text,json}] tool + +positional arguments: + tool Name of the bpftrace tool + +optional arguments: + -h, --help show this help message and exit + -b BPFTRACE, --bpftrace BPFTRACE + Path of the bpftrace executable + -m MMTK, --mmtk MMTK Path of the MMTk binary + -H, --harness Only collect data for the timing iteration (harness_begin/harness_end) + -p, --print-script Print the content of the bpftrace script + -f {text,json}, --format {text,json} + bpftrace output format +``` + +- `-b`: the path to the `bpftrace` executable. By default, it uses `bpftrace` +executable in your `PATH`. We strongly recommend you use the latest statically +complied `bpftrace` from [upstream](https://github.com/iovisor/bpftrace/releases). +You need to be able to have sudo permission for whichever `bpftrace` you want to use. +- `-m`: the path to a MMTk binary that contains the tracepoints. +This depends on the binding you use. +For the OpenJDK binding, it should be `jdk/lib/server/libmmtk_openjdk.so` under +your build folder. +To check whether the binary contains tracepoints, you can use `readelf -n`. +You should see a bunch of `stapsdt` notes with `mmtk` as the provider. +- `-H`: pass this flag is you want to only measure the timing iteration of a +benchmark. +By default, the tracing tools will measure the entire execution. +- `-p`: print the entire tracing script before execution. +This is mainly for debugging use. +- `-f`: change the bpftrace output format. +By default, it uses human-readable plain text output (`text`). +You can set this to `json` for easy parsing. + +Please run the tracing tools **before** running the workload. +If you use `-H`, the tracing tools will automatically end with `harness_end` is +called. +Otherwise, you will need to terminate the tools manually with `Ctrl-C`. +These tools also have a timeout of 1200 seconds so not to stall unattended +benchmark execution. + +## Tracing tools +### Measuring the time spend in allocation slow path (`alloc_slow`) +This tool measures the distribution of the allocation slow path time. +The time unit is 400ns, so that we use the histogram bins with higher +fidelity better. + +Sample output: +``` +@alloc_slow_hist: +[4, 8) 304 |@ | +[8, 16) 12603 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@| +[16, 32) 8040 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ | +[32, 64) 941 |@@@ | +[64, 128) 171 | | +[128, 256) 13 | | +[256, 512) 2 | | +[512, 1K) 0 | | +[1K, 2K) 0 | | +[2K, 4K) 0 | | +[4K, 8K) 0 | | +[8K, 16K) 0 | | +[16K, 32K) 14 | | +[32K, 64K) 37 | | +[64K, 128K) 19 | | +[128K, 256K) 1 | | +``` + +In the above output, we can see that most allocation slow paths finish between +3.2us and 6.4us. +However, there is a long tail, presumably due to GC pauses. + +### Measuring the time spend in different GC stages (`gc_stages`) +This tool measures the time spent in different stages of GC: before `Closure`, +during `Closure`, and after `Closure`. +The time unit is ns. + +Sample output: +``` +@closure_time: 1405302743 +@post_closure_time: 81432919 +@pre_closure_time: 103886118 +``` + +In the above output, overall, the execution spends 1.4s in the main transitive +closure, 103ms before that, and 81ms after that (a total of around 1.5s). + +### Measuring the time spend in lock contended state for Rust `Mutex` (`lock_contend`) +This tools measures the time spent in the lock contended state for Rust `Mutex`s. +The Rust standard library implements `Mutex` using the fast-slow-path paradigm. +Most lock operations take place in inlined fast paths, when there's no contention. +However, when there's contention, +`std::sys::unix::locks::futex_mutex::Mutex::lock_contended` is called. + +```rust +#[inline] +pub fn lock(&self) { + if self.futex.compare_exchange(0, 1, Acquire, Relaxed).is_err() { + self.lock_contended(); + } +} + +#[cold] +fn lock_contended(&self) { + +} +``` + + +MMTk uses Rust `Mutex`, e.g., in allocation slow paths for synchronization, +and this tool can be useful to measure the contention in these parts of code. + +The time unit is 256ns. + +Sample output: +``` +@lock_dist[140637228007056]: +[1] 447 |@@@@ | +[2, 4) 3836 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ | +[4, 8) 3505 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ | +[8, 16) 1354 |@@@@@@@@@@@@@@ | +[16, 32) 832 |@@@@@@@@ | +[32, 64) 1077 |@@@@@@@@@@@ | +[64, 128) 2991 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ | +[128, 256) 4846 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ | +[256, 512) 5013 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@| +[512, 1K) 1203 |@@@@@@@@@@@@ | +[1K, 2K) 34 | | +[2K, 4K) 15 | | +``` + +In the above output, we can see that the lock instance (140637228007056, or 0x7fe8a8047e90) +roughly has a bimodal distribution in terms of the time spent in lock contended +code path. +The first peak is around 512ns\~1024ns, and the second peak is around 66us\~131us. + +If you can't tell which lock instance is for which lock in MMTk, you can trace +the allocation of the Mutex and record the stack trace (note that you might want +to compile MMTk with `force-frame-pointers` to obtain better stack traces). + +### Measuring the distribution of `process_edges` packet sizes (`packet_size`) +Most of the GC time is spend in the transitive closure for tracing-based GCs, +and MMTk performs transitive closure via work packets that calls the `process_edges` method. +This tool measures the distribution of the sizes of these work packets, and also +count root edges separately. + +Sample output: +``` +@process_edges_packet_size: +[1] 238 |@@@@@ | +[2, 4) 806 |@@@@@@@@@@@@@@@@@ | +[4, 8) 1453 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ | +[8, 16) 1105 |@@@@@@@@@@@@@@@@@@@@@@@ | +[16, 32) 2410 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@| +[32, 64) 1317 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@ | +[64, 128) 1252 |@@@@@@@@@@@@@@@@@@@@@@@@@@@ | +[128, 256) 1131 |@@@@@@@@@@@@@@@@@@@@@@@@ | +[256, 512) 2017 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ | +[512, 1K) 1270 |@@@@@@@@@@@@@@@@@@@@@@@@@@@ | +[1K, 2K) 1028 |@@@@@@@@@@@@@@@@@@@@@@ | +[2K, 4K) 874 |@@@@@@@@@@@@@@@@@@ | +[4K, 8K) 1024 |@@@@@@@@@@@@@@@@@@@@@@ | +[8K, 16K) 58 |@ | +[16K, 32K) 5 | | + +@process_edges_root_packet_size: +[1] 71 |@@@@@@@ | +[2, 4) 4 | | +[4, 8) 276 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@ | +[8, 16) 495 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@| +[16, 32) 477 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ | +[32, 64) 344 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ | +[64, 128) 242 |@@@@@@@@@@@@@@@@@@@@@@@@@ | +[128, 256) 109 |@@@@@@@@@@@ | +[256, 512) 31 |@@@ | +[512, 1K) 33 |@@@ | +[1K, 2K) 75 |@@@@@@@ | +[2K, 4K) 75 |@@@@@@@ | +[4K, 8K) 336 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ | +[8K, 16K) 56 |@@@@@ | +[16K, 32K) 3 | | +``` + +In the above output, we can see that overall, the sizes of the `process_edges` +has a unimodal distribution with a peak around 16\~32 edges per packet. +However, if we focus on root edges, the distribution is roughly bimodal, with a +first peak around 8\~16 and a second peak around 4096\~8192. + +## Attribution +If used for research, please cite the following publication (the `BibTeX` record +will be updated once a DOI is assigned). +```bibtex +@inproceedings{conf/mplr/Huang23, + author = {Claire Huang and + Stephen M. Blackburn and + Zixian Cai}, + title = {Improving Garbage Collection Observability with Performance Tracing}, + booktitle = {Proceedings of the 20th International Conference on Managed Programming + Languages and Runtimes, {MPLR} 2023, Cascais, Portugal, October + 22, 2023}, + publisher = {{ACM}}, + year = {2023} +} +``` diff --git a/tools/tracing/alloc_slow.bt b/tools/tracing/alloc_slow.bt new file mode 100644 index 0000000000..202eea1e6b --- /dev/null +++ b/tools/tracing/alloc_slow.bt @@ -0,0 +1,11 @@ +usdt:$MMTK:mmtk:alloc_slow_once_start { + if (@stats_enabled) { + @alloc_slow_nsecs[tid] = nsecs; + } +} + +usdt:$MMTK:mmtk:alloc_slow_once_end { + if (@stats_enabled) { + @alloc_slow_hist = hist((nsecs - @alloc_slow_nsecs[tid])/400); + } +} diff --git a/tools/tracing/epilogue.bt.fragment b/tools/tracing/epilogue.bt.fragment new file mode 100644 index 0000000000..70b82a4eb7 --- /dev/null +++ b/tools/tracing/epilogue.bt.fragment @@ -0,0 +1,7 @@ +END { + system("rm $TMP_FILE"); +} + +interval:s:1200 { + exit(); +} diff --git a/tools/tracing/gc_stages.bt b/tools/tracing/gc_stages.bt new file mode 100644 index 0000000000..1dab57d99e --- /dev/null +++ b/tools/tracing/gc_stages.bt @@ -0,0 +1,28 @@ +usdt:$MMTK:mmtk:gc_start { + if (@stats_enabled) { + @gc_start_nsecs = nsecs; + } +} + +usdt:$MMTK:mmtk:gc_end { + if (@stats_enabled) { + @post_closure_time += nsecs - @post_closure_nsecs; + } +} + +usdt:$MMTK:mmtk:bucket_opened { + if (@stats_enabled) { + $ns = nsecs; + // Please check enum WorkBucketStage for the numerical values of stages + // Closure is 2 when vo_bit is not set + if (arg0 == 2) { + @closure_nsecs = $ns; + @pre_closure_time += $ns - @gc_start_nsecs; + } + // Release is 14 when vo_bit is not set + if (arg0 == 14) { + @post_closure_nsecs = $ns; + @closure_time += $ns - @closure_nsecs; + } + } +} diff --git a/tools/tracing/lock_contended.bt b/tools/tracing/lock_contended.bt new file mode 100644 index 0000000000..4afb7984a2 --- /dev/null +++ b/tools/tracing/lock_contended.bt @@ -0,0 +1,11 @@ +uprobe:$MMTK:_ZN3std3sys4unix5locks11futex_mutex5Mutex14lock_contended* { + if (@stats_enabled) { + @lock_nsecs[tid] = (arg0, nsecs); + } +} + +uretprobe:$MMTK:_ZN3std3sys4unix5locks11futex_mutex5Mutex14lock_contended* { + if (@stats_enabled) { + @lock_dist[@lock_nsecs[tid].0] = hist((nsecs - @lock_nsecs[tid].1)/256); + } +} diff --git a/tools/tracing/packet_size.bt b/tools/tracing/packet_size.bt new file mode 100644 index 0000000000..c50cd98af7 --- /dev/null +++ b/tools/tracing/packet_size.bt @@ -0,0 +1,8 @@ +usdt:$MMTK:mmtk:process_edges { + if (@stats_enabled) { + @process_edges_packet_size = hist(arg0); + if (arg1) { + @process_edges_root_packet_size = hist(arg0); + } + } +} diff --git a/tools/tracing/prologue_with_harness.bt.fragment b/tools/tracing/prologue_with_harness.bt.fragment new file mode 100644 index 0000000000..81f9156a39 --- /dev/null +++ b/tools/tracing/prologue_with_harness.bt.fragment @@ -0,0 +1,10 @@ +usdt:$MMTK:mmtk:harness_begin { + //begin collecting data at harness_begin (start of final iteration) + @stats_enabled = 1; +} + +usdt:$MMTK:mmtk:harness_end { + //end data at harness_end (end of final iteration) + @stats_enabled = 0; + exit(); +} diff --git a/tools/tracing/prologue_without_harness.bt.fragment b/tools/tracing/prologue_without_harness.bt.fragment new file mode 100644 index 0000000000..3ab4b60f4c --- /dev/null +++ b/tools/tracing/prologue_without_harness.bt.fragment @@ -0,0 +1,4 @@ +BEGIN { + //always collect data + @stats_enabled = 1; +} diff --git a/tools/tracing/run.py b/tools/tracing/run.py new file mode 100755 index 0000000000..c30224e9e2 --- /dev/null +++ b/tools/tracing/run.py @@ -0,0 +1,63 @@ +#!/usr/bin/env python3 +from string import Template +from argparse import ArgumentParser +from pathlib import Path +import tempfile +import sys +import os + + +def get_args(): + parser = ArgumentParser() + parser.add_argument("-b", "--bpftrace", type=str, default="bpftrace", + help="Path of the bpftrace executable") + parser.add_argument("-m", "--mmtk", type=str, required=True, + help="Path of the MMTk binary") + parser.add_argument("-H", "--harness", action="store_true", + help="Only collect data for the timing iteration (harness_begin/harness_end)") + parser.add_argument("-p", "--print-script", action="store_true", + help="Print the content of the bpftrace script") + parser.add_argument( + "-f", "--format", choices=["text", "json"], default="text", help="bpftrace output format") + parser.add_argument("tool", type=str, help="Name of the bpftrace tool") + return parser.parse_args() + + +def main(): + args = get_args() + here = Path(__file__).parent.resolve() + bpftrace_script = here / f"{args.tool}.bt" + if not bpftrace_script.exists(): + print(f"Tracing script {str(bpftrace_script)} not found.") + sys.exit(1) + mmtk_bin = Path(args.mmtk) + if not mmtk_bin.exists(): + print(f"MMTk binary {str(mmtk_bin)} not found.") + sys.exit(1) + prologue_file = here / \ + ("prologue_with_harness.bt.fragment" if args.harness else "prologue_without_harness.bt.fragment") + prologue = prologue_file.read_text() + epilogue = (here / "epilogue.bt.fragment").read_text() + template = Template(prologue + bpftrace_script.read_text() + epilogue) + with tempfile.NamedTemporaryFile(mode="w+t") as tmp: + content = template.safe_substitute( + MMTK=mmtk_bin, TMP_FILE=tmp.name) + if args.print_script: + print(content) + tmp.write(content) + tmp.flush() + # We use execvp to replace the current process instead of creating + # a subprocess (or sh -c). This is so that when users invoke this from + # the command line, Ctrl-C will be captured by bpftrace instead of the + # outer Python script. The temporary file can then be cleaned up by + # the END probe in bpftrace. + # + # In theory, you can implement this via pty, but it is very finicky + # and doesn't work reliably. + # See also https://github.com/anupli/running-ng/commit/b74e3a13f56dd97f73432d8a391e1d6cd9db8663 + os.execvp("sudo", ["sudo", args.bpftrace, + "--unsafe", "-f", args.format, tmp.name]) + + +if __name__ == "__main__": + main()