diff --git a/tools/tracing/README.md b/tools/tracing/README.md
new file mode 100644
index 0000000000..ab368fd01d
--- /dev/null
+++ b/tools/tracing/README.md
@@ -0,0 +1,241 @@
+# MMTk performace tracing
+
+## Notes for MMTk developers
+Please open pull requests if you develop new tools that others might find useful.
+When you add new tools, please update this documentation.
+If you change MMTk internals that the tracing tools depend on (such as the
+definition of `enum WorkBucketStage`), please update the scripts accordingly.
+
+## Notes for MMTk users
+Since some of the tools depend on the MMTk internals, please use the tools
+shipped with the MMTk release you use.
+
+## Tracepoints
+Currently, the core provides the following tracepoints.
+- `mmtk:collection_initialized()`: GC is enabled
+- `mmtk:harness_begin()`: the timing iteration of a benchmark begins
+- `mmtk:harness_end()`: the timing iteration of a benchmark ends
+- `mmtk:gccontroller_run()`: the GC controller thread enters its work loop
+- `mmtk:gcworker_run()`: a GC worker thread enters its work loop
+- `mmtk:gc_start()`: a collection epoch starts
+- `mmtk:gc_end()`: a collection epoch ends
+- `mmtk:process_edges(num_edges: int, is_roots: bool)`: a invocation of the
+`process_edges` method. The first argument is the number of edges to be processed,
+and the second argument is whether these edges are root edges.
+- `mmtk:bucket_opened(id: int)`: a work bucket opened. The first argument is the
+numerical representation of `enum WorkBucketStage`.
+- `mmtk:work_poll()`: a work packet is to be polled.
+- `mmtk:work(type_name: char *, type_name_len: int)`: a work packet was just
+executed. The first argument is points to the string of the Rust type name of
+the work packet, and the second argument is the length of the string.
+- `mmtk:alloc_slow_once_start()`: the allocation slow path starts.
+- `mmtk:alloc_slow_once_end()`: the allocation slow path ends.
+
+## Running tracing tools
+The tracing tools are to be invoked by a wrapper script `run.py`.
+```
+usage: run.py [-h] [-b BPFTRACE] -m MMTK [-H] [-p] [-f {text,json}] tool
+
+positional arguments:
+  tool                  Name of the bpftrace tool
+
+optional arguments:
+  -h, --help            show this help message and exit
+  -b BPFTRACE, --bpftrace BPFTRACE
+                        Path of the bpftrace executable
+  -m MMTK, --mmtk MMTK  Path of the MMTk binary
+  -H, --harness         Only collect data for the timing iteration (harness_begin/harness_end)
+  -p, --print-script    Print the content of the bpftrace script
+  -f {text,json}, --format {text,json}
+                        bpftrace output format
+```
+
+- `-b`: the path to the `bpftrace` executable. By default, it uses `bpftrace`
+executable in your `PATH`. We strongly recommend you use the latest statically
+complied `bpftrace` from [upstream](https://github.com/iovisor/bpftrace/releases).
+You need to be able to have sudo permission for whichever `bpftrace` you want to use.
+- `-m`: the path to a MMTk binary that contains the tracepoints.
+This depends on the binding you use.
+For the OpenJDK binding, it should be `jdk/lib/server/libmmtk_openjdk.so` under
+your build folder.
+To check whether the binary contains tracepoints, you can use `readelf -n`.
+You should see a bunch of `stapsdt` notes with `mmtk` as the provider.
+- `-H`: pass this flag is you want to only measure the timing iteration of a
+benchmark.
+By default, the tracing tools will measure the entire execution.
+- `-p`: print the entire tracing script before execution.
+This is mainly for debugging use.
+- `-f`: change the bpftrace output format.
+By default, it uses human-readable plain text output (`text`).
+You can set this to `json` for easy parsing.
+
+Please run the tracing tools **before** running the workload.
+If you use `-H`, the tracing tools will automatically end with `harness_end` is
+called.
+Otherwise, you will need to terminate the tools manually with `Ctrl-C`.
+These tools also have a timeout of 1200 seconds so not to stall unattended
+benchmark execution. 
+
+## Tracing tools
+### Measuring the time spend in allocation slow path (`alloc_slow`)
+This tool measures the distribution of the allocation slow path time.
+The time unit is 400ns, so that we use the histogram bins with higher
+fidelity better.
+
+Sample output:
+```
+@alloc_slow_hist: 
+[4, 8)               304 |@                                                   |
+[8, 16)            12603 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
+[16, 32)            8040 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@                   |
+[32, 64)             941 |@@@                                                 |
+[64, 128)            171 |                                                    |
+[128, 256)            13 |                                                    |
+[256, 512)             2 |                                                    |
+[512, 1K)              0 |                                                    |
+[1K, 2K)               0 |                                                    |
+[2K, 4K)               0 |                                                    |
+[4K, 8K)               0 |                                                    |
+[8K, 16K)              0 |                                                    |
+[16K, 32K)            14 |                                                    |
+[32K, 64K)            37 |                                                    |
+[64K, 128K)           19 |                                                    |
+[128K, 256K)           1 |                                                    |
+```
+
+In the above output, we can see that most allocation slow paths finish between
+3.2us and 6.4us.
+However, there is a long tail, presumably due to GC pauses.
+
+### Measuring the time spend in different GC stages (`gc_stages`)
+This tool measures the time spent in different stages of GC: before `Closure`,
+during `Closure`, and after `Closure`.
+The time unit is ns.
+
+Sample output:
+```
+@closure_time: 1405302743
+@post_closure_time: 81432919
+@pre_closure_time: 103886118
+```
+
+In the above output, overall, the execution spends 1.4s in the main transitive
+closure, 103ms before that, and 81ms after that (a total of around 1.5s).
+
+### Measuring the time spend in lock contended state for Rust `Mutex` (`lock_contend`)
+This tools measures the time spent in the lock contended state for Rust `Mutex`s.
+The Rust standard library implements `Mutex` using the fast-slow-path paradigm.
+Most lock operations take place in inlined fast paths, when there's no contention.
+However, when there's contention,
+`std::sys::unix::locks::futex_mutex::Mutex::lock_contended` is called.
+
+```rust
+#[inline]
+pub fn lock(&self) {
+    if self.futex.compare_exchange(0, 1, Acquire, Relaxed).is_err() {
+        self.lock_contended();
+    }
+}
+
+#[cold]
+fn lock_contended(&self) {
+    <snip>
+}
+```
+
+
+MMTk uses Rust `Mutex`, e.g., in allocation slow paths for synchronization,
+and this tool can be useful to measure the contention in these parts of code.
+
+The time unit is 256ns.
+
+Sample output:
+```
+@lock_dist[140637228007056]: 
+[1]                  447 |@@@@                                                |
+[2, 4)              3836 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@             |
+[4, 8)              3505 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@                |
+[8, 16)             1354 |@@@@@@@@@@@@@@                                      |
+[16, 32)             832 |@@@@@@@@                                            |
+[32, 64)            1077 |@@@@@@@@@@@                                         |
+[64, 128)           2991 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@                     |
+[128, 256)          4846 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@  |
+[256, 512)          5013 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
+[512, 1K)           1203 |@@@@@@@@@@@@                                        |
+[1K, 2K)              34 |                                                    |
+[2K, 4K)              15 |                                                    |
+```
+
+In the above output, we can see that the lock instance (140637228007056, or 0x7fe8a8047e90)
+roughly has a bimodal distribution in terms of the time spent in lock contended
+code path.
+The first peak is around 512ns\~1024ns, and the second peak is around 66us\~131us.
+
+If you can't tell which lock instance is for which lock in MMTk, you can trace
+the allocation of the Mutex and record the stack trace (note that you might want
+to compile MMTk with `force-frame-pointers` to obtain better stack traces).
+
+### Measuring the distribution of `process_edges` packet sizes (`packet_size`)
+Most of the GC time is spend in the transitive closure for tracing-based GCs,
+and MMTk performs transitive closure via work packets that calls the `process_edges` method.
+This tool measures the distribution of the sizes of these work packets, and also
+count root edges separately.
+
+Sample output:
+```
+@process_edges_packet_size: 
+[1]                  238 |@@@@@                                               |
+[2, 4)               806 |@@@@@@@@@@@@@@@@@                                   |
+[4, 8)              1453 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@                     |
+[8, 16)             1105 |@@@@@@@@@@@@@@@@@@@@@@@                             |
+[16, 32)            2410 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
+[32, 64)            1317 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@                        |
+[64, 128)           1252 |@@@@@@@@@@@@@@@@@@@@@@@@@@@                         |
+[128, 256)          1131 |@@@@@@@@@@@@@@@@@@@@@@@@                            |
+[256, 512)          2017 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@         |
+[512, 1K)           1270 |@@@@@@@@@@@@@@@@@@@@@@@@@@@                         |
+[1K, 2K)            1028 |@@@@@@@@@@@@@@@@@@@@@@                              |
+[2K, 4K)             874 |@@@@@@@@@@@@@@@@@@                                  |
+[4K, 8K)            1024 |@@@@@@@@@@@@@@@@@@@@@@                              |
+[8K, 16K)             58 |@                                                   |
+[16K, 32K)             5 |                                                    |
+
+@process_edges_root_packet_size: 
+[1]                   71 |@@@@@@@                                             |
+[2, 4)                 4 |                                                    |
+[4, 8)               276 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@                        |
+[8, 16)              495 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
+[16, 32)             477 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@  |
+[32, 64)             344 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@                |
+[64, 128)            242 |@@@@@@@@@@@@@@@@@@@@@@@@@                           |
+[128, 256)           109 |@@@@@@@@@@@                                         |
+[256, 512)            31 |@@@                                                 |
+[512, 1K)             33 |@@@                                                 |
+[1K, 2K)              75 |@@@@@@@                                             |
+[2K, 4K)              75 |@@@@@@@                                             |
+[4K, 8K)             336 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@                 |
+[8K, 16K)             56 |@@@@@                                               |
+[16K, 32K)             3 |                                                    |
+```
+
+In the above output, we can see that overall, the sizes of the `process_edges`
+has a unimodal distribution with a peak around 16\~32 edges per packet.
+However, if we focus on root edges, the distribution is roughly bimodal, with a
+first peak around 8\~16 and a second peak around 4096\~8192.
+
+## Attribution
+If used for research, please cite the following publication (the `BibTeX` record
+will be updated once a DOI is assigned).
+```bibtex
+@inproceedings{conf/mplr/Huang23,
+  author       = {Claire Huang and
+                  Stephen M. Blackburn and
+                  Zixian Cai},
+  title        = {Improving Garbage Collection Observability with Performance Tracing},
+  booktitle    = {Proceedings of the 20th International Conference on Managed Programming
+                  Languages and Runtimes, {MPLR} 2023, Cascais, Portugal, October
+                  22, 2023},
+  publisher    = {{ACM}},
+  year         = {2023}
+}
+```
diff --git a/tools/tracing/alloc_slow.bt b/tools/tracing/alloc_slow.bt
new file mode 100644
index 0000000000..202eea1e6b
--- /dev/null
+++ b/tools/tracing/alloc_slow.bt
@@ -0,0 +1,11 @@
+usdt:$MMTK:mmtk:alloc_slow_once_start {
+    if (@stats_enabled) {
+        @alloc_slow_nsecs[tid] = nsecs;
+    }
+}
+
+usdt:$MMTK:mmtk:alloc_slow_once_end {
+    if (@stats_enabled) {
+        @alloc_slow_hist = hist((nsecs - @alloc_slow_nsecs[tid])/400);
+    }
+}
diff --git a/tools/tracing/epilogue.bt.fragment b/tools/tracing/epilogue.bt.fragment
new file mode 100644
index 0000000000..70b82a4eb7
--- /dev/null
+++ b/tools/tracing/epilogue.bt.fragment
@@ -0,0 +1,7 @@
+END {
+    system("rm $TMP_FILE");
+}
+
+interval:s:1200 {
+    exit();
+}
diff --git a/tools/tracing/gc_stages.bt b/tools/tracing/gc_stages.bt
new file mode 100644
index 0000000000..1dab57d99e
--- /dev/null
+++ b/tools/tracing/gc_stages.bt
@@ -0,0 +1,28 @@
+usdt:$MMTK:mmtk:gc_start {
+    if (@stats_enabled) {
+        @gc_start_nsecs = nsecs;
+    }
+}
+
+usdt:$MMTK:mmtk:gc_end {
+    if (@stats_enabled) {
+        @post_closure_time += nsecs - @post_closure_nsecs;
+    }
+}
+
+usdt:$MMTK:mmtk:bucket_opened {
+    if (@stats_enabled) {
+        $ns = nsecs;
+        // Please check enum WorkBucketStage for the numerical values of stages
+        // Closure is 2 when vo_bit is not set
+        if (arg0 == 2) {
+            @closure_nsecs = $ns;
+            @pre_closure_time += $ns - @gc_start_nsecs;
+        }
+        // Release is 14 when vo_bit is not set
+        if (arg0 == 14) {
+            @post_closure_nsecs = $ns;
+            @closure_time += $ns - @closure_nsecs;
+        }
+    }
+}
diff --git a/tools/tracing/lock_contended.bt b/tools/tracing/lock_contended.bt
new file mode 100644
index 0000000000..4afb7984a2
--- /dev/null
+++ b/tools/tracing/lock_contended.bt
@@ -0,0 +1,11 @@
+uprobe:$MMTK:_ZN3std3sys4unix5locks11futex_mutex5Mutex14lock_contended* {
+    if (@stats_enabled) {
+        @lock_nsecs[tid] = (arg0, nsecs);
+    }
+}
+
+uretprobe:$MMTK:_ZN3std3sys4unix5locks11futex_mutex5Mutex14lock_contended* {
+    if (@stats_enabled) {
+        @lock_dist[@lock_nsecs[tid].0] = hist((nsecs - @lock_nsecs[tid].1)/256);
+    }
+}
diff --git a/tools/tracing/packet_size.bt b/tools/tracing/packet_size.bt
new file mode 100644
index 0000000000..c50cd98af7
--- /dev/null
+++ b/tools/tracing/packet_size.bt
@@ -0,0 +1,8 @@
+usdt:$MMTK:mmtk:process_edges {
+    if (@stats_enabled) {
+        @process_edges_packet_size = hist(arg0);
+        if (arg1) {
+            @process_edges_root_packet_size = hist(arg0);
+        }
+    }
+}
diff --git a/tools/tracing/prologue_with_harness.bt.fragment b/tools/tracing/prologue_with_harness.bt.fragment
new file mode 100644
index 0000000000..81f9156a39
--- /dev/null
+++ b/tools/tracing/prologue_with_harness.bt.fragment
@@ -0,0 +1,10 @@
+usdt:$MMTK:mmtk:harness_begin {
+    //begin collecting data at harness_begin (start of final iteration)
+    @stats_enabled = 1;
+}
+
+usdt:$MMTK:mmtk:harness_end {
+    //end data at harness_end (end of final iteration)
+    @stats_enabled = 0;
+    exit();
+}
diff --git a/tools/tracing/prologue_without_harness.bt.fragment b/tools/tracing/prologue_without_harness.bt.fragment
new file mode 100644
index 0000000000..3ab4b60f4c
--- /dev/null
+++ b/tools/tracing/prologue_without_harness.bt.fragment
@@ -0,0 +1,4 @@
+BEGIN {
+    //always collect data
+    @stats_enabled = 1;
+}
diff --git a/tools/tracing/run.py b/tools/tracing/run.py
new file mode 100755
index 0000000000..c30224e9e2
--- /dev/null
+++ b/tools/tracing/run.py
@@ -0,0 +1,63 @@
+#!/usr/bin/env python3
+from string import Template
+from argparse import ArgumentParser
+from pathlib import Path
+import tempfile
+import sys
+import os
+
+
+def get_args():
+    parser = ArgumentParser()
+    parser.add_argument("-b", "--bpftrace", type=str, default="bpftrace",
+                        help="Path of the bpftrace executable")
+    parser.add_argument("-m", "--mmtk", type=str, required=True,
+                        help="Path of the MMTk binary")
+    parser.add_argument("-H", "--harness", action="store_true",
+                        help="Only collect data for the timing iteration (harness_begin/harness_end)")
+    parser.add_argument("-p", "--print-script", action="store_true",
+                        help="Print the content of the bpftrace script")
+    parser.add_argument(
+        "-f", "--format", choices=["text", "json"], default="text", help="bpftrace output format")
+    parser.add_argument("tool", type=str, help="Name of the bpftrace tool")
+    return parser.parse_args()
+
+
+def main():
+    args = get_args()
+    here = Path(__file__).parent.resolve()
+    bpftrace_script = here / f"{args.tool}.bt"
+    if not bpftrace_script.exists():
+        print(f"Tracing script {str(bpftrace_script)} not found.")
+        sys.exit(1)
+    mmtk_bin = Path(args.mmtk)
+    if not mmtk_bin.exists():
+        print(f"MMTk binary {str(mmtk_bin)} not found.")
+        sys.exit(1)
+    prologue_file = here / \
+        ("prologue_with_harness.bt.fragment" if args.harness else "prologue_without_harness.bt.fragment")
+    prologue = prologue_file.read_text()
+    epilogue = (here / "epilogue.bt.fragment").read_text()
+    template = Template(prologue + bpftrace_script.read_text() + epilogue)
+    with tempfile.NamedTemporaryFile(mode="w+t") as tmp:
+        content = template.safe_substitute(
+            MMTK=mmtk_bin, TMP_FILE=tmp.name)
+        if args.print_script:
+            print(content)
+        tmp.write(content)
+        tmp.flush()
+        # We use execvp to replace the current process instead of creating
+        # a subprocess (or sh -c). This is so that when users invoke this from
+        # the command line, Ctrl-C will be captured by bpftrace instead of the
+        # outer Python script. The temporary file can then be cleaned up by
+        # the END probe in bpftrace.
+        # 
+        # In theory, you can implement this via pty, but it is very finicky
+        # and doesn't work reliably.
+        # See also https://github.com/anupli/running-ng/commit/b74e3a13f56dd97f73432d8a391e1d6cd9db8663
+        os.execvp("sudo", ["sudo", args.bpftrace,
+                           "--unsafe", "-f", args.format, tmp.name])
+
+
+if __name__ == "__main__":
+    main()