Performance and Benchmarks #104

KDr2 · 2022-01-07T05:47:15Z

No description provided.

KDr2 · 2022-01-07T08:16:39Z

It's interesting, in the last commit I found step_in a tape is faster than running a model directly, maybe I missed something?

$ julia --project perf/p0.jl
  2.286 μs (38 allocations: 1.95 KiB)
  97.440 ns (1 allocation: 48 bytes)
  440.273 μs (48 allocations: 2.56 KiB)
  969.364 ns (4 allocations: 288 bytes)

yebai · 2022-01-07T11:47:53Z

It is possible that step_in the tape executes faster than the original function f(args...) since the tape specialises more (e.g. removing control flows, caching input/output arguments). Note that the total runtime (of Turing inference algorithm) also depends on the CTask constructor (see 1, 2):

julia> @btime t = Libtask.CTask(f, args...);
  258.923 ms (766345 allocations: 43.38 MiB)

julia> @btime Libtask.step_in(t.tf.tape, args)
  95.054 ns (1 allocation: 48 bytes)

julia> @btime f(args...)
  2.549 μs (38 allocations: 1.95 KiB)
(2.0, VarInfo (2 variables (μ, σ), dimension 2; logp: -1.2750123006e7))

So it appears that a lot of time is spent on repetitively constructing CTask. Maybe we can speed this up by resuing tapes?

KDr2 · 2022-01-09T07:32:32Z

Without Cache:

$ julia --project perf/p0.jl
"Directly call..." = "Directly call..."
  2.233 μs (38 allocations: 1.95 KiB)
"CTask construction..." = "CTask construction..."
  410.719 ms (974878 allocations: 59.77 MiB)
"Step in a tape..." = "Step in a tape..."
  90.543 ns (1 allocation: 48 bytes)
"Directly call..." = "Directly call..."
  422.273 μs (48 allocations: 2.56 KiB)
"CTask construction..." = "CTask construction..."
  416.812 ms (974908 allocations: 59.77 MiB)
"Step in a tape..." = "Step in a tape..."
  923.559 ns (4 allocations: 288 bytes)

With IR and Tape Cache:

$ julia --project perf/p0.jl
"Directly call..." = "Directly call..."
  2.117 μs (38 allocations: 1.95 KiB)
"CTask construction..." = "CTask construction..."
  99.222 μs (489 allocations: 22.02 KiB)
"Step in a tape..." = "Step in a tape..."
  87.400 ns (1 allocation: 48 bytes)
"Directly call..." = "Directly call..."
  417.133 μs (48 allocations: 2.56 KiB)
"CTask construction..." = "CTask construction..."
  103.745 μs (495 allocations: 22.48 KiB)
"Step in a tape..." = "Step in a tape..."
  924.314 ns (4 allocations: 288 bytes)

KDr2 · 2022-01-11T07:51:20Z

In spite of numeric test failures and a few errors, unit tests finished in about 2 hours on my machine:

real    130m46.448s
user    126m4.793s
sys     5m25.685s

src/tapedtask.jl

Project.toml

src/tapedtask.jl

perf/p0.jl

src/tapedtask.jl

Co-authored-by: David Widmann <devmotion@users.noreply.github.com>

perf/src/LibtaskPerf.jl

yebai

I can confirm that the tests now run correctly - we can rerun the Turing CI once this PR is merged. Fingers crossed!

perf/p2.jl

src/tapedfunction.jl

KDr2 · 2022-01-19T00:47:18Z

This PR is ready to merge. @yebai

KDr2 added 2 commits January 7, 2022 03:37

temporarily add some pkgs to do testing

0efe368

simple benchmarks

463c187

use ir and tape cache

7d9eeb0

KDr2 added 3 commits January 10, 2022 19:23

use LRUCache instead of Dict

b78d3f3

partially copy tape

dd211e9

fix a TArray bug

2e3491d

KDr2 force-pushed the perf branch from af672b5 to 2e3491d Compare January 11, 2022 04:26

yebai requested changes Jan 11, 2022

View reviewed changes

src/tapedtask.jl Outdated Show resolved Hide resolved

src/tapedtask.jl Show resolved Hide resolved

yebai reviewed Jan 11, 2022

View reviewed changes

Project.toml Outdated Show resolved Hide resolved

yebai reviewed Jan 11, 2022

View reviewed changes

src/tapedtask.jl Outdated Show resolved Hide resolved

KDr2 added 2 commits January 12, 2022 00:24

add Project.toml for perf dir

400571c

minor update

3a25240

KDr2 marked this pull request as ready for review January 12, 2022 00:41

devmotion reviewed Jan 12, 2022

View reviewed changes

perf/p0.jl Show resolved Hide resolved

src/tapedtask.jl Outdated Show resolved Hide resolved

Update src/tapedtask.jl

f24480e

Co-authored-by: David Widmann <devmotion@users.noreply.github.com>

yebai changed the title ~~[WIP] Performance and Benchmarks~~ Performance and Benchmarks Jan 12, 2022

yebai reviewed Jan 12, 2022

View reviewed changes

perf/src/LibtaskPerf.jl Outdated Show resolved Hide resolved

yebai approved these changes Jan 12, 2022

View reviewed changes

KDr2 and others added 4 commits January 12, 2022 23:59

remove redundant module

9b548c6

Catch and print error while re-running a (cached) tape.

c6ec201

put new onto tape

e4838e9

copy NewInstruction

e1ae835