-
Notifications
You must be signed in to change notification settings - Fork 7
Performance Analysis
There are many tools to help assess the performance of Python code. This page gives a few pointers and some general guidelines.
Flame graphs can help visualize quickly where the most time-consuming parts of the code are located, so we can focus our efforts on functions or methods that will benefit the most from optimizations.
$ pip install git+https://github.com/evanhempel/python-flamegraph.git
$ git clone https://github.com/brendangregg/FlameGraph.git
Create a symbolic link to FlameGraph/flamegraph.pl
from a location on your PATH
, e.g., $VIRTUAL_ENV/bin
.
In order to produce a flame graph of TRON running on the TORSION problem, run
$ python -m flamegraph -o torsion3.log nlp_tron.py torsion.par3.nl
$ flamegraph.pl --title "tron/torsion3" torsion3.log > torsion3.svg
$ open -a Safari torsion3.svg
and voilà:
(right-click on the graph, choose "open in new tab" and start exploring by moving the mouse around).
On the flame graph, the longer a horizontal block, the more time consuming the corresponding task is in the code. In the example above, we see that, as expecting, all the time is taken by the solve
method. The latter essentially breaks down into
-
cauchy
, and -
projected_newton_step
.
In turn, most of the time spent in cauchy
is due to projected_step
so it's a good idea to try and reduce the time needed by that function. But it will be difficult because that function is just one Numpy call.
Similarly, most of the time spent in projected_newton_step
is due to truncated_cg
(and a little bit to project
, which is again just one Numpy call and seems difficult to improve.
Going one level up in the graph, the time spent in truncated_cg
is mostly due to operator-vector products, trqsol
and other operations that are inside the truncated_cg()
function.
The Benchmark module lets us compare variants of a function to determine the most efficient implementation in terms of running time. To illustrate, let's take the trqsol
function from the previous section and isolate it in a file:
import numpy as np
import math
def trqsol(x, p, delta, sqrt):
ptx = np.dot(p, x)
ptp = np.dot(p, p)
xtx = np.dot(x, x)
dsq = square(delta)
rad = square(ptx) + ptp * (dsq - xtx)
rad = sqrt(max(rad, 0))
if ptx > 0:
sigma = (dsq - xtx) / (ptx + rad)
elif rad > 0:
sigma = (rad - ptx) / ptp
else:
sigma = 0
return sigma
We might wonder if the square root function from Numpy is really more efficient than that from the math
module. That's why sqrt
is now an input argument of trqsol()
. In the same file, we add a benchmark:
import benchmark
class Benchmark_trqsol(benchmark.Benchmark):
each = 10
def setUp(self):
n = 10000
self.x = np.zeros(n); self.x[0] = 0.1
self.p = np.ones(n)
self.delta = 1.0
def test_np(self):
trqsol(self.x, self.p, self.delta, np.sqrt)
def test_math(self):
trqsol(self.x, self.p, self.delta, math.sqrt)
if __name__ == "__main__":
benchmark.main(format="markdown", numberFortran="%.3f")
Running the file produces:
name | rank | runs | mean | stdev | baseline |
---|---|---|---|---|---|
math | 1 | 10 | 1.855e-05 | 7.179e-07 | 1 |
np | 2 | 10 | 2.398e-05 | 1.195e-05 | 1.293 |
So it turns out that we had better use math.sqrt
than np.sqrt
for scalars.
To be written. See https://github.com/ionelmc/pytest-benchmark
To be written. See https://www.huyng.com/posts/python-performance-analysis