Skip to content

Performance Analysis

Dominique edited this page Apr 11, 2016 · 1 revision

Introduction

There are many tools to help assess the performance of Python code. This page gives a few pointers and some general guidelines.

Flame Graphs

Flame graphs can help visualize quickly where the most time-consuming parts of the code are located, so we can focus our efforts on functions or methods that will benefit the most from optimizations.

Installation

$ pip install git+https://github.com/evanhempel/python-flamegraph.git
$ git clone https://github.com/brendangregg/FlameGraph.git

Create a symbolic link to FlameGraph/flamegraph.pl from a location on your PATH, e.g., $VIRTUAL_ENV/bin.

Example usage

In order to produce a flame graph of TRON running on the TORSION problem, run

$ python -m flamegraph -o torsion3.log nlp_tron.py torsion.par3.nl
$ flamegraph.pl --title "tron/torsion3" torsion3.log > torsion3.svg
$ open -a Safari torsion3.svg

and voilà:

Torsion3 FlameGraph

(right-click on the graph, choose "open in new tab" and start exploring by moving the mouse around).

On the flame graph, the longer a horizontal block, the more time consuming the corresponding task is in the code. In the example above, we see that, as expecting, all the time is taken by the solve method. The latter essentially breaks down into

  1. cauchy, and
  2. projected_newton_step.

In turn, most of the time spent in cauchy is due to projected_step so it's a good idea to try and reduce the time needed by that function. But it will be difficult because that function is just one Numpy call.

Similarly, most of the time spent in projected_newton_step is due to truncated_cg (and a little bit to project, which is again just one Numpy call and seems difficult to improve.

Going one level up in the graph, the time spent in truncated_cg is mostly due to operator-vector products, trqsol and other operations that are inside the truncated_cg() function.

Benchmarks

The Benchmark module lets us compare variants of a function to determine the most efficient implementation in terms of running time. To illustrate, let's take the trqsol function from the previous section and isolate it in a file:

import numpy as np
import math

def trqsol(x, p, delta, sqrt):
    ptx = np.dot(p, x)
    ptp = np.dot(p, p)
    xtx = np.dot(x, x)
    dsq = square(delta)

    rad = square(ptx) + ptp * (dsq - xtx)
    rad = sqrt(max(rad, 0))

    if ptx > 0:
        sigma = (dsq - xtx) / (ptx + rad)
    elif rad > 0:
        sigma = (rad - ptx) / ptp
    else:
        sigma = 0
    return sigma

We might wonder if the square root function from Numpy is really more efficient than that from the math module. That's why sqrt is now an input argument of trqsol(). In the same file, we add a benchmark:

import benchmark

class Benchmark_trqsol(benchmark.Benchmark):

    each = 10

    def setUp(self):
        n = 10000
        self.x = np.zeros(n); self.x[0] = 0.1
        self.p = np.ones(n)
        self.delta = 1.0

    def test_np(self):
        trqsol(self.x, self.p, self.delta, np.sqrt)

    def test_math(self):
        trqsol(self.x, self.p, self.delta, math.sqrt)


if __name__ == "__main__":
    benchmark.main(format="markdown", numberFortran="%.3f")

Running the file produces:

name rank runs mean stdev baseline
math 1 10 1.855e-05 7.179e-07 1
np 2 10 2.398e-05 1.195e-05 1.293

So it turns out that we had better use math.sqrt than np.sqrt for scalars.

py.test Benchmarks

To be written. See https://github.com/ionelmc/pytest-benchmark

Others

To be written. See https://www.huyng.com/posts/python-performance-analysis

Clone this wiki locally