Oprofile

Introduction

Oprofile is a sampling-based profiler for Linux.

To understand what that means, contrast its architecture with gprof. When gprof profiling is used, a flag is given to the compiler (-pg) with tells it to instrument the code itself to record what the code is doing. The instrumentation records the call graph, while sampling is used to determine where the code is at any given time. This information is stored in a binary file (gmon.out), which is then post-processed by the gprof tool into a human-readable format.

By contrast, oprofile does not depend on any compiler instrumentation whatsoever. No special compiler directives are necessary to use it. Oprofile loads a special kernel module into the memory of your kernel, and samples the system-wide activity for particular events at a user-specified sampling rate. Examples of particular events might be CPU time, cache misses, or branch mispredictions.

The advantages of oprofile over gprof are the following:

Oprofile can profile information other than CPU time.
Oprofile can deliver information specific to each line of code or assembly instruction; the information from gprof is given at the subroutine level.
Oprofile can tell you how much time your program is spending in libraries, even if those libraries are not compiled with profiling or debugging flags.

The main disadvantages are that its use is more complicated, that it requires root access, and it is a little buggy.

Installing

Install the oprofile package with the usual

 sudo apt-get install profile

It then needs intialising with

 sudo opcontrol --init

Using oprofile

Suppose we wish to profile where the CPU spends its time in a subroutine.

First, find the event you wish to profile with:

 ophelp

Ophelp lists all the events available on your CPU. On my CPU, the name for the relevant event is GLOBAL_POWER_EVENTS. On your CPU, it might also be called CPU_CLK_UNHALTED.

Now, we need to configure the oprofile daemon to tell it what it is we want to profile.

 sudo opcontrol --no-vmlinux
 sudo opcontrol --event=GLOBAL_POWER_EVENTS:1000000

The --no-vmlinux refers to oprofile's ability to profile the running Linux kernel, which is irrelevant to us. The 1000000 after GLOBAL_POWER_EVENTS is the sample period (smaller is more often) at which the oprofile kernel module will wake up and see what is going on. If the rate is set too low (a large numeric value), the sampling will be inaccurate; if it is set too high (a small numeric value), your machine will freeze as it spends all its time profiling itself. The oprofile docs are plastered with warnings about not setting this too high, for you can easily freeze your machine by setting this value incorrectly. Start with a large number, and work your way down if you find your results are insufficiently accurate.

Confirm that oprofile has liked your inputs with

 sudo opcontrol --status

Now, here comes the action part:

 sudo opcontrol --start; sudo opcontrol --reset; sudo opcontrol --status; run_program ;
 sudo opcontrol --shutdown

where run_program is whatever program is to be profiled. The --reset argument causes the daemon to lose whatever information it had stored; this is so that previous results do not pollute your current profile. The --start and --shutdown are self-explanatory.

Flat profile

If your program is not compiled with debugging symbols, then line information is not present in the binary and the most oprofile can give you is subroutine-level information. Try

 opreport -l /path/to/binary

This produces output like the following (shamelessly copied from the oprofile manual):

 Counted GLOBAL_POWER_EVENTS events (clocks processor is not halted) with a unit mask of 0x00 (No unit mask) count 23150
 vma      samples  %           image name               symbol name
 0804be10 14971    28.1993     oprofiled                odb_insert
 0804afdc 7144     13.4564     oprofiled                pop_buffer_value
 c01daea0 6113     11.5144     vmlinux                  __copy_to_user_ll

Ignoring the specific details, it lists the subroutines in order of how many samples are associated with each.

Annotated source

The real power of oprofile comes when it is able to associate the samples with lines of source code. To do this, the binary must be compiled with debugging symbols. Use

 opannotate --source --output-dir=annotated /path/to/binary

This will generate a directory called 'annotated' which will contain all the source code which oprofile can identify. If you compiled your binary in /home/pfarrell/source/program, then the annotated source will be located in annotated/home/pfarrell/source/program. The source code will look something like

 ...
               :static uint64_t pop_buffer_value(struct transient * trans)
 11510  1.9661 :{ /* pop_buffer_value total:  89901 15.3566 */
               :        uint64_t val;
               :
 10227  1.7469 :        if (!trans->remaining) {
               :                fprintf(stderr, "BUG: popping empty buffer !\n");
               :                exit(EXIT_FAILURE);
               :        }
               :
               :        val = get_buffer_value(trans->buffer, 0);
  2281  0.3896 :        trans->remaining--;
  2296  0.3922 :        trans->buffer += kernel_pointer_size;
               :        return val;
 10454  1.7857 :}
 ...

The information before the : on the left is the annotation. The first number refers to the number of samples associated with each line. The second number is the percentage of the total samples associated with each line. As you can imagine, this information is very useful for identifying lines that miss cache, or branch mispredictions.

If samples are associated with a function declaration (i.e., not with lines of source code inside the function definition, but where the function interface is defined), then it means that those samples occurred during the initialisation and finalisation of the function. To see in more detail exactly where, try looking at the assembly output. To do this, use

 opannotate --source --assembly /path/to/binary > assembly.out 2>&1

This will take a while, as it has to disassemble the entire binary (and since fluidity binaries are generally huge, this can take up to an hour on a slower machine). But it might be worth it; the information may be invaluable for optimisation.

There are some caveats. See the the oprofile manual page at http://oprofile.sourceforge.net/doc/interpreting.html for details. In particular, the line association is not perfect; it can assign blame incorrectly due to hardware latencies. Use your common sense.

Call graph

By default oprofile may be configured with a maximum call graph depth of 0 (not terribly useful). You can see your current maximum call graph depth with:

  sudo opcontrol --status

and change it if required with:

  sudo opcontrol --callgraph=DEPTH

After running under oprofile a call graph can be generated with:

  sudo opreport -c /path/to/binary

The availability of the callgraph functionality is dependent upon your platform and kernel version.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly