-
Notifications
You must be signed in to change notification settings - Fork 115
Oprofile
Oprofile is a sampling-based profiler for Linux.
To understand what that means, contrast its architecture with gprof. When gprof profiling is used, a flag is given to the compiler (-pg) with tells it to instrument the code itself to record what the code is doing. The instrumentation records the call graph, while sampling is used to determine where the code is at any given time. This information is stored in a binary file (gmon.out), which is then post-processed by the gprof tool into a human-readable format.
By contrast, oprofile does not depend on any compiler instrumentation whatsoever. No special compiler directives are necessary to use it. Oprofile loads a special kernel module into the memory of your kernel, and samples the system-wide activity for particular events at a user-specified sampling rate. Examples of particular events might be CPU time, cache misses, or branch mispredictions.
The advantages of oprofile over gprof are the following:
- Oprofile can profile information other than CPU time.
- Oprofile can deliver information specific to each line of code or assembly instruction; the information from gprof is given at the subroutine level.
- Oprofile can tell you how much time your program is spending in libraries, even if those libraries are not compiled with profiling or debugging flags.
The main disadvantages are that its use is more complicated, that it requires root access, and it is a little buggy.
Install the oprofile package with the usual
sudo apt-get install profile
It then needs intialising with
sudo opcontrol --init
Suppose we wish to profile where the CPU spends its time in a subroutine.
First, find the event you wish to profile with:
ophelp
Ophelp lists all the events available on your CPU. On my CPU, the name for the relevant event is GLOBAL_POWER_EVENTS. On your CPU, it might also be called CPU_CLK_UNHALTED.
Now, we need to configure the oprofile daemon to tell it what it is we want to profile.
sudo opcontrol --no-vmlinux
sudo opcontrol --event=GLOBAL_POWER_EVENTS:1000000
The --no-vmlinux refers to oprofile's ability to profile the running Linux kernel, which is irrelevant to us. The 1000000 after GLOBAL_POWER_EVENTS is the sample period (smaller is more often) at which the oprofile kernel module will wake up and see what is going on. If the rate is set too low (a large numeric value), the sampling will be inaccurate; if it is set too high (a small numeric value), your machine will freeze as it spends all its time profiling itself. The oprofile docs are plastered with warnings about not setting this too high, for you can easily freeze your machine by setting this value incorrectly. Start with a large number, and work your way down if you find your results are insufficiently accurate.
Confirm that oprofile has liked your inputs with
sudo opcontrol --status
Now, here comes the action part:
sudo opcontrol --start; sudo opcontrol --reset; sudo opcontrol --status; run_program ;
sudo opcontrol --shutdown
where run_program is whatever program is to be profiled. The --reset argument causes the daemon to lose whatever information it had stored; this is so that previous results do not pollute your current profile. The --start and --shutdown are self-explanatory.
If your program is not compiled with debugging symbols, then line information is not present in the binary and the most oprofile can give you is subroutine-level information. Try
opreport -l /path/to/binary
This produces output like the following (shamelessly copied from the oprofile manual):
Counted GLOBAL_POWER_EVENTS events (clocks processor is not halted) with a unit mask of 0x00 (No unit mask) count 23150
vma samples % image name symbol name
0804be10 14971 28.1993 oprofiled odb_insert
0804afdc 7144 13.4564 oprofiled pop_buffer_value
c01daea0 6113 11.5144 vmlinux __copy_to_user_ll
Ignoring the specific details, it lists the subroutines in order of how many samples are associated with each.
The real power of oprofile comes when it is able to associate the samples with lines of source code. To do this, the binary must be compiled with debugging symbols. Use
opannotate --source --output-dir=annotated /path/to/binary
This will generate a directory called 'annotated' which will contain all the source code which oprofile can identify. If you compiled your binary in /home/pfarrell/source/program, then the annotated source will be located in annotated/home/pfarrell/source/program. The source code will look something like
...
:static uint64_t pop_buffer_value(struct transient * trans)
11510 1.9661 :{ /* pop_buffer_value total: 89901 15.3566 */
: uint64_t val;
:
10227 1.7469 : if (!trans->remaining) {
: fprintf(stderr, "BUG: popping empty buffer !\n");
: exit(EXIT_FAILURE);
: }
:
: val = get_buffer_value(trans->buffer, 0);
2281 0.3896 : trans->remaining--;
2296 0.3922 : trans->buffer += kernel_pointer_size;
: return val;
10454 1.7857 :}
...
The information before the : on the left is the annotation. The first number refers to the number of samples associated with each line. The second number is the percentage of the total samples associated with each line. As you can imagine, this information is very useful for identifying lines that miss cache, or branch mispredictions.
If samples are associated with a function declaration (i.e., not with lines of source code inside the function definition, but where the function interface is defined), then it means that those samples occurred during the initialisation and finalisation of the function. To see in more detail exactly where, try looking at the assembly output. To do this, use
opannotate --source --assembly /path/to/binary > assembly.out 2>&1
This will take a while, as it has to disassemble the entire binary (and since fluidity binaries are generally huge, this can take up to an hour on a slower machine). But it might be worth it; the information may be invaluable for optimisation.
There are some caveats. See the the oprofile manual page at http://oprofile.sourceforge.net/doc/interpreting.html for details. In particular, the line association is not perfect; it can assign blame incorrectly due to hardware latencies. Use your common sense.
By default oprofile may be configured with a maximum call graph depth of 0 (not terribly useful). You can see your current maximum call graph depth with:
sudo opcontrol --status
and change it if required with:
sudo opcontrol --callgraph=DEPTH
After running under oprofile a call graph can be generated with:
sudo opreport -c /path/to/binary
The availability of the callgraph functionality is dependent upon your platform and kernel version.