Table of Contents
Table of Contents
This document is current for OProfile version 0.7cvs. This document provides some details on the internal workings of OProfile for the interested hacker. This document assumes strong C, working C++, plus some knowledge of kernel internals and CPU hardware.
Only the "new" implementation associated with kernel 2.6 and above is covered here. 2.4 uses a very different kernel module implementation and daemon to produce the sample files.
OProfile is a statistical continuous profiler. In other words, profiles are generated by regularly sampling the current registers on each CPU (from an interrupt handler, the saved PC value at the time of interrupt is stored), and converting that runtime PC value into something meaningful to the programmer.
OProfile achieves this by taking the stream of sampled PC values, along with the detail of which task was running at the time of the interrupt, and converting into a file offset against a particular binary file. Because applications mmap() the code they run (be it /bin/bash, /lib/libfoo.so or whatever), it's possible to find the relevant binary file and offset by walking the task's list of mapped memory areas. Each PC value is thus converted into a tuple of binary-image,offset. This is something that the userspace tools can use directly to reconstruct where the code came from, including the particular assembly instructions, symbol, and source line (via the binary's debug information if present).
Regularly sampling the PC value like this approximates what actually was executed and how often - more often than not, this statistical approximation is good enough to reflect reality. In common operation, the time between each sample interrupt is regulated by a fixed number of clock cycles. This implies that the results will reflect where the CPU is spending the most time; this is obviously a very useful information source for performance analysis.
Sometimes though, an application programmer needs different kinds of information: for example, "which of the source routines cause the most cache misses ?". The rise in importance of such metrics in recent years has led many CPU manufacturers to provide hardware performance counters capable of measuring these events on the hardware level. Typically, these counters increment once per each event, and generate an interrupt on reaching some pre-defined number of events. OProfile can use these interrupts to generate samples: then, the profile results are a statistical approximation of which code caused how many of the given event.
There are typically more than one of these counters, so it's possible to set up profiling for several different event types. Using these counters gives us a powerful, low-overhead way of gaining performance metrics. If OProfile, or the CPU, does not support performance counters, then a simpler method is used: the kernel timer interrupt feeds samples into OProfile itself.
The rest of this document concerns itself with how we get from receiving samples at interrupt time to producing user-readable profile information.
If OProfile supports the hardware performance counters found on a particular architecture, code for managing the details of setting up and managing these counters can be found in the kernel source tree in the relevant arch/arch/oprofile/ directory. The architecture-specific implementation works via filling in the oprofile_operations structure at init time. This provides a set of operations such as setup(), start(), stop(), etc. that manage the hardware-specific details of fiddling with the performance counter registers.
The other important facility available to the architecture code is oprofile_add_sample(). This is where a particular sample taken at interrupt time is fed into the generic OProfile driver code.
OProfile implements a pseudo-filesystem known as "oprofilefs", mounted from userspace at /dev/oprofile. This consists of small files for reporting and receiving configuration from userspace, as well as the actual character device that the OProfile userspace receives samples from. At setup() time, the architecture-specific may add further configuration files related to the details of the performance counters. For example, on x86, one numbered directory for each hardware performance counter is added, with files in each for the event type, reset value, etc.
The filesystem also contains a stats directory with a number of useful counters for various OProfile events.
This lives in drivers/oprofile/, and forms the core of how OProfile works in the kernel. Its job is to take samples delivered from the architecture-specific code (via oprofile_add_sample()), and buffer this data, in a transformed form as described later, until releasing the data to the userspace daemon via the /dev/oprofile/buffer character device.
The OProfile userspace daemon's job is to take the raw data provided by the kernel and write it to the disk. It takes the single data stream from the kernel and logs sample data against a number of sample files (found in /var/lib/oprofile/samples/current/. For the benefit of the "separate" functionality, the names/paths of these sample files are mangled to reflect where the samples were from: this can include thread IDs, the binary file path, the event type used, and more.
After this final step from interrupt to disk file, the data is now persistent (that is, changes in the running of the system do not invalidate stored data). So the post-profiling tools can run on this data at any time (assuming the original binary files are still available and unchanged, naturally).
Table of Contents
Table of Contents
Table of Contents
Table of Contents
The primary binary image used by an application. This is derived from the kernel and corresponds to the binary started upon running an application: for example, /bin/bash.
An ELF file containing executable code: this includes kernel modules, the kernel itself (a.k.a. vmlinux), shared libraries, and application binaries.
Short for "dentry cookie". A unique ID that can be looked up to provide the full path name of a binary image.
A binary image that is dependent upon an application, used with per-application separation. Most commonly, shared libraries. For example, if /bin/bash is running and we take some samples inside the C library itself due to bash calling library code, then the image /lib/libc.so would be dependent upon /bin/bash.
This refers to the ability to merge several distinct sample files into one set of data at runtime, in the post-profiling tools. For example, per-thread sample files can be merged into one set of data, because they are compatible (i.e. the aggregation of the data is meaningful), but it's not possible to merge sample files for two different events, because there would be no useful meaning to the results.
A collection of profile data that has been collected under the same class template. For example, if we're using opreport to show results after profiling with two performance counters enabled profiling DATA_MEM_REFS and CPU_CLK_UNHALTED, there would be two profile classes, one for each event. Or if we're on an SMP system and doing per-cpu profiling, and we request opreport to show results for each CPU side-by-side, there would be a profile class for each CPU.
The parameters the user passes to the post-profiling tools that limit what sample files are used. This specification is matched against the available sample files to generate a selection of profile data.
The parameters that define what goes in a particular profile class. This includes a symbolic name (e.g. "cpu:1") and the code-usable equivalent.