From: Mikael Pettersson This patch updates perfctr's documentation: - adds new Implementation Notes section to the x86 documentation - some minor fixes in the x86 documentation - adds new documentation on the per-process perfctrs - adds new overview documentation Signed-off-by: Mikael Pettersson Signed-off-by: Andrew Morton --- 25-akpm/Documentation/perfctr/low-level-x86.txt | 122 +++++++- 25-akpm/Documentation/perfctr/overview.txt | 129 ++++++++ 25-akpm/Documentation/perfctr/virtual.txt | 355 ++++++++++++++++++++++++ 3 files changed, 602 insertions(+), 4 deletions(-) diff -puN Documentation/perfctr/low-level-x86.txt~perfctr-documentation-update Documentation/perfctr/low-level-x86.txt --- 25/Documentation/perfctr/low-level-x86.txt~perfctr-documentation-update 2004-07-12 00:04:07.771309056 -0700 +++ 25-akpm/Documentation/perfctr/low-level-x86.txt 2004-07-12 00:04:07.776308296 -0700 @@ -8,6 +8,13 @@ This document only describes x86-specifi For detailed hardware control register layouts, see the manufacturers' documentation. +Contents +======== +- Supported processors +- Contents of +- Processor-specific Notes +- Implementation Notes + Supported processors ==================== - Intel P5, P5MMX, P6, P4. @@ -17,8 +24,8 @@ Supported processors - VIA C3. (bad P6 clone) - Any generic x86 with a TSC. -Contents of -================================= +Contents of +================================ "struct perfctr_sum_ctrs" ------------------------- @@ -116,7 +123,7 @@ On HT P4s, resource conflicts can occur (T0 and T1) in a processor share the same perfctr registers. To prevent conflicts, only thread 0 in each processor is allowed to access the counters. perfctr_cpus_forbidden_mask contains the -smp_processor_id()'s of each processor's thread 1, and it is the +smp_processor_id()s of each processor's thread 1, and it is the responsibility of the high-level driver to ensure that it never accesses the perfctr state from a forbidden thread. @@ -155,7 +162,7 @@ The hardware stores both counters' contr control register, the CESR MSR. The evntsel values are limited to 16 bits each, and are combined by the low-level driver to form the value for the CESR. Apart from that, -the evntsel values direct images of the CESR. +the evntsel values are direct images of the CESR. Bits 0xFE00 in an evntsel value are reserved. At least one evntsel CPL bit (0x00C0) must be set. @@ -244,3 +251,110 @@ If p4.pebs_enable is non-zero: - at least one of bits 10, 9, 2, 1, or 0 must be set - in p4.pebs_matrix_vert, all bits except 1 and 0 must be clear, and at least one of bits 1 and 0 must be set + +Implementation Notes +==================== + +Caching +------- +Each 'struct perfctr_cpu_state' contains two cache-related fields: +- 'id': a unique identifier for the control data contents +- 'isuspend_cpu': the identity of the CPU on which a state containing + interrupt-mode counters was last suspended + +To this the driver adds a per-CPU cache, recording: +- the 'id' of the control data currently in that CPU +- the current contents of each control register + +When perfctr_cpu_update_control() has validated the new control data, +it also updates the id field. + +The driver's internal 'write_control' function, called from the +perfctr_cpu_resume() API function, first checks if the state's id +matches that of the CPU's cache, and if so, returns. Otherwise +it checks each control register in the state and updates those +that do not match the cache. Finally, it writes the state's id +to the cache. Tests on various x86 processor types have shown that +MSR writes are very expensive: the purpose of these cache checks +is to avoid MSR writes whenever possible. + +Unlike accumulation-mode counters, interrupt-mode counters must be +physically stopped when suspended, primilarly to avoid overflow +interrupts in contexts not expecting them, and secondarily to avoid +increments to the counters themselves (see below). + +When suspending interrupt-mode counters, the driver: +- records the CPU identity in the per-CPU cache +- stops each interrupt-mode counter by disabling its control register +- lets the cache and state id values remain the same + +Later, when resuming interrupt-mode counters, the driver: +- if the state and cache id values match: + * the cache id is cleared, to force a reload of the control + registers stopped at suspend (see below) + * if the state's "suspend" CPU identity matches the current CPU, + the counter registers are still valid, and the procedure returns +- if the procedure did not return above, it then loops over each + interrupt-mode counter: + * the counter's control register is physically disabled, unless + the cache indicates that it already is disabled; this is necessary + to prevent premature events and overflow interrupts if the CPU's + registers previously belonged to some other state + * then the counter register itself is restored +After this interrupt-mode specific resume code is complete, the +driver continues by calling 'write_control' as described above. +The state and cache ids will not match, forcing write_control to +reload the disabled interrupt-mode control registers. + +Call-site Backpatching +---------------------- +The x86 family of processors is quite diverse in how their +performance counters work and are accessed. There are three +main designs (P5, P6, and P4) with several variations. +To handle this the processor type detection and initialisation +code sets up a number of function pointers to point to the +correct procedures for the actual CPU type. + +Calls via function pointers are more expensive than direct calls, +so the driver actually performs direct calls to wrappers that +backpatch the original call sites to instead call the actual +CPU-specific functions in the future. + +Unsynchronised code backpatching in SMP systems doesn't work +on Intel P6 processors due to an erratum, so the driver performs +a "finalise backpatching" step after the CPU-specific function +pointers have been set up. This step invokes the API procedures +on a temporary state object, set up to force every backpatchable +call site to be invoked and adjusted. + +Several low-level API procedures are called in the context-switch +path by the per-process perfctrs kernel extension, which motivates +the efforts to reduce runtime overheads as much as possible. + +Overflow Interrupts +------------------- +The x86 hardware enables overflow interrupts via the local +APIC's LVTPC entry, which is only present in P6/K7/K8/P4. + +The low-level driver supports overflow interrupts as follows: +- It reserves a local APIC vector, 0xee, as LOCAL_PERFCTR_VECTOR. +- It adds a local APIC exception handler to entry.S, which + invokes the driver's smp_perfctr_interrupt() procedure. +- It adds code to i8259.c to bind the LOCAL_PERFCTR_VECTOR + interrupt gate to the exception handler in entry.S. +- During processor type detection, it records whether the + processor supports the local APIC, and sets up function pointers + for the suspend and resume operations on interrupt-mode counters. +- When the low-level driver is activated, it enables overflow + interrupts by writing LOCAL_PERFCTR_VECTOR to each CPU's APIC_LVTPC. +- Overflow interrupts now end up in smp_perfctr_interrupt(), which + ACKs the interrupt and invokes the interrupt handler installed + by the high-level service/driver. +- When the low-level driver is deactivated, it disables overflow + interrupts by masking APIC_LVTPC in each CPU. It then releases + the local APIC back to the NMI watchdog. + +At compile-time, the low-level driver indicates overflow interrupt +support by enabling CONFIG_PERFCTR_INTERRUPT_SUPPORT. If the feature +is also available at runtime, it sets the PERFCTR_FEATURE_PCINT flag +in the perfctr_info object. diff -puN /dev/null Documentation/perfctr/overview.txt --- /dev/null 2003-09-15 06:40:47.000000000 -0700 +++ 25-akpm/Documentation/perfctr/overview.txt 2004-07-12 00:04:07.777308144 -0700 @@ -0,0 +1,129 @@ +$Id: perfctr-documentation-update.patch,v 1.1 2004/07/12 05:41:57 akpm Exp $ + +AN OVERVIEW OF PERFCTR +====================== +The perfctr package adds support to the Linux kernel for using +the performance-monitoring counters found in many processors. + +Perfctr is internally organised in three layers: + +- The low-level drivers, one for each supported architecture. + Currently there are two, one for 32 and 64-bit x86 processors, + and one for 32-bit PowerPC processors. + + low-level-api.txt documents the model of the performance counters + used in this package, and the internal API to the low-level drivers. + + low-level-{x86,ppc}.txt provide documentation specific for those + architectures and their low-level drivers. + +- The high-level services. + There is currently one, a kernel extension adding support for + virtualised per-process performance counters. + See virtual.txt for documentation on this kernel extension. + + [There used to be a second high-level service, a simple driver + to control and access all performance counters in all processors. + This driver is currently removed, pending an acceptable new API.] + +- The top-level, which performs initialisation and implements + common procedures and system calls. + +Rationale +--------- +The perfctr package solves three problems: + +- Hardware invariably restricts programming of the performance + counter registers to kernel-level code, and sometimes also + restricts reading the counters to kernel-level code. + + Perfctr adds APIs allowing user-space code access the counters. + In the case of the per-process counters kernel extension, + even non-privileged processes are allowed access. + +- Hardware often limits the precision of the hardware counters, + making them unsuitable for storing total event counts. + + The counts are instead maintained as 64-bit values in software, + with the hardware counters used to derive increments over given + time periods. + +- In a non-modified kernel, the thread state does not include the + performance monitoring counters, and the context switch code + does not save and restore them. In this situation the counters + are system-wide, making them unreliable and inaccurate when used + for monitoring specific processes or specific segments of code. + + The per-process counters kernel extension treats the counter state as + part of the thread state, solving the reliability and accuracy problems. + +Non-goals +--------- +Providing high-level interfaces that abstract and hide the +underlying hardware is a non-goal. Such abstractions can +and should be implemented in user-space, for several reasons: + +- The complexity and variability of the hardware means that + any abstraction would be inaccurate. There would be both + loss of functionality, and presence of functionality which + isn't supportable on any given processor. User-space tools + and libraries can implement this, on top of the processor- + specific interfaces provided by the kernel. + +- The implementation of such an abstraction would be large + and complex. (Consider ESCR register assignment on P4.) + Performing complex actions in user-space simplifies the + kernel, allowing it to concentrate on validating control + data, managing processes, and driving the hardware. + (C.f. the role of compilers.) + +- The abstraction is purely a user-convenience thing. The + kernel-level components have no need for it. + +Common System Calls +=================== +This lists those system calls that are not tied to +a specific high-level service/driver. + +Querying CPU and Driver Information +----------------------------------- +int err = sys_perfctr_info(struct perfctr_info *info, + struct perfctr_cpu_mask *cpus, + struct perfctr_cpu_mask *forbidden); + +This operation retrieves information from the kernel about +the processors in the system. + +If non-NULL, '*info' will be updated with information about the +capabilities of the processor and the low-level driver. + +If non-NULL, '*cpus' will be updated with a bitmask listing the +set of processors in the system. The size of this bitmask is not +statically known, so the protocol is: + +1. User-space initialises cpus->nrwords to the number of elements + allocated for cpus->mask[]. +2. The kernel reads cpus->nrwords, and then writes the required + number of words to cpus->nrwords. +3. If the required number of words is less than the original value + of cpus->nrwords, then an EOVERFLOW error is signalled. +4. Otherwise, the kernel converts its internal cpumask_t value + to the external format and writes that to cpus->mask[]. + +If non-NULL, '*forbidden' will be updated with a bitmask listing +the set of processors in the system on which users must not try +to use performance counters. This is currently only relevant for +hyper-threaded Pentium 4/Xeon systems. The protocol is the same +as for '*cpus'. + +Notes: +- The internal representation of a cpumask_t is as an array of + unsigned long. This representation is unsuitable for user-space, + because it is not binary-compatible between 32 and 64-bit + variants of a big-endian processor. The 'struct perfctr_cpu_mask' + type uses an array of unsigned 32-bit integers. +- The protocol for retrieving a 'struct perfctr_cpu_mask' was + designed to allow user-space to quickly determine the correct + size of the 'mask[]' array. Other system calls use weaker protocols, + which force user-space to guess increasingly larger values in a + loop, until finally an acceptable value was guessed. diff -puN /dev/null Documentation/perfctr/virtual.txt --- /dev/null 2003-09-15 06:40:47.000000000 -0700 +++ 25-akpm/Documentation/perfctr/virtual.txt 2004-07-12 00:04:07.780307688 -0700 @@ -0,0 +1,355 @@ +$Id: perfctr-documentation-update.patch,v 1.1 2004/07/12 05:41:57 akpm Exp $ + +VIRTUAL PER-PROCESS PERFORMANCE COUNTERS +======================================== +This document describes the virtualised per-process performance +counters kernel extension. See "General Model" in low-level-api.txt +for the model of the processor's performance counters. + +Contents +======== +- Summary +- Design & Implementation Notes + * State + * Thread Management Hooks + * Synchronisation Rules + * The Pseudo File System +- API For User-Space + * Opening/Creating the State + * Updating the Control + * Unlinking the State + * Reading the State + * Resuming After Handling Overflow Signal + * Reading the Counter Values +- Limitations / TODO List + +Summary +======= +The virtualised per-process performance counters facility +(virtual perfctrs) is a kernel extension which extends the +thread state to record perfctr settings and values, and augments +the context-switch code to save perfctr values at suspends and +restore them at resumes. This "virtualises" the performance +counters in much the same way as the kernel already virtualises +general-purpose and floating-point registers. + +Virtual perfctrs also adds an API allowing non-privileged +user-space processes to set up and access their perfctrs. + +As this facility is primarily intended to support developers +of user-space code, both virtualisation and allowing access +from non-privileged code are essential features. + +Design & Implementation Notes +============================= + +State +----- +The state of a thread's perfctrs is packaged up in an object of +type 'struct vperfctr'. It consists of CPU-dependent state, a +sampling timer, and some auxiliary administrative data. This is +an independent object, with its own lifetime and access rules. + +The state object is attached to the thread via a pointer in its +thread_struct. While attached, the object records the identity +of its owner thread: this is used for user-space API accesses +from threads other than the owner. + +The state is separate from the thread_struct for several resons: +- It's potentially large, hence it's allocated only when needed. +- It can outlive its owner thread. The state can be opened as + a pseudo file: as long as that file is live, so is the object. +- It can be mapped, via mmap() on the pseudo file's descriptor. + To facilitate this, a full page is allocated and reserved. + +Thread Management Hooks +----------------------- +Virtual perfctrs hooks into several thread management events: + +- exit_thread(): Calls perfctr_exit_thread() to stop the counters + and detach the thread's vperfctr object. + +- copy_thread(): Calls perfctr_copy_thread() to initialise + the child's vperfctr pointer. Currently the settings are + not inherited from parent to child, so the pointer is set + to NULL in the child's thread_struct. + +- switch_to(): + * Calls perfctr_suspend_thread() on the previous thread, to + suspend its counters. + * Calls perfctr_resume_thread() on the next thread, to resume + its counters. Also resets the sampling timer (see below). + +- update_process_times(): Calls perfctr_sample_thread(), which + decrements the sampling timer and samples the counters if the + timer reaches zero. + + Sampling is normally only done at switch_to(), but if too much + time passes before the next switch_to(), a hardware counter may + increment by more than its range (usually 2^32). If this occurs, + the difference from its start value will be incorrect, causing + its updated sum to also be incorrect. The sampling timer is used + to prevent this problem, which has been observed on SMP machines, + and on high clock frequency UP machines. + +- set_cpus_allowed(): Calls perfctr_set_cpus_allowed() to detect + attempts to migrate the thread to a "forbidden" CPU, in which + case a flag in the vperfctr object is set. perfctr_resume_thread() + checks this flag, and if set, marks the counters as stopped and + sends a SIGILL to the thread. + + The notion of forbidden CPUs is a workaround for a design flaw + in hyper-threaded Pentium 4s and Xeons. See low-level-x86.txt + for details. + +To reduce overheads, these hooks are implemented as inline functions +that check if the thread is using perfctrs before calling the code +that implements the behaviour. The hooks also reduce to no-ops if +CONFIG_PERFCTR_VIRTUAL is disabled. + +Synchronisation Rules +--------------------- +There are four types of accesses to a thread's perfctr state: + +1. Thread management events (see above) done by the thread itself. + Suspend, resume, and sample are lock-less. + +2. API operations done by the thread itself. + These are lock-less, except when an individual operation + has specific synchronisation needs. For instance, preemption + is often disabled to prevent accesses due to context switches. + +3. API operations done by a different thread ("monitor thread"). + The owner thread must be suspended for the duration of the operation. + This is ensured by requiring that the monitor thread is ptrace()ing + the owner thread, and that the owner thread is in TASK_STOPPED state. + +4. set_cpus_allowed(). + The kernel does not lock the target during set_cpus_allowed(), + so it can execute concurrently with the owner thread or with + some monitor thread. In particular, the state may be deallocated. + + To solve this problem, both perfctr_set_cpus_allowed() and the + operations that can change the owner thread's perfctr pointer + (creat, unlink, exit) perform a task_lock() on the owner thread + before accessing the perfctr pointer. + + When concurrent set_cpus_allowed() isn't a problem (because the + architecture doesn't have a notion of forbidden CPUs), atomicity + of updates to the thread's perfctr pointer is ensured by disabling + preemption. + +The Pseudo File System +---------------------- +The perfctr state is accessed from user-space via a file descriptor. + +The main reason for this is to enable mmap() on the file descriptor, +which gives read-only access to the state. + +The file descriptor is a handle to the perfctr state object. This +allows a very simple implementation of the user-space 'perfex' +program, which runs another program with given perfctr settings +and reports their final values. Without this handle, monitoring +applications like perfex would have to be implemented like debuggers +in order to catch the target thread's exit and retrieve the counter +values before the exit completes and the state disappears. + +The file for a perfctr state object belongs to the vperfctrs pseudo +file system. Files in this file system support only a few operations: +- mmap() +- release() decrements the perfctr object's reference count and + deallocates the object when no references remain +- the listing of a thread's open file descriptors identifies + perfctr state file descriptors as belonging to "vperfctrfs" +The implementation is based on the code for pipefs. + +In previous versions of the perfctr package, the file descriptors +for perfctr state objects also supported the API's ioctl() method. + +API For User-Space +================== + +Opening/Creating the State +-------------------------- +int fd = sys_vperfctr_open(int tid, int creat); + +'tid' must be the id of a thread, or 0 which is interpreted as an +alias for the current thread. + +This operation returns an open file descriptor which is a handle +on the thread's perfctr state object. + +If 'creat' is non-zero and the object did not exist, then it is +created and attached to the thread. The newly created state object +is inactive, with all control fields disabled and all counters +having the value zero. If 'creat' is non-zero and the object +already existed, then an EEXIST error is signalled. + +If 'tid' does not denote the current thread, then it must denote a +thread that is stopped and under ptrace control by the current thread. + +Notes: +- The access rule in the non-self case is the same as for the + ptrace() system call. It ensures that no other thread, including + the target thread itself, can access or change the target thread's + perfctr state during the operation. +- An open file descriptor for a perfctr state object counts as a + reference to that object; even if detached from its thread the + object will not be deallocated until the last reference is gone. +- The file descriptor can be passed to mmap(), for low-overhead + counter sampling. See "READING THE COUNTER VALUES" for details. +- The file descriptor can be passed to another thread. Accesses + from threads other than the owner are permitted as long as they + posses the file descriptor and use ptrace() for synchronisation. + +Updating the Control +-------------------- +int err = sys_vperfctr_control(int fd, const struct vperfctr_control *control); + +'fd' must be the return value from a call to sys_vperfctr_open(), +The perfctr object must still be attached to its owner thread. + +This operation stops and samples any currently running counters in +the thread, and then updates the control settings. If the resulting +state has any enabled counters, then the counters are restarted. + +Before restarting, the counter sums are reset to zero. However, +if a counter's bit is set in the control object's 'preserve' +bitmask field, then that counter's sum is not reset. The TSC's +sum is only reset if the TSC is disabled in the new state. + +If any of the programmable counters are enabled, then the thread's +CPU affinity mask is adjusted to exclude the set of forbidden CPUs. + +If the control data activates any interrupt-mode counters, then +a signal (specified by the 'si_signo' control field) will be sent +to the owner thread after an overflow interrupt. The documentation +for sys_vperfctr_iresume() describes this mechanism. + +If 'fd' does not denote the current thread, then it must denote a +thread that is stopped and under ptrace control by the current thread. +The perfctr state object denoted by 'fd' must still be attached +to its owner thread. + +Notes: +- It is strongly recommended to memset() the vperfctr_control object + to all-bits-zero before setting the fields of interest. +- Stopping the counters is done by invoking the control operation + with a control object that activates neither the TSC nor any PMCs. + +Unlinking the State +------------------- +int err = sys_vperfctr_unlink(int fd); + +'fd' must be the return value from a call to sys_vperfctr_open(). + +This operation stops and samples the thread's counters, and then +detaches the perfctr state object from the thread. If the object +already had been detached, then no action is performed. + +If 'fd' does not denote the current thread, then it must denote a +thread that is stopped and under ptrace control by the current thread. + +Reading the State +----------------- +int err = sys_vperfctr_read(int fd, struct perfctr_sum_ctrs *sum, + struct vperfctr_control *control); + +'fd' must be the return value from a call to sys_vperfctr_open(). + +This operation copies data from the perfctr state object to +user-space. If 'sum' is non-NULL, then the counter sums are +written to it. If 'control' is non-NULL, then the control data +is written to it. + +If the perfctr state object is attached to the current thread, +then the counters are sampled and updated first. + +If 'fd' does not denote the current thread, then it must denote a +thread that is stopped and under ptrace control by the current thread. + +Notes: +- An alternate and faster way to retrieve the counter sums is described + below. This system call can be used if the hardware does not permit + user-space reads of the counters. + +Resuming After Handling Overflow Signal +--------------------------------------- +int err = sys_vperfctr_iresume(int fd); + +'fd' must be the return value from a call to sys_vperfctr_open(). +The perfctr object must still be attached to its owner thread. + +When an interrupt-mode counter has overflowed, the counters +are sampled and suspended (TSC remains active). Then a signal, +as specified by the 'si_signo' control field, is sent to the +owner thread: the associated 'struct siginfo' has 'si_code' +equal to 'SI_PMC_OVF', and 'si_pmc_ovf_mask' equal to the set +of overflown counters. + +The counters are suspended to avoid generating new performance +counter events during the execution of the signal handler, but +the previous settings are saved. Calling sys_vperfctr_iresume() +restores the previous settings and resumes the counters. Doing +this is optional. + +If 'fd' does not denote the current thread, then it must denote a +thread that is stopped and under ptrace control by the current thread. + +Reading the Counter Values +-------------------------- +The value of a counter is computed from three components: + + value = sum + (now - start); + +Two of these (sum and start) reside in the kernel's state object, +and the third (now) is the contents of the hardware counter. +To perform this computation in user-space requires access to +the state object. This is achieved by passing the file descriptor +from sys_vperfctr_open() to mmap(): + + volatile const struct vperfctr_state *kstate; + kstate = mmap(NULL, PAGE_SIZE, PROT_READ, MAP_SHARED, fd, 0); + +Reading the three components is a non-atomic operation. If the +thread is scheduled during the operation, the three values will +not be consistent and the wrong result will be computed. +To detect this situation, user-space should check the kernel +state's TSC start value before and after the operation, and +retry the operation in case of a mismatch. + +The algorithm for retrieving the value of counter 'i' is: + + tsc0 = kstate->cpu_state.tsc_start; + for(;;) { + rdpmcl(kstate->cpu_state.pmc[i].map, now); + start = kstate->cpu_state.pmc[i].start; + sum = kstate->cpu_state.pmc[i].sum; + tsc1 = kstate->cpu_state.tsc_start; + if (likely(tsc1 == tsc0)) + break; + tsc0 = tsc1; + } + return sum + (now - start); + +The algorithm for retrieving the value of the TSC is similar, +as is the algorithm for retrieving the values of all counters. + +Notes: +- Since the state's TSC time-stamps are used, the algorithm requires + that user-space enables TSC sampling. +- The algorithm requires that the hardware allows user-space reads + of the counter registers. If this property isn't statically known + for the architecture, user-space should retrieve the kernel's + 'struct perfctr_info' object and check that the PERFCTR_FEATURE_RDPMC + flag is set. + +Limitations / TODO List +======================= +- Perfctr settings are not inherited from parent to child at fork(). + The issue is not fork() but propagating final counts from children + to parents, and allowing user-space to distinguish "self" counts + from "children" counts. + An implementation of this feature is being planned. +- Buffering of overflow samples is not implemented. So far, not a + single user has requested it. _