From: Mikael Pettersson - core driver files and kernel changes DESC perfctr: remove bogus perfctr_sample_thread() calls EDESC From: Mikael Pettersson 2.6.10-mm3 added perfctr_sample_thread() calls in account_{user,system}_time(). I believe these to be bogus: 1. When they are called from update_process_times(), there will be two perfctr_sample_thread()s per tick, one of which is redundant. 2. s390's weird timer tick code calls both account_{user,system}_time() directly, bypassing update_process_times(). In this case there also be two perfctr_sample_thread()s per tick. I believe the proper fix is to remove the new calls and, should s390 ever get perfctr support, add _one_ perfctr_sample_thread() call in s390's account_user_vtime(). The patch below removes the extraneous calls. Signed-off-by: Mikael Pettersson DESC perfctr: i386 EDESC From: Mikael Pettersson - i386 driver and arch changes DESC perfctr x86 core updates EDESC From: Mikael Pettersson - Move perfctr_suspend_thread() call from __switch_to() to the beginning of switch_to(). Ensures that suspend actions are done when the owner task still is 'current'. Signed-off-by: Mikael Pettersson DESC perfctr x86 driver updates EDESC From: Mikael Pettersson - Add facility for masking perfctr interrupts. To reduce overheads, this is done in software via a per-cpu mask instead of writing to the local APIC. - Mask interrupts when interrupt-mode counters are suspended, and unmask when they are resumed. Prevents delayed interrupts (due to HW quirk) from being delivered to the wrong tasks. - Suspend path records if any interrupt-mode counters are in overflow state. This informs the higher levels that a pending interrupt (now masked) must be simulated. Signed-off-by: Mikael Pettersson DESC perfctr: x86 driver cleanup EDESC From: Mikael Pettersson - Provide API function for checking for pending interrupts. Avoids direct structure access in higher levels, which is required for ppc32. Signed-off-by: Mikael Pettersson DESC Prescott fix for perfctr EDESC From: Mikael Pettersson This eliminates a potential oops in perfctr's x86 initialisation code when running on a P4 Model 3 Prescott processor. The P4M3 removed two control registers. I knew that and handled it in the control setup validation code, but I forgot to also modify the initialisation code to avoid clearing them. Perfctr hasn't been hit by this problem on the P4M3 Noconas, but people are reporting that oprofile and the NMI watchdog oops due to this on P4M3 Prescotts. Signed-off-by: Mikael Pettersson DESC perfctr x86 update 2 EDESC From: Mikael Pettersson Part 2/3 of perfctr control API changes: - Switch per-counter control fields from struct-of-arrays to array-of-struct layout, placed at the end of the perfctr_cpu_control struct for flexibility. - Drop ____cacheline_aligned from per-cpu cache object. - In per-cpu cache object, make interrupts_masked flag share cache line with the cache id field. Signed-off-by: Mikael Pettersson DESC perfctr: x86_64 EDESC From: Mikael Pettersson - x86_64 arch changes DESC perfctr x86_64 core updates EDESC From: Mikael Pettersson - Move perfctr_suspend_thread() call from __switch_to() to the beginning of switch_to(). Ensures that suspend actions are done when the owner task still is 'current'. Signed-off-by: Mikael Pettersson DESC perfctr: PowerPC EDESC From: Mikael Pettersson - PowerPC driver and arch changes Signed-off-by: Mikael Pettersson DESC perfctr: ppc32 driver update EDESC From: Mikael Pettersson - Provide new API function for checking for pending interrupts: on ppc32 it always returns false. - Enable performance counter interrupts on the later non-broken IBM 750 series processors (FX DD2.3, and GX). Signed-off-by: Mikael Pettersson DESC perfctr ppc32 MMCR0 handling fixes EDESC From: Mikael Pettersson This patch is a cleanup and correction to perfctr's PPC32 low-level driver. This is a prerequisite for the next patch which enables performance monitor interrupts. Details from my RELEASE-NOTES entry: - PPC32: Correct MMCR0 handling for FCECE/TRIGGER. Read MMCR0 at suspend and then freeze the counters. Move this code from read_counters() to suspend(). At resume, reload MMCR0 to unfreeze the counters. Clean up the cstatus checks controlling this behaviour. Signed-off-by: Mikael Pettersson DESC perfctr ppc32 update EDESC From: Mikael Pettersson This patch is an update to perfctr's PPC32 low-level driver: - Add support for the MPC7447A processor. - Add partial support for the new MPC7448 processor. PLL_CFG decoding not yet implemented due to lack of docs. - Enable overflow interrupt support on all G4 processors except those with the DEC/TAU/PMI erratum. - Wrap thread_struct's perfctr pointer in an #ifdef to avoid bloat when perfctr is disabled. This was requested by some users in the PPC32 embedded world. Signed-off-by: Mikael Pettersson DESC perfctr ppc32 update EDESC From: Mikael Pettersson Part 1/3 of perfctr control API changes: - Switch per-counter control fields from struct-of-arrays to array-of-struct layout, placed at the end of the perfctr_cpu_control struct for flexibility. - Drop ____cacheline_aligned from per-cpu cache object. Signed-off-by: Mikael Pettersson DESC perfctr: virtualised counters EDESC From: Mikael Pettersson - driver for virtualised (per-process) performance counters Signed-off-by: Mikael Pettersson DESC virtual perfctr illegal sleep EDESC From: Mikael Pettersson This patch fixes an illegal sleep issue in perfctr's virtualised per-process counters: a spinlock is taken around calls to perfctr_cpu_{reserve,release}() which sleeps on a mutex. Change the spinlock to a mutex too. The problem was reported by Sami Farin. Strangely enough, DEBUG_SPINLOCK_SLEEP only triggers if I also have PREEMPT enabled. Is it supposed to be like that? Signed-off-by: Mikael Pettersson DESC Make PERFCTR_VIRTUAL default in Kconfig match recommendation in help text EDESC From: Jesper Juhl A tiny patch to make PERFCTR_VIRTUAL default to Y to match the recommendation given in the help text. The help has a very clear "Say Y" recommendation and it doesn't make much sense to not enable this currently if PERFCTR is set, so it should default to Y, not N as it does currently. Signed-off-by: Jesper Juhl DESC perfctr ifdef cleanup EDESC From: Cleaning up some #if/#ifdef confusion in the perfctr patch. DESC perfctr: Kconfig-related updates EDESC From: Mikael Pettersson - Default CONFIG_PERFCTR_INIT_TESTS to n. - Change PERFCTR_INTERRUPT_SUPPORT from a conditional #define to a Kconfig-derived option. Ditto PERFCTR_CPUS_FORBIDDEN_MASK_NEEDED. - Add URL and mailing list pointer to Kconfig help text. Signed-off-by: Mikael Pettersson DESC perfctr virtual updates EDESC From: Mikael Pettersson - When a task is resumed, check if suspend recorded that an overflow interrupt is pending. If so, handle the overflow and deliver the signal. - Split interrupt handler in two parts: one used only for hardware interrupts, and one also used for software-generated interrupts. - Change signal generation code to not wake up the target task (== current). Avoids lockups when the interrupt/signal is generated from switch_to(). - Remove obsolete comment at vperfctr_suspend(). Signed-off-by: Mikael Pettersson DESC perfctr: virtual cleanup EDESC From: Mikael Pettersson Check for pending overflow via new API function. Skip clearing pending interrupt flag: the low-level driver takes care of that. Both changes are required for ppc32. Signed-off-by: Mikael Pettersson DESC perfctr ppc32 preliminary interrupt support EDESC From: Mikael Pettersson This patch adds preliminary support for performance monitor interrupts to perfctr's PPC32 low-level driver. It requires the MMCR0 handling fixes from the previous patch I sent. PPC arranges the counters in two disjoint groups, and each group has a single global interrupt enable bit. This is a problem because the API assumes per-counter control. The fix is to filter out stray interrupts, but this is not yet implemented. (On my TODO list.) Tested on an MPC7455 (G4-type chip). The patch applies cleanly to and compiles ok in 2.6.9-rc2-mm3, but 2.6.9-rc2-mm3 has other problems on PPC32 so I tested it with 2.6.9-rc2 vanilla. Signed-off-by: Mikael Pettersson DESC perfctr: reduce stack usage EDESC From: Mikael Pettersson - Reduce stack usage by using kmalloc() instead of the stack for temporary state and control copies. - Eliminate some unnecessary cpumask_t copies. Use newish cpus_intersects() instead of cpus_and(); !cpus_empty(). Signed-off-by: Mikael Pettersson DESC perfctr interrupt_support Kconfig fix EDESC From: Mikael Pettersson On x86, PERFCTR_INTERRUPT_SUPPORT is supposed to be a derived non-user-controllable option which is set if and only if X86_LOCAL_APIC is set. However, I broke that logic when I added preliminary user-selectable interrupt support to ppc32. The patch below fixes it. Signed-off-by: Mikael Pettersson DESC perfctr low-level documentation EDESC From: Mikael Pettersson This patch adds documentation for perfctr's low-level drivers in Documentation/perfctr/. The internal API between perfctr's low-level and high-level drivers is described, as are the architecture-specific data structures users use to control and inspect the counters. Signed-off-by: Mikael Pettersson DESC perfctr inheritance: driver updates EDESC From: Mikael Pettersson This set of patches add "inheritance" support to the per-process performance counters code in 2.6.8-rc1-mm1, bringing it in sync with the stand-alone perfctr-2.7.4 package. Inheritance has the following semantics: - At fork()/clone(), the child gets the same perfctr control settings (but fresh/reset counters) as its parent. - As an exited child is reaped, if it still uses the exact same control as its parent, then its final counts (self plus children) are merged into its parent's "children counts" state. This is analogous to how the kernel handles plain time etc. If either parent or child has reprogrammed their counters since the fork(), then the child's final counts are not merged back. This feature is one users have asked for repeatedly, and it's the only embarrasing feature omission in the current code. It's not perfect, since one cannot distinguish child 1 from child 2 or some grandchild, but it's easy and cheap to implement. The implementation is as follows: - The per-process counters object is extended with "children counts". - To determine if the control in parent and child are related, each new control setting gets a new 64-bit id. fork() copies control and id to the child. release_task() checks the ids of child and parent and only merges the final counts if the ids match. - The copy_task() callback is renamed to copy_thread(), and also takes the "struct pt_regs *regs" as parameter. "regs" is needed to check if the thread is created for a user-space fork()/clone(), or a kernel-level thread; in the latter case the perfctr state is _not_ inherited. - Adds callback to release_task(), invoked at the point where the other child time etc values are propagated to the parent. - The tsk->thread.perfctr locking rules are strengthened to always take task_lock(tsk). Previously it sufficed to disable preemption when HT P4s couldn't occur. The updated perfctr-2.7.4 library and tools package is needed to actually use the updated kernel code. This patch: - Bump driver version to 2.7.4 - Add children counts & control inheritance id to per-process perfctr state - Drop vperfctr_task_lock() wrapper, always use task_lock() now - Add copy_thread() callback to inherit perfctr settings from parent to child - Add release_task() callback to merge final counts back to parent - Extend sys_vperfctr_read() to allow reading children counts Signed-off-by: Mikael Pettersson DESC perfctr inheritance: kernel updates EDESC From: Mikael Pettersson - s/perfctr_copy_thread(&p->thread)/perfctr_copy_task(p, regs)/g Needed to access to the task struct (for setting owner in new perfctr state) and for accessing regs (for checking user_mode(regs)) - Add perfctr_release_task() callback in kernel/exit.c Signed-off-by: Mikael Pettersson DESC perfctr inheritance: documentation updates EDESC From: Mikael Pettersson - Documentation changes for new task event callbacks, updated locking rules, API update, and TODO list update Signed-off-by: Mikael Pettersson DESC perfctr inheritance locking fix EDESC From: Mikael Pettersson This patch eliminates the illegal task_lock() perfctr's inheritance feature introduced in release_task(). - Changed __vperfctr_release() to use schedule_work() to do the task_lock(parent) etc in a different thread's context. This is because release_task() has a write lock on the task list lock, and task_lock() is forbidden in that case. When current == parent, this is bypassed and the merge work is done immediately without taking task_lock(). Added children_lock to struct vperfctr, to synchronise accesses (release/update_control/read) to the children array. Signed-off-by: Mikael Pettersson DESC perfctr API changes: first step EDESC From: Mikael Pettersson This patch is the first step in the planned perfctr API changes. It converts sys_vperfctr_read() to interpret a command token telling it what to read, instead of it getting a bunch of pointers to structs to update. Right now the functionality is the same; the point of the change is to allow new functionality to be added without breaking the API. The "write control data" and "other state changes" APIs will also be updated shortly. Signed-off-by: Mikael Pettersson DESC perfctr virtual update EDESC From: Mikael Pettersson Part 3/3 of perfctr control API changes: - Changed sys_vperfctr_control() to handle variable-sized cpu_control data. Moved cpu_control field to end of struct vperfctr_control, added size parameter to sys_vperfctr_control(), and changed do_vperfctr_control() to only copy as many bytes as user-space indicates. Together with the array-of-struct layout for per-counter control fields, this: * maintains API compatibility even if future processors add more counters, and * allows for reduced argument copying when user-space only uses a small subset of the available counters. - Bump version. Signed-off-by: Mikael Pettersson DESC perfctr x86-64 ia32 emulation fix EDESC From: Mikael Pettersson The perfctr syscall numbers changed in the i386 kernel recently, but the x86-64 kernel's ia32 emulation was not updated at the same time. This patch fixes that. Signed-off-by: Mikael Pettersson DESC perfctr sysfs update: core EDESC From: Mikael Pettersson This patch set changes perfctr to publish its global information via a textual sysfs interface instead of passing binary structs via sys_perfctr_info(). We can now remove sys_perfctr_info(). Perfctr sysfs update part 1/4: - Publish global information via text files in sysfs. - Remove sys_perfctr_info() from arch-neutral code. Signed-off-by: Mikael Pettersson DESC Perfctr sysfs update EDESC From: Mikael Pettersson Perfctr sysfs update: - Simplify perfctr sysfs code. Signed-off-by: Mikael Pettersson DESC perfctr sysfs update: x86 EDESC From: Mikael Pettersson Perfctr sysfs update part 2/4: - Remove sys_perfctr_info() from x86. Signed-off-by: Mikael Pettersson DESC perfctr sysfs update: x86-64 EDESC From: Mikael Pettersson Perfctr sysfs update part 3/4: - Remove sys_perfctr_info() from x86-64. Signed-off-by: Mikael Pettersson DESC perfctr: syscall numbers in x86-64 ia32-emulation EDESC From: Mikael Pettersson 2.6.10-mm3 reverted the perfctr syscall numbers in x86-64's ia32-emulation to an older definition when sys_perfctr_info() still existed. This patch fixes that. Signed-off-by: Mikael Petterson DESC perfctr x86_64 native syscall numbers fix EDESC From: Mikael Pettersson 2.6.10-mm3 added some syscalls to x86-64, but the preliminary syscall numbers for perfctr weren't adjusted, causing an overlap. Fix below. Signed-off-by: Mikael Pettersson DESC perfctr sysfs update: ppc32 EDESC From: Mikael Pettersson Perfctr sysfs update part 4/4: - Remove sys_perfctr_info() from ppc32. Signed-off-by: Mikael Pettersson DESC perfctr-2.7.10 API update 1/4: common EDESC From: Mikael Pettersson This set of patches form the first half of a major perfctr API update. The goal is to change the upload-new-control-data system call to be much more generic and independent of struct layouts. To this end the upload-new-control-data syscall will become ret = sys_vperfctr_write(fd, namespace, data, datalen) where namespace determines how data is to be interpreted. Initially there will probably be one namespace for perfctr's software state, and one CPU-specific namespace for pure hardware state; the latter will probably be expressed generically as a array. This API change will however require that the write() operation doesn't imply a (re)start of the context, since usually more than one write will be needed to upload all control data. Therefore this first set of patches alter the API so that control data uploads and parameterless state changes are performed by different system calls. The current control() call becomes a light-weight write() call, but still using the old control data layout. A new unified control() call is introduced for state changes, replacing and extending the current unlink() and iresume() calls. perfctr-2.7.10 update, 1/4: - Added new sys_vperfctr_control(), with UNLINK, SUSPEND, RESUME, and CLEAR sub-commands. Deleted sys_vperfctr_unlink() and sys_vperfctr_iresume(). Changed sys_vperfctr_write() to only update control data and not reenable the context. RESUME now works both for resuming after overflow interrupts and for restarting after changing control data. - Renamed old sys_vperfctr_control() to sys_vperfctr_write(). Signed-off-by: Mikael Pettersson DESC perfctr-2.7.10 API update 2/4: i386 EDESC From: Mikael Pettersson perfctr-2.7.10 update, 2/4: - Update i386 syscall table for perfctr-2.7.10 API changes. Signed-off-by: Mikael Pettersson DESC perfctr-2.7.10 API update 3/4: x86_64 EDESC From: Mikael Pettersson perfctr-2.7.10 update, 3/4: - Update x86_64 syscall table for perfctr-2.7.10 API changes. Signed-off-by: Mikael Pettersson DESC perfctr-2.7.10 API update 4/4: ppc32 EDESC From: Mikael Pettersson perfctr-2.7.10 update, 4/4: - Update ppc32 syscall table for perfctr-2.7.10 API changes. Signed-off-by: Mikael Pettersson DESC perfctr API update 1/9: physical indexing, x86 EDESC From: Mikael Pettersson This is the last planned major perfctr API update. This set of patches change how control data is communicated between user-space and the kernel. The main idea is that the control data is partitioned into "domains", and each domain is given its own representation. The design principles were: - Data directly corresponding to CPU register contents is sent in variable-length arrays. This allows us to handle future CPUs with more control registers without breaking any binary structure layouts. The register numbers used are the natural numbers for that platform, i.e. MSR numbers on x86 and SPR numbers on PPC. - Potentially variable-length arrays are not embedded in other API-visible structures, but are in separate domains. This allows for larger arrays in the future, and it also allows user-space to pass only as much data as is necessary. The virtual-to-physical counter mapping is handled this way. - Simple purely software-defined structures are Ok, as long as they don't contain variable-length data or CPU register values. - No embedded user-space pointers anywhere, to avoid having to special-case 32-bit binaries on 64-bit kernels. The API write function takes a triple, interprets the data given the domain, and updates the in-kernel control structures accordingly. The API read function is similar. Implementing this is done in a sequence of four logical steps: 1. The low-level drivers are adjusted to use physical register numbers not virtual ones when indexing their control structures. This is needed because with the new API, the user's control data will be a physically-indexed image of the CPU state, not a virtually-indexed image as before. 2. Common header fields in the low-level control structures are broken out into a separate structure. This is because those fields form a separate domain in the new API. 3. The low-level drivers are extended with an in-kernel API for writing control data in form to the control structure. A similar read API is also added. 4. sys_vperfctr_write() and sys_vperfctr_read() are converted to the new domain-based form. These changes require an updated user-space library, which I'll release tomorrow. After this there will be some minor cleanups, and then I'll start merging David Gibson's ppc64 driver. This patch: - Switch x86 driver to use physically-indexed control data. - Rearrange struct perfctr_cpu_control. Remove _reserved fields. - On P5 and P5 clones users must now format the two counters' control data into a single CESR image. - On P4 check ESCR value after retrieving the counter's ESCR number. Signed-off-by: Mikael Pettersson DESC perfctr API update 2/9: physical indexing, ppc32 EDESC From: Mikael Pettersson - Switch ppc32 driver to use physically-indexed control data. - Rearrange struct perfctr_cpu_control. Remove _reserved fields. - ppc_mmcr[] array in struct perfctr_cpu_state is no longer needed. - In perfctr_cpu_update_control, call check_ireset after check_control, since check_ireset now needs to use the virtual-to-physical map. - Users must now format the 2-6 event selector values into the MMCR0/MMCR1 images. - Verify that unused/non-existent parts of MMCR images are zero. Signed-off-by: Mikael Pettersson DESC perfctr API update 3/9: cpu_control_header, x86 EDESC From: Mikael Pettersson - Move tsc_on/nractrs/nrictrs control fields to new struct cpu_control_header. This depends on the physical-indexing patch for x86. Signed-off-by: Mikael Pettersson DESC perfctr API update 4/9: cpu_control_header, ppc32 EDESC From: Mikael Pettersson - Move tsc_on/nractrs/nrictrs control fields to new struct cpu_control_header. This depends on the physical-indexing patch for ppc32. Signed-off-by: Mikael Pettersson DESC perfctr API update 5/9: cpu_control_header, common EDESC From: Mikael Pettersson - Move tsc_on/nractrs/nrictrs control fields to new struct cpu_control_header. Signed-off-by: Mikael Pettersson DESC perfctr API update 6/9: cpu_control access, common EDESC From: Mikael Pettersson - Add declarations of common arch-specific domain numbers and corresponding data structures to . This just factors out common code, it does not impose any requirements on arch-specific code. Signed-off-by: Mikael Pettersson DESC perfctr API update 7/9: cpu_control access, x86 EDESC From: Mikael Pettersson - Implement perfctr_cpu_control_write()/read() in-kernel API. Only handle PERFCTR_DOMAIN_CPU_REGS, as the other domains will be handled in generic code. - Add per-CPU family reg_offset() functions, and have CPU family detection set up a pointer to the appropriate function. This depends on the physical-indexing patch for x86, and on the common part of the cpu_control access patch. Signed-off-by: Mikael Pettersson DESC perfctr API update 8/9: cpu_control access, ppc32 EDESC From: Mikael Pettersson - Implement perfctr_cpu_control_write()/read() in-kernel API. Only handle PERFCTR_DOMAIN_CPU_REGS, as the other domains will be handled in generic code. - Implement get_reg_offset() via a static table. The ppc32 SPR numbers we use don't form a nice dense range, alas. This depends on the physical-indexing patch for ppc32, and on the common part of the cpu_control access patch. Signed-off-by: Mikael Pettersson DESC perfctr API update 9/9: domain-based read/write syscalls EDESC From: Mikael Pettersson - Convert sys_vperfctr_write() to accept triples. Have it interpret codes for the common domains, and pass unknown domains to perfctr_cpu_control_write(). - In sys_vperfctr_read(), replace "cmd" by "domain" and complete conversion to fine-grained domains for control data. - Remove _reserved and cpu_control fields from struct vperfctr_control. This depends on the cpu_control header and the cpu_control access patches. Signed-off-by: Mikael Pettersson DESC perfctr ia32 syscalls on x86-64 fix EDESC From: Mikael Pettersson The ia32 perfctr syscalls were moved due to addition of ioprio syscalls, but the ia32 emulation code in x86-64 wasn't updated. Simple fix below. Signed-off-by: Mikael Pettersson DESC perfctr cleanups: common EDESC From: Mikael Pettersson Common-code cleanups for perfctr: - init.c: remove unused , don't initialise perfctr_info, don't show dummy cpu_type, show driver version directly from VERSION. - : remove types & constants not used in the kernel any more, make perfctr_info kernel-only and remove unused fields, use explicitly-sized integers in user-visible types. Signed-off-by: Mikael Pettersson DESC perfctr cleanups: ppc32 EDESC From: Mikael Pettersson ppc32-specific cleanups for perfctr: - ppc.c: don't initialise obsolete perfctr_info.cpu_type, use DEFINE_SPINLOCK(). - : remove cpu_type constants and PERFCTR_CPU_VERSION unused in the kernel, use explicitly-sized integers in user-visible types, make perfctr_cpu_control kernel-private. Signed-off-by: Mikael Pettersson DESC perfctr cleanups: x86 EDESC From: Mikael Pettersson x86-specific cleanups for perfctr: - x86.c: use DEFINE_SPINLOCK(). - : remove cpu_type constants and PERFCTR_CPU_VERSION unused in the kernel, use explicitly-sized integers in user-visible types, make perfctr_cpu_control kernel-private. Signed-off-by: Mikael Pettersson DESC perfctr: x86 fix and cleanups EDESC From: Mikael Pettersson Some small fixes and cleanups. The ppc64 code should be next, but I'm waiting for David Gibson to look over and ACK the API changes I've inflicted on his code first. x86 fix and cleanups: - finalise_backpatching() now exercises all control flow paths, to ensure that calls in cloned control flows are backpatched properly. This is needed for gcc-4.0. - Eliminate power-of-two sizeof assumption in access_regs(). - Merge check_ireset() and setup_imode_start_values(). Signed-off-by: Mikael Pettersson DESC perfctr: ppc32 fix and cleanups EDESC From: Mikael Pettersson ppc32 fix and cleanups: - If check_ireset() fails, clear state->cstatus to undo any settings check_control() may have left there. - Eliminate power-of-two sizeof assumption in access_regs(). - Merge check_ireset() and setup_imode_start_values(). Signed-off-by: Mikael Pettersson DESC perfctr: 64-bit values in register descriptors EDESC From: Mikael Pettersson - : Change value fields in register descriptors to 64 bits. This will be needed for ppc64, and ppc32 user-space on ppc64 kernels, and may eventually also be needed on x86. We could have different descriptor types for 32 and 64-bit registers, but that just complicates things for no real benefit. Signed-off-by: Mikael Pettersson DESC perfctr-64-bit-values-in-register-descriptors fix EDESC From: Mikael Pettersson : Change number fields in register descriptors to 64 bits. Otherwise i386 binaries break on x86_64 kernels since the descriptors get larger alignment and sizes on x86_64 than on i386. Signed-off-by: Mikael Pettersson DESC perfctr: mapped state cleanup: x86 EDESC From: Mikael Pettersson - Swap cstatus and k1 fields in struct perfctr_cpu_state. Move now contiguous user-visible fields to struct perfctr_cpu_state_user. Hide kernel-private stuff. Inline now obsolete k1 struct. Cleanups. Signed-off-by: Mikael Pettersson DESC perfctr: mapped state cleanup: ppc32 EDESC From: Mikael Pettersson - Swap cstatus and k1 fields in struct perfctr_cpu_state. Move now contiguous user-visible fields to struct perfctr_cpu_state_user. Hide kernel-private stuff. Inline now obsolete k1 struct. Cleanups. Signed-off-by: Mikael Pettersson DESC perfctr: mapped state cleanup: common EDESC From: Mikael Pettersson - Update virtual.c for perfctr_cpu_state layout change. - Add perfctr sysfs attribute providing user-space with the offset in an mmap()ed perfctr object to the user-visible state. Signed-off-by: Mikael Pettersson DESC perfctr: ppc64 arch hooks EDESC From: Mikael Pettersson Here's a 3-part patch kit which adds a ppc64 driver to perfctr, written by David Gibson . ppc64 is sufficiently different from ppc32 that this driver is kept separate from my ppc32 driver. This shouldn't matter unless people actually want to run ppc32 kernels on ppc64 processors. ppc64 perfctr driver from David Gibson : - ppc64 arch hooks: Kconfig, syscalls numbers and tables, task struct, and process management ops (switch_to, exit, fork) Signed-off-by: Mikael Pettersson DESC perfctr: common updates for ppc64 EDESC From: Mikael Pettersson ppc64 perfctr driver from David Gibson : - perfctr common updates: Makefile, version - perfctr virtual quirk: the ppc64 low-level driver is unable to prevent all stray overflow interrupts, on ppc64 (and only ppc64) the right action in this case is to ignore the interrupt and resume Signed-off-by: Mikael Pettersson DESC perfctr: ppc64 driver core EDESC From: Mikael Pettersson ppc64 perfctr driver from David Gibson : - ppc64 perfctr driver core Signed-off-by: Mikael Pettersson DESC perfctr: x86 ABI update EDESC From: Mikael Pettersson This 3-part patch set widens the counter 'start' fields in the mmap()-visible state from 32 to 64 bits, to prepare for future processors that may need that extra precision. This would bump the size of the pmc[] array elements to 24 bytes, of which only 20 would be used, so the 'map' fields in that array are removed -- the kernel can retrive that data from the control structure instead, and user-space can also maintain it itself. This brings the pmc[] array elements down to 16 bytes again, with 100% utilisation. The removal of the 'cstatus' field from the user-visible state that David Gibson proposed has not been done. The problem is that cstatus also exposes asynchronous state changes (in particular at overflow interrupts), and I'm not yet convinced that user-space can handle its removal without undue burden. perfctr x86 ABI update: - : In user-visible state, make start fields 64 bits (for future-proofing the ABI). Remove map field from pmc[] array to avoid underutilised cache lines. - x86.c: retrieve mapping from ->control.pmc_map[]. Signed-off-by: Mikael Pettersson DESC perfctr: ppc32 ABI update EDESC From: Mikael Pettersson perfctr ppc32 ABI update: - : In user-visible state, make start fields 64 bits (for future-proofing the ABI). Remove map field from pmc[] array to avoid underutilised cache lines. - ppc.c: retrieve mapping from ->control.pmc_map[]. - ppc32: Add sampling counter to user-visible state, and increment it in perfctr_cpu_resume() and perfctr_cpu_sample(). Signed-off-by: Mikael Pettersson DESC perfctr: ppc64 ABI update EDESC From: Mikael Pettersson perfctr ppc64 ABI update: - : In user-visible state, make start fields 64 bits (for future-proofing the ABI). Remove map field from pmc[] array to avoid underutilised cache lines. - ppc64.c: retrieve mapping from ->control.pmc_map[]. - ppc64: Add sampling counter to user-visible state, and increment it in perfctr_cpu_resume() and perfctr_cpu_sample(). Signed-off-by: Mikael Pettersson DESC perfctr: ppc64 wraparound fixes EDESC From: Mikael Pettersson Here's an update for perfctr's ppc64 low-level driver which fixes counter wraparound issues in the driver. This patch was written by David Gibson; here's his description: "Problem was that with the conversion of the perfctr state "start" values from 32 to 64 bits, my typing/sign-extension was no longer correct, leading to overflows if the hardware counter rolled over during a run. With this patch the PAPI testcase failures appear to be back down to the two that we know about." Signed-off-by: Mikael Pettersson Cc: DESC perfctr: x86 update with K8 multicore fixes, take 2 EDESC From: Mikael Pettersson Here's an update for perfctr's x86/x86-64 low-level driver which works around issues with current K8 multicore chips. Following Andi's comments about the original patch, this version uses cpu_core_map[] instead of deriving that info manually. - Added code to detect multicore K8s and prevent threads in the thread-centric API from using northbridge events. This avoids resource conflicts, and an erratum in Revision E chips. Signed-off-by: Mikael Pettersson DESC perfctr: seqlocks for mmaped state: common EDESC From: Mikael Pettersson This set of patches changes perfctr's low-level drivers to indicate changes to the mmap:ed counter state via a unified seqlock mechanism. This cleans up user-space, enables user-space fast sampling in some previously impossible cases (x86 w/o TSC), and eliminates a highly unlikely but not impossible failure case on x86 SMP. This is a rewrite of a patch originally from David Gibson. perfctr seqlocks 1/4: common changes - define write_perfseq_begin/end in - bump version and sync it with current user-space package Signed-off-by: Mikael Pettersson DESC perfctr: seqlocks for mmaped state: x86 EDESC From: Mikael Pettersson perfctr seqlocks 2/4: x86 changes - use write_perfseq_begin/end in perfctr_cpu_suspend/resume/sample to indicate that the state has changed - in mmap:ed state, redefine filler field as the sequence number Signed-off-by: Mikael Pettersson DESC perfctr: seqlocks for mmaped state: ppc64 EDESC From: Mikael Pettersson perfctr seqlocks 3/4: ppc64 changes - use write_perfseq_begin/end in perfctr_cpu_suspend/resume/sample to indicate that the state has changed - in mmap:ed state, redefine samplecnt field as the sequence number Signed-off-by: Mikael Pettersson DESC perfctr: seqlocks for mmaped state: ppc32 EDESC From: Mikael Pettersson perfctr seqlocks 4/4: ppc32 changes - use write_perfseq_begin/end in perfctr_cpu_suspend/resume/sample to indicate that the state has changed - in mmap:ed state, redefine samplecnt field as the sequence number Signed-off-by: Mikael Pettersson Cc: Signed-off-by: Andrew Morton --- CREDITS | 1 Documentation/perfctr/low-level-api.txt | 216 +++ Documentation/perfctr/low-level-ppc32.txt | 164 ++ Documentation/perfctr/low-level-x86.txt | 360 +++++ Documentation/perfctr/overview.txt | 129 ++ Documentation/perfctr/virtual.txt | 357 +++++ MAINTAINERS | 6 arch/i386/Kconfig | 2 arch/i386/kernel/entry.S | 10 arch/i386/kernel/i8259.c | 3 arch/i386/kernel/process.c | 5 arch/i386/kernel/syscall_table.S | 4 arch/ppc/Kconfig | 2 arch/ppc/kernel/head.S | 4 arch/ppc/kernel/misc.S | 4 arch/ppc/kernel/process.c | 6 arch/ppc64/Kconfig | 1 arch/ppc64/kernel/misc.S | 4 arch/ppc64/kernel/process.c | 6 arch/x86_64/Kconfig | 2 arch/x86_64/ia32/ia32entry.S | 6 arch/x86_64/kernel/entry.S | 5 arch/x86_64/kernel/i8259.c | 3 arch/x86_64/kernel/process.c | 6 drivers/Makefile | 1 drivers/perfctr/Kconfig | 64 drivers/perfctr/Makefile | 19 drivers/perfctr/cpumask.h | 25 drivers/perfctr/init.c | 115 + drivers/perfctr/ppc.c | 1094 +++++++++++++++++ drivers/perfctr/ppc64.c | 749 +++++++++++ drivers/perfctr/ppc64_tests.c | 322 +++++ drivers/perfctr/ppc64_tests.h | 12 drivers/perfctr/ppc_tests.c | 288 ++++ drivers/perfctr/ppc_tests.h | 12 drivers/perfctr/version.h | 1 drivers/perfctr/virtual.c | 1253 +++++++++++++++++++ drivers/perfctr/virtual.h | 13 drivers/perfctr/x86.c | 1800 ++++++++++++++++++++++++++++ drivers/perfctr/x86_tests.c | 308 ++++ drivers/perfctr/x86_tests.h | 30 include/asm-i386/mach-default/irq_vectors.h | 5 include/asm-i386/mach-visws/irq_vectors.h | 5 include/asm-i386/perfctr.h | 200 +++ include/asm-i386/processor.h | 2 include/asm-i386/system.h | 1 include/asm-i386/unistd.h | 6 include/asm-ppc/perfctr.h | 174 ++ include/asm-ppc/processor.h | 3 include/asm-ppc/reg.h | 86 + include/asm-ppc/unistd.h | 6 include/asm-ppc64/perfctr.h | 167 ++ include/asm-ppc64/processor.h | 2 include/asm-ppc64/unistd.h | 6 include/asm-x86_64/hw_irq.h | 5 include/asm-x86_64/ia32_unistd.h | 6 include/asm-x86_64/irq.h | 2 include/asm-x86_64/perfctr.h | 1 include/asm-x86_64/processor.h | 2 include/asm-x86_64/system.h | 6 include/asm-x86_64/unistd.h | 10 include/linux/perfctr.h | 176 ++ include/linux/sched.h | 3 kernel/exit.c | 2 kernel/sched.c | 3 kernel/sys_ni.c | 4 kernel/timer.c | 2 sys.c | 0 68 files changed, 8267 insertions(+), 30 deletions(-) diff -puN CREDITS~perfctr CREDITS --- devel/CREDITS~perfctr 2005-07-08 23:11:41.000000000 -0700 +++ devel-akpm/CREDITS 2005-07-08 23:11:41.000000000 -0700 @@ -2628,6 +2628,7 @@ N: Mikael Pettersson E: mikpe@csd.uu.se W: http://www.csd.uu.se/~mikpe/ D: Miscellaneous fixes +D: Performance-monitoring counters driver N: Reed H. Petty E: rhp@draper.net diff -puN drivers/Makefile~perfctr drivers/Makefile --- devel/drivers/Makefile~perfctr 2005-07-08 23:11:41.000000000 -0700 +++ devel-akpm/drivers/Makefile 2005-07-08 23:11:41.000000000 -0700 @@ -63,6 +63,7 @@ obj-$(CONFIG_MCA) += mca/ obj-$(CONFIG_EISA) += eisa/ obj-$(CONFIG_CPU_FREQ) += cpufreq/ obj-$(CONFIG_MMC) += mmc/ +obj-$(CONFIG_PERFCTR) += perfctr/ obj-$(CONFIG_INFINIBAND) += infiniband/ obj-$(CONFIG_SGI_IOC4) += sn/ obj-y += firmware/ diff -puN /dev/null drivers/perfctr/cpumask.h --- /dev/null 2003-09-15 06:40:47.000000000 -0700 +++ devel-akpm/drivers/perfctr/cpumask.h 2005-07-08 23:11:41.000000000 -0700 @@ -0,0 +1,25 @@ +/* $Id: cpumask.h,v 1.7 2004/05/12 19:59:01 mikpe Exp $ + * Performance-monitoring counters driver. + * Partial simulation of cpumask_t on non-cpumask_t kernels. + * Extension to allow inspecting a cpumask_t as array of ulong. + * Appropriate definition of perfctr_cpus_forbidden_mask. + * + * Copyright (C) 2003-2004 Mikael Pettersson + */ + +#ifdef CPU_ARRAY_SIZE +#define PERFCTR_CPUMASK_NRLONGS CPU_ARRAY_SIZE +#else +#define PERFCTR_CPUMASK_NRLONGS 1 +#endif + +/* CPUs in `perfctr_cpus_forbidden_mask' must not use the + performance-monitoring counters. TSC use is unrestricted. + This is needed to prevent resource conflicts on hyper-threaded P4s. */ +#ifdef CONFIG_PERFCTR_CPUS_FORBIDDEN_MASK +extern cpumask_t perfctr_cpus_forbidden_mask; +#define perfctr_cpu_is_forbidden(cpu) cpu_isset((cpu), perfctr_cpus_forbidden_mask) +#else +#define perfctr_cpus_forbidden_mask CPU_MASK_NONE +#define perfctr_cpu_is_forbidden(cpu) 0 /* cpu_isset() needs an lvalue :-( */ +#endif diff -puN /dev/null drivers/perfctr/init.c --- /dev/null 2003-09-15 06:40:47.000000000 -0700 +++ devel-akpm/drivers/perfctr/init.c 2005-07-08 23:11:41.000000000 -0700 @@ -0,0 +1,115 @@ +/* $Id: init.c,v 1.81 2005/03/17 23:49:07 mikpe Exp $ + * Performance-monitoring counters driver. + * Top-level initialisation code. + * + * Copyright (C) 1999-2005 Mikael Pettersson + */ +#include +#include +#include +#include +#include +#include + +#include "cpumask.h" +#include "virtual.h" +#include "version.h" + +struct perfctr_info perfctr_info; + +static ssize_t +driver_version_show(struct class *class, char *buf) +{ + return sprintf(buf, "%s\n", VERSION); +} + +static ssize_t +cpu_features_show(struct class *class, char *buf) +{ + return sprintf(buf, "%#x\n", perfctr_info.cpu_features); +} + +static ssize_t +cpu_khz_show(struct class *class, char *buf) +{ + return sprintf(buf, "%u\n", perfctr_info.cpu_khz); +} + +static ssize_t +tsc_to_cpu_mult_show(struct class *class, char *buf) +{ + return sprintf(buf, "%u\n", perfctr_info.tsc_to_cpu_mult); +} + +static ssize_t +state_user_offset_show(struct class *class, char *buf) +{ + return sprintf(buf, "%u\n", (unsigned int)offsetof(struct perfctr_cpu_state, user)); +} + +static ssize_t +cpus_online_show(struct class *class, char *buf) +{ + int ret = cpumask_scnprintf(buf, PAGE_SIZE-1, cpu_online_map); + buf[ret++] = '\n'; + return ret; +} + +static ssize_t +cpus_forbidden_show(struct class *class, char *buf) +{ + int ret = cpumask_scnprintf(buf, PAGE_SIZE-1, perfctr_cpus_forbidden_mask); + buf[ret++] = '\n'; + return ret; +} + +static struct class_attribute perfctr_class_attrs[] = { + __ATTR_RO(driver_version), + __ATTR_RO(cpu_features), + __ATTR_RO(cpu_khz), + __ATTR_RO(tsc_to_cpu_mult), + __ATTR_RO(state_user_offset), + __ATTR_RO(cpus_online), + __ATTR_RO(cpus_forbidden), + __ATTR_NULL +}; + +static struct class perfctr_class = { + .name = "perfctr", + .class_attrs = perfctr_class_attrs, +}; + +char *perfctr_cpu_name __initdata; + +static int __init perfctr_init(void) +{ + int err; + + err = perfctr_cpu_init(); + if (err) { + printk(KERN_INFO "perfctr: not supported by this processor\n"); + return err; + } + err = vperfctr_init(); + if (err) + return err; + err = class_register(&perfctr_class); + if (err) { + printk(KERN_ERR "perfctr: class initialisation failed\n"); + return err; + } + printk(KERN_INFO "perfctr: driver %s, cpu type %s at %u kHz\n", + VERSION, + perfctr_cpu_name, + perfctr_info.cpu_khz); + return 0; +} + +static void __exit perfctr_exit(void) +{ + vperfctr_exit(); + perfctr_cpu_exit(); +} + +module_init(perfctr_init) +module_exit(perfctr_exit) diff -puN /dev/null drivers/perfctr/Kconfig --- /dev/null 2003-09-15 06:40:47.000000000 -0700 +++ devel-akpm/drivers/perfctr/Kconfig 2005-07-08 23:11:41.000000000 -0700 @@ -0,0 +1,64 @@ +# $Id: Kconfig,v 1.10 2004/05/24 11:00:55 mikpe Exp $ +# Performance-monitoring counters driver configuration +# + +menu "Performance-monitoring counters support" + +config PERFCTR + bool "Performance monitoring counters support" + help + This driver provides access to the performance-monitoring counter + registers available in some (but not all) modern processors. + These special-purpose registers can be programmed to count low-level + performance-related events which occur during program execution, + such as cache misses, pipeline stalls, etc. + + You can safely say Y here, even if you intend to run the kernel + on a processor without performance-monitoring counters. + + At you can find + the corresponding user-space components, as well as other + versions of this package. A mailing list is also available, at + . + +config PERFCTR_INIT_TESTS + bool "Init-time hardware tests" + depends on PERFCTR + default n + help + This option makes the driver perform additional hardware tests + during initialisation, and log their results in the kernel's + message buffer. For most supported processors, these tests simply + measure the runtime overheads of performance counter operations. + + If you have a less well-known processor (one not listed in the + etc/costs/ directory in the user-space package), you should enable + this option and email the results to the perfctr developers. + + If unsure, say N. + +config PERFCTR_VIRTUAL + bool "Virtual performance counters support" + depends on PERFCTR + default y + help + The processor's performance-monitoring counters are special-purpose + global registers. This option adds support for virtual per-process + performance-monitoring counters which only run when the process + to which they belong is executing. This improves the accuracy of + performance measurements by reducing "noise" from other processes. + + Say Y. + +config PERFCTR_INTERRUPT_SUPPORT + prompt "Performance counter overflow interrupt support" if PPC + bool + depends on PERFCTR + default y if X86_LOCAL_APIC + +config PERFCTR_CPUS_FORBIDDEN_MASK + bool + depends on PERFCTR + default y if X86 && SMP + +endmenu diff -puN /dev/null drivers/perfctr/Makefile --- /dev/null 2003-09-15 06:40:47.000000000 -0700 +++ devel-akpm/drivers/perfctr/Makefile 2005-07-08 23:11:41.000000000 -0700 @@ -0,0 +1,19 @@ +# $Id: Makefile,v 1.27 2005/03/23 01:29:34 mikpe Exp $ +# Makefile for the Performance-monitoring counters driver. + +# This also covers x86_64. +perfctr-objs-$(CONFIG_X86) := x86.o +tests-objs-$(CONFIG_X86) := x86_tests.o + +perfctr-objs-$(CONFIG_PPC32) := ppc.o +tests-objs-$(CONFIG_PPC32) := ppc_tests.o + +perfctr-objs-$(CONFIG_PPC64) := ppc64.o +tests-objs-$(CONFIG_PPC64) := ppc64_tests.o + +perfctr-objs-y += init.o +perfctr-objs-$(CONFIG_PERFCTR_INIT_TESTS) += $(tests-objs-y) +perfctr-objs-$(CONFIG_PERFCTR_VIRTUAL) += virtual.o + +perfctr-objs := $(perfctr-objs-y) +obj-$(CONFIG_PERFCTR) := perfctr.o diff -puN /dev/null drivers/perfctr/version.h --- /dev/null 2003-09-15 06:40:47.000000000 -0700 +++ devel-akpm/drivers/perfctr/version.h 2005-07-08 23:11:41.000000000 -0700 @@ -0,0 +1 @@ +#define VERSION "2.7.17" diff -puN /dev/null include/linux/perfctr.h --- /dev/null 2003-09-15 06:40:47.000000000 -0700 +++ devel-akpm/include/linux/perfctr.h 2005-07-08 23:11:41.000000000 -0700 @@ -0,0 +1,176 @@ +/* $Id: perfctr.h,v 1.91 2005/03/18 00:10:53 mikpe Exp $ + * Performance-Monitoring Counters driver + * + * Copyright (C) 1999-2005 Mikael Pettersson + */ +#ifndef _LINUX_PERFCTR_H +#define _LINUX_PERFCTR_H + +#ifdef CONFIG_PERFCTR /* don't break archs without */ + +#include + +/* cpu_features flag bits */ +#define PERFCTR_FEATURE_RDPMC 0x01 +#define PERFCTR_FEATURE_RDTSC 0x02 +#define PERFCTR_FEATURE_PCINT 0x04 + +/* virtual perfctr control object */ +struct vperfctr_control { + __s32 si_signo; + __u32 preserve; +}; + +/* commands for sys_vperfctr_control() */ +#define VPERFCTR_CONTROL_UNLINK 0x01 +#define VPERFCTR_CONTROL_SUSPEND 0x02 +#define VPERFCTR_CONTROL_RESUME 0x03 +#define VPERFCTR_CONTROL_CLEAR 0x04 + +/* common description of an arch-specific control register */ +struct perfctr_cpu_reg { + __u64 nr; + __u64 value; +}; + +/* state and control domain numbers + 0-127 are for architecture-neutral domains + 128-255 are for architecture-specific domains */ +#define VPERFCTR_DOMAIN_SUM 1 /* struct perfctr_sum_ctrs */ +#define VPERFCTR_DOMAIN_CONTROL 2 /* struct vperfctr_control */ +#define VPERFCTR_DOMAIN_CHILDREN 3 /* struct perfctr_sum_ctrs */ + +/* domain numbers for common arch-specific control data */ +#define PERFCTR_DOMAIN_CPU_CONTROL 128 /* struct perfctr_cpu_control_header */ +#define PERFCTR_DOMAIN_CPU_MAP 129 /* __u32[] */ +#define PERFCTR_DOMAIN_CPU_REGS 130 /* struct perfctr_cpu_reg[] */ + +#endif /* CONFIG_PERFCTR */ + +#ifdef __KERNEL__ + +/* + * The perfctr system calls. + */ +asmlinkage long sys_vperfctr_open(int tid, int creat); +asmlinkage long sys_vperfctr_control(int fd, unsigned int cmd); +asmlinkage long sys_vperfctr_write(int fd, unsigned int domain, + const void __user *argp, + unsigned int argbytes); +asmlinkage long sys_vperfctr_read(int fd, unsigned int domain, + void __user *argp, + unsigned int argbytes); + +struct perfctr_info { + unsigned int cpu_features; + unsigned int cpu_khz; + unsigned int tsc_to_cpu_mult; +}; + +extern struct perfctr_info perfctr_info; + +#ifdef CONFIG_PERFCTR_VIRTUAL + +/* + * Virtual per-process performance-monitoring counters. + */ +struct vperfctr; /* opaque */ + +/* process management operations */ +extern void __vperfctr_copy(struct task_struct*, struct pt_regs*); +extern void __vperfctr_release(struct task_struct*); +extern void __vperfctr_exit(struct vperfctr*); +extern void __vperfctr_suspend(struct vperfctr*); +extern void __vperfctr_resume(struct vperfctr*); +extern void __vperfctr_sample(struct vperfctr*); +extern void __vperfctr_set_cpus_allowed(struct task_struct*, struct vperfctr*, cpumask_t); + +static inline void perfctr_copy_task(struct task_struct *tsk, struct pt_regs *regs) +{ + if (tsk->thread.perfctr) + __vperfctr_copy(tsk, regs); +} + +static inline void perfctr_release_task(struct task_struct *tsk) +{ + if (tsk->thread.perfctr) + __vperfctr_release(tsk); +} + +static inline void perfctr_exit_thread(struct thread_struct *thread) +{ + struct vperfctr *perfctr; + perfctr = thread->perfctr; + if (perfctr) + __vperfctr_exit(perfctr); +} + +static inline void perfctr_suspend_thread(struct thread_struct *prev) +{ + struct vperfctr *perfctr; + perfctr = prev->perfctr; + if (perfctr) + __vperfctr_suspend(perfctr); +} + +static inline void perfctr_resume_thread(struct thread_struct *next) +{ + struct vperfctr *perfctr; + perfctr = next->perfctr; + if (perfctr) + __vperfctr_resume(perfctr); +} + +static inline void perfctr_sample_thread(struct thread_struct *thread) +{ + struct vperfctr *perfctr; + perfctr = thread->perfctr; + if (perfctr) + __vperfctr_sample(perfctr); +} + +static inline void perfctr_set_cpus_allowed(struct task_struct *p, cpumask_t new_mask) +{ +#ifdef CONFIG_PERFCTR_CPUS_FORBIDDEN_MASK + struct vperfctr *perfctr; + + task_lock(p); + perfctr = p->thread.perfctr; + if (perfctr) + __vperfctr_set_cpus_allowed(p, perfctr, new_mask); + task_unlock(p); +#endif +} + +#else /* !CONFIG_PERFCTR_VIRTUAL */ + +static inline void perfctr_copy_task(struct task_struct *p, struct pt_regs *r) { } +static inline void perfctr_release_task(struct task_struct *p) { } +static inline void perfctr_exit_thread(struct thread_struct *t) { } +static inline void perfctr_suspend_thread(struct thread_struct *t) { } +static inline void perfctr_resume_thread(struct thread_struct *t) { } +static inline void perfctr_sample_thread(struct thread_struct *t) { } +static inline void perfctr_set_cpus_allowed(struct task_struct *p, cpumask_t m) { } + +#endif /* CONFIG_PERFCTR_VIRTUAL */ + +/* These routines are identical to write_seqcount_begin() and + * write_seqcount_end(), except they take an explicit __u32 rather + * than a seqcount_t. That's because this sequence lock is user from + * userspace, so we have to pin down the counter's type explicitly to + * have a clear ABI. They also omit the SMP write barriers since we + * only support mmap() based sampling for self-monitoring tasks. + */ +static inline void write_perfseq_begin(__u32 *seq) +{ + ++*seq; +} + +static inline void write_perfseq_end(__u32 *seq) +{ + ++*seq; +} + +#endif /* __KERNEL__ */ + +#endif /* _LINUX_PERFCTR_H */ diff -puN kernel/sched.c~perfctr kernel/sched.c --- devel/kernel/sched.c~perfctr 2005-07-08 23:11:41.000000000 -0700 +++ devel-akpm/kernel/sched.c 2005-07-08 23:11:41.000000000 -0700 @@ -42,6 +42,7 @@ #include #include #include +#include #include #include #include @@ -4261,6 +4262,8 @@ int set_cpus_allowed(task_t *p, cpumask_ migration_req_t req; runqueue_t *rq; + perfctr_set_cpus_allowed(p, new_mask); + rq = task_rq_lock(p, &flags); if (!cpus_intersects(new_mask, cpu_online_map)) { ret = -EINVAL; diff -puN kernel/sys.c~perfctr kernel/sys.c diff -puN kernel/timer.c~perfctr kernel/timer.c --- devel/kernel/timer.c~perfctr 2005-07-08 23:11:41.000000000 -0700 +++ devel-akpm/kernel/timer.c 2005-07-08 23:11:41.000000000 -0700 @@ -32,6 +32,7 @@ #include #include #include +#include #include #include @@ -846,6 +847,7 @@ void update_process_times(int user_tick) account_user_time(p, jiffies_to_cputime(1)); else account_system_time(p, HARDIRQ_OFFSET, jiffies_to_cputime(1)); + perfctr_sample_thread(&p->thread); run_local_timers(); if (rcu_pending(cpu)) rcu_check_callbacks(cpu, user_tick); diff -puN MAINTAINERS~perfctr MAINTAINERS --- devel/MAINTAINERS~perfctr 2005-07-08 23:11:41.000000000 -0700 +++ devel-akpm/MAINTAINERS 2005-07-08 23:11:41.000000000 -0700 @@ -1849,6 +1849,12 @@ M: george@mvista.com L: netdev@vger.kernel.org S: Supported +PERFORMANCE-MONITORING COUNTERS DRIVER +P: Mikael Pettersson +M: mikpe@csd.uu.se +W: http://www.csd.uu.se/~mikpe/linux/perfctr/ +S: Maintained + PNP SUPPORT P: Adam Belay M: ambx1@neo.rr.com diff -puN kernel/sys_ni.c~perfctr kernel/sys_ni.c --- devel/kernel/sys_ni.c~perfctr 2005-07-08 23:11:41.000000000 -0700 +++ devel-akpm/kernel/sys_ni.c 2005-07-08 23:11:41.000000000 -0700 @@ -68,6 +68,10 @@ cond_syscall(compat_sys_mq_timedsend); cond_syscall(compat_sys_mq_timedreceive); cond_syscall(compat_sys_mq_notify); cond_syscall(compat_sys_mq_getsetattr); +cond_syscall(sys_vperfctr_open); +cond_syscall(sys_vperfctr_control); +cond_syscall(sys_vperfctr_write); +cond_syscall(sys_vperfctr_read); cond_syscall(sys_mbind); cond_syscall(sys_get_mempolicy); cond_syscall(sys_set_mempolicy); diff -puN arch/i386/Kconfig~perfctr arch/i386/Kconfig --- devel/arch/i386/Kconfig~perfctr 2005-07-08 23:11:41.000000000 -0700 +++ devel-akpm/arch/i386/Kconfig 2005-07-08 23:11:41.000000000 -0700 @@ -1145,6 +1145,8 @@ config APM_REAL_MODE_POWER_OFF a work-around for a number of buggy BIOSes. Switch this option on if your computer crashes instead of powering off properly. +source "drivers/perfctr/Kconfig" + endmenu source "arch/i386/kernel/cpu/cpufreq/Kconfig" diff -puN arch/i386/kernel/entry.S~perfctr arch/i386/kernel/entry.S --- devel/arch/i386/kernel/entry.S~perfctr 2005-07-08 23:11:41.000000000 -0700 +++ devel-akpm/arch/i386/kernel/entry.S 2005-07-08 23:11:41.000000000 -0700 @@ -445,6 +445,16 @@ ENTRY(name) \ /* The include is where all of the SMP etc. interrupts come from */ #include "entry_arch.h" +#if defined(CONFIG_X86_LOCAL_APIC) && defined(CONFIG_PERFCTR) +ENTRY(perfctr_interrupt) + pushl $LOCAL_PERFCTR_VECTOR-256 + SAVE_ALL + pushl %esp + call smp_perfctr_interrupt + addl $4, %esp + jmp ret_from_intr +#endif + ENTRY(divide_error) pushl $0 # no error code pushl $do_divide_error diff -puN arch/i386/kernel/i8259.c~perfctr arch/i386/kernel/i8259.c --- devel/arch/i386/kernel/i8259.c~perfctr 2005-07-08 23:11:41.000000000 -0700 +++ devel-akpm/arch/i386/kernel/i8259.c 2005-07-08 23:11:41.000000000 -0700 @@ -24,6 +24,7 @@ #include #include #include +#include #include @@ -424,6 +425,8 @@ void __init init_IRQ(void) */ intr_init_hook(); + perfctr_vector_init(); + /* * Set the clock to HZ Hz, we already have a valid * vector now: diff -puN arch/i386/kernel/process.c~perfctr arch/i386/kernel/process.c --- devel/arch/i386/kernel/process.c~perfctr 2005-07-08 23:11:41.000000000 -0700 +++ devel-akpm/arch/i386/kernel/process.c 2005-07-08 23:11:41.000000000 -0700 @@ -33,6 +33,7 @@ #include #include #include +#include #include #include #include @@ -399,6 +400,7 @@ void exit_thread(void) tss->io_bitmap_base = INVALID_IO_BITMAP_OFFSET; put_cpu(); } + perfctr_exit_thread(&tsk->thread); } void flush_thread(void) @@ -478,6 +480,8 @@ int copy_thread(int nr, unsigned long cl savesegment(fs,p->thread.fs); savesegment(gs,p->thread.gs); + perfctr_copy_task(p, regs); + tsk = current; if (unlikely(NULL != tsk->thread.io_bitmap_ptr)) { p->thread.io_bitmap_ptr = kmalloc(IO_BITMAP_BYTES, GFP_KERNEL); @@ -724,6 +728,7 @@ struct task_struct fastcall * __switch_t disable_tsc(prev_p, next_p); + perfctr_resume_thread(next); return prev_p; } diff -puN /dev/null drivers/perfctr/x86.c --- /dev/null 2003-09-15 06:40:47.000000000 -0700 +++ devel-akpm/drivers/perfctr/x86.c 2005-07-08 23:11:41.000000000 -0700 @@ -0,0 +1,1800 @@ +/* $Id: x86.c,v 1.158 2005/04/08 14:36:49 mikpe Exp $ + * x86/x86_64 performance-monitoring counters driver. + * + * Copyright (C) 1999-2005 Mikael Pettersson + */ +#include +#include +#include +#include +#include + +#include +#undef MSR_P6_PERFCTR0 +#undef MSR_IA32_MISC_ENABLE +#include +#include +struct hw_interrupt_type; +#include +#include /* cpu_khz */ + +#include "cpumask.h" +#include "x86_tests.h" + +/* Support for lazy evntsel and perfctr MSR updates. */ +struct per_cpu_cache { /* roughly a subset of perfctr_cpu_state */ + unsigned int id; /* cache owner id */ +#ifdef CONFIG_PERFCTR_INTERRUPT_SUPPORT + unsigned int interrupts_masked; +#endif + struct { + /* NOTE: these caches have physical indices, not virtual */ + unsigned int evntsel[18]; + unsigned int escr[0x3E2-0x3A0]; + unsigned int pebs_enable; + unsigned int pebs_matrix_vert; + } control; +}; +static DEFINE_PER_CPU(struct per_cpu_cache, per_cpu_cache); +#define __get_cpu_cache(cpu) (&per_cpu(per_cpu_cache, cpu)) +#define get_cpu_cache() (&__get_cpu_var(per_cpu_cache)) + +/* Structure for counter snapshots, as 32-bit values. */ +struct perfctr_low_ctrs { + unsigned int tsc; + unsigned int pmc[18]; +}; + +/* Intel P5, Cyrix 6x86MX/MII/III, Centaur WinChip C6/2/3 */ +#define MSR_P5_CESR 0x11 +#define MSR_P5_CTR0 0x12 /* .. 0x13 */ +#define P5_CESR_CPL 0x00C0 +#define P5_CESR_RESERVED (~0x01FF) +#define MII_CESR_RESERVED (~0x05FF) +#define C6_CESR_RESERVED (~0x00FF) + +/* Intel P6, VIA C3 */ +#define MSR_P6_PERFCTR0 0xC1 /* .. 0xC2 */ +#define MSR_P6_EVNTSEL0 0x186 /* .. 0x187 */ +#define P6_EVNTSEL_ENABLE 0x00400000 +#define P6_EVNTSEL_INT 0x00100000 +#define P6_EVNTSEL_CPL 0x00030000 +#define P6_EVNTSEL_RESERVED 0x00280000 +#define VC3_EVNTSEL1_RESERVED (~0x1FF) + +/* AMD K7 */ +#define MSR_K7_EVNTSEL0 0xC0010000 /* .. 0xC0010003 */ +#define MSR_K7_PERFCTR0 0xC0010004 /* .. 0xC0010007 */ + +/* AMD K8 */ +#define IS_K8_NB_EVENT(EVNTSEL) ((((EVNTSEL) >> 5) & 0x7) == 0x7) + +/* Intel P4, Intel Pentium M */ +#define MSR_IA32_MISC_ENABLE 0x1A0 +#define MSR_IA32_MISC_ENABLE_PERF_AVAIL (1<<7) /* read-only status bit */ +#define MSR_IA32_MISC_ENABLE_PEBS_UNAVAIL (1<<12) /* read-only status bit */ + +/* Intel P4 */ +#define MSR_P4_PERFCTR0 0x300 /* .. 0x311 */ +#define MSR_P4_CCCR0 0x360 /* .. 0x371 */ +#define MSR_P4_ESCR0 0x3A0 /* .. 0x3E1, with some gaps */ + +#define MSR_P4_PEBS_ENABLE 0x3F1 +#define P4_PE_REPLAY_TAG_BITS 0x00000607 +#define P4_PE_UOP_TAG 0x01000000 +#define P4_PE_RESERVED 0xFEFFF9F8 /* only allow ReplayTagging */ + +#define MSR_P4_PEBS_MATRIX_VERT 0x3F2 +#define P4_PMV_REPLAY_TAG_BITS 0x00000003 +#define P4_PMV_RESERVED 0xFFFFFFFC + +#define P4_CCCR_OVF 0x80000000 +#define P4_CCCR_CASCADE 0x40000000 +#define P4_CCCR_OVF_PMI_T1 0x08000000 +#define P4_CCCR_OVF_PMI_T0 0x04000000 +#define P4_CCCR_FORCE_OVF 0x02000000 +#define P4_CCCR_ACTIVE_THREAD 0x00030000 +#define P4_CCCR_ENABLE 0x00001000 +#define P4_CCCR_ESCR_SELECT(X) (((X) >> 13) & 0x7) +#define P4_CCCR_EXTENDED_CASCADE 0x00000800 +#define P4_CCCR_RESERVED (0x300007FF|P4_CCCR_OVF|P4_CCCR_OVF_PMI_T1) + +#define P4_ESCR_CPL_T1 0x00000003 +#define P4_ESCR_CPL_T0 0x0000000C +#define P4_ESCR_TAG_ENABLE 0x00000010 +#define P4_ESCR_RESERVED (0x80000000) + +#define P4_FAST_RDPMC 0x80000000 +#define P4_MASK_FAST_RDPMC 0x0000001F /* we only need low 5 bits */ + +/* missing from */ +#define cpu_has_msr boot_cpu_has(X86_FEATURE_MSR) + +#define rdmsr_low(msr,low) \ + __asm__ __volatile__("rdmsr" : "=a"(low) : "c"(msr) : "edx") +#define rdpmc_low(ctr,low) \ + __asm__ __volatile__("rdpmc" : "=a"(low) : "c"(ctr) : "edx") + +static void clear_msr_range(unsigned int base, unsigned int n) +{ + unsigned int i; + + for(i = 0; i < n; ++i) + wrmsr(base+i, 0, 0); +} + +static inline void set_in_cr4_local(unsigned int mask) +{ + write_cr4(read_cr4() | mask); +} + +static inline void clear_in_cr4_local(unsigned int mask) +{ + write_cr4(read_cr4() & ~mask); +} + +static unsigned int new_id(void) +{ + static DEFINE_SPINLOCK(lock); + static unsigned int counter; + int id; + + spin_lock(&lock); + id = ++counter; + spin_unlock(&lock); + return id; +} + +#ifdef CONFIG_X86_LOCAL_APIC +static void perfctr_default_ihandler(unsigned long pc) +{ +} + +static perfctr_ihandler_t perfctr_ihandler = perfctr_default_ihandler; + +asmlinkage void smp_perfctr_interrupt(struct pt_regs *regs) +{ + /* PREEMPT note: invoked via an interrupt gate, which + masks interrupts. We're still on the originating CPU. */ + /* XXX: recursive interrupts? delay the ACK, mask LVTPC, or queue? */ + ack_APIC_irq(); + if (get_cpu_cache()->interrupts_masked) + return; + irq_enter(); + (*perfctr_ihandler)(instruction_pointer(regs)); + irq_exit(); +} + +void perfctr_cpu_set_ihandler(perfctr_ihandler_t ihandler) +{ + perfctr_ihandler = ihandler ? ihandler : perfctr_default_ihandler; +} + +static inline void perfctr_cpu_mask_interrupts(struct per_cpu_cache *cache) +{ + cache->interrupts_masked = 1; +} + +static inline void perfctr_cpu_unmask_interrupts(struct per_cpu_cache *cache) +{ + cache->interrupts_masked = 0; +} + +#else +#define perfctr_cstatus_has_ictrs(cstatus) 0 +#undef cpu_has_apic +#define cpu_has_apic 0 +#undef apic_write +#define apic_write(reg,vector) do{}while(0) +#endif + +#if defined(CONFIG_SMP) + +static inline void +set_isuspend_cpu(struct perfctr_cpu_state *state, int cpu) +{ + state->isuspend_cpu = cpu; +} + +static inline int +is_isuspend_cpu(const struct perfctr_cpu_state *state, int cpu) +{ + return state->isuspend_cpu == cpu; +} + +static inline void clear_isuspend_cpu(struct perfctr_cpu_state *state) +{ + state->isuspend_cpu = NR_CPUS; +} + +#else +static inline void set_isuspend_cpu(struct perfctr_cpu_state *state, int cpu) { } +static inline int is_isuspend_cpu(const struct perfctr_cpu_state *state, int cpu) { return 1; } +static inline void clear_isuspend_cpu(struct perfctr_cpu_state *state) { } +#endif + +/**************************************************************** + * * + * Driver procedures. * + * * + ****************************************************************/ + +/* + * Intel P5 family (Pentium, family code 5). + * - One TSC and two 40-bit PMCs. + * - A single 32-bit CESR (MSR 0x11) controls both PMCs. + * CESR has two halves, each controlling one PMC. + * - Overflow interrupts are not available. + * - Pentium MMX added the RDPMC instruction. RDPMC has lower + * overhead than RDMSR and it can be used in user-mode code. + * - The MMX events are not symmetric: some events are only available + * for some PMC, and some event codes denote different events + * depending on which PMCs they control. + */ + +/* shared with MII and C6 */ +static int p5_like_check_control(struct perfctr_cpu_state *state, + unsigned int reserved_bits, int is_c6) +{ + unsigned short cesr_half[2]; + unsigned int pmc, evntsel, i; + + if (state->control.header.nrictrs != 0 || state->control.header.nractrs > 2) + return -EINVAL; + cesr_half[0] = 0; + cesr_half[1] = 0; + for(i = 0; i < state->control.header.nractrs; ++i) { + pmc = state->control.pmc_map[i]; + if (pmc > 1 || cesr_half[pmc] != 0) + return -EINVAL; + evntsel = state->control.evntsel[0]; + if (pmc == 0) + evntsel &= 0xffff; + else + evntsel >>= 16; + /* protect reserved bits */ + if ((evntsel & reserved_bits) != 0) + return -EPERM; + /* the CPL field (if defined) must be non-zero */ + if (!is_c6 && !(evntsel & P5_CESR_CPL)) + return -EINVAL; + cesr_half[pmc] = evntsel; + } + state->id = (cesr_half[1] << 16) | cesr_half[0]; + return 0; +} + +static int p5_check_control(struct perfctr_cpu_state *state, int is_global) +{ + return p5_like_check_control(state, P5_CESR_RESERVED, 0); +} + +/* shared with MII but not C6 */ +static void p5_write_control(const struct perfctr_cpu_state *state) +{ + struct per_cpu_cache *cache; + unsigned int cesr; + + cesr = state->id; + if (!cesr) /* no PMC is on (this test doesn't work on C6) */ + return; + cache = get_cpu_cache(); + if (cache->id != cesr) { + cache->id = cesr; + wrmsr(MSR_P5_CESR, cesr, 0); + } +} + +static void p5_read_counters(const struct perfctr_cpu_state *state, + struct perfctr_low_ctrs *ctrs) +{ + unsigned int cstatus, nrctrs, i; + + /* The P5 doesn't allocate a cache line on a write miss, so do + a dummy read to avoid a write miss here _and_ a read miss + later in our caller. */ + asm("" : : "r"(ctrs->tsc)); + + cstatus = state->user.cstatus; + if (perfctr_cstatus_has_tsc(cstatus)) + rdtscl(ctrs->tsc); + nrctrs = perfctr_cstatus_nractrs(cstatus); + for(i = 0; i < nrctrs; ++i) { + unsigned int pmc = state->control.pmc_map[i]; + rdmsr_low(MSR_P5_CTR0+pmc, ctrs->pmc[i]); + } +} + +/* used by all except pre-MMX P5 */ +static void rdpmc_read_counters(const struct perfctr_cpu_state *state, + struct perfctr_low_ctrs *ctrs) +{ + unsigned int cstatus, nrctrs, i; + + cstatus = state->user.cstatus; + if (perfctr_cstatus_has_tsc(cstatus)) + rdtscl(ctrs->tsc); + nrctrs = perfctr_cstatus_nractrs(cstatus); + for(i = 0; i < nrctrs; ++i) { + unsigned int pmc = state->control.pmc_map[i]; + rdpmc_low(pmc, ctrs->pmc[i]); + } +} + +/* shared with MII and C6 */ +static void p5_clear_counters(void) +{ + clear_msr_range(MSR_P5_CESR, 1+2); +} + +/* + * Cyrix 6x86/MII/III. + * - Same MSR assignments as P5 MMX. Has RDPMC and two 48-bit PMCs. + * - Event codes and CESR formatting as in the plain P5 subset. + * - Many but not all P5 MMX event codes are implemented. + * - Cyrix adds a few more event codes. The event code is widened + * to 7 bits, and Cyrix puts the high bit in CESR bit 10 + * (and CESR bit 26 for PMC1). + */ + +static int mii_check_control(struct perfctr_cpu_state *state, int is_global) +{ + return p5_like_check_control(state, MII_CESR_RESERVED, 0); +} + +/* + * Centaur WinChip C6/2/3. + * - Same MSR assignments as P5 MMX. Has RDPMC and two 40-bit PMCs. + * - CESR is formatted with two halves, like P5. However, there + * are no defined control fields for e.g. CPL selection, and + * there is no defined method for stopping the counters. + * - Only a few event codes are defined. + * - The 64-bit TSC is synthesised from the low 32 bits of the + * two PMCs, and CESR has to be set up appropriately. + * Reprogramming CESR causes RDTSC to yield invalid results. + * (The C6 may also hang in this case, due to C6 erratum I-13.) + * Therefore, using the PMCs on any of these processors requires + * that the TSC is not accessed at all: + * 1. The kernel must be configured or a TSC-less processor, i.e. + * generic 586 or less. + * 2. The "notsc" boot parameter must be passed to the kernel. + * 3. User-space libraries and code must also be configured and + * compiled for a generic 586 or less. + */ + +#if !defined(CONFIG_X86_TSC) +static int c6_check_control(struct perfctr_cpu_state *state, int is_global) +{ + if (state->control.header.tsc_on) + return -EINVAL; + return p5_like_check_control(state, C6_CESR_RESERVED, 1); +} + +static void c6_write_control(const struct perfctr_cpu_state *state) +{ + struct per_cpu_cache *cache; + unsigned int cesr; + + if (perfctr_cstatus_nractrs(state->user.cstatus) == 0) /* no PMC is on */ + return; + cache = get_cpu_cache(); + cesr = state->id; + if (cache->id != cesr) { + cache->id = cesr; + wrmsr(MSR_P5_CESR, cesr, 0); + } +} +#endif + +/* + * Intel P6 family (Pentium Pro, Pentium II, and Pentium III cores, + * and Xeon and Celeron versions of Pentium II and III cores). + * - One TSC and two 40-bit PMCs. + * - One 32-bit EVNTSEL MSR for each PMC. + * - EVNTSEL0 contains a global enable/disable bit. + * That bit is reserved in EVNTSEL1. + * - Each EVNTSEL contains a CPL field. + * - Overflow interrupts are possible, but requires that the + * local APIC is available. Some Mobile P6s have no local APIC. + * - The PMCs cannot be initialised with arbitrary values, since + * wrmsr fills the high bits by sign-extending from bit 31. + * - Most events are symmetric, but a few are not. + */ + +static int k8_is_multicore; /* affects northbridge events */ + +/* shared with K7 */ +static int p6_like_check_control(struct perfctr_cpu_state *state, int is_k7, int is_global) +{ + unsigned int evntsel, i, nractrs, nrctrs, pmc_mask, pmc; + + nractrs = state->control.header.nractrs; + nrctrs = nractrs + state->control.header.nrictrs; + if (nrctrs < nractrs || nrctrs > (is_k7 ? 4 : 2)) + return -EINVAL; + + pmc_mask = 0; + for(i = 0; i < nrctrs; ++i) { + pmc = state->control.pmc_map[i]; + if (pmc >= (is_k7 ? 4 : 2) || (pmc_mask & (1<control.evntsel[pmc]; + /* prevent the K8 multicore NB event clobber erratum */ + if (!is_global && k8_is_multicore && IS_K8_NB_EVENT(evntsel)) + return -EPERM; + /* protect reserved bits */ + if (evntsel & P6_EVNTSEL_RESERVED) + return -EPERM; + /* check ENable bit */ + if (is_k7) { + /* ENable bit must be set in each evntsel */ + if (!(evntsel & P6_EVNTSEL_ENABLE)) + return -EINVAL; + } else { + /* only evntsel[0] has the ENable bit */ + if (evntsel & P6_EVNTSEL_ENABLE) { + if (pmc > 0) + return -EPERM; + } else { + if (pmc == 0) + return -EINVAL; + } + } + /* the CPL field must be non-zero */ + if (!(evntsel & P6_EVNTSEL_CPL)) + return -EINVAL; + /* INT bit must be off for a-mode and on for i-mode counters */ + if (evntsel & P6_EVNTSEL_INT) { + if (i < nractrs) + return -EINVAL; + } else { + if (i >= nractrs) + return -EINVAL; + } + } + state->id = new_id(); + return 0; +} + +static int p6_check_control(struct perfctr_cpu_state *state, int is_global) +{ + return p6_like_check_control(state, 0, is_global); +} + +#ifdef CONFIG_X86_LOCAL_APIC +/* PRE: perfctr_cstatus_has_ictrs(state->cstatus) != 0 */ +/* shared with K7 and P4 */ +static void p6_like_isuspend(struct perfctr_cpu_state *state, + unsigned int msr_evntsel0) +{ + struct per_cpu_cache *cache; + unsigned int cstatus, nrctrs, i; + int cpu; + unsigned int pending = 0; + + cpu = smp_processor_id(); + set_isuspend_cpu(state, cpu); /* early to limit cpu's live range */ + cache = __get_cpu_cache(cpu); + perfctr_cpu_mask_interrupts(cache); + cstatus = state->user.cstatus; + nrctrs = perfctr_cstatus_nrctrs(cstatus); + for(i = perfctr_cstatus_nractrs(cstatus); i < nrctrs; ++i) { + unsigned int pmc_raw, pmc_idx, now; + pmc_raw = state->control.pmc_map[i]; + /* Note: P4_MASK_FAST_RDPMC is a no-op for P6 and K7. + We don't need to make it into a parameter. */ + pmc_idx = pmc_raw & P4_MASK_FAST_RDPMC; + cache->control.evntsel[pmc_idx] = 0; + /* On P4 this intensionally also clears the CCCR.OVF flag. */ + wrmsr(msr_evntsel0+pmc_idx, 0, 0); + /* P4 erratum N17 does not apply since we read only low 32 bits. */ + rdpmc_low(pmc_raw, now); + state->user.pmc[i].sum += now - state->user.pmc[i].start; + state->user.pmc[i].start = now; + if ((int)now >= 0) + ++pending; + } + state->pending_interrupt = pending; + /* cache->id is still == state->id */ +} + +/* PRE: perfctr_cstatus_has_ictrs(state->cstatus) != 0 */ +/* shared with K7 and P4 */ +static void p6_like_iresume(const struct perfctr_cpu_state *state, + unsigned int msr_evntsel0, + unsigned int msr_perfctr0) +{ + struct per_cpu_cache *cache; + unsigned int cstatus, nrctrs, i; + int cpu; + + cpu = smp_processor_id(); + cache = __get_cpu_cache(cpu); + perfctr_cpu_unmask_interrupts(cache); + if (cache->id == state->id) { + cache->id = 0; /* force reload of cleared EVNTSELs */ + if (is_isuspend_cpu(state, cpu)) + return; /* skip reload of PERFCTRs */ + } + cstatus = state->user.cstatus; + nrctrs = perfctr_cstatus_nrctrs(cstatus); + for(i = perfctr_cstatus_nractrs(cstatus); i < nrctrs; ++i) { + /* Note: P4_MASK_FAST_RDPMC is a no-op for P6 and K7. + We don't need to make it into a parameter. */ + unsigned int pmc = state->control.pmc_map[i] & P4_MASK_FAST_RDPMC; + /* If the control wasn't ours we must disable the evntsels + before reinitialising the counters, to prevent unexpected + counter increments and missed overflow interrupts. */ + if (cache->control.evntsel[pmc]) { + cache->control.evntsel[pmc] = 0; + wrmsr(msr_evntsel0+pmc, 0, 0); + } + /* P4 erratum N15 does not apply since the CCCR is disabled. */ + wrmsr(msr_perfctr0+pmc, (unsigned int)state->user.pmc[i].start, -1); + } + /* cache->id remains != state->id */ +} + +static void p6_isuspend(struct perfctr_cpu_state *state) +{ + p6_like_isuspend(state, MSR_P6_EVNTSEL0); +} + +static void p6_iresume(const struct perfctr_cpu_state *state) +{ + p6_like_iresume(state, MSR_P6_EVNTSEL0, MSR_P6_PERFCTR0); +} +#endif /* CONFIG_X86_LOCAL_APIC */ + +/* shared with K7 and VC3 */ +static void p6_like_write_control(const struct perfctr_cpu_state *state, + unsigned int msr_evntsel0) +{ + struct per_cpu_cache *cache; + unsigned int nrctrs, i; + + cache = get_cpu_cache(); + if (cache->id == state->id) + return; + nrctrs = perfctr_cstatus_nrctrs(state->user.cstatus); + for(i = 0; i < nrctrs; ++i) { + unsigned int pmc = state->control.pmc_map[i]; + unsigned int evntsel = state->control.evntsel[pmc]; + if (evntsel != cache->control.evntsel[pmc]) { + cache->control.evntsel[pmc] = evntsel; + wrmsr(msr_evntsel0+pmc, evntsel, 0); + } + } + cache->id = state->id; +} + +/* shared with VC3, Generic*/ +static void p6_write_control(const struct perfctr_cpu_state *state) +{ + p6_like_write_control(state, MSR_P6_EVNTSEL0); +} + +static void p6_clear_counters(void) +{ + clear_msr_range(MSR_P6_EVNTSEL0, 2); + clear_msr_range(MSR_P6_PERFCTR0, 2); +} + +/* + * AMD K7 family (Athlon, Duron). + * - Somewhat similar to the Intel P6 family. + * - Four 48-bit PMCs. + * - Four 32-bit EVNTSEL MSRs with similar layout as in P6. + * - Completely different MSR assignments :-( + * - Fewer countable events defined :-( + * - The events appear to be completely symmetric. + * - The EVNTSEL MSRs are symmetric since each has its own enable bit. + * - Publicly available documentation is incomplete. + * - K7 model 1 does not have a local APIC. AMD Document #22007 + * Revision J hints that it may use debug interrupts instead. + * + * The K8 has the same hardware layout as the K7. It also has + * better documentation and a different set of available events. + */ + +static int k7_check_control(struct perfctr_cpu_state *state, int is_global) +{ + return p6_like_check_control(state, 1, is_global); +} + +#ifdef CONFIG_X86_LOCAL_APIC +static void k7_isuspend(struct perfctr_cpu_state *state) +{ + p6_like_isuspend(state, MSR_K7_EVNTSEL0); +} + +static void k7_iresume(const struct perfctr_cpu_state *state) +{ + p6_like_iresume(state, MSR_K7_EVNTSEL0, MSR_K7_PERFCTR0); +} +#endif /* CONFIG_X86_LOCAL_APIC */ + +static void k7_write_control(const struct perfctr_cpu_state *state) +{ + p6_like_write_control(state, MSR_K7_EVNTSEL0); +} + +static void k7_clear_counters(void) +{ + clear_msr_range(MSR_K7_EVNTSEL0, 4+4); +} + +/* + * VIA C3 family. + * - A Centaur design somewhat similar to the P6/Celeron. + * - PERFCTR0 is an alias for the TSC, and EVNTSEL0 is read-only. + * - PERFCTR1 is 32 bits wide. + * - EVNTSEL1 has no defined control fields, and there is no + * defined method for stopping the counter. + * - According to testing, the reserved fields in EVNTSEL1 have + * no function. We always fill them with zeroes. + * - Only a few event codes are defined. + * - No local APIC or interrupt-mode support. + * - pmc_map[0] must be 1, if nractrs == 1. + */ +static int vc3_check_control(struct perfctr_cpu_state *state, int is_global) +{ + if (state->control.header.nrictrs || state->control.header.nractrs > 1) + return -EINVAL; + if (state->control.header.nractrs == 1) { + if (state->control.pmc_map[0] != 1) + return -EINVAL; + if (state->control.evntsel[1] & VC3_EVNTSEL1_RESERVED) + return -EPERM; + state->id = state->control.evntsel[1]; + } else + state->id = 0; + return 0; +} + +static void vc3_clear_counters(void) +{ + /* Not documented, but seems to be default after boot. */ + wrmsr(MSR_P6_EVNTSEL0+1, 0x00070079, 0); +} + +/* + * Intel Pentium 4. + * Current implementation restrictions: + * - No DS/PEBS support. + * + * Known quirks: + * - OVF_PMI+FORCE_OVF counters must have an ireset value of -1. + * This allows the regular overflow check to also handle FORCE_OVF + * counters. Not having this restriction would lead to MAJOR + * complications in the driver's "detect overflow counters" code. + * There is no loss of functionality since the ireset value doesn't + * affect the counter's PMI rate for FORCE_OVF counters. + * - In experiments with FORCE_OVF counters, and regular OVF_PMI + * counters with small ireset values between -8 and -1, it appears + * that the faulting instruction is subjected to a new PMI before + * it can complete, ad infinitum. This occurs even though the driver + * clears the CCCR (and in testing also the ESCR) and invokes a + * user-space signal handler before restoring the CCCR and resuming + * the instruction. + */ + +/* + * Table 15-4 in the IA32 Volume 3 manual contains a 18x8 entry mapping + * from counter/CCCR number (0-17) and ESCR SELECT value (0-7) to the + * actual ESCR MSR number. This mapping contains some repeated patterns, + * so we can compact it to a 4x8 table of MSR offsets: + * + * 1. CCCRs 16 and 17 are mapped just like CCCRs 13 and 14, respectively. + * Thus, we only consider the 16 CCCRs 0-15. + * 2. The CCCRs are organised in pairs, and both CCCRs in a pair use the + * same mapping. Thus, we only consider the 8 pairs 0-7. + * 3. In each pair of pairs, the second odd-numbered pair has the same domain + * as the first even-numbered pair, and the range is 1+ the range of the + * the first even-numbered pair. For example, CCCR(0) and (1) map ESCR + * SELECT(7) to 0x3A0, and CCCR(2) and (3) map it to 0x3A1. + * The only exception is that pair (7) [CCCRs 14 and 15] does not have + * ESCR SELECT(3) in its domain, like pair (6) [CCCRs 12 and 13] has. + * NOTE: Revisions of IA32 Volume 3 older than #245472-007 had an error + * in this table: CCCRs 12, 13, and 16 had their mappings for ESCR SELECT + * values 2 and 3 swapped. + * 4. All MSR numbers are on the form 0x3??. Instead of storing these as + * 16-bit numbers, the table only stores the 8-bit offsets from 0x300. + */ + +static const unsigned char p4_cccr_escr_map[4][8] = { + /* 0x00 and 0x01 as is, 0x02 and 0x03 are +1 */ + [0x00/4] { [7] 0xA0, + [6] 0xA2, + [2] 0xAA, + [4] 0xAC, + [0] 0xB2, + [1] 0xB4, + [3] 0xB6, + [5] 0xC8, }, + /* 0x04 and 0x05 as is, 0x06 and 0x07 are +1 */ + [0x04/4] { [0] 0xC0, + [2] 0xC2, + [1] 0xC4, }, + /* 0x08 and 0x09 as is, 0x0A and 0x0B are +1 */ + [0x08/4] { [1] 0xA4, + [0] 0xA6, + [5] 0xA8, + [2] 0xAE, + [3] 0xB0, }, + /* 0x0C, 0x0D, and 0x10 as is, + 0x0E, 0x0F, and 0x11 are +1 except [3] is not in the domain */ + [0x0C/4] { [4] 0xB8, + [5] 0xCC, + [6] 0xE0, + [0] 0xBA, + [2] 0xBC, + [3] 0xBE, + [1] 0xCA, }, +}; + +static unsigned int p4_escr_addr(unsigned int pmc, unsigned int cccr_val) +{ + unsigned int escr_select, pair, escr_offset; + + escr_select = P4_CCCR_ESCR_SELECT(cccr_val); + if (pmc > 0x11) + return 0; /* pmc range error */ + if (pmc > 0x0F) + pmc -= 3; /* 0 <= pmc <= 0x0F */ + pair = pmc / 2; /* 0 <= pair <= 7 */ + escr_offset = p4_cccr_escr_map[pair / 2][escr_select]; + if (!escr_offset || (pair == 7 && escr_select == 3)) + return 0; /* ESCR SELECT range error */ + return escr_offset + (pair & 1) + 0x300; +}; + +static int p4_IQ_ESCR_ok; /* only models <= 2 can use IQ_ESCR{0,1} */ +static int p4_is_ht; /* affects several CCCR & ESCR fields */ +static int p4_extended_cascade_ok; /* only models >= 2 can use extended cascading */ + +static int p4_check_control(struct perfctr_cpu_state *state, int is_global) +{ + unsigned int i, nractrs, nrctrs, pmc_mask; + + nractrs = state->control.header.nractrs; + nrctrs = nractrs + state->control.header.nrictrs; + if (nrctrs < nractrs || nrctrs > 18) + return -EINVAL; + + pmc_mask = 0; + for(i = 0; i < nrctrs; ++i) { + unsigned int pmc, cccr_val, escr_val, escr_addr; + /* check that pmc_map[] is well-defined; + pmc_map[i] is what we pass to RDPMC, the PMC itself + is extracted by masking off the FAST_RDPMC flag */ + pmc = state->control.pmc_map[i] & ~P4_FAST_RDPMC; + if (pmc >= 18 || (pmc_mask & (1<control.evntsel[pmc]; + if (cccr_val & P4_CCCR_RESERVED) + return -EPERM; + if (cccr_val & P4_CCCR_EXTENDED_CASCADE) { + if (!p4_extended_cascade_ok) + return -EPERM; + if (!(pmc == 12 || pmc >= 15)) + return -EPERM; + } + if ((cccr_val & P4_CCCR_ACTIVE_THREAD) != P4_CCCR_ACTIVE_THREAD && !p4_is_ht) + return -EINVAL; + if (!(cccr_val & (P4_CCCR_ENABLE | P4_CCCR_CASCADE | P4_CCCR_EXTENDED_CASCADE))) + return -EINVAL; + if (cccr_val & P4_CCCR_OVF_PMI_T0) { + if (i < nractrs) + return -EINVAL; + if ((cccr_val & P4_CCCR_FORCE_OVF) && + state->control.ireset[pmc] != -1) + return -EINVAL; + } else { + if (i >= nractrs) + return -EINVAL; + } + /* compute and cache ESCR address */ + escr_addr = p4_escr_addr(pmc, cccr_val); + if (!escr_addr) + return -EINVAL; /* ESCR SELECT range error */ + /* IQ_ESCR0 and IQ_ESCR1 only exist in models <= 2 */ + if ((escr_addr & ~0x001) == 0x3BA && !p4_IQ_ESCR_ok) + return -EINVAL; + /* XXX: Two counters could map to the same ESCR. Should we + check that they use the same ESCR value? */ + state->p4_escr_map[i] = escr_addr - MSR_P4_ESCR0; + /* check ESCR contents */ + escr_val = state->control.p4.escr[escr_addr - MSR_P4_ESCR0]; + if (escr_val & P4_ESCR_RESERVED) + return -EPERM; + if ((escr_val & P4_ESCR_CPL_T1) && (!p4_is_ht || !is_global)) + return -EINVAL; + } + /* check ReplayTagging control (PEBS_ENABLE and PEBS_MATRIX_VERT) */ + if (state->control.p4.pebs_enable) { + if (!nrctrs) + return -EPERM; + if (state->control.p4.pebs_enable & P4_PE_RESERVED) + return -EPERM; + if (!(state->control.p4.pebs_enable & P4_PE_UOP_TAG)) + return -EINVAL; + if (!(state->control.p4.pebs_enable & P4_PE_REPLAY_TAG_BITS)) + return -EINVAL; + if (state->control.p4.pebs_matrix_vert & P4_PMV_RESERVED) + return -EPERM; + if (!(state->control.p4.pebs_matrix_vert & P4_PMV_REPLAY_TAG_BITS)) + return -EINVAL; + } else if (state->control.p4.pebs_matrix_vert) + return -EPERM; + state->id = new_id(); + return 0; +} + +#ifdef CONFIG_X86_LOCAL_APIC +static void p4_isuspend(struct perfctr_cpu_state *state) +{ + return p6_like_isuspend(state, MSR_P4_CCCR0); +} + +static void p4_iresume(const struct perfctr_cpu_state *state) +{ + return p6_like_iresume(state, MSR_P4_CCCR0, MSR_P4_PERFCTR0); +} +#endif /* CONFIG_X86_LOCAL_APIC */ + +static void p4_write_control(const struct perfctr_cpu_state *state) +{ + struct per_cpu_cache *cache; + unsigned int nrctrs, i; + + cache = get_cpu_cache(); + if (cache->id == state->id) + return; + nrctrs = perfctr_cstatus_nrctrs(state->user.cstatus); + for(i = 0; i < nrctrs; ++i) { + unsigned int escr_val, escr_off, cccr_val, pmc; + escr_off = state->p4_escr_map[i]; + escr_val = state->control.p4.escr[escr_off]; + if (escr_val != cache->control.escr[escr_off]) { + cache->control.escr[escr_off] = escr_val; + wrmsr(MSR_P4_ESCR0+escr_off, escr_val, 0); + } + pmc = state->control.pmc_map[i] & P4_MASK_FAST_RDPMC; + cccr_val = state->control.evntsel[pmc]; + if (cccr_val != cache->control.evntsel[pmc]) { + cache->control.evntsel[pmc] = cccr_val; + wrmsr(MSR_P4_CCCR0+pmc, cccr_val, 0); + } + } + if (state->control.p4.pebs_enable != cache->control.pebs_enable) { + cache->control.pebs_enable = state->control.p4.pebs_enable; + wrmsr(MSR_P4_PEBS_ENABLE, state->control.p4.pebs_enable, 0); + } + if (state->control.p4.pebs_matrix_vert != cache->control.pebs_matrix_vert) { + cache->control.pebs_matrix_vert = state->control.p4.pebs_matrix_vert; + wrmsr(MSR_P4_PEBS_MATRIX_VERT, state->control.p4.pebs_matrix_vert, 0); + } + cache->id = state->id; +} + +static void p4_clear_counters(void) +{ + /* MSR 0x3F0 seems to have a default value of 0xFC00, but current + docs doesn't fully define it, so leave it alone for now. */ + /* clear PEBS_ENABLE and PEBS_MATRIX_VERT; they handle both PEBS + and ReplayTagging, and should exist even if PEBS is disabled */ + clear_msr_range(0x3F1, 2); + clear_msr_range(0x3A0, 26); + if (p4_IQ_ESCR_ok) + clear_msr_range(0x3BA, 2); + clear_msr_range(0x3BC, 3); + clear_msr_range(0x3C0, 6); + clear_msr_range(0x3C8, 6); + clear_msr_range(0x3E0, 2); + clear_msr_range(MSR_P4_CCCR0, 18); + clear_msr_range(MSR_P4_PERFCTR0, 18); +} + +/* + * Generic driver for any x86 with a working TSC. + */ + +static int generic_check_control(struct perfctr_cpu_state *state, int is_global) +{ + if (state->control.header.nractrs || state->control.header.nrictrs) + return -EINVAL; + return 0; +} + +static void generic_clear_counters(void) +{ +} + +/* + * Driver methods, internal and exported. + * + * Frequently called functions (write_control, read_counters, + * isuspend and iresume) are back-patched to invoke the correct + * processor-specific methods directly, thereby saving the + * overheads of indirect function calls. + * + * Backpatchable call sites must have been "finalised" after + * initialisation. The reason for this is that unsynchronised code + * modification doesn't work in multiprocessor systems, due to + * Intel P6 errata. Consequently, all backpatchable call sites + * must be known and local to this file. + * + * Backpatchable calls must initially be to 'noinline' stubs. + * Otherwise the compiler may inline the stubs, which breaks + * redirect_call() and finalise_backpatching(). + */ + +static int redirect_call_disable; + +static noinline void redirect_call(void *ra, void *to) +{ + /* XXX: make this function __init later */ + if (redirect_call_disable) + printk(KERN_ERR __FILE__ ":%s: unresolved call to %p at %p\n", + __FUNCTION__, to, ra); + /* we can only redirect `call near relative' instructions */ + if (*((unsigned char*)ra - 5) != 0xE8) { + printk(KERN_WARNING __FILE__ ":%s: unable to redirect caller %p to %p\n", + __FUNCTION__, ra, to); + return; + } + *(int*)((char*)ra - 4) = (char*)to - (char*)ra; +} + +static void (*write_control)(const struct perfctr_cpu_state*); +static noinline void perfctr_cpu_write_control(const struct perfctr_cpu_state *state) +{ + redirect_call(__builtin_return_address(0), write_control); + return write_control(state); +} + +static void (*read_counters)(const struct perfctr_cpu_state*, + struct perfctr_low_ctrs*); +static noinline void perfctr_cpu_read_counters(const struct perfctr_cpu_state *state, + struct perfctr_low_ctrs *ctrs) +{ + redirect_call(__builtin_return_address(0), read_counters); + return read_counters(state, ctrs); +} + +#ifdef CONFIG_X86_LOCAL_APIC +static void (*cpu_isuspend)(struct perfctr_cpu_state*); +static noinline void perfctr_cpu_isuspend(struct perfctr_cpu_state *state) +{ + redirect_call(__builtin_return_address(0), cpu_isuspend); + return cpu_isuspend(state); +} + +static void (*cpu_iresume)(const struct perfctr_cpu_state*); +static noinline void perfctr_cpu_iresume(const struct perfctr_cpu_state *state) +{ + redirect_call(__builtin_return_address(0), cpu_iresume); + return cpu_iresume(state); +} + +/* Call perfctr_cpu_ireload() just before perfctr_cpu_resume() to + bypass internal caching and force a reload if the I-mode PMCs. */ +void perfctr_cpu_ireload(struct perfctr_cpu_state *state) +{ +#ifdef CONFIG_SMP + clear_isuspend_cpu(state); +#else + get_cpu_cache()->id = 0; +#endif +} + +/* PRE: the counters have been suspended and sampled by perfctr_cpu_suspend() */ +static int lvtpc_reinit_needed; +unsigned int perfctr_cpu_identify_overflow(struct perfctr_cpu_state *state) +{ + unsigned int cstatus, nrctrs, i, pmc_mask; + + cstatus = state->user.cstatus; + nrctrs = perfctr_cstatus_nrctrs(cstatus); + state->pending_interrupt = 0; + pmc_mask = 0; + for(i = perfctr_cstatus_nractrs(cstatus); i < nrctrs; ++i) { + if ((int)state->user.pmc[i].start >= 0) { /* XXX: ">" ? */ + unsigned int pmc = state->control.pmc_map[i] & P4_MASK_FAST_RDPMC; + /* XXX: "+=" to correct for overshots */ + state->user.pmc[i].start = state->control.ireset[pmc]; + pmc_mask |= (1 << i); + /* On a P4 we should now clear the OVF flag in the + counter's CCCR. However, p4_isuspend() already + did that as a side-effect of clearing the CCCR + in order to stop the i-mode counters. */ + } + } + if (lvtpc_reinit_needed) + apic_write(APIC_LVTPC, LOCAL_PERFCTR_VECTOR); + return pmc_mask; +} + +static inline int check_ireset(struct perfctr_cpu_state *state) +{ + unsigned int nrctrs, i; + + i = state->control.header.nractrs; + nrctrs = i + state->control.header.nrictrs; + for(; i < nrctrs; ++i) { + unsigned int pmc = state->control.pmc_map[i] & P4_MASK_FAST_RDPMC; + if ((int)state->control.ireset[pmc] >= 0) + return -EINVAL; + state->user.pmc[i].start = state->control.ireset[pmc]; + } + return 0; +} + +#else /* CONFIG_X86_LOCAL_APIC */ +static inline void perfctr_cpu_isuspend(struct perfctr_cpu_state *state) { } +static inline void perfctr_cpu_iresume(const struct perfctr_cpu_state *state) { } +static inline int check_ireset(struct perfctr_cpu_state *state) { return 0; } +#endif /* CONFIG_X86_LOCAL_APIC */ + +static int (*check_control)(struct perfctr_cpu_state*, int); +int perfctr_cpu_update_control(struct perfctr_cpu_state *state, int is_global) +{ + int err; + + clear_isuspend_cpu(state); + state->user.cstatus = 0; + + /* disallow i-mode counters if we cannot catch the interrupts */ + if (!(perfctr_info.cpu_features & PERFCTR_FEATURE_PCINT) + && state->control.header.nrictrs) + return -EPERM; + + err = check_control(state, is_global); + if (err < 0) + return err; + err = check_ireset(state); + if (err < 0) + return err; + state->user.cstatus = perfctr_mk_cstatus(state->control.header.tsc_on, + state->control.header.nractrs, + state->control.header.nrictrs); + return 0; +} + +/* + * get_reg_offset() maps MSR numbers to offsets into struct perfctr_cpu_control, + * suitable for accessing control data of type unsigned int. + */ +static int p5_reg_offset(unsigned int msr) +{ + if (msr == MSR_P5_CESR) + return offsetof(struct perfctr_cpu_control, evntsel[0]); + return -1; +} + +static int p6_reg_offset(unsigned int msr) +{ + if (msr - MSR_P6_EVNTSEL0 < 2) + return offsetof(struct perfctr_cpu_control, evntsel[msr - MSR_P6_EVNTSEL0]); + if (msr - MSR_P6_PERFCTR0 < 2) + return offsetof(struct perfctr_cpu_control, ireset[msr - MSR_P6_PERFCTR0]); + return -1; +} + +static int k7_reg_offset(unsigned int msr) +{ + if (msr - MSR_K7_EVNTSEL0 < 4) + return offsetof(struct perfctr_cpu_control, evntsel[msr - MSR_K7_EVNTSEL0]); + if (msr - MSR_K7_PERFCTR0 < 4) + return offsetof(struct perfctr_cpu_control, ireset[msr - MSR_K7_PERFCTR0]); + return -1; +} + +static int p4_reg_offset(unsigned int msr) +{ + if (msr - MSR_P4_CCCR0 < 18) + return offsetof(struct perfctr_cpu_control, evntsel[msr - MSR_P4_CCCR0]); + if (msr - MSR_P4_PERFCTR0 < 18) + return offsetof(struct perfctr_cpu_control, ireset[msr - MSR_P4_PERFCTR0]); + if (msr - MSR_P4_ESCR0 < 0x3E2 - 0x3A0) + return offsetof(struct perfctr_cpu_control, p4.escr[msr - MSR_P4_ESCR0]); + if (msr == MSR_P4_PEBS_ENABLE) + return offsetof(struct perfctr_cpu_control, p4.pebs_enable); + if (msr == MSR_P4_PEBS_MATRIX_VERT) + return offsetof(struct perfctr_cpu_control, p4.pebs_matrix_vert); + return -1; +} + +static int generic_reg_offset(unsigned int msr) +{ + return -1; +} + +static int (*get_reg_offset)(unsigned int); + +static int access_regs(struct perfctr_cpu_control *control, + void *argp, unsigned int argbytes, int do_write) +{ + struct perfctr_cpu_reg *regs; + unsigned int i, nr_regs, *where; + int offset; + + nr_regs = argbytes / sizeof(struct perfctr_cpu_reg); + if (nr_regs * sizeof(struct perfctr_cpu_reg) != argbytes) + return -EINVAL; + regs = (struct perfctr_cpu_reg*)argp; + + for(i = 0; i < nr_regs; ++i) { + offset = get_reg_offset(regs[i].nr); + if (offset < 0) + return -EINVAL; + where = (unsigned int*)((char*)control + offset); + if (do_write) + *where = regs[i].value; + else + regs[i].value = *where; + } + return argbytes; +} + +int perfctr_cpu_control_write(struct perfctr_cpu_control *control, unsigned int domain, + const void *srcp, unsigned int srcbytes) +{ + if (domain != PERFCTR_DOMAIN_CPU_REGS) + return -EINVAL; + return access_regs(control, (void*)srcp, srcbytes, 1); +} + +int perfctr_cpu_control_read(const struct perfctr_cpu_control *control, unsigned int domain, + void *dstp, unsigned int dstbytes) +{ + if (domain != PERFCTR_DOMAIN_CPU_REGS) + return -EINVAL; + return access_regs((struct perfctr_cpu_control*)control, dstp, dstbytes, 0); +} + +void perfctr_cpu_suspend(struct perfctr_cpu_state *state) +{ + unsigned int i, cstatus, nractrs; + struct perfctr_low_ctrs now; + + write_perfseq_begin(&state->user.sequence); + if (perfctr_cstatus_has_ictrs(state->user.cstatus)) + perfctr_cpu_isuspend(state); + perfctr_cpu_read_counters(state, &now); + cstatus = state->user.cstatus; + if (perfctr_cstatus_has_tsc(cstatus)) + state->user.tsc_sum += now.tsc - state->user.tsc_start; + nractrs = perfctr_cstatus_nractrs(cstatus); + for(i = 0; i < nractrs; ++i) + state->user.pmc[i].sum += now.pmc[i] - state->user.pmc[i].start; + /* perfctr_cpu_disable_rdpmc(); */ /* not for x86 */ + write_perfseq_end(&state->user.sequence); +} + +void perfctr_cpu_resume(struct perfctr_cpu_state *state) +{ + write_perfseq_begin(&state->user.sequence); + if (perfctr_cstatus_has_ictrs(state->user.cstatus)) + perfctr_cpu_iresume(state); + /* perfctr_cpu_enable_rdpmc(); */ /* not for x86 or global-mode */ + perfctr_cpu_write_control(state); + //perfctr_cpu_read_counters(state, &state->start); + { + struct perfctr_low_ctrs now; + unsigned int i, cstatus, nrctrs; + perfctr_cpu_read_counters(state, &now); + cstatus = state->user.cstatus; + if (perfctr_cstatus_has_tsc(cstatus)) + state->user.tsc_start = now.tsc; + nrctrs = perfctr_cstatus_nractrs(cstatus); + for(i = 0; i < nrctrs; ++i) + state->user.pmc[i].start = now.pmc[i]; + } + write_perfseq_end(&state->user.sequence); +} + +void perfctr_cpu_sample(struct perfctr_cpu_state *state) +{ + unsigned int i, cstatus, nractrs; + struct perfctr_low_ctrs now; + + write_perfseq_begin(&state->user.sequence); + perfctr_cpu_read_counters(state, &now); + cstatus = state->user.cstatus; + if (perfctr_cstatus_has_tsc(cstatus)) { + state->user.tsc_sum += now.tsc - state->user.tsc_start; + state->user.tsc_start = now.tsc; + } + nractrs = perfctr_cstatus_nractrs(cstatus); + for(i = 0; i < nractrs; ++i) { + state->user.pmc[i].sum += now.pmc[i] - state->user.pmc[i].start; + state->user.pmc[i].start = now.pmc[i]; + } + write_perfseq_end(&state->user.sequence); +} + +static void (*clear_counters)(void); +static void perfctr_cpu_clear_counters(void) +{ + return clear_counters(); +} + +/**************************************************************** + * * + * Processor detection and initialisation procedures. * + * * + ****************************************************************/ + +static inline void clear_perfctr_cpus_forbidden_mask(void) +{ +#if !defined(perfctr_cpus_forbidden_mask) + cpus_clear(perfctr_cpus_forbidden_mask); +#endif +} + +static inline void set_perfctr_cpus_forbidden_mask(cpumask_t mask) +{ +#if !defined(perfctr_cpus_forbidden_mask) + perfctr_cpus_forbidden_mask = mask; +#endif +} + +/* see comment above at redirect_call() */ +static void __init finalise_backpatching(void) +{ + struct per_cpu_cache *cache; + struct perfctr_cpu_state state; + cpumask_t old_mask; + + old_mask = perfctr_cpus_forbidden_mask; + clear_perfctr_cpus_forbidden_mask(); + + cache = get_cpu_cache(); + memset(cache, 0, sizeof *cache); + memset(&state, 0, sizeof state); + if (perfctr_info.cpu_features & PERFCTR_FEATURE_PCINT) { + state.user.cstatus = __perfctr_mk_cstatus(0, 1, 0, 0); + perfctr_cpu_sample(&state); + perfctr_cpu_resume(&state); + perfctr_cpu_suspend(&state); + } + state.user.cstatus = 0; + perfctr_cpu_sample(&state); + perfctr_cpu_resume(&state); + perfctr_cpu_suspend(&state); + + set_perfctr_cpus_forbidden_mask(old_mask); + + redirect_call_disable = 1; +} + +#ifdef CONFIG_SMP + +cpumask_t perfctr_cpus_forbidden_mask; + +static void __init p4_ht_mask_setup_cpu(void *forbidden) +{ + unsigned int local_apic_physical_id = cpuid_ebx(1) >> 24; + unsigned int logical_processor_id = local_apic_physical_id & 1; + if (logical_processor_id != 0) + /* We rely on cpu_set() being atomic! */ + cpu_set(smp_processor_id(), *(cpumask_t*)forbidden); +} + +static int __init p4_ht_smp_init(void) +{ + cpumask_t forbidden; + unsigned int cpu; + + cpus_clear(forbidden); + smp_call_function(p4_ht_mask_setup_cpu, &forbidden, 1, 1); + p4_ht_mask_setup_cpu(&forbidden); + if (cpus_empty(forbidden)) + return 0; + perfctr_cpus_forbidden_mask = forbidden; + printk(KERN_INFO "perfctr/x86.c: hyper-threaded P4s detected:" + " restricting access for CPUs"); + for(cpu = 0; cpu < NR_CPUS; ++cpu) + if (cpu_isset(cpu, forbidden)) + printk(" %u", cpu); + printk("\n"); + return 0; +} +#else /* SMP */ +#define p4_ht_smp_init() (0) +#endif /* SMP */ + +static int __init p4_ht_init(void) +{ + unsigned int nr_siblings; + + if (!cpu_has_ht) + return 0; + nr_siblings = (cpuid_ebx(1) >> 16) & 0xFF; + if (nr_siblings > 2) { + printk(KERN_WARNING "perfctr/x86.c: hyper-threaded P4s detected:" + " unsupported number of siblings: %u -- bailing out\n", + nr_siblings); + return -ENODEV; + } + if (nr_siblings < 2) + return 0; + p4_is_ht = 1; /* needed even in a UP kernel */ + return p4_ht_smp_init(); +} + +static int __init intel_init(void) +{ + static char p5_name[] __initdata = "Intel P5"; + static char p6_name[] __initdata = "Intel P6"; + static char p4_name[] __initdata = "Intel P4"; + unsigned int misc_enable; + + if (!cpu_has_tsc) + return -ENODEV; + switch (current_cpu_data.x86) { + case 5: + if (cpu_has_mmx) { + read_counters = rdpmc_read_counters; + + /* Avoid Pentium Erratum 74. */ + if (current_cpu_data.x86_model == 4 && + (current_cpu_data.x86_mask == 4 || + (current_cpu_data.x86_mask == 3 && + ((cpuid_eax(1) >> 12) & 0x3) == 1))) + perfctr_info.cpu_features &= ~PERFCTR_FEATURE_RDPMC; + } else { + perfctr_info.cpu_features &= ~PERFCTR_FEATURE_RDPMC; + read_counters = p5_read_counters; + } + perfctr_set_tests_type(PTT_P5); + perfctr_cpu_name = p5_name; + write_control = p5_write_control; + check_control = p5_check_control; + clear_counters = p5_clear_counters; + get_reg_offset = p5_reg_offset; + return 0; + case 6: + if (current_cpu_data.x86_model == 9 || + current_cpu_data.x86_model == 13) { /* Pentium M */ + /* Pentium M added the MISC_ENABLE MSR from P4. */ + rdmsr_low(MSR_IA32_MISC_ENABLE, misc_enable); + if (!(misc_enable & MSR_IA32_MISC_ENABLE_PERF_AVAIL)) + break; + /* Erratum Y3 probably does not apply since we + read only the low 32 bits. */ + } else if (current_cpu_data.x86_model < 3) { /* Pentium Pro */ + /* Avoid Pentium Pro Erratum 26. */ + if (current_cpu_data.x86_mask < 9) + perfctr_info.cpu_features &= ~PERFCTR_FEATURE_RDPMC; + } + perfctr_set_tests_type(PTT_P6); + perfctr_cpu_name = p6_name; + read_counters = rdpmc_read_counters; + write_control = p6_write_control; + check_control = p6_check_control; + clear_counters = p6_clear_counters; + get_reg_offset = p6_reg_offset; +#ifdef CONFIG_X86_LOCAL_APIC + if (cpu_has_apic) { + perfctr_info.cpu_features |= PERFCTR_FEATURE_PCINT; + cpu_isuspend = p6_isuspend; + cpu_iresume = p6_iresume; + /* P-M apparently inherited P4's LVTPC auto-masking :-( */ + if (current_cpu_data.x86_model == 9 || + current_cpu_data.x86_model == 13) + lvtpc_reinit_needed = 1; + } +#endif + return 0; + case 15: /* Pentium 4 */ + rdmsr_low(MSR_IA32_MISC_ENABLE, misc_enable); + if (!(misc_enable & MSR_IA32_MISC_ENABLE_PERF_AVAIL)) + break; + if (p4_ht_init() != 0) + break; + if (current_cpu_data.x86_model <= 2) + p4_IQ_ESCR_ok = 1; + if (current_cpu_data.x86_model >= 2) + p4_extended_cascade_ok = 1; + perfctr_set_tests_type(PTT_P4); + perfctr_cpu_name = p4_name; + read_counters = rdpmc_read_counters; + write_control = p4_write_control; + check_control = p4_check_control; + clear_counters = p4_clear_counters; + get_reg_offset = p4_reg_offset; +#ifdef CONFIG_X86_LOCAL_APIC + if (cpu_has_apic) { + perfctr_info.cpu_features |= PERFCTR_FEATURE_PCINT; + cpu_isuspend = p4_isuspend; + cpu_iresume = p4_iresume; + lvtpc_reinit_needed = 1; + } +#endif + return 0; + } + return -ENODEV; +} + +/* + * Multicore K8s have issues with northbridge events: + * 1. The NB is shared between the cores, so two different cores + * in the same node cannot count NB events simultaneously. + * This can be handled by using perfctr_cpus_forbidden_mask to + * restrict NB-using threads to core0 of all nodes. + * 2. The initial multicore chips (Revision E) have an erratum + * which causes the NB counters to be reset when either core + * reprograms its evntsels (even for non-NB events). + * This is only an issue because of scheduling of threads, so + * we restrict NB events to the non thread-centric API. + * + * For now we only implement the workaround for issue 2, as this + * also handles issue 1. + * + * TODO: Detect post Revision E chips and implement a weaker + * workaround for them. + */ +#ifdef CONFIG_SMP +static void __init k8_multicore_init(void) +{ + cpumask_t non0cores; + int i; + + cpus_clear(non0cores); + for(i = 0; i < NR_CPUS; ++i) { + cpumask_t cores = cpu_core_map[i]; + int core0 = first_cpu(cores); + if (core0 >= NR_CPUS) + continue; + cpu_clear(core0, cores); + cpus_or(non0cores, non0cores, cores); + } + if (cpus_empty(non0cores)) + return; + k8_is_multicore = 1; + printk(KERN_INFO "perfctr/x86.c: multi-core K8s detected:" + " restricting access to northbridge events\n"); +} +#else +#define k8_multicore_init() do{}while(0) +#endif + +static int __init amd_init(void) +{ + static char amd_name[] __initdata = "AMD K7/K8"; + + if (!cpu_has_tsc) + return -ENODEV; + switch (current_cpu_data.x86) { + case 6: /* K7 */ + break; + case 15: /* K8. Like a K7 with a different event set. */ + k8_multicore_init(); + break; + default: + return -ENODEV; + } + perfctr_set_tests_type(PTT_AMD); + perfctr_cpu_name = amd_name; + read_counters = rdpmc_read_counters; + write_control = k7_write_control; + check_control = k7_check_control; + clear_counters = k7_clear_counters; + get_reg_offset = k7_reg_offset; +#ifdef CONFIG_X86_LOCAL_APIC + if (cpu_has_apic) { + perfctr_info.cpu_features |= PERFCTR_FEATURE_PCINT; + cpu_isuspend = k7_isuspend; + cpu_iresume = k7_iresume; + } +#endif + return 0; +} + +static int __init cyrix_init(void) +{ + static char mii_name[] __initdata = "Cyrix 6x86MX/MII/III"; + if (!cpu_has_tsc) + return -ENODEV; + switch (current_cpu_data.x86) { + case 6: /* 6x86MX, MII, or III */ + perfctr_set_tests_type(PTT_P5); + perfctr_cpu_name = mii_name; + read_counters = rdpmc_read_counters; + write_control = p5_write_control; + check_control = mii_check_control; + clear_counters = p5_clear_counters; + get_reg_offset = p5_reg_offset; + return 0; + } + return -ENODEV; +} + +static int __init centaur_init(void) +{ +#if !defined(CONFIG_X86_TSC) + static char winchip_name[] __initdata = "WinChip C6/2/3"; +#endif + static char vc3_name[] __initdata = "VIA C3"; + switch (current_cpu_data.x86) { +#if !defined(CONFIG_X86_TSC) + case 5: + switch (current_cpu_data.x86_model) { + case 4: /* WinChip C6 */ + case 8: /* WinChip 2, 2A, or 2B */ + case 9: /* WinChip 3, a 2A with larger cache and lower voltage */ + break; + default: + return -ENODEV; + } + perfctr_set_tests_type(PTT_WINCHIP); + perfctr_cpu_name = winchip_name; + /* + * TSC must be inaccessible for perfctrs to work. + */ + if (!(read_cr4() & X86_CR4_TSD) || cpu_has_tsc) + return -ENODEV; + perfctr_info.cpu_features &= ~PERFCTR_FEATURE_RDTSC; + read_counters = rdpmc_read_counters; + write_control = c6_write_control; + check_control = c6_check_control; + clear_counters = p5_clear_counters; + get_reg_offset = p5_reg_offset; + return 0; +#endif + case 6: /* VIA C3 */ + if (!cpu_has_tsc) + return -ENODEV; + switch (current_cpu_data.x86_model) { + case 6: /* Cyrix III */ + case 7: /* Samuel 2, Ezra (steppings >= 8) */ + case 8: /* Ezra-T */ + case 9: /* Antaur/Nehemiah */ + break; + default: + return -ENODEV; + } + perfctr_set_tests_type(PTT_VC3); + perfctr_cpu_name = vc3_name; + read_counters = rdpmc_read_counters; + write_control = p6_write_control; + check_control = vc3_check_control; + clear_counters = vc3_clear_counters; + get_reg_offset = p6_reg_offset; + return 0; + } + return -ENODEV; +} + +static int __init generic_init(void) +{ + static char generic_name[] __initdata = "Generic x86 with TSC"; + if (!cpu_has_tsc) + return -ENODEV; + perfctr_info.cpu_features &= ~PERFCTR_FEATURE_RDPMC; + perfctr_set_tests_type(PTT_GENERIC); + perfctr_cpu_name = generic_name; + check_control = generic_check_control; + write_control = p6_write_control; + read_counters = rdpmc_read_counters; + clear_counters = generic_clear_counters; + get_reg_offset = generic_reg_offset; + return 0; +} + +static void perfctr_cpu_invalidate_cache(void) +{ + /* + * per_cpu_cache[] is initialised to contain "impossible" + * evntsel values guaranteed to differ from anything accepted + * by perfctr_cpu_update_control(). + * All-bits-one works for all currently supported processors. + * The memset also sets the ids to -1, which is intentional. + */ + memset(get_cpu_cache(), ~0, sizeof(struct per_cpu_cache)); +} + +static void perfctr_cpu_init_one(void *ignore) +{ + /* PREEMPT note: when called via smp_call_function(), + this is in IRQ context with preemption disabled. */ + perfctr_cpu_clear_counters(); + perfctr_cpu_invalidate_cache(); + if (cpu_has_apic) + apic_write(APIC_LVTPC, LOCAL_PERFCTR_VECTOR); + if (perfctr_info.cpu_features & PERFCTR_FEATURE_RDPMC) + set_in_cr4_local(X86_CR4_PCE); +} + +static void perfctr_cpu_exit_one(void *ignore) +{ + /* PREEMPT note: when called via smp_call_function(), + this is in IRQ context with preemption disabled. */ + perfctr_cpu_clear_counters(); + perfctr_cpu_invalidate_cache(); + if (cpu_has_apic) + apic_write(APIC_LVTPC, APIC_DM_NMI | APIC_LVT_MASKED); + if (perfctr_info.cpu_features & PERFCTR_FEATURE_RDPMC) + clear_in_cr4_local(X86_CR4_PCE); +} + +#if defined(CONFIG_X86_LOCAL_APIC) && defined(CONFIG_PM) + +static void perfctr_pm_suspend(void) +{ + /* XXX: clear control registers */ + printk("perfctr/x86: PM suspend\n"); +} + +static void perfctr_pm_resume(void) +{ + /* XXX: reload control registers */ + printk("perfctr/x86: PM resume\n"); +} + +#include + +static int perfctr_device_suspend(struct sys_device *dev, u32 state) +{ + perfctr_pm_suspend(); + return 0; +} + +static int perfctr_device_resume(struct sys_device *dev) +{ + perfctr_pm_resume(); + return 0; +} + +static struct sysdev_class perfctr_sysclass = { + set_kset_name("perfctr"), + .resume = perfctr_device_resume, + .suspend = perfctr_device_suspend, +}; + +static struct sys_device device_perfctr = { + .id = 0, + .cls = &perfctr_sysclass, +}; + +static void x86_pm_init(void) +{ + if (sysdev_class_register(&perfctr_sysclass) == 0) + sysdev_register(&device_perfctr); +} + +static void x86_pm_exit(void) +{ + sysdev_unregister(&device_perfctr); + sysdev_class_unregister(&perfctr_sysclass); +} + +#else + +static inline void x86_pm_init(void) { } +static inline void x86_pm_exit(void) { } + +#endif /* CONFIG_X86_LOCAL_APIC && CONFIG_PM */ + +#if !defined(CONFIG_X86_LOCAL_APIC) +static inline int reserve_lapic_nmi(void) { return 0; } +static inline void release_lapic_nmi(void) { } +#endif + +static void do_init_tests(void) +{ +#ifdef CONFIG_PERFCTR_INIT_TESTS + if (reserve_lapic_nmi() >= 0) { + perfctr_x86_init_tests(); + release_lapic_nmi(); + } +#endif +} + +static int init_done; + +int __init perfctr_cpu_init(void) +{ + int err = -ENODEV; + + preempt_disable(); + + /* RDPMC and RDTSC are on by default. They will be disabled + by the init procedures if necessary. */ + perfctr_info.cpu_features = PERFCTR_FEATURE_RDPMC | PERFCTR_FEATURE_RDTSC; + + if (cpu_has_msr) { + switch (current_cpu_data.x86_vendor) { + case X86_VENDOR_INTEL: + err = intel_init(); + break; + case X86_VENDOR_AMD: + err = amd_init(); + break; + case X86_VENDOR_CYRIX: + err = cyrix_init(); + break; + case X86_VENDOR_CENTAUR: + err = centaur_init(); + } + } + if (err) { + err = generic_init(); /* last resort */ + if (err) + goto out; + } + do_init_tests(); + finalise_backpatching(); + + perfctr_info.cpu_khz = cpu_khz; + perfctr_info.tsc_to_cpu_mult = 1; + init_done = 1; + + out: + preempt_enable(); + return err; +} + +void __exit perfctr_cpu_exit(void) +{ +} + +/**************************************************************** + * * + * Hardware reservation. * + * * + ****************************************************************/ + +static DECLARE_MUTEX(mutex); +static const char *current_service = 0; + +const char *perfctr_cpu_reserve(const char *service) +{ + const char *ret; + + if (!init_done) + return "unsupported hardware"; + down(&mutex); + ret = current_service; + if (ret) + goto out_up; + ret = "unknown driver (oprofile?)"; + if (reserve_lapic_nmi() < 0) + goto out_up; + current_service = service; + if (perfctr_info.cpu_features & PERFCTR_FEATURE_RDPMC) + mmu_cr4_features |= X86_CR4_PCE; + on_each_cpu(perfctr_cpu_init_one, NULL, 1, 1); + perfctr_cpu_set_ihandler(NULL); + x86_pm_init(); + ret = NULL; + out_up: + up(&mutex); + return ret; +} + +void perfctr_cpu_release(const char *service) +{ + down(&mutex); + if (service != current_service) { + printk(KERN_ERR "%s: attempt by %s to release while reserved by %s\n", + __FUNCTION__, service, current_service); + goto out_up; + } + /* power down the counters */ + if (perfctr_info.cpu_features & PERFCTR_FEATURE_RDPMC) + mmu_cr4_features &= ~X86_CR4_PCE; + on_each_cpu(perfctr_cpu_exit_one, NULL, 1, 1); + perfctr_cpu_set_ihandler(NULL); + x86_pm_exit(); + current_service = 0; + release_lapic_nmi(); + out_up: + up(&mutex); +} diff -puN /dev/null drivers/perfctr/x86_tests.c --- /dev/null 2003-09-15 06:40:47.000000000 -0700 +++ devel-akpm/drivers/perfctr/x86_tests.c 2005-07-08 23:11:41.000000000 -0700 @@ -0,0 +1,308 @@ +/* $Id: x86_tests.c,v 1.34 2004/08/08 19:54:40 mikpe Exp $ + * Performance-monitoring counters driver. + * Optional x86/x86_64-specific init-time tests. + * + * Copyright (C) 1999-2004 Mikael Pettersson + */ +#include +#include +#include +#include +#include +#include +#undef MSR_P6_PERFCTR0 +#undef MSR_P4_IQ_CCCR0 +#undef MSR_P4_CRU_ESCR0 +#include +#include +#include /* cpu_khz */ +#include "x86_tests.h" + +#define MSR_P5_CESR 0x11 +#define MSR_P5_CTR0 0x12 +#define P5_CESR_VAL (0x16 | (3<<6)) +#define MSR_P6_PERFCTR0 0xC1 +#define MSR_P6_EVNTSEL0 0x186 +#define P6_EVNTSEL0_VAL (0xC0 | (3<<16) | (1<<22)) +#define MSR_K7_EVNTSEL0 0xC0010000 +#define MSR_K7_PERFCTR0 0xC0010004 +#define K7_EVNTSEL0_VAL (0xC0 | (3<<16) | (1<<22)) +#define VC3_EVNTSEL1_VAL 0xC0 +#define MSR_P4_IQ_COUNTER0 0x30C +#define MSR_P4_IQ_CCCR0 0x36C +#define MSR_P4_CRU_ESCR0 0x3B8 +#define P4_CRU_ESCR0_VAL ((2<<25) | (1<<9) | (0x3<<2)) +#define P4_IQ_CCCR0_VAL ((0x3<<16) | (4<<13) | (1<<12)) + +#define NITER 64 +#define X2(S) S";"S +#define X8(S) X2(X2(X2(S))) + +#ifdef __x86_64__ +#define CR4MOV "movq" +#else +#define CR4MOV "movl" +#endif + +#ifndef CONFIG_X86_LOCAL_APIC +#undef apic_write +#define apic_write(reg,vector) do{}while(0) +#endif + +#if !defined(__x86_64__) +/* Avoid speculative execution by the CPU */ +extern inline void sync_core(void) +{ + int tmp; + asm volatile("cpuid" : "=a" (tmp) : "0" (1) : "ebx","ecx","edx","memory"); +} +#endif + +static void __init do_rdpmc(unsigned pmc, unsigned unused2) +{ + unsigned i; + for(i = 0; i < NITER/8; ++i) + __asm__ __volatile__(X8("rdpmc") : : "c"(pmc) : "eax", "edx"); +} + +static void __init do_rdmsr(unsigned msr, unsigned unused2) +{ + unsigned i; + for(i = 0; i < NITER/8; ++i) + __asm__ __volatile__(X8("rdmsr") : : "c"(msr) : "eax", "edx"); +} + +static void __init do_wrmsr(unsigned msr, unsigned data) +{ + unsigned i; + for(i = 0; i < NITER/8; ++i) + __asm__ __volatile__(X8("wrmsr") : : "c"(msr), "a"(data), "d"(0)); +} + +static void __init do_rdcr4(unsigned unused1, unsigned unused2) +{ + unsigned i; + unsigned long dummy; + for(i = 0; i < NITER/8; ++i) + __asm__ __volatile__(X8(CR4MOV" %%cr4,%0") : "=r"(dummy)); +} + +static void __init do_wrcr4(unsigned cr4, unsigned unused2) +{ + unsigned i; + for(i = 0; i < NITER/8; ++i) + __asm__ __volatile__(X8(CR4MOV" %0,%%cr4") : : "r"((long)cr4)); +} + +static void __init do_rdtsc(unsigned unused1, unsigned unused2) +{ + unsigned i; + for(i = 0; i < NITER/8; ++i) + __asm__ __volatile__(X8("rdtsc") : : : "eax", "edx"); +} + +static void __init do_wrlvtpc(unsigned val, unsigned unused2) +{ + unsigned i; + for(i = 0; i < NITER/8; ++i) { + apic_write(APIC_LVTPC, val); + apic_write(APIC_LVTPC, val); + apic_write(APIC_LVTPC, val); + apic_write(APIC_LVTPC, val); + apic_write(APIC_LVTPC, val); + apic_write(APIC_LVTPC, val); + apic_write(APIC_LVTPC, val); + apic_write(APIC_LVTPC, val); + } +} + +static void __init do_sync_core(unsigned unused1, unsigned unused2) +{ + unsigned i; + for(i = 0; i < NITER/8; ++i) { + sync_core(); + sync_core(); + sync_core(); + sync_core(); + sync_core(); + sync_core(); + sync_core(); + sync_core(); + } +} + +static void __init do_empty_loop(unsigned unused1, unsigned unused2) +{ + unsigned i; + for(i = 0; i < NITER/8; ++i) + __asm__ __volatile__("" : : "c"(0)); +} + +static unsigned __init run(void (*doit)(unsigned, unsigned), + unsigned arg1, unsigned arg2) +{ + unsigned start, dummy, stop; + sync_core(); + rdtsc(start, dummy); + (*doit)(arg1, arg2); /* should take < 2^32 cycles to complete */ + sync_core(); + rdtsc(stop, dummy); + return stop - start; +} + +static void __init init_tests_message(void) +{ + printk(KERN_INFO "Please email the following PERFCTR INIT lines " + "to mikpe@csd.uu.se\n" + KERN_INFO "To remove this message, rebuild the driver " + "with CONFIG_PERFCTR_INIT_TESTS=n\n"); + printk(KERN_INFO "PERFCTR INIT: vendor %u, family %u, model %u, stepping %u, clock %u kHz\n", + current_cpu_data.x86_vendor, + current_cpu_data.x86, + current_cpu_data.x86_model, + current_cpu_data.x86_mask, + (unsigned int)cpu_khz); +} + +static void __init +measure_overheads(unsigned msr_evntsel0, unsigned evntsel0, unsigned msr_perfctr0, + unsigned msr_cccr, unsigned cccr_val) +{ + int i; + unsigned int loop, ticks[13]; + const char *name[13]; + + if (msr_evntsel0) + wrmsr(msr_evntsel0, 0, 0); + if (msr_cccr) + wrmsr(msr_cccr, 0, 0); + + name[0] = "rdtsc"; + ticks[0] = run(do_rdtsc, 0, 0); + name[1] = "rdpmc"; + ticks[1] = (perfctr_info.cpu_features & PERFCTR_FEATURE_RDPMC) + ? run(do_rdpmc,1,0) : 0; + name[2] = "rdmsr (counter)"; + ticks[2] = msr_perfctr0 ? run(do_rdmsr, msr_perfctr0, 0) : 0; + name[3] = msr_cccr ? "rdmsr (escr)" : "rdmsr (evntsel)"; + ticks[3] = msr_evntsel0 ? run(do_rdmsr, msr_evntsel0, 0) : 0; + name[4] = "wrmsr (counter)"; + ticks[4] = msr_perfctr0 ? run(do_wrmsr, msr_perfctr0, 0) : 0; + name[5] = msr_cccr ? "wrmsr (escr)" : "wrmsr (evntsel)"; + ticks[5] = msr_evntsel0 ? run(do_wrmsr, msr_evntsel0, evntsel0) : 0; + name[6] = "read cr4"; + ticks[6] = run(do_rdcr4, 0, 0); + name[7] = "write cr4"; + ticks[7] = run(do_wrcr4, read_cr4(), 0); + name[8] = "rdpmc (fast)"; + ticks[8] = msr_cccr ? run(do_rdpmc, 0x80000001, 0) : 0; + name[9] = "rdmsr (cccr)"; + ticks[9] = msr_cccr ? run(do_rdmsr, msr_cccr, 0) : 0; + name[10] = "wrmsr (cccr)"; + ticks[10] = msr_cccr ? run(do_wrmsr, msr_cccr, cccr_val) : 0; + name[11] = "write LVTPC"; + ticks[11] = (perfctr_info.cpu_features & PERFCTR_FEATURE_PCINT) + ? run(do_wrlvtpc, APIC_DM_NMI|APIC_LVT_MASKED, 0) : 0; + name[12] = "sync_core"; + ticks[12] = run(do_sync_core, 0, 0); + + loop = run(do_empty_loop, 0, 0); + + if (msr_evntsel0) + wrmsr(msr_evntsel0, 0, 0); + if (msr_cccr) + wrmsr(msr_cccr, 0, 0); + + init_tests_message(); + printk(KERN_INFO "PERFCTR INIT: NITER == %u\n", NITER); + printk(KERN_INFO "PERFCTR INIT: loop overhead is %u cycles\n", loop); + for(i = 0; i < ARRAY_SIZE(ticks); ++i) { + unsigned int x; + if (!ticks[i]) + continue; + x = ((ticks[i] - loop) * 10) / NITER; + printk(KERN_INFO "PERFCTR INIT: %s cost is %u.%u cycles (%u total)\n", + name[i], x/10, x%10, ticks[i]); + } +} + +#ifndef __x86_64__ +static inline void perfctr_p5_init_tests(void) +{ + measure_overheads(MSR_P5_CESR, P5_CESR_VAL, MSR_P5_CTR0, 0, 0); +} + +static inline void perfctr_p6_init_tests(void) +{ + measure_overheads(MSR_P6_EVNTSEL0, P6_EVNTSEL0_VAL, MSR_P6_PERFCTR0, 0, 0); +} + +#if !defined(CONFIG_X86_TSC) +static inline void perfctr_c6_init_tests(void) +{ + unsigned int cesr, dummy; + + rdmsr(MSR_P5_CESR, cesr, dummy); + init_tests_message(); + printk(KERN_INFO "PERFCTR INIT: boot CESR == %#08x\n", cesr); +} +#endif + +static inline void perfctr_vc3_init_tests(void) +{ + measure_overheads(MSR_P6_EVNTSEL0+1, VC3_EVNTSEL1_VAL, MSR_P6_PERFCTR0+1, 0, 0); +} +#endif /* !__x86_64__ */ + +static inline void perfctr_p4_init_tests(void) +{ + measure_overheads(MSR_P4_CRU_ESCR0, P4_CRU_ESCR0_VAL, MSR_P4_IQ_COUNTER0, + MSR_P4_IQ_CCCR0, P4_IQ_CCCR0_VAL); +} + +static inline void perfctr_k7_init_tests(void) +{ + measure_overheads(MSR_K7_EVNTSEL0, K7_EVNTSEL0_VAL, MSR_K7_PERFCTR0, 0, 0); +} + +static inline void perfctr_generic_init_tests(void) +{ + measure_overheads(0, 0, 0, 0, 0); +} + +enum perfctr_x86_tests_type perfctr_x86_tests_type __initdata = PTT_UNKNOWN; + +void __init perfctr_x86_init_tests(void) +{ + switch (perfctr_x86_tests_type) { +#ifndef __x86_64__ + case PTT_P5: /* Intel P5, P5MMX; Cyrix 6x86MX, MII, III */ + perfctr_p5_init_tests(); + break; + case PTT_P6: /* Intel PPro, PII, PIII, PENTM */ + perfctr_p6_init_tests(); + break; +#if !defined(CONFIG_X86_TSC) + case PTT_WINCHIP: /* WinChip C6, 2, 3 */ + perfctr_c6_init_tests(); + break; +#endif + case PTT_VC3: /* VIA C3 */ + perfctr_vc3_init_tests(); + break; +#endif /* !__x86_64__ */ + case PTT_P4: /* Intel P4 */ + perfctr_p4_init_tests(); + break; + case PTT_AMD: /* AMD K7, K8 */ + perfctr_k7_init_tests(); + break; + case PTT_GENERIC: + perfctr_generic_init_tests(); + break; + default: + printk(KERN_INFO "%s: unknown CPU type %u\n", + __FUNCTION__, perfctr_x86_tests_type); + break; + } +} diff -puN /dev/null drivers/perfctr/x86_tests.h --- /dev/null 2003-09-15 06:40:47.000000000 -0700 +++ devel-akpm/drivers/perfctr/x86_tests.h 2005-07-08 23:11:41.000000000 -0700 @@ -0,0 +1,30 @@ +/* $Id: x86_tests.h,v 1.10 2004/05/22 20:48:57 mikpe Exp $ + * Performance-monitoring counters driver. + * Optional x86/x86_64-specific init-time tests. + * + * Copyright (C) 1999-2004 Mikael Pettersson + */ + +/* 'enum perfctr_x86_tests_type' classifies CPUs according + to relevance for perfctr_x86_init_tests(). */ +enum perfctr_x86_tests_type { + PTT_UNKNOWN, + PTT_GENERIC, + PTT_P5, + PTT_P6, + PTT_P4, + PTT_AMD, + PTT_WINCHIP, + PTT_VC3, +}; + +extern enum perfctr_x86_tests_type perfctr_x86_tests_type; + +static inline void perfctr_set_tests_type(enum perfctr_x86_tests_type t) +{ +#ifdef CONFIG_PERFCTR_INIT_TESTS + perfctr_x86_tests_type = t; +#endif +} + +extern void perfctr_x86_init_tests(void); diff -puN include/asm-i386/mach-default/irq_vectors.h~perfctr include/asm-i386/mach-default/irq_vectors.h --- devel/include/asm-i386/mach-default/irq_vectors.h~perfctr 2005-07-08 23:11:41.000000000 -0700 +++ devel-akpm/include/asm-i386/mach-default/irq_vectors.h 2005-07-08 23:11:41.000000000 -0700 @@ -56,14 +56,15 @@ * sources per level' errata. */ #define LOCAL_TIMER_VECTOR 0xef +#define LOCAL_PERFCTR_VECTOR 0xee /* - * First APIC vector available to drivers: (vectors 0x30-0xee) + * First APIC vector available to drivers: (vectors 0x30-0xed) * we start at 0x31 to spread out vectors evenly between priority * levels. (0x80 is the syscall vector) */ #define FIRST_DEVICE_VECTOR 0x31 -#define FIRST_SYSTEM_VECTOR 0xef +#define FIRST_SYSTEM_VECTOR 0xee #define TIMER_IRQ 0 diff -puN include/asm-i386/mach-visws/irq_vectors.h~perfctr include/asm-i386/mach-visws/irq_vectors.h --- devel/include/asm-i386/mach-visws/irq_vectors.h~perfctr 2005-07-08 23:11:41.000000000 -0700 +++ devel-akpm/include/asm-i386/mach-visws/irq_vectors.h 2005-07-08 23:11:41.000000000 -0700 @@ -35,14 +35,15 @@ * sources per level' errata. */ #define LOCAL_TIMER_VECTOR 0xef +#define LOCAL_PERFCTR_VECTOR 0xee /* - * First APIC vector available to drivers: (vectors 0x30-0xee) + * First APIC vector available to drivers: (vectors 0x30-0xed) * we start at 0x31 to spread out vectors evenly between priority * levels. (0x80 is the syscall vector) */ #define FIRST_DEVICE_VECTOR 0x31 -#define FIRST_SYSTEM_VECTOR 0xef +#define FIRST_SYSTEM_VECTOR 0xee #define TIMER_IRQ 0 diff -puN /dev/null include/asm-i386/perfctr.h --- /dev/null 2003-09-15 06:40:47.000000000 -0700 +++ devel-akpm/include/asm-i386/perfctr.h 2005-07-08 23:11:41.000000000 -0700 @@ -0,0 +1,200 @@ +/* $Id: perfctr.h,v 1.63 2005/04/08 14:36:49 mikpe Exp $ + * x86/x86_64 Performance-Monitoring Counters driver + * + * Copyright (C) 1999-2005 Mikael Pettersson + */ +#ifndef _ASM_I386_PERFCTR_H +#define _ASM_I386_PERFCTR_H + +#include + +struct perfctr_sum_ctrs { + __u64 tsc; + __u64 pmc[18]; /* the size is not part of the user ABI */ +}; + +struct perfctr_cpu_control_header { + __u32 tsc_on; + __u32 nractrs; /* number of accumulation-mode counters */ + __u32 nrictrs; /* number of interrupt-mode counters */ +}; + +struct perfctr_cpu_state_user { + __u32 cstatus; + /* This is a sequence counter to ensure atomic reads by + * userspace. The mechanism is identical to that used for + * seqcount_t in include/linux/seqlock.h. */ + __u32 sequence; + __u64 tsc_start; + __u64 tsc_sum; + struct { + __u64 start; + __u64 sum; + } pmc[18]; /* the size is not part of the user ABI */ +}; + +/* cstatus is a re-encoding of control.tsc_on/nractrs/nrictrs + which should have less overhead in most cases */ + +static inline +unsigned int __perfctr_mk_cstatus(unsigned int tsc_on, unsigned int have_ictrs, + unsigned int nrictrs, unsigned int nractrs) +{ + return (tsc_on<<31) | (have_ictrs<<16) | ((nractrs+nrictrs)<<8) | nractrs; +} + +static inline +unsigned int perfctr_mk_cstatus(unsigned int tsc_on, unsigned int nractrs, + unsigned int nrictrs) +{ + return __perfctr_mk_cstatus(tsc_on, nrictrs, nrictrs, nractrs); +} + +static inline unsigned int perfctr_cstatus_enabled(unsigned int cstatus) +{ + return cstatus; +} + +static inline int perfctr_cstatus_has_tsc(unsigned int cstatus) +{ + return (int)cstatus < 0; /* test and jump on sign */ +} + +static inline unsigned int perfctr_cstatus_nractrs(unsigned int cstatus) +{ + return cstatus & 0x7F; /* and with imm8 */ +} + +static inline unsigned int perfctr_cstatus_nrctrs(unsigned int cstatus) +{ + return (cstatus >> 8) & 0x7F; +} + +static inline unsigned int perfctr_cstatus_has_ictrs(unsigned int cstatus) +{ + return cstatus & (0x7F << 16); +} + +/* + * 'struct siginfo' support for perfctr overflow signals. + * In unbuffered mode, si_code is set to SI_PMC_OVF and a bitmask + * describing which perfctrs overflowed is put in si_pmc_ovf_mask. + * A bitmask is used since more than one perfctr can have overflowed + * by the time the interrupt handler runs. + * + * glibc's doesn't seem to define __SI_FAULT or __SI_CODE(), + * and including as well may cause redefinition errors, + * so the user and kernel values are different #defines here. + */ +#ifdef __KERNEL__ +#define SI_PMC_OVF (__SI_FAULT|'P') +#else +#define SI_PMC_OVF ('P') +#endif +#define si_pmc_ovf_mask _sifields._pad[0] /* XXX: use an unsigned field later */ + +#ifdef __KERNEL__ + +#if defined(CONFIG_PERFCTR) + +struct perfctr_cpu_control { + struct perfctr_cpu_control_header header; + unsigned int evntsel[18]; /* primary control registers, physical indices */ + unsigned int ireset[18]; /* >= 2^31, for i-mode counters, physical indices */ + struct { + unsigned int escr[0x3E2-0x3A0]; /* secondary controls, physical indices */ + unsigned int pebs_enable; /* for replay tagging */ + unsigned int pebs_matrix_vert; /* for replay tagging */ + } p4; + unsigned int pmc_map[18]; /* virtual to physical (rdpmc) index map */ +}; + +struct perfctr_cpu_state { + /* Don't change field order here without first considering the number + of cache lines touched during sampling and context switching. */ + unsigned int id; + int isuspend_cpu; + struct perfctr_cpu_state_user user; + struct perfctr_cpu_control control; + unsigned int p4_escr_map[18]; +#ifdef CONFIG_PERFCTR_INTERRUPT_SUPPORT + unsigned int pending_interrupt; +#endif +}; + +/* Driver init/exit. */ +extern int perfctr_cpu_init(void); +extern void perfctr_cpu_exit(void); + +/* CPU type name. */ +extern char *perfctr_cpu_name; + +/* Hardware reservation. */ +extern const char *perfctr_cpu_reserve(const char *service); +extern void perfctr_cpu_release(const char *service); + +/* PRE: state has no running interrupt-mode counters. + Check that the new control data is valid. + Update the driver's private control data. + is_global should be zero for per-process counters and non-zero + for global-mode counters. This matters for HT P4s, alas. + Returns a negative error code if the control data is invalid. */ +extern int perfctr_cpu_update_control(struct perfctr_cpu_state *state, int is_global); + +/* Parse and update control for the given domain. */ +extern int perfctr_cpu_control_write(struct perfctr_cpu_control *control, + unsigned int domain, + const void *srcp, unsigned int srcbytes); + +/* Retrieve and format control for the given domain. + Returns number of bytes written. */ +extern int perfctr_cpu_control_read(const struct perfctr_cpu_control *control, + unsigned int domain, + void *dstp, unsigned int dstbytes); + +/* Read a-mode counters. Subtract from start and accumulate into sums. + Must be called with preemption disabled. */ +extern void perfctr_cpu_suspend(struct perfctr_cpu_state *state); + +/* Write control registers. Read a-mode counters into start. + Must be called with preemption disabled. */ +extern void perfctr_cpu_resume(struct perfctr_cpu_state *state); + +/* Perform an efficient combined suspend/resume operation. + Must be called with preemption disabled. */ +extern void perfctr_cpu_sample(struct perfctr_cpu_state *state); + +/* The type of a perfctr overflow interrupt handler. + It will be called in IRQ context, with preemption disabled. */ +typedef void (*perfctr_ihandler_t)(unsigned long pc); + +/* Operations related to overflow interrupt handling. */ +#ifdef CONFIG_X86_LOCAL_APIC +extern void perfctr_cpu_set_ihandler(perfctr_ihandler_t); +extern void perfctr_cpu_ireload(struct perfctr_cpu_state*); +extern unsigned int perfctr_cpu_identify_overflow(struct perfctr_cpu_state*); +static inline int perfctr_cpu_has_pending_interrupt(const struct perfctr_cpu_state *state) +{ + return state->pending_interrupt; +} +#else +static inline void perfctr_cpu_set_ihandler(perfctr_ihandler_t x) { } +static inline int perfctr_cpu_has_pending_interrupt(const struct perfctr_cpu_state *state) +{ + return 0; +} +#endif + +#endif /* CONFIG_PERFCTR */ + +#if defined(CONFIG_PERFCTR) && defined(CONFIG_X86_LOCAL_APIC) +asmlinkage void perfctr_interrupt(struct pt_regs*); +#define perfctr_vector_init() \ + set_intr_gate(LOCAL_PERFCTR_VECTOR, perfctr_interrupt) +#else +#define perfctr_vector_init() do{}while(0) +#endif + +#endif /* __KERNEL__ */ + +#endif /* _ASM_I386_PERFCTR_H */ diff -puN include/asm-i386/processor.h~perfctr include/asm-i386/processor.h --- devel/include/asm-i386/processor.h~perfctr 2005-07-08 23:11:41.000000000 -0700 +++ devel-akpm/include/asm-i386/processor.h 2005-07-08 23:11:41.000000000 -0700 @@ -456,6 +456,8 @@ struct thread_struct { unsigned long *io_bitmap_ptr; /* max allowed port in the bitmap, in bytes: */ unsigned long io_bitmap_max; +/* performance counters */ + struct vperfctr *perfctr; }; #define INIT_THREAD { \ diff -puN include/asm-i386/unistd.h~perfctr include/asm-i386/unistd.h --- devel/include/asm-i386/unistd.h~perfctr 2005-07-08 23:11:41.000000000 -0700 +++ devel-akpm/include/asm-i386/unistd.h 2005-07-08 23:11:41.000000000 -0700 @@ -298,8 +298,12 @@ #define __NR_ioprio_get 290 #define __NR_pselect6 291 #define __NR_ppoll 292 +#define __NR_vperfctr_open 293 +#define __NR_vperfctr_control (__NR_vperfctr_open+1) +#define __NR_vperfctr_write (__NR_vperfctr_open+2) +#define __NR_vperfctr_read (__NR_vperfctr_open+3) -#define NR_syscalls 293 +#define NR_syscalls 297 /* * user-visible error numbers are in the range -1 - -128: see diff -puN include/asm-i386/system.h~perfctr include/asm-i386/system.h --- devel/include/asm-i386/system.h~perfctr 2005-07-08 23:11:41.000000000 -0700 +++ devel-akpm/include/asm-i386/system.h 2005-07-08 23:11:41.000000000 -0700 @@ -14,6 +14,7 @@ extern struct task_struct * FASTCALL(__s #define switch_to(prev,next,last) do { \ unsigned long esi,edi; \ + perfctr_suspend_thread(&(prev)->thread); \ asm volatile("pushfl\n\t" \ "pushl %%ebp\n\t" \ "movl %%esp,%0\n\t" /* save ESP */ \ diff -puN arch/x86_64/ia32/ia32entry.S~perfctr arch/x86_64/ia32/ia32entry.S --- devel/arch/x86_64/ia32/ia32entry.S~perfctr 2005-07-08 23:11:41.000000000 -0700 +++ devel-akpm/arch/x86_64/ia32/ia32entry.S 2005-07-08 23:11:41.000000000 -0700 @@ -586,6 +586,12 @@ ia32_sys_call_table: .quad sys_add_key .quad sys_request_key .quad sys_keyctl + .quad quiet_ni_syscall /* sys_ioprio_set */ + .quad quiet_ni_syscall /* sys_ioprio_get */ /* 290 */ + .quad sys_vperfctr_open + .quad sys_vperfctr_control + .quad sys_vperfctr_write + .quad sys_vperfctr_read /* don't forget to change IA32_NR_syscalls */ ia32_syscall_end: .rept IA32_NR_syscalls-(ia32_syscall_end-ia32_sys_call_table)/8 diff -puN arch/x86_64/Kconfig~perfctr arch/x86_64/Kconfig --- devel/arch/x86_64/Kconfig~perfctr 2005-07-08 23:11:41.000000000 -0700 +++ devel-akpm/arch/x86_64/Kconfig 2005-07-08 23:11:41.000000000 -0700 @@ -528,6 +528,8 @@ config UID16 depends on IA32_EMULATION default y +source "drivers/perfctr/Kconfig" + endmenu source drivers/Kconfig diff -puN arch/x86_64/kernel/entry.S~perfctr arch/x86_64/kernel/entry.S --- devel/arch/x86_64/kernel/entry.S~perfctr 2005-07-08 23:11:41.000000000 -0700 +++ devel-akpm/arch/x86_64/kernel/entry.S 2005-07-08 23:11:41.000000000 -0700 @@ -554,6 +554,11 @@ ENTRY(spurious_interrupt) apicinterrupt SPURIOUS_APIC_VECTOR,smp_spurious_interrupt #endif +#if defined(CONFIG_X86_LOCAL_APIC) && defined(CONFIG_PERFCTR) +ENTRY(perfctr_interrupt) + apicinterrupt LOCAL_PERFCTR_VECTOR,smp_perfctr_interrupt +#endif + /* * Exception entry points. */ diff -puN arch/x86_64/kernel/i8259.c~perfctr arch/x86_64/kernel/i8259.c --- devel/arch/x86_64/kernel/i8259.c~perfctr 2005-07-08 23:11:41.000000000 -0700 +++ devel-akpm/arch/x86_64/kernel/i8259.c 2005-07-08 23:11:41.000000000 -0700 @@ -24,6 +24,7 @@ #include #include #include +#include #include @@ -579,6 +580,8 @@ void __init init_IRQ(void) set_intr_gate(ERROR_APIC_VECTOR, error_interrupt); #endif + perfctr_vector_init(); + /* * Set the clock to HZ Hz, we already have a valid * vector now: diff -puN arch/x86_64/kernel/process.c~perfctr arch/x86_64/kernel/process.c --- devel/arch/x86_64/kernel/process.c~perfctr 2005-07-08 23:11:41.000000000 -0700 +++ devel-akpm/arch/x86_64/kernel/process.c 2005-07-08 23:11:41.000000000 -0700 @@ -35,6 +35,7 @@ #include #include #include +#include #include #include @@ -341,6 +342,7 @@ void exit_thread(void) t->io_bitmap_max = 0; put_cpu(); } + perfctr_exit_thread(&me->thread); } void flush_thread(void) @@ -450,6 +452,8 @@ int copy_thread(int nr, unsigned long cl asm("mov %%es,%0" : "=m" (p->thread.es)); asm("mov %%ds,%0" : "=m" (p->thread.ds)); + perfctr_copy_task(p, regs); + if (unlikely(me->thread.io_bitmap_ptr != NULL)) { p->thread.io_bitmap_ptr = kmalloc(IO_BITMAP_BYTES, GFP_KERNEL); if (!p->thread.io_bitmap_ptr) { @@ -628,6 +632,8 @@ struct task_struct *__switch_to(struct t disable_tsc(prev_p, next_p); + perfctr_resume_thread(next); + return prev_p; } diff -puN include/asm-x86_64/hw_irq.h~perfctr include/asm-x86_64/hw_irq.h --- devel/include/asm-x86_64/hw_irq.h~perfctr 2005-07-08 23:11:41.000000000 -0700 +++ devel-akpm/include/asm-x86_64/hw_irq.h 2005-07-08 23:11:41.000000000 -0700 @@ -65,14 +65,15 @@ struct hw_interrupt_type; * sources per level' errata. */ #define LOCAL_TIMER_VECTOR 0xef +#define LOCAL_PERFCTR_VECTOR 0xee /* - * First APIC vector available to drivers: (vectors 0x30-0xee) + * First APIC vector available to drivers: (vectors 0x30-0xed) * we start at 0x31 to spread out vectors evenly between priority * levels. (0x80 is the syscall vector) */ #define FIRST_DEVICE_VECTOR 0x31 -#define FIRST_SYSTEM_VECTOR 0xef /* duplicated in irq.h */ +#define FIRST_SYSTEM_VECTOR 0xee /* duplicated in irq.h */ #ifndef __ASSEMBLY__ diff -puN include/asm-x86_64/ia32_unistd.h~perfctr include/asm-x86_64/ia32_unistd.h --- devel/include/asm-x86_64/ia32_unistd.h~perfctr 2005-07-08 23:11:41.000000000 -0700 +++ devel-akpm/include/asm-x86_64/ia32_unistd.h 2005-07-08 23:11:41.000000000 -0700 @@ -294,7 +294,11 @@ #define __NR_ia32_add_key 286 #define __NR_ia32_request_key 287 #define __NR_ia32_keyctl 288 +#define __NR_ia32_vperfctr_open 291 +#define __NR_ia32_vperfctr_control (__NR_ia32_vperfctr_open+1) +#define __NR_ia32_vperfctr_write (__NR_ia32_vperfctr_open+2) +#define __NR_ia32_vperfctr_read (__NR_ia32_vperfctr_open+3) -#define IA32_NR_syscalls 290 /* must be > than biggest syscall! */ +#define IA32_NR_syscalls 295 /* must be > than biggest syscall! */ #endif /* _ASM_X86_64_IA32_UNISTD_H_ */ diff -puN include/asm-x86_64/irq.h~perfctr include/asm-x86_64/irq.h --- devel/include/asm-x86_64/irq.h~perfctr 2005-07-08 23:11:41.000000000 -0700 +++ devel-akpm/include/asm-x86_64/irq.h 2005-07-08 23:11:41.000000000 -0700 @@ -29,7 +29,7 @@ */ #define NR_VECTORS 256 -#define FIRST_SYSTEM_VECTOR 0xef /* duplicated in hw_irq.h */ +#define FIRST_SYSTEM_VECTOR 0xee /* duplicated in hw_irq.h */ #ifdef CONFIG_PCI_MSI #define NR_IRQS FIRST_SYSTEM_VECTOR diff -puN /dev/null include/asm-x86_64/perfctr.h --- /dev/null 2003-09-15 06:40:47.000000000 -0700 +++ devel-akpm/include/asm-x86_64/perfctr.h 2005-07-08 23:11:41.000000000 -0700 @@ -0,0 +1 @@ +#include diff -puN include/asm-x86_64/processor.h~perfctr include/asm-x86_64/processor.h --- devel/include/asm-x86_64/processor.h~perfctr 2005-07-08 23:11:41.000000000 -0700 +++ devel-akpm/include/asm-x86_64/processor.h 2005-07-08 23:11:41.000000000 -0700 @@ -252,6 +252,8 @@ struct thread_struct { unsigned io_bitmap_max; /* cached TLS descriptors. */ u64 tls_array[GDT_ENTRY_TLS_ENTRIES]; +/* performance counters */ + struct vperfctr *perfctr; } __attribute__((aligned(16))); #define INIT_THREAD {} diff -puN include/asm-x86_64/unistd.h~perfctr include/asm-x86_64/unistd.h --- devel/include/asm-x86_64/unistd.h~perfctr 2005-07-08 23:11:41.000000000 -0700 +++ devel-akpm/include/asm-x86_64/unistd.h 2005-07-08 23:11:41.000000000 -0700 @@ -565,8 +565,16 @@ __SYSCALL(__NR_keyctl, sys_keyctl) __SYSCALL(__NR_ioprio_set, sys_ioprio_set) #define __NR_ioprio_get 252 __SYSCALL(__NR_ioprio_get, sys_ioprio_get) +#define __NR_vperfctr_open 253 +__SYSCALL(__NR_vperfctr_open, sys_vperfctr_open) +#define __NR_vperfctr_control (__NR_vperfctr_open+1) +__SYSCALL(__NR_vperfctr_control, sys_vperfctr_control) +#define __NR_vperfctr_write (__NR_vperfctr_open+2) +__SYSCALL(__NR_vperfctr_write, sys_vperfctr_write) +#define __NR_vperfctr_read (__NR_vperfctr_open+3) +__SYSCALL(__NR_vperfctr_read, sys_vperfctr_read) -#define __NR_syscall_max __NR_ioprio_get +#define __NR_syscall_max __NR_vperfctr_read #ifndef __NO_STUBS /* user-visible error numbers are in the range -1 - -4095 */ diff -puN include/asm-x86_64/system.h~perfctr include/asm-x86_64/system.h --- devel/include/asm-x86_64/system.h~perfctr 2005-07-08 23:11:41.000000000 -0700 +++ devel-akpm/include/asm-x86_64/system.h 2005-07-08 23:11:41.000000000 -0700 @@ -26,7 +26,8 @@ #define __EXTRA_CLOBBER \ ,"rcx","rbx","rdx","r8","r9","r10","r11","r12","r13","r14","r15" -#define switch_to(prev,next,last) \ +#define switch_to(prev,next,last) do { \ + perfctr_suspend_thread(&(prev)->thread); \ asm volatile(SAVE_CONTEXT \ "movq %%rsp,%P[threadrsp](%[prev])\n\t" /* save RSP */ \ "movq %P[threadrsp](%[next]),%%rsp\n\t" /* restore RSP */ \ @@ -46,7 +47,8 @@ [tif_fork] "i" (TIF_FORK), \ [thread_info] "i" (offsetof(struct task_struct, thread_info)), \ [pda_pcurrent] "i" (offsetof(struct x8664_pda, pcurrent)) \ - : "memory", "cc" __EXTRA_CLOBBER) + : "memory", "cc" __EXTRA_CLOBBER); \ +} while (0) extern void load_gs_index(unsigned); diff -puN arch/ppc/Kconfig~perfctr arch/ppc/Kconfig --- devel/arch/ppc/Kconfig~perfctr 2005-07-08 23:11:41.000000000 -0700 +++ devel-akpm/arch/ppc/Kconfig 2005-07-08 23:11:41.000000000 -0700 @@ -280,6 +280,8 @@ config NOT_COHERENT_CACHE depends on 4xx || 8xx || E200 default y +source "drivers/perfctr/Kconfig" + endmenu menu "Platform options" diff -puN arch/ppc/kernel/misc.S~perfctr arch/ppc/kernel/misc.S --- devel/arch/ppc/kernel/misc.S~perfctr 2005-07-08 23:11:41.000000000 -0700 +++ devel-akpm/arch/ppc/kernel/misc.S 2005-07-08 23:11:41.000000000 -0700 @@ -1453,3 +1453,7 @@ _GLOBAL(sys_call_table) .long sys_ioprio_get .long sys_pselect6 /* 275 */ .long sys_ppoll + .long sys_vperfctr_open + .long sys_vperfctr_control + .long sys_vperfctr_write + .long sys_vperfctr_read /* 280 */ diff -puN arch/ppc/kernel/process.c~perfctr arch/ppc/kernel/process.c --- devel/arch/ppc/kernel/process.c~perfctr 2005-07-08 23:11:41.000000000 -0700 +++ devel-akpm/arch/ppc/kernel/process.c 2005-07-08 23:11:41.000000000 -0700 @@ -35,6 +35,7 @@ #include #include #include +#include #include #include @@ -301,7 +302,9 @@ struct task_struct *__switch_to(struct t #endif /* CONFIG_SPE */ new_thread = &new->thread; old_thread = ¤t->thread; + perfctr_suspend_thread(&prev->thread); last = _switch(old_thread, new_thread); + perfctr_resume_thread(¤t->thread); local_irq_restore(s); return last; } @@ -363,6 +366,7 @@ void exit_thread(void) if (last_task_used_spe == current) last_task_used_spe = NULL; #endif + perfctr_exit_thread(¤t->thread); } void flush_thread(void) @@ -455,6 +459,8 @@ copy_thread(int nr, unsigned long clone_ p->thread.last_syscall = -1; + perfctr_copy_task(p, regs); + return 0; } diff -puN /dev/null drivers/perfctr/ppc.c --- /dev/null 2003-09-15 06:40:47.000000000 -0700 +++ devel-akpm/drivers/perfctr/ppc.c 2005-07-08 23:11:41.000000000 -0700 @@ -0,0 +1,1094 @@ +/* $Id: ppc.c,v 1.39 2005/04/08 14:36:49 mikpe Exp $ + * PPC32 performance-monitoring counters driver. + * + * Copyright (C) 2004-2005 Mikael Pettersson + */ +#include +#include +#include +#include +#include +#include +#include /* tb_ticks_per_jiffy, get_tbl() */ + +#include "ppc_tests.h" + +/* Support for lazy evntsel and perfctr SPR updates. */ +struct per_cpu_cache { /* roughly a subset of perfctr_cpu_state */ + unsigned int id; /* cache owner id */ + /* Physically indexed cache of the MMCRs. */ + unsigned int ppc_mmcr[3]; +}; +static DEFINE_PER_CPU(struct per_cpu_cache, per_cpu_cache); +#define __get_cpu_cache(cpu) (&per_cpu(per_cpu_cache, cpu)) +#define get_cpu_cache() (&__get_cpu_var(per_cpu_cache)) + +/* Structure for counter snapshots, as 32-bit values. */ +struct perfctr_low_ctrs { + unsigned int tsc; + unsigned int pmc[6]; +}; + +enum pm_type { + PM_NONE, + PM_604, + PM_604e, + PM_750, /* XXX: Minor event set diffs between IBM and Moto. */ + PM_7400, + PM_7450, +}; +static enum pm_type pm_type; + +static unsigned int new_id(void) +{ + static DEFINE_SPINLOCK(lock); + static unsigned int counter; + int id; + + spin_lock(&lock); + id = ++counter; + spin_unlock(&lock); + return id; +} + +#ifdef CONFIG_PERFCTR_INTERRUPT_SUPPORT +static void perfctr_default_ihandler(unsigned long pc) +{ +} + +static perfctr_ihandler_t perfctr_ihandler = perfctr_default_ihandler; + +void do_perfctr_interrupt(struct pt_regs *regs) +{ + preempt_disable(); + (*perfctr_ihandler)(instruction_pointer(regs)); + preempt_enable_no_resched(); +} + +void perfctr_cpu_set_ihandler(perfctr_ihandler_t ihandler) +{ + perfctr_ihandler = ihandler ? ihandler : perfctr_default_ihandler; +} + +#else +#define perfctr_cstatus_has_ictrs(cstatus) 0 +#endif + +#if defined(CONFIG_SMP) && defined(CONFIG_PERFCTR_INTERRUPT_SUPPORT) + +static inline void +set_isuspend_cpu(struct perfctr_cpu_state *state, int cpu) +{ + state->isuspend_cpu = cpu; +} + +static inline int +is_isuspend_cpu(const struct perfctr_cpu_state *state, int cpu) +{ + return state->isuspend_cpu == cpu; +} + +static inline void clear_isuspend_cpu(struct perfctr_cpu_state *state) +{ + state->isuspend_cpu = NR_CPUS; +} + +#else +static inline void set_isuspend_cpu(struct perfctr_cpu_state *state, int cpu) { } +static inline int is_isuspend_cpu(const struct perfctr_cpu_state *state, int cpu) { return 1; } +static inline void clear_isuspend_cpu(struct perfctr_cpu_state *state) { } +#endif + +/* The ppc driver internally uses cstatus & (1<<30) to record that + a context has an asynchronously changing MMCR0. */ +static inline unsigned int perfctr_cstatus_set_mmcr0_quirk(unsigned int cstatus) +{ + return cstatus | (1 << 30); +} + +static inline int perfctr_cstatus_has_mmcr0_quirk(unsigned int cstatus) +{ + return cstatus & (1 << 30); +} + +/**************************************************************** + * * + * Driver procedures. * + * * + ****************************************************************/ + +/* + * The PowerPC 604/750/74xx family. + * + * Common features + * --------------- + * - Per counter event selection data in subfields of control registers. + * MMCR0 contains both global control and PMC1/PMC2 event selectors. + * - Overflow interrupt support is present in all processors, but an + * erratum makes it difficult to use in 750/7400/7410 processors. + * - There is no concept of per-counter qualifiers: + * - User-mode/supervisor-mode restrictions are global. + * - Two groups of counters, PMC1 and PMC2-PMC. Each group + * has a single overflow interrupt/event enable/disable flag. + * - The instructions used to read (mfspr) and write (mtspr) the control + * and counter registers (SPRs) only support hardcoded register numbers. + * There is no support for accessing an SPR via a runtime value. + * - Each counter supports its own unique set of events. However, events + * 0-1 are common for PMC1-PMC4, and events 2-4 are common for PMC1-PMC4. + * - There is no separate high-resolution core clock counter. + * The time-base counter is available, but it typically runs an order of + * magnitude slower than the core clock. + * Any performance counter can be programmed to count core clocks, but + * doing this (a) reserves one PMC, and (b) needs indirect accesses + * since the SPR number in general isn't known at compile-time. + * + * 604 + * --- + * 604 has MMCR0, PMC1, PMC2, SIA, and SDA. + * + * MMCR0[THRESHOLD] is not automatically multiplied. + * + * On the 604, software must always reset MMCR0[ENINT] after + * taking a PMI. This is not the case for the 604e. + * + * 604e + * ---- + * 604e adds MMCR1, PMC3, and PMC4. + * Bus-to-core multiplier is available via HID1[PLL_CFG]. + * + * MMCR0[THRESHOLD] is automatically multiplied by 4. + * + * When the 604e vectors to the PMI handler, it automatically + * clears any pending PMIs. Unlike the 604, the 604e does not + * require MMCR0[ENINT] to be cleared (and possibly reset) + * before external interrupts can be re-enabled. + * + * 750 + * --- + * 750 adds user-readable MMCRn/PMCn/SIA registers, and removes SDA. + * + * MMCR0[THRESHOLD] is not automatically multiplied. + * + * Motorola MPC750UM.pdf, page C-78, states: "The performance monitor + * of the MPC755 functions the same as that of the MPC750, (...), except + * that for both the MPC750 and MPC755, no combination of the thermal + * assist unit, the decrementer register, and the performance monitor + * can be used at any one time. If exceptions for any two of these + * functional blocks are enabled together, multiple exceptions caused + * by any of these three blocks cause unpredictable results." + * + * IBM 750CXe_Err_DD2X.pdf, Erratum #13, states that a PMI which + * occurs immediately after a delayed decrementer exception can + * corrupt SRR0, causing the processor to hang. It also states that + * PMIs via TB bit transitions can be used to simulate the decrementer. + * + * 750FX adds dual-PLL support and programmable core frequency switching. + * + * 750FX DD2.3 fixed the DEC/PMI SRR0 corruption erratum. + * + * 74xx + * ---- + * 7400 adds MMCR2 and BAMR. + * + * MMCR0[THRESHOLD] is multiplied by 2 or 32, as specified + * by MMCR2[THRESHMULT]. + * + * 74xx changes the semantics of several MMCR0 control bits, + * compared to 604/750. + * + * PPC7410 Erratum No. 10: Like the MPC750 TAU/DECR/PMI erratum. + * Erratum No. 14 marks TAU as unsupported in 7410, but this leaves + * perfmon and decrementer interrupts as being mutually exclusive. + * Affects PPC7410 1.0-1.2 (PVR 0x800C1100-0x800C1102). 1.3 and up + * (PVR 0x800C1103 up) are Ok. + * + * 7450 adds PMC5 and PMC6. + * + * 7455/7445 V3.3 (PVR 80010303) and later use the 7457 PLL table, + * earlier revisions use the 7450 PLL table + */ + +static inline unsigned int read_pmc(unsigned int pmc) +{ + switch (pmc) { + default: /* impossible, but silences gcc warning */ + case 0: + return mfspr(SPRN_PMC1); + case 1: + return mfspr(SPRN_PMC2); + case 2: + return mfspr(SPRN_PMC3); + case 3: + return mfspr(SPRN_PMC4); + case 4: + return mfspr(SPRN_PMC5); + case 5: + return mfspr(SPRN_PMC6); + } +} + +static void ppc_read_counters(struct perfctr_cpu_state *state, + struct perfctr_low_ctrs *ctrs) +{ + unsigned int cstatus, nrctrs, i; + + cstatus = state->user.cstatus; + if (perfctr_cstatus_has_tsc(cstatus)) + ctrs->tsc = get_tbl(); + nrctrs = perfctr_cstatus_nractrs(cstatus); + for(i = 0; i < nrctrs; ++i) { + unsigned int pmc = state->control.pmc_map[i]; + ctrs->pmc[i] = read_pmc(pmc); + } +} + +static unsigned int pmc_max_event(unsigned int pmc) +{ + switch (pmc) { + default: /* impossible, but silences gcc warning */ + case 0: + return 127; + case 1: + return 63; + case 2: + return 31; + case 3: + return 31; + case 4: + return 31; + case 5: + return 63; + } +} + +static unsigned int get_nr_pmcs(void) +{ + switch (pm_type) { + case PM_7450: + return 6; + case PM_7400: + case PM_750: + case PM_604e: + return 4; + case PM_604: + return 2; + default: /* PM_NONE, but silences gcc warning */ + return 0; + } +} + +static int ppc_check_control(struct perfctr_cpu_state *state) +{ + unsigned int i, nractrs, nrctrs, pmc_mask, pmi_mask, pmc; + unsigned int nr_pmcs, evntsel[6]; + + nr_pmcs = get_nr_pmcs(); + nractrs = state->control.header.nractrs; + nrctrs = nractrs + state->control.header.nrictrs; + if (nrctrs < nractrs || nrctrs > nr_pmcs) + return -EINVAL; + + pmc_mask = 0; + pmi_mask = 0; + evntsel[1-1] = (state->control.mmcr0 >> (31-25)) & 0x7F; + evntsel[2-1] = (state->control.mmcr0 >> (31-31)) & 0x3F; + evntsel[3-1] = (state->control.mmcr1 >> (31- 4)) & 0x1F; + evntsel[4-1] = (state->control.mmcr1 >> (31- 9)) & 0x1F; + evntsel[5-1] = (state->control.mmcr1 >> (31-14)) & 0x1F; + evntsel[6-1] = (state->control.mmcr1 >> (31-20)) & 0x3F; + + for(i = 0; i < nrctrs; ++i) { + pmc = state->control.pmc_map[i]; + if (pmc >= nr_pmcs || (pmc_mask & (1<= nractrs) + pmi_mask |= (1< pmc_max_event(pmc)) + return -EINVAL; + } + + /* unused event selectors must be zero */ + for(i = 0; i < ARRAY_SIZE(evntsel); ++i) + if (!(pmc_mask & (1<control.mmcr2 & MMCR2_RESERVED) + return -EINVAL; + break; + default: + if (state->control.mmcr2) + return -EINVAL; + } + + /* check MMCR1; non-existent event selectors are taken care of + by the "unused event selectors must be zero" check above */ + if (state->control.mmcr1 & MMCR1__RESERVED) + return -EINVAL; + + /* We do not yet handle TBEE as the only exception cause, + so PMXE requires at least one interrupt-mode counter. */ + if ((state->control.mmcr0 & MMCR0_PMXE) && !state->control.header.nrictrs) + return -EINVAL; + + state->id = new_id(); + + /* + * MMCR0[FC] and MMCR0[TRIGGER] may change on 74xx if FCECE or + * TRIGGER is set. At suspends we must read MMCR0 back into + * the state and the cache and then freeze the counters, and + * at resumes we must unfreeze the counters and reload MMCR0. + */ + switch (pm_type) { + case PM_7450: + case PM_7400: + if (state->control.mmcr0 & (MMCR0_FCECE | MMCR0_TRIGGER)) + state->user.cstatus = perfctr_cstatus_set_mmcr0_quirk(state->user.cstatus); + default: + ; + } + + /* The MMCR0 handling for FCECE and TRIGGER is also needed for PMXE. */ + if (state->control.mmcr0 & (MMCR0_PMXE | MMCR0_FCECE | MMCR0_TRIGGER)) + state->user.cstatus = perfctr_cstatus_set_mmcr0_quirk(state->user.cstatus); + + return 0; +} + +#ifdef CONFIG_PERFCTR_INTERRUPT_SUPPORT +/* PRE: perfctr_cstatus_has_ictrs(state->cstatus) != 0 */ +/* PRE: counters frozen */ +static void ppc_isuspend(struct perfctr_cpu_state *state) +{ + struct per_cpu_cache *cache; + unsigned int cstatus, nrctrs, i; + int cpu; + + cpu = smp_processor_id(); + set_isuspend_cpu(state, cpu); /* early to limit cpu's live range */ + cache = __get_cpu_cache(cpu); + cstatus = state->user.cstatus; + nrctrs = perfctr_cstatus_nrctrs(cstatus); + for(i = perfctr_cstatus_nractrs(cstatus); i < nrctrs; ++i) { + unsigned int pmc = state->control.pmc_map[i]; + unsigned int now = read_pmc(pmc); + state->user.pmc[i].sum += now - state->user.pmc[i].start; + state->user.pmc[i].start = now; + } + /* cache->id is still == state->id */ +} + +static void ppc_iresume(const struct perfctr_cpu_state *state) +{ + struct per_cpu_cache *cache; + unsigned int cstatus, nrctrs, i; + int cpu; + unsigned int pmc[6]; + + cpu = smp_processor_id(); + cache = __get_cpu_cache(cpu); + if (cache->id == state->id) { + /* Clearing cache->id to force write_control() + to unfreeze MMCR0 would be done here, but it + is subsumed by resume()'s MMCR0 reload logic. */ + if (is_isuspend_cpu(state, cpu)) + return; /* skip reload of PMCs */ + } + /* + * The CPU state wasn't ours. + * + * The counters must be frozen before being reinitialised, + * to prevent unexpected increments and missed overflows. + * + * All unused counters must be reset to a non-overflow state. + */ + if (!(cache->ppc_mmcr[0] & MMCR0_FC)) { + cache->ppc_mmcr[0] |= MMCR0_FC; + mtspr(SPRN_MMCR0, cache->ppc_mmcr[0]); + } + memset(&pmc[0], 0, sizeof pmc); + cstatus = state->user.cstatus; + nrctrs = perfctr_cstatus_nrctrs(cstatus); + for(i = perfctr_cstatus_nractrs(cstatus); i < nrctrs; ++i) + pmc[state->control.pmc_map[i]] = state->user.pmc[i].start; + + switch (pm_type) { + case PM_7450: + mtspr(SPRN_PMC6, pmc[6-1]); + mtspr(SPRN_PMC5, pmc[5-1]); + case PM_7400: + case PM_750: + case PM_604e: + mtspr(SPRN_PMC4, pmc[4-1]); + mtspr(SPRN_PMC3, pmc[3-1]); + case PM_604: + mtspr(SPRN_PMC2, pmc[2-1]); + mtspr(SPRN_PMC1, pmc[1-1]); + case PM_NONE: + ; + } + /* cache->id remains != state->id */ +} +#endif + +static void ppc_write_control(const struct perfctr_cpu_state *state) +{ + struct per_cpu_cache *cache; + unsigned int value; + + cache = get_cpu_cache(); + if (cache->id == state->id) + return; + /* + * Order matters here: update threshmult and event + * selectors before updating global control, which + * potentially enables PMIs. + * + * Since mtspr doesn't accept a runtime value for the + * SPR number, unroll the loop so each mtspr targets + * a constant SPR. + * + * For processors without MMCR2, we ensure that the + * cache and the state indicate the same value for it, + * preventing any actual mtspr to it. Ditto for MMCR1. + */ + value = state->control.mmcr2; + if (value != cache->ppc_mmcr[2]) { + cache->ppc_mmcr[2] = value; + mtspr(SPRN_MMCR2, value); + } + value = state->control.mmcr1; + if (value != cache->ppc_mmcr[1]) { + cache->ppc_mmcr[1] = value; + mtspr(SPRN_MMCR1, value); + } + value = state->control.mmcr0; + if (value != cache->ppc_mmcr[0]) { + cache->ppc_mmcr[0] = value; + mtspr(SPRN_MMCR0, value); + } + cache->id = state->id; +} + +static void ppc_clear_counters(void) +{ + switch (pm_type) { + case PM_7450: + case PM_7400: + mtspr(SPRN_MMCR2, 0); + mtspr(SPRN_BAMR, 0); + case PM_750: + case PM_604e: + mtspr(SPRN_MMCR1, 0); + case PM_604: + mtspr(SPRN_MMCR0, 0); + case PM_NONE: + ; + } + switch (pm_type) { + case PM_7450: + mtspr(SPRN_PMC6, 0); + mtspr(SPRN_PMC5, 0); + case PM_7400: + case PM_750: + case PM_604e: + mtspr(SPRN_PMC4, 0); + mtspr(SPRN_PMC3, 0); + case PM_604: + mtspr(SPRN_PMC2, 0); + mtspr(SPRN_PMC1, 0); + case PM_NONE: + ; + } +} + +/* + * Driver methods, internal and exported. + */ + +static void perfctr_cpu_write_control(const struct perfctr_cpu_state *state) +{ + return ppc_write_control(state); +} + +static void perfctr_cpu_read_counters(struct perfctr_cpu_state *state, + struct perfctr_low_ctrs *ctrs) +{ + return ppc_read_counters(state, ctrs); +} + +#ifdef CONFIG_PERFCTR_INTERRUPT_SUPPORT +static void perfctr_cpu_isuspend(struct perfctr_cpu_state *state) +{ + return ppc_isuspend(state); +} + +static void perfctr_cpu_iresume(const struct perfctr_cpu_state *state) +{ + return ppc_iresume(state); +} + +/* Call perfctr_cpu_ireload() just before perfctr_cpu_resume() to + bypass internal caching and force a reload if the I-mode PMCs. */ +void perfctr_cpu_ireload(struct perfctr_cpu_state *state) +{ + state->control.mmcr0 |= MMCR0_PMXE; +#ifdef CONFIG_SMP + clear_isuspend_cpu(state); +#else + get_cpu_cache()->id = 0; +#endif +} + +/* PRE: the counters have been suspended and sampled by perfctr_cpu_suspend() */ +unsigned int perfctr_cpu_identify_overflow(struct perfctr_cpu_state *state) +{ + unsigned int cstatus, nrctrs, i, pmc_mask; + + cstatus = state->user.cstatus; + nrctrs = perfctr_cstatus_nrctrs(cstatus); + pmc_mask = 0; + for(i = perfctr_cstatus_nractrs(cstatus); i < nrctrs; ++i) { + if ((int)state->user.pmc[i].start < 0) { /* PPC-specific */ + unsigned int pmc = state->control.pmc_map[i]; + /* XXX: "+=" to correct for overshots */ + state->user.pmc[i].start = state->control.ireset[pmc]; + pmc_mask |= (1 << i); + } + } + if (!pmc_mask && (state->control.mmcr0 & MMCR0_TBEE)) + pmc_mask = (1<<8); /* fake TB bit flip indicator */ + return pmc_mask; +} + +static inline int check_ireset(struct perfctr_cpu_state *state) +{ + unsigned int nrctrs, i; + + i = state->control.header.nractrs; + nrctrs = i + state->control.header.nrictrs; + for(; i < nrctrs; ++i) { + unsigned int pmc = state->control.pmc_map[i]; + if ((int)state->control.ireset[pmc] < 0) /* PPC-specific */ + return -EINVAL; + state->user.pmc[i].start = state->control.ireset[pmc]; + } + return 0; +} + +#else /* CONFIG_PERFCTR_INTERRUPT_SUPPORT */ +static inline void perfctr_cpu_isuspend(struct perfctr_cpu_state *state) { } +static inline void perfctr_cpu_iresume(const struct perfctr_cpu_state *state) { } +static inline int check_ireset(struct perfctr_cpu_state *state) { return 0; } +#endif /* CONFIG_PERFCTR_INTERRUPT_SUPPORT */ + +static int check_control(struct perfctr_cpu_state *state) +{ + return ppc_check_control(state); +} + +int perfctr_cpu_update_control(struct perfctr_cpu_state *state, int is_global) +{ + int err; + + clear_isuspend_cpu(state); + state->user.cstatus = 0; + + /* disallow i-mode counters if we cannot catch the interrupts */ + if (!(perfctr_info.cpu_features & PERFCTR_FEATURE_PCINT) + && state->control.header.nrictrs) + return -EPERM; + + err = check_control(state); /* may initialise state->cstatus */ + if (err < 0) + return err; + err = check_ireset(state); + if (err < 0) { + state->user.cstatus = 0; + return err; + } + state->user.cstatus |= perfctr_mk_cstatus(state->control.header.tsc_on, + state->control.header.nractrs, + state->control.header.nrictrs); + return 0; +} + +/* + * get_reg_offset() maps SPR numbers to offsets into struct perfctr_cpu_control, + * suitable for accessing control data of type unsigned int. + */ +static const struct { + unsigned int spr; + unsigned int offset; +} reg_offsets[] = { + { SPRN_MMCR0, offsetof(struct perfctr_cpu_control, mmcr0) }, + { SPRN_MMCR1, offsetof(struct perfctr_cpu_control, mmcr1) }, + { SPRN_MMCR2, offsetof(struct perfctr_cpu_control, mmcr2) }, + { SPRN_PMC1, offsetof(struct perfctr_cpu_control, ireset[1-1]) }, + { SPRN_PMC2, offsetof(struct perfctr_cpu_control, ireset[2-1]) }, + { SPRN_PMC3, offsetof(struct perfctr_cpu_control, ireset[3-1]) }, + { SPRN_PMC4, offsetof(struct perfctr_cpu_control, ireset[4-1]) }, + { SPRN_PMC5, offsetof(struct perfctr_cpu_control, ireset[5-1]) }, + { SPRN_PMC6, offsetof(struct perfctr_cpu_control, ireset[6-1]) }, +}; + +static int get_reg_offset(unsigned int spr) +{ + unsigned int i; + + for(i = 0; i < ARRAY_SIZE(reg_offsets); ++i) + if (spr == reg_offsets[i].spr) + return reg_offsets[i].offset; + return -1; +} + +static int access_regs(struct perfctr_cpu_control *control, + void *argp, unsigned int argbytes, int do_write) +{ + struct perfctr_cpu_reg *regs; + unsigned int i, nr_regs, *where; + int offset; + + nr_regs = argbytes / sizeof(struct perfctr_cpu_reg); + if (nr_regs * sizeof(struct perfctr_cpu_reg) != argbytes) + return -EINVAL; + regs = (struct perfctr_cpu_reg*)argp; + + for(i = 0; i < nr_regs; ++i) { + offset = get_reg_offset(regs[i].nr); + if (offset < 0) + return -EINVAL; + where = (unsigned int*)((char*)control + offset); + if (do_write) + *where = regs[i].value; + else + regs[i].value = *where; + } + return argbytes; +} + +int perfctr_cpu_control_write(struct perfctr_cpu_control *control, unsigned int domain, + const void *srcp, unsigned int srcbytes) +{ + if (domain != PERFCTR_DOMAIN_CPU_REGS) + return -EINVAL; + return access_regs(control, (void*)srcp, srcbytes, 1); +} + +int perfctr_cpu_control_read(const struct perfctr_cpu_control *control, unsigned int domain, + void *dstp, unsigned int dstbytes) +{ + if (domain != PERFCTR_DOMAIN_CPU_REGS) + return -EINVAL; + return access_regs((struct perfctr_cpu_control*)control, dstp, dstbytes, 0); +} + +void perfctr_cpu_suspend(struct perfctr_cpu_state *state) +{ + unsigned int i, cstatus, nractrs; + struct perfctr_low_ctrs now; + + write_perfseq_begin(&state->user.sequence); + if (perfctr_cstatus_has_mmcr0_quirk(state->user.cstatus)) { + unsigned int mmcr0 = mfspr(SPRN_MMCR0); + mtspr(SPRN_MMCR0, mmcr0 | MMCR0_FC); + get_cpu_cache()->ppc_mmcr[0] = mmcr0 | MMCR0_FC; + state->control.mmcr0 = mmcr0; + } + if (perfctr_cstatus_has_ictrs(state->user.cstatus)) + perfctr_cpu_isuspend(state); + perfctr_cpu_read_counters(state, &now); + cstatus = state->user.cstatus; + if (perfctr_cstatus_has_tsc(cstatus)) + state->user.tsc_sum += now.tsc - state->user.tsc_start; + nractrs = perfctr_cstatus_nractrs(cstatus); + for(i = 0; i < nractrs; ++i) + state->user.pmc[i].sum += now.pmc[i] - state->user.pmc[i].start; + write_perfseq_end(&state->user.sequence); +} + +void perfctr_cpu_resume(struct perfctr_cpu_state *state) +{ + write_perfseq_begin(&state->user.sequence); + if (perfctr_cstatus_has_ictrs(state->user.cstatus)) + perfctr_cpu_iresume(state); + if (perfctr_cstatus_has_mmcr0_quirk(state->user.cstatus)) + get_cpu_cache()->id = 0; /* force reload of MMCR0 */ + perfctr_cpu_write_control(state); + //perfctr_cpu_read_counters(state, &state->start); + { + struct perfctr_low_ctrs now; + unsigned int i, cstatus, nrctrs; + perfctr_cpu_read_counters(state, &now); + cstatus = state->user.cstatus; + if (perfctr_cstatus_has_tsc(cstatus)) + state->user.tsc_start = now.tsc; + nrctrs = perfctr_cstatus_nractrs(cstatus); + for(i = 0; i < nrctrs; ++i) + state->user.pmc[i].start = now.pmc[i]; + } + write_perfseq_end(&state->user.sequence); +} + +void perfctr_cpu_sample(struct perfctr_cpu_state *state) +{ + unsigned int i, cstatus, nractrs; + struct perfctr_low_ctrs now; + + write_perfseq_begin(&state->user.sequence); + perfctr_cpu_read_counters(state, &now); + cstatus = state->user.cstatus; + if (perfctr_cstatus_has_tsc(cstatus)) { + state->user.tsc_sum += now.tsc - state->user.tsc_start; + state->user.tsc_start = now.tsc; + } + nractrs = perfctr_cstatus_nractrs(cstatus); + for(i = 0; i < nractrs; ++i) { + state->user.pmc[i].sum += now.pmc[i] - state->user.pmc[i].start; + state->user.pmc[i].start = now.pmc[i]; + } + write_perfseq_end(&state->user.sequence); +} + +static void perfctr_cpu_clear_counters(void) +{ + struct per_cpu_cache *cache; + + cache = get_cpu_cache(); + memset(cache, 0, sizeof *cache); + cache->id = -1; + + ppc_clear_counters(); +} + +/**************************************************************** + * * + * Processor detection and initialisation procedures. * + * * + ****************************************************************/ + +/* Derive CPU core frequency from TB frequency and PLL_CFG. */ + +enum pll_type { + PLL_NONE, /* for e.g. 604 which has no HID1[PLL_CFG] */ + PLL_604e, + PLL_750, + PLL_750FX, + PLL_7400, + PLL_7450, + PLL_7457, +}; + +/* These are the known bus-to-core ratios, indexed by PLL_CFG. + Multiplied by 2 since half-multiplier steps are present. */ + +static unsigned char cfg_ratio_604e[16] __initdata = { // *2 + 2, 2, 14, 2, 4, 13, 5, 9, + 6, 11, 8, 10, 3, 12, 7, 0 +}; + +static unsigned char cfg_ratio_750[16] __initdata = { // *2 + 5, 15, 14, 2, 4, 13, 20, 9, // 0b0110 is 18 if L1_TSTCLK=0, but that is abnormal + 6, 11, 8, 10, 16, 12, 7, 0 +}; + +static unsigned char cfg_ratio_750FX[32] __initdata = { // *2 + 0, 0, 2, 2, 4, 5, 6, 7, + 8, 9, 10, 11, 12, 13, 14, 15, + 16, 17, 18, 19, 20, 22, 24, 26, + 28, 30, 32, 34, 36, 38, 40, 0 +}; + +static unsigned char cfg_ratio_7400[16] __initdata = { // *2 + 18, 15, 14, 2, 4, 13, 5, 9, + 6, 11, 8, 10, 16, 12, 7, 0 +}; + +static unsigned char cfg_ratio_7450[32] __initdata = { // *2 + 1, 0, 15, 30, 14, 0, 2, 0, + 4, 0, 13, 26, 5, 0, 9, 18, + 6, 0, 11, 22, 8, 20, 10, 24, + 16, 28, 12, 32, 7, 0, 0, 0 +}; + +static unsigned char cfg_ratio_7457[32] __initdata = { // *2 + 23, 34, 15, 30, 14, 36, 2, 40, + 4, 42, 13, 26, 17, 48, 19, 18, + 6, 21, 11, 22, 8, 20, 10, 24, + 16, 28, 12, 32, 27, 56, 0, 25 +}; + +static unsigned int __init tb_to_core_ratio(enum pll_type pll_type) +{ + unsigned char *cfg_ratio; + unsigned int shift = 28, mask = 0xF, hid1, pll_cfg, ratio; + + switch (pll_type) { + case PLL_604e: + cfg_ratio = cfg_ratio_604e; + break; + case PLL_750: + cfg_ratio = cfg_ratio_750; + break; + case PLL_750FX: + cfg_ratio = cfg_ratio_750FX; + hid1 = mfspr(SPRN_HID1); + switch ((hid1 >> 16) & 0x3) { /* HID1[PI0,PS] */ + case 0: /* PLL0 with external config */ + shift = 31-4; /* access HID1[PCE] */ + break; + case 2: /* PLL0 with internal config */ + shift = 31-20; /* access HID1[PC0] */ + break; + case 1: case 3: /* PLL1 */ + shift = 31-28; /* access HID1[PC1] */ + break; + } + mask = 0x1F; + break; + case PLL_7400: + cfg_ratio = cfg_ratio_7400; + break; + case PLL_7450: + cfg_ratio = cfg_ratio_7450; + shift = 12; + mask = 0x1F; + break; + case PLL_7457: + cfg_ratio = cfg_ratio_7457; + shift = 12; + mask = 0x1F; + break; + default: + return 0; + } + hid1 = mfspr(SPRN_HID1); + pll_cfg = (hid1 >> shift) & mask; + ratio = cfg_ratio[pll_cfg]; + if (!ratio) + printk(KERN_WARNING "perfctr: unknown PLL_CFG 0x%x\n", pll_cfg); + return (4/2) * ratio; +} + +static unsigned int __init pll_to_core_khz(enum pll_type pll_type) +{ + unsigned int tb_to_core = tb_to_core_ratio(pll_type); + perfctr_info.tsc_to_cpu_mult = tb_to_core; + return tb_ticks_per_jiffy * tb_to_core * (HZ/10) / (1000/10); +} + +/* Extract core and timebase frequencies from Open Firmware. */ + +static unsigned int __init of_to_core_khz(void) +{ + struct device_node *cpu; + unsigned int *fp, core, tb; + + cpu = find_type_devices("cpu"); + if (!cpu) + return 0; + fp = (unsigned int*)get_property(cpu, "clock-frequency", NULL); + if (!fp || !(core = *fp)) + return 0; + fp = (unsigned int*)get_property(cpu, "timebase-frequency", NULL); + if (!fp || !(tb = *fp)) + return 0; + perfctr_info.tsc_to_cpu_mult = core / tb; + return core / 1000; +} + +static unsigned int __init detect_cpu_khz(enum pll_type pll_type) +{ + unsigned int khz; + + khz = pll_to_core_khz(pll_type); + if (khz) + return khz; + + khz = of_to_core_khz(); + if (khz) + return khz; + + printk(KERN_WARNING "perfctr: unable to determine CPU speed\n"); + return 0; +} + +static int __init known_init(void) +{ + static char known_name[] __initdata = "PowerPC 60x/7xx/74xx"; + unsigned int features; + enum pll_type pll_type; + unsigned int pvr; + int have_mmcr1; + + features = PERFCTR_FEATURE_RDTSC | PERFCTR_FEATURE_RDPMC; + have_mmcr1 = 1; + pvr = mfspr(SPRN_PVR); + switch (PVR_VER(pvr)) { + case 0x0004: /* 604 */ + pm_type = PM_604; + pll_type = PLL_NONE; + features = PERFCTR_FEATURE_RDTSC; + have_mmcr1 = 0; + break; + case 0x0009: /* 604e; */ + case 0x000A: /* 604ev */ + pm_type = PM_604e; + pll_type = PLL_604e; + features = PERFCTR_FEATURE_RDTSC; + break; + case 0x0008: /* 750/740 */ + pm_type = PM_750; + pll_type = PLL_750; + break; + case 0x7000: case 0x7001: /* IBM750FX */ + if ((pvr & 0xFF0F) >= 0x0203) + features |= PERFCTR_FEATURE_PCINT; + pm_type = PM_750; + pll_type = PLL_750FX; + break; + case 0x7002: /* IBM750GX */ + features |= PERFCTR_FEATURE_PCINT; + pm_type = PM_750; + pll_type = PLL_750FX; + break; + case 0x000C: /* 7400 */ + pm_type = PM_7400; + pll_type = PLL_7400; + break; + case 0x800C: /* 7410 */ + if ((pvr & 0xFFFF) >= 0x1103) + features |= PERFCTR_FEATURE_PCINT; + pm_type = PM_7400; + pll_type = PLL_7400; + break; + case 0x8000: /* 7451/7441 */ + features |= PERFCTR_FEATURE_PCINT; + pm_type = PM_7450; + pll_type = PLL_7450; + break; + case 0x8001: /* 7455/7445 */ + features |= PERFCTR_FEATURE_PCINT; + pm_type = PM_7450; + pll_type = ((pvr & 0xFFFF) < 0x0303) ? PLL_7450 : PLL_7457; + break; + case 0x8002: /* 7457/7447 */ + case 0x8003: /* 7447A */ + features |= PERFCTR_FEATURE_PCINT; + pm_type = PM_7450; + pll_type = PLL_7457; + break; + case 0x8004: /* 7448 */ + features |= PERFCTR_FEATURE_PCINT; + pm_type = PM_7450; + pll_type = PLL_NONE; /* known to differ from 7447A, no details yet */ + break; + default: + return -ENODEV; + } + perfctr_info.cpu_features = features; + perfctr_cpu_name = known_name; + perfctr_info.cpu_khz = detect_cpu_khz(pll_type); + perfctr_ppc_init_tests(have_mmcr1); + return 0; +} + +static int __init unknown_init(void) +{ + static char unknown_name[] __initdata = "Generic PowerPC with TB"; + unsigned int khz; + + khz = detect_cpu_khz(PLL_NONE); + if (!khz) + return -ENODEV; + perfctr_info.cpu_features = PERFCTR_FEATURE_RDTSC; + perfctr_cpu_name = unknown_name; + perfctr_info.cpu_khz = khz; + pm_type = PM_NONE; + return 0; +} + +static void perfctr_cpu_clear_one(void *ignore) +{ + /* PREEMPT note: when called via on_each_cpu(), + this is in IRQ context with preemption disabled. */ + perfctr_cpu_clear_counters(); +} + +static void perfctr_cpu_reset(void) +{ + on_each_cpu(perfctr_cpu_clear_one, NULL, 1, 1); + perfctr_cpu_set_ihandler(NULL); +} + +static int init_done; + +int __init perfctr_cpu_init(void) +{ + int err; + + perfctr_info.cpu_features = 0; + + err = known_init(); + if (err) { + err = unknown_init(); + if (err) + goto out; + } + + perfctr_cpu_reset(); + init_done = 1; + out: + return err; +} + +void __exit perfctr_cpu_exit(void) +{ + perfctr_cpu_reset(); +} + +/**************************************************************** + * * + * Hardware reservation. * + * * + ****************************************************************/ + +static DECLARE_MUTEX(mutex); +static const char *current_service = 0; + +const char *perfctr_cpu_reserve(const char *service) +{ + const char *ret; + + if (!init_done) + return "unsupported hardware"; + down(&mutex); + ret = current_service; + if (!ret) + current_service = service; + up(&mutex); + return ret; +} + +void perfctr_cpu_release(const char *service) +{ + down(&mutex); + if (service != current_service) { + printk(KERN_ERR "%s: attempt by %s to release while reserved by %s\n", + __FUNCTION__, service, current_service); + } else { + /* power down the counters */ + perfctr_cpu_reset(); + current_service = 0; + } + up(&mutex); +} diff -puN /dev/null drivers/perfctr/ppc_tests.c --- /dev/null 2003-09-15 06:40:47.000000000 -0700 +++ devel-akpm/drivers/perfctr/ppc_tests.c 2005-07-08 23:11:41.000000000 -0700 @@ -0,0 +1,288 @@ +/* $Id: ppc_tests.c,v 1.4 2004/05/21 16:57:53 mikpe Exp $ + * Performance-monitoring counters driver. + * Optional PPC32-specific init-time tests. + * + * Copyright (C) 2004 Mikael Pettersson + */ +#include +#include +#include +#include +#include +#include +#include /* for tb_ticks_per_jiffy */ +#include "ppc_tests.h" + +#define NITER 256 +#define X2(S) S"; "S +#define X8(S) X2(X2(X2(S))) + +static void __init do_read_tbl(unsigned int unused) +{ + unsigned int i, dummy; + for(i = 0; i < NITER/8; ++i) + __asm__ __volatile__(X8("mftbl %0") : "=r"(dummy)); +} + +static void __init do_read_pmc1(unsigned int unused) +{ + unsigned int i, dummy; + for(i = 0; i < NITER/8; ++i) + __asm__ __volatile__(X8("mfspr %0," __stringify(SPRN_PMC1)) : "=r"(dummy)); +} + +static void __init do_read_pmc2(unsigned int unused) +{ + unsigned int i, dummy; + for(i = 0; i < NITER/8; ++i) + __asm__ __volatile__(X8("mfspr %0," __stringify(SPRN_PMC2)) : "=r"(dummy)); +} + +static void __init do_read_pmc3(unsigned int unused) +{ + unsigned int i, dummy; + for(i = 0; i < NITER/8; ++i) + __asm__ __volatile__(X8("mfspr %0," __stringify(SPRN_PMC3)) : "=r"(dummy)); +} + +static void __init do_read_pmc4(unsigned int unused) +{ + unsigned int i, dummy; + for(i = 0; i < NITER/8; ++i) + __asm__ __volatile__(X8("mfspr %0," __stringify(SPRN_PMC4)) : "=r"(dummy)); +} + +static void __init do_read_mmcr0(unsigned int unused) +{ + unsigned int i, dummy; + for(i = 0; i < NITER/8; ++i) + __asm__ __volatile__(X8("mfspr %0," __stringify(SPRN_MMCR0)) : "=r"(dummy)); +} + +static void __init do_read_mmcr1(unsigned int unused) +{ + unsigned int i, dummy; + for(i = 0; i < NITER/8; ++i) + __asm__ __volatile__(X8("mfspr %0," __stringify(SPRN_MMCR1)) : "=r"(dummy)); +} + +static void __init do_write_pmc2(unsigned int arg) +{ + unsigned int i; + for(i = 0; i < NITER/8; ++i) + __asm__ __volatile__(X8("mtspr " __stringify(SPRN_PMC2) ",%0") : : "r"(arg)); +} + +static void __init do_write_pmc3(unsigned int arg) +{ + unsigned int i; + for(i = 0; i < NITER/8; ++i) + __asm__ __volatile__(X8("mtspr " __stringify(SPRN_PMC3) ",%0") : : "r"(arg)); +} + +static void __init do_write_pmc4(unsigned int arg) +{ + unsigned int i; + for(i = 0; i < NITER/8; ++i) + __asm__ __volatile__(X8("mtspr " __stringify(SPRN_PMC4) ",%0") : : "r"(arg)); +} + +static void __init do_write_mmcr1(unsigned int arg) +{ + unsigned int i; + for(i = 0; i < NITER/8; ++i) + __asm__ __volatile__(X8("mtspr " __stringify(SPRN_MMCR1) ",%0") : : "r"(arg)); +} + +static void __init do_write_mmcr0(unsigned int arg) +{ + unsigned int i; + for(i = 0; i < NITER/8; ++i) + __asm__ __volatile__(X8("mtspr " __stringify(SPRN_MMCR0) ",%0") : : "r"(arg)); +} + +static void __init do_empty_loop(unsigned int unused) +{ + unsigned i; + for(i = 0; i < NITER/8; ++i) + __asm__ __volatile__("" : : ); +} + +static unsigned __init run(void (*doit)(unsigned int), unsigned int arg) +{ + unsigned int start, stop; + start = mfspr(SPRN_PMC1); + (*doit)(arg); /* should take < 2^32 cycles to complete */ + stop = mfspr(SPRN_PMC1); + return stop - start; +} + +static void __init init_tests_message(void) +{ + unsigned int pvr = mfspr(SPRN_PVR); + printk(KERN_INFO "Please email the following PERFCTR INIT lines " + "to mikpe@csd.uu.se\n" + KERN_INFO "To remove this message, rebuild the driver " + "with CONFIG_PERFCTR_INIT_TESTS=n\n"); + printk(KERN_INFO "PERFCTR INIT: PVR 0x%08x, CPU clock %u kHz, TB clock %u kHz\n", + pvr, + perfctr_info.cpu_khz, + tb_ticks_per_jiffy*(HZ/10)/(1000/10)); +} + +static void __init clear(int have_mmcr1) +{ + mtspr(SPRN_MMCR0, 0); + mtspr(SPRN_PMC1, 0); + mtspr(SPRN_PMC2, 0); + if (have_mmcr1) { + mtspr(SPRN_MMCR1, 0); + mtspr(SPRN_PMC3, 0); + mtspr(SPRN_PMC4, 0); + } +} + +static void __init check_fcece(unsigned int pmc1ce) +{ + unsigned int mmcr0; + + /* + * This test checks if MMCR0[FC] is set after PMC1 overflows + * when MMCR0[FCECE] is set. + * 74xx documentation states this behaviour, while documentation + * for 604/750 processors doesn't mention this at all. + * + * Also output the value of PMC1 shortly after the overflow. + * This tells us if PMC1 really was frozen. On 604/750, it may not + * freeze since we don't enable PMIs. [No freeze confirmed on 750.] + * + * When pmc1ce == 0, MMCR0[PMC1CE] is zero. It's unclear whether + * this masks all PMC1 overflow events or just PMC1 PMIs. + * + * PMC1 counts processor cycles, with 100 to go before overflowing. + * FCECE is set. + * PMC1CE is clear if !pmc1ce, otherwise set. + */ + mtspr(SPRN_PMC1, 0x80000000-100); + mmcr0 = (1<<(31-6)) | (0x01 << 6); + if (pmc1ce) + mmcr0 |= (1<<(31-16)); + mtspr(SPRN_MMCR0, mmcr0); + do { + do_empty_loop(0); + } while (!(mfspr(SPRN_PMC1) & 0x80000000)); + do_empty_loop(0); + printk(KERN_INFO "PERFCTR INIT: %s(%u): MMCR0[FC] is %u, PMC1 is %#x\n", + __FUNCTION__, pmc1ce, + !!(mfspr(SPRN_MMCR0) & (1<<(31-0))), mfspr(SPRN_PMC1)); + mtspr(SPRN_MMCR0, 0); + mtspr(SPRN_PMC1, 0); +} + +static void __init check_trigger(unsigned int pmc1ce) +{ + unsigned int mmcr0; + + /* + * This test checks if MMCR0[TRIGGER] is reset after PMC1 overflows. + * 74xx documentation states this behaviour, while documentation + * for 604/750 processors doesn't mention this at all. + * [No reset confirmed on 750.] + * + * Also output the values of PMC1 and PMC2 shortly after the overflow. + * PMC2 should be equal to PMC1-0x80000000. + * + * When pmc1ce == 0, MMCR0[PMC1CE] is zero. It's unclear whether + * this masks all PMC1 overflow events or just PMC1 PMIs. + * + * PMC1 counts processor cycles, with 100 to go before overflowing. + * PMC2 counts processor cycles, starting from 0. + * TRIGGER is set, so PMC2 doesn't start until PMC1 overflows. + * PMC1CE is clear if !pmc1ce, otherwise set. + */ + mtspr(SPRN_PMC2, 0); + mtspr(SPRN_PMC1, 0x80000000-100); + mmcr0 = (1<<(31-18)) | (0x01 << 6) | (0x01 << 0); + if (pmc1ce) + mmcr0 |= (1<<(31-16)); + mtspr(SPRN_MMCR0, mmcr0); + do { + do_empty_loop(0); + } while (!(mfspr(SPRN_PMC1) & 0x80000000)); + do_empty_loop(0); + printk(KERN_INFO "PERFCTR INIT: %s(%u): MMCR0[TRIGGER] is %u, PMC1 is %#x, PMC2 is %#x\n", + __FUNCTION__, pmc1ce, + !!(mfspr(SPRN_MMCR0) & (1<<(31-18))), mfspr(SPRN_PMC1), mfspr(SPRN_PMC2)); + mtspr(SPRN_MMCR0, 0); + mtspr(SPRN_PMC1, 0); + mtspr(SPRN_PMC2, 0); +} + +static void __init +measure_overheads(int have_mmcr1) +{ + int i; + unsigned int mmcr0, loop, ticks[12]; + const char *name[12]; + + clear(have_mmcr1); + + /* PMC1 = "processor cycles", + PMC2 = "completed instructions", + not disabled in any mode, + no interrupts */ + mmcr0 = (0x01 << 6) | (0x02 << 0); + mtspr(SPRN_MMCR0, mmcr0); + + name[0] = "mftbl"; + ticks[0] = run(do_read_tbl, 0); + name[1] = "mfspr (pmc1)"; + ticks[1] = run(do_read_pmc1, 0); + name[2] = "mfspr (pmc2)"; + ticks[2] = run(do_read_pmc2, 0); + name[3] = "mfspr (pmc3)"; + ticks[3] = have_mmcr1 ? run(do_read_pmc3, 0) : 0; + name[4] = "mfspr (pmc4)"; + ticks[4] = have_mmcr1 ? run(do_read_pmc4, 0) : 0; + name[5] = "mfspr (mmcr0)"; + ticks[5] = run(do_read_mmcr0, 0); + name[6] = "mfspr (mmcr1)"; + ticks[6] = have_mmcr1 ? run(do_read_mmcr1, 0) : 0; + name[7] = "mtspr (pmc2)"; + ticks[7] = run(do_write_pmc2, 0); + name[8] = "mtspr (pmc3)"; + ticks[8] = have_mmcr1 ? run(do_write_pmc3, 0) : 0; + name[9] = "mtspr (pmc4)"; + ticks[9] = have_mmcr1 ? run(do_write_pmc4, 0) : 0; + name[10] = "mtspr (mmcr1)"; + ticks[10] = have_mmcr1 ? run(do_write_mmcr1, 0) : 0; + name[11] = "mtspr (mmcr0)"; + ticks[11] = run(do_write_mmcr0, mmcr0); + + loop = run(do_empty_loop, 0); + + clear(have_mmcr1); + + init_tests_message(); + printk(KERN_INFO "PERFCTR INIT: NITER == %u\n", NITER); + printk(KERN_INFO "PERFCTR INIT: loop overhead is %u cycles\n", loop); + for(i = 0; i < ARRAY_SIZE(ticks); ++i) { + unsigned int x; + if (!ticks[i]) + continue; + x = ((ticks[i] - loop) * 10) / NITER; + printk(KERN_INFO "PERFCTR INIT: %s cost is %u.%u cycles (%u total)\n", + name[i], x/10, x%10, ticks[i]); + } + check_fcece(0); + check_fcece(1); + check_trigger(0); + check_trigger(1); +} + +void __init perfctr_ppc_init_tests(int have_mmcr1) +{ + preempt_disable(); + measure_overheads(have_mmcr1); + preempt_enable(); +} diff -puN /dev/null drivers/perfctr/ppc_tests.h --- /dev/null 2003-09-15 06:40:47.000000000 -0700 +++ devel-akpm/drivers/perfctr/ppc_tests.h 2005-07-08 23:11:41.000000000 -0700 @@ -0,0 +1,12 @@ +/* $Id: ppc_tests.h,v 1.1 2004/01/12 01:59:11 mikpe Exp $ + * Performance-monitoring counters driver. + * Optional PPC32-specific init-time tests. + * + * Copyright (C) 2004 Mikael Pettersson + */ + +#ifdef CONFIG_PERFCTR_INIT_TESTS +extern void perfctr_ppc_init_tests(int have_mmcr1); +#else +static inline void perfctr_ppc_init_tests(int have_mmcr1) { } +#endif diff -puN /dev/null include/asm-ppc/perfctr.h --- /dev/null 2003-09-15 06:40:47.000000000 -0700 +++ devel-akpm/include/asm-ppc/perfctr.h 2005-07-08 23:11:41.000000000 -0700 @@ -0,0 +1,174 @@ +/* $Id: perfctr.h,v 1.19 2005/04/08 14:36:49 mikpe Exp $ + * PPC32 Performance-Monitoring Counters driver + * + * Copyright (C) 2004-2005 Mikael Pettersson + */ +#ifndef _ASM_PPC_PERFCTR_H +#define _ASM_PPC_PERFCTR_H + +#include + +struct perfctr_sum_ctrs { + __u64 tsc; + __u64 pmc[8]; /* the size is not part of the user ABI */ +}; + +struct perfctr_cpu_control_header { + __u32 tsc_on; + __u32 nractrs; /* number of accumulation-mode counters */ + __u32 nrictrs; /* number of interrupt-mode counters */ +}; + +struct perfctr_cpu_state_user { + __u32 cstatus; + /* This is a sequence counter to ensure atomic reads by + * userspace. The mechanism is identical to that used for + * seqcount_t in include/linux/seqlock.h. */ + __u32 sequence; + __u64 tsc_start; + __u64 tsc_sum; + struct { + __u64 start; + __u64 sum; + } pmc[8]; /* the size is not part of the user ABI */ +}; + +/* cstatus is a re-encoding of control.tsc_on/nractrs/nrictrs + which should have less overhead in most cases */ +/* XXX: ppc driver internally also uses cstatus&(1<<30) */ + +static inline +unsigned int perfctr_mk_cstatus(unsigned int tsc_on, unsigned int nractrs, + unsigned int nrictrs) +{ + return (tsc_on<<31) | (nrictrs<<16) | ((nractrs+nrictrs)<<8) | nractrs; +} + +static inline unsigned int perfctr_cstatus_enabled(unsigned int cstatus) +{ + return cstatus; +} + +static inline int perfctr_cstatus_has_tsc(unsigned int cstatus) +{ + return (int)cstatus < 0; /* test and jump on sign */ +} + +static inline unsigned int perfctr_cstatus_nractrs(unsigned int cstatus) +{ + return cstatus & 0x7F; /* and with imm8 */ +} + +static inline unsigned int perfctr_cstatus_nrctrs(unsigned int cstatus) +{ + return (cstatus >> 8) & 0x7F; +} + +static inline unsigned int perfctr_cstatus_has_ictrs(unsigned int cstatus) +{ + return cstatus & (0x7F << 16); +} + +/* + * 'struct siginfo' support for perfctr overflow signals. + * In unbuffered mode, si_code is set to SI_PMC_OVF and a bitmask + * describing which perfctrs overflowed is put in si_pmc_ovf_mask. + * A bitmask is used since more than one perfctr can have overflowed + * by the time the interrupt handler runs. + * + * glibc's doesn't seem to define __SI_FAULT or __SI_CODE(), + * and including as well may cause redefinition errors, + * so the user and kernel values are different #defines here. + */ +#ifdef __KERNEL__ +#define SI_PMC_OVF (__SI_FAULT|'P') +#else +#define SI_PMC_OVF ('P') +#endif +#define si_pmc_ovf_mask _sifields._pad[0] /* XXX: use an unsigned field later */ + +#ifdef __KERNEL__ + +#if defined(CONFIG_PERFCTR) + +struct perfctr_cpu_control { + struct perfctr_cpu_control_header header; + unsigned int mmcr0; + unsigned int mmcr1; + unsigned int mmcr2; + /* IABR/DABR/BAMR not supported */ + unsigned int ireset[8]; /* [0,0x7fffffff], for i-mode counters, physical indices */ + unsigned int pmc_map[8]; /* virtual to physical index map */ +}; + +struct perfctr_cpu_state { + /* Don't change field order here without first considering the number + of cache lines touched during sampling and context switching. */ + unsigned int id; + int isuspend_cpu; + struct perfctr_cpu_state_user user; + struct perfctr_cpu_control control; +}; + +/* Driver init/exit. */ +extern int perfctr_cpu_init(void); +extern void perfctr_cpu_exit(void); + +/* CPU type name. */ +extern char *perfctr_cpu_name; + +/* Hardware reservation. */ +extern const char *perfctr_cpu_reserve(const char *service); +extern void perfctr_cpu_release(const char *service); + +/* PRE: state has no running interrupt-mode counters. + Check that the new control data is valid. + Update the driver's private control data. + Returns a negative error code if the control data is invalid. */ +extern int perfctr_cpu_update_control(struct perfctr_cpu_state *state, int is_global); + +/* Parse and update control for the given domain. */ +extern int perfctr_cpu_control_write(struct perfctr_cpu_control *control, + unsigned int domain, + const void *srcp, unsigned int srcbytes); + +/* Retrieve and format control for the given domain. + Returns number of bytes written. */ +extern int perfctr_cpu_control_read(const struct perfctr_cpu_control *control, + unsigned int domain, + void *dstp, unsigned int dstbytes); + +/* Read a-mode counters. Subtract from start and accumulate into sums. + Must be called with preemption disabled. */ +extern void perfctr_cpu_suspend(struct perfctr_cpu_state *state); + +/* Write control registers. Read a-mode counters into start. + Must be called with preemption disabled. */ +extern void perfctr_cpu_resume(struct perfctr_cpu_state *state); + +/* Perform an efficient combined suspend/resume operation. + Must be called with preemption disabled. */ +extern void perfctr_cpu_sample(struct perfctr_cpu_state *state); + +/* The type of a perfctr overflow interrupt handler. + It will be called in IRQ context, with preemption disabled. */ +typedef void (*perfctr_ihandler_t)(unsigned long pc); + +/* Operations related to overflow interrupt handling. */ +#ifdef CONFIG_PERFCTR_INTERRUPT_SUPPORT +extern void perfctr_cpu_set_ihandler(perfctr_ihandler_t); +extern void perfctr_cpu_ireload(struct perfctr_cpu_state*); +extern unsigned int perfctr_cpu_identify_overflow(struct perfctr_cpu_state*); +#else +static inline void perfctr_cpu_set_ihandler(perfctr_ihandler_t x) { } +#endif +static inline int perfctr_cpu_has_pending_interrupt(const struct perfctr_cpu_state *state) +{ + return 0; +} + +#endif /* CONFIG_PERFCTR */ + +#endif /* __KERNEL__ */ + +#endif /* _ASM_PPC_PERFCTR_H */ diff -puN include/asm-ppc/processor.h~perfctr include/asm-ppc/processor.h --- devel/include/asm-ppc/processor.h~perfctr 2005-07-08 23:11:41.000000000 -0700 +++ devel-akpm/include/asm-ppc/processor.h 2005-07-08 23:11:41.000000000 -0700 @@ -122,6 +122,9 @@ struct thread_struct { unsigned long spefscr; /* SPE & eFP status */ int used_spe; /* set if process has used spe */ #endif /* CONFIG_SPE */ +#ifdef CONFIG_PERFCTR_VIRTUAL + struct vperfctr *perfctr; /* performance counters */ +#endif }; #define ARCH_MIN_TASKALIGN 16 diff -puN include/asm-ppc/reg.h~perfctr include/asm-ppc/reg.h --- devel/include/asm-ppc/reg.h~perfctr 2005-07-08 23:11:41.000000000 -0700 +++ devel-akpm/include/asm-ppc/reg.h 2005-07-08 23:11:41.000000000 -0700 @@ -275,22 +275,14 @@ #define SPRN_LDSTCR 0x3f8 /* Load/Store control register */ #define SPRN_LDSTDB 0x3f4 /* */ #define SPRN_LR 0x008 /* Link Register */ -#define SPRN_MMCR0 0x3B8 /* Monitor Mode Control Register 0 */ -#define SPRN_MMCR1 0x3BC /* Monitor Mode Control Register 1 */ #ifndef SPRN_PIR #define SPRN_PIR 0x3FF /* Processor Identification Register */ #endif -#define SPRN_PMC1 0x3B9 /* Performance Counter Register 1 */ -#define SPRN_PMC2 0x3BA /* Performance Counter Register 2 */ -#define SPRN_PMC3 0x3BD /* Performance Counter Register 3 */ -#define SPRN_PMC4 0x3BE /* Performance Counter Register 4 */ #define SPRN_PTEHI 0x3D5 /* 981 7450 PTE HI word (S/W TLB load) */ #define SPRN_PTELO 0x3D6 /* 982 7450 PTE LO word (S/W TLB load) */ #define SPRN_PVR 0x11F /* Processor Version Register */ #define SPRN_RPA 0x3D6 /* Required Physical Address Register */ -#define SPRN_SDA 0x3BF /* Sampled Data Address Register */ #define SPRN_SDR1 0x019 /* MMU Hash Base Register */ -#define SPRN_SIA 0x3BB /* Sampled Instruction Address Register */ #define SPRN_SPRG0 0x110 /* Special Purpose Register General 0 */ #define SPRN_SPRG1 0x111 /* Special Purpose Register General 1 */ #define SPRN_SPRG2 0x112 /* Special Purpose Register General 2 */ @@ -317,16 +309,79 @@ #define SPRN_THRM3 0x3FE /* Thermal Management Register 3 */ #define THRM3_E (1<<0) #define SPRN_TLBMISS 0x3D4 /* 980 7450 TLB Miss Register */ -#define SPRN_UMMCR0 0x3A8 /* User Monitor Mode Control Register 0 */ -#define SPRN_UMMCR1 0x3AC /* User Monitor Mode Control Register 0 */ -#define SPRN_UPMC1 0x3A9 /* User Performance Counter Register 1 */ -#define SPRN_UPMC2 0x3AA /* User Performance Counter Register 2 */ -#define SPRN_UPMC3 0x3AD /* User Performance Counter Register 3 */ -#define SPRN_UPMC4 0x3AE /* User Performance Counter Register 4 */ -#define SPRN_USIA 0x3AB /* User Sampled Instruction Address Register */ #define SPRN_VRSAVE 0x100 /* Vector Register Save Register */ #define SPRN_XER 0x001 /* Fixed Point Exception Register */ +/* Performance-monitoring control and counter registers */ +#define SPRN_MMCR0 0x3B8 /* Monitor Mode Control Register 0 (604 and up) */ +#define SPRN_MMCR1 0x3BC /* Monitor Mode Control Register 1 (604e and up) */ +#define SPRN_MMCR2 0x3B0 /* Monitor Mode Control Register 2 (7400 and up) */ +#define SPRN_PMC1 0x3B9 /* Performance Counter Register 1 (604 and up) */ +#define SPRN_PMC2 0x3BA /* Performance Counter Register 2 (604 and up) */ +#define SPRN_PMC3 0x3BD /* Performance Counter Register 3 (604e and up) */ +#define SPRN_PMC4 0x3BE /* Performance Counter Register 4 (604e and up) */ +#define SPRN_PMC5 0x3B1 /* Performance Counter Register 5 (7450 and up) */ +#define SPRN_PMC6 0x3B2 /* Performance Counter Register 6 (7450 and up) */ +#define SPRN_SIA 0x3BB /* Sampled Instruction Address Register (604 and up) */ +#define SPRN_SDA 0x3BF /* Sampled Data Address Register (604/604e only) */ +#define SPRN_BAMR 0x3B7 /* Breakpoint Address Mask Register (7400 and up) */ + +#define SPRN_UMMCR0 0x3A8 /* User Monitor Mode Control Register 0 (750 and up) */ +#define SPRN_UMMCR1 0x3AC /* User Monitor Mode Control Register 0 (750 and up) */ +#define SPRN_UMMCR2 0x3A0 /* User Monitor Mode Control Register 0 (7400 and up) */ +#define SPRN_UPMC1 0x3A9 /* User Performance Counter Register 1 (750 and up) */ +#define SPRN_UPMC2 0x3AA /* User Performance Counter Register 2 (750 and up) */ +#define SPRN_UPMC3 0x3AD /* User Performance Counter Register 3 (750 and up) */ +#define SPRN_UPMC4 0x3AE /* User Performance Counter Register 4 (750 and up) */ +#define SPRN_UPMC5 0x3A1 /* User Performance Counter Register 5 (7450 and up) */ +#define SPRN_UPMC6 0x3A2 /* User Performance Counter Register 5 (7450 and up) */ +#define SPRN_USIA 0x3AB /* User Sampled Instruction Address Register (750 and up) */ +#define SPRN_UBAMR 0x3A7 /* User Breakpoint Address Mask Register (7400 and up) */ + +/* MMCR0 layout (74xx terminology) */ +#define MMCR0_FC 0x80000000 /* Freeze counters unconditionally. */ +#define MMCR0_FCS 0x40000000 /* Freeze counters while MSR[PR]=0 (supervisor mode). */ +#define MMCR0_FCP 0x20000000 /* Freeze counters while MSR[PR]=1 (user mode). */ +#define MMCR0_FCM1 0x10000000 /* Freeze counters while MSR[PM]=1. */ +#define MMCR0_FCM0 0x08000000 /* Freeze counters while MSR[PM]=0. */ +#define MMCR0_PMXE 0x04000000 /* Enable performance monitor exceptions. + * Cleared by hardware when a PM exception occurs. + * 604: PMXE is not cleared by hardware. + */ +#define MMCR0_FCECE 0x02000000 /* Freeze counters on enabled condition or event. + * FCECE is treated as 0 if TRIGGER is 1. + * 74xx: FC is set when the event occurs. + * 604/750: ineffective when PMXE=0. + */ +#define MMCR0_TBSEL 0x01800000 /* Time base lower (TBL) bit selector. + * 00: bit 31, 01: bit 23, 10: bit 19, 11: bit 15. + */ +#define MMCR0_TBEE 0x00400000 /* Enable event on TBL bit transition from 0 to 1. */ +#define MMCR0_THRESHOLD 0x003F0000 /* Threshold value for certain events. */ +#define MMCR0_PMC1CE 0x00008000 /* Enable event on PMC1 overflow. */ +#define MMCR0_PMCjCE 0x00004000 /* Enable event on PMC2-PMC6 overflow. + * 604/750: Overrides FCECE (DISCOUNT). + */ +#define MMCR0_TRIGGER 0x00002000 /* Disable PMC2-PMC6 until PMC1 overflow or other event. + * 74xx: cleared by hardware when the event occurs. + */ +#define MMCR0_PMC1SEL 0x00001FB0 /* PMC1 event selector, 7 bits. */ +#define MMCR0_PMC2SEL 0x0000003F /* PMC2 event selector, 6 bits. */ + +/* MMCR1 layout (604e-7457) */ +#define MMCR1_PMC3SEL 0xF8000000 /* PMC3 event selector, 5 bits. */ +#define MMCR1_PMC4SEL 0x07B00000 /* PMC4 event selector, 5 bits. */ +#define MMCR1_PMC5SEL 0x003E0000 /* PMC5 event selector, 5 bits. (745x only) */ +#define MMCR1_PMC6SEL 0x0001F800 /* PMC6 event selector, 6 bits. (745x only) */ +#define MMCR1__RESERVED 0x000007FF /* should be zero */ + +/* MMCR2 layout (7400-7457) */ +#define MMCR2_THRESHMULT 0x80000000 /* MMCR0[THRESHOLD] multiplier. */ +#define MMCR2_SMCNTEN 0x40000000 /* 7400/7410 only, should be zero. */ +#define MMCR2_SMINTEN 0x20000000 /* 7400/7410 only, should be zero. */ +#define MMCR2__RESERVED 0x1FFFFFFF /* should be zero */ +#define MMCR2_RESERVED (MMCR2_SMCNTEN | MMCR2_SMINTEN | MMCR2__RESERVED) + /* Bit definitions for MMCR0 and PMC1 / PMC2. */ #define MMCR0_PMC1_CYCLES (1 << 7) #define MMCR0_PMC1_ICACHEMISS (5 << 7) @@ -335,7 +390,6 @@ #define MMCR0_PMC2_CYCLES 0x1 #define MMCR0_PMC2_ITLB 0x7 #define MMCR0_PMC2_LOADMISSTIME 0x5 -#define MMCR0_PMXE (1 << 26) /* Processor Version Register */ diff -puN include/asm-ppc/unistd.h~perfctr include/asm-ppc/unistd.h --- devel/include/asm-ppc/unistd.h~perfctr 2005-07-08 23:11:41.000000000 -0700 +++ devel-akpm/include/asm-ppc/unistd.h 2005-07-08 23:11:41.000000000 -0700 @@ -281,8 +281,12 @@ #define __NR_ioprio_get 274 #define __NR_pselect6 275 #define __NR_ppoll 276 +#define __NR_vperfctr_open 277 +#define __NR_vperfctr_control (__NR_vperfctr_open+1) +#define __NR_vperfctr_write (__NR_vperfctr_open+2) +#define __NR_vperfctr_read (__NR_vperfctr_open+3) -#define __NR_syscalls 277 +#define __NR_syscalls 281 #define __NR(n) #n diff -puN /dev/null drivers/perfctr/virtual.c --- /dev/null 2003-09-15 06:40:47.000000000 -0700 +++ devel-akpm/drivers/perfctr/virtual.c 2005-07-08 23:11:41.000000000 -0700 @@ -0,0 +1,1253 @@ +/* $Id: virtual.c,v 1.115 2005/03/28 22:39:02 mikpe Exp $ + * Virtual per-process performance counters. + * + * Copyright (C) 1999-2005 Mikael Pettersson + */ +#include +#include +#include /* for unlikely() in 2.4.18 and older */ +#include +#include +#include +#include +#include +#include + +#include +#include + +#include "cpumask.h" +#include "virtual.h" + +/**************************************************************** + * * + * Data types and macros. * + * * + ****************************************************************/ + +struct vperfctr { +/* User-visible fields: (must be first for mmap()) */ + struct perfctr_cpu_state cpu_state; +/* Kernel-private fields: */ + int si_signo; + atomic_t count; + spinlock_t owner_lock; + struct task_struct *owner; + /* sampling_timer and bad_cpus_allowed are frequently + accessed, so they get to share a cache line */ + unsigned int sampling_timer ____cacheline_aligned; +#ifdef CONFIG_PERFCTR_CPUS_FORBIDDEN_MASK + atomic_t bad_cpus_allowed; +#endif + unsigned int preserve; + unsigned int resume_cstatus; +#ifdef CONFIG_PERFCTR_INTERRUPT_SUPPORT + unsigned int ireload_needed; /* only valid if resume_cstatus != 0 */ +#endif + /* children_lock protects inheritance_id and children, + when parent is not the one doing release_task() */ + spinlock_t children_lock; + unsigned long long inheritance_id; + struct perfctr_sum_ctrs children; + /* schedule_work() data for when an operation cannot be + done in the current context due to locking rules */ + struct work_struct work; + struct task_struct *parent_tsk; +}; +#define IS_RUNNING(perfctr) perfctr_cstatus_enabled((perfctr)->cpu_state.user.cstatus) + +#ifdef CONFIG_PERFCTR_INTERRUPT_SUPPORT + +static void vperfctr_ihandler(unsigned long pc); +static void vperfctr_handle_overflow(struct task_struct*, struct vperfctr*); + +static inline void vperfctr_set_ihandler(void) +{ + perfctr_cpu_set_ihandler(vperfctr_ihandler); +} + +#else +static inline void vperfctr_set_ihandler(void) { } +#endif + +#ifdef CONFIG_PERFCTR_CPUS_FORBIDDEN_MASK + +static inline void vperfctr_init_bad_cpus_allowed(struct vperfctr *perfctr) +{ + atomic_set(&perfctr->bad_cpus_allowed, 0); +} + +#else /* !CONFIG_PERFCTR_CPUS_FORBIDDEN_MASK */ +static inline void vperfctr_init_bad_cpus_allowed(struct vperfctr *perfctr) { } +#endif /* !CONFIG_PERFCTR_CPUS_FORBIDDEN_MASK */ + +/**************************************************************** + * * + * Resource management. * + * * + ****************************************************************/ + +/* XXX: perhaps relax this to number of _live_ perfctrs */ +static DECLARE_MUTEX(nrctrs_mutex); +static int nrctrs; +static const char this_service[] = __FILE__; + +static int inc_nrctrs(void) +{ + const char *other; + + other = NULL; + down(&nrctrs_mutex); + if (++nrctrs == 1) { + other = perfctr_cpu_reserve(this_service); + if (other) + nrctrs = 0; + } + up(&nrctrs_mutex); + if (other) { + printk(KERN_ERR __FILE__ + ": cannot operate, perfctr hardware taken by '%s'\n", + other); + return -EBUSY; + } + vperfctr_set_ihandler(); + return 0; +} + +static void dec_nrctrs(void) +{ + down(&nrctrs_mutex); + if (--nrctrs == 0) + perfctr_cpu_release(this_service); + up(&nrctrs_mutex); +} + +/* Allocate a `struct vperfctr'. Claim and reserve + an entire page so that it can be mmap():ed. */ +static struct vperfctr *vperfctr_alloc(void) +{ + unsigned long page; + + if (inc_nrctrs() != 0) + return ERR_PTR(-EBUSY); + page = get_zeroed_page(GFP_KERNEL); + if (!page) { + dec_nrctrs(); + return ERR_PTR(-ENOMEM); + } + SetPageReserved(virt_to_page(page)); + return (struct vperfctr*) page; +} + +static void vperfctr_free(struct vperfctr *perfctr) +{ + ClearPageReserved(virt_to_page(perfctr)); + free_page((unsigned long)perfctr); + dec_nrctrs(); +} + +static struct vperfctr *get_empty_vperfctr(void) +{ + struct vperfctr *perfctr = vperfctr_alloc(); + if (!IS_ERR(perfctr)) { + atomic_set(&perfctr->count, 1); + vperfctr_init_bad_cpus_allowed(perfctr); + spin_lock_init(&perfctr->owner_lock); + spin_lock_init(&perfctr->children_lock); + } + return perfctr; +} + +static void put_vperfctr(struct vperfctr *perfctr) +{ + if (atomic_dec_and_test(&perfctr->count)) + vperfctr_free(perfctr); +} + +static void scheduled_vperfctr_free(void *perfctr) +{ + vperfctr_free((struct vperfctr*)perfctr); +} + +static void schedule_put_vperfctr(struct vperfctr *perfctr) +{ + if (!atomic_dec_and_test(&perfctr->count)) + return; + INIT_WORK(&perfctr->work, scheduled_vperfctr_free, perfctr); + schedule_work(&perfctr->work); +} + +static unsigned long long new_inheritance_id(void) +{ + static spinlock_t lock = SPIN_LOCK_UNLOCKED; + static unsigned long long counter; + unsigned long long id; + + spin_lock(&lock); + id = ++counter; + spin_unlock(&lock); + return id; +} + +/**************************************************************** + * * + * Basic counter operations. * + * These must all be called by the owner process only. * + * These must all be called with preemption disabled. * + * * + ****************************************************************/ + +/* PRE: IS_RUNNING(perfctr) + * Suspend the counters. + */ +static inline void vperfctr_suspend(struct vperfctr *perfctr) +{ + perfctr_cpu_suspend(&perfctr->cpu_state); +} + +static inline void vperfctr_reset_sampling_timer(struct vperfctr *perfctr) +{ + /* XXX: base the value on perfctr_info.cpu_khz instead! */ + perfctr->sampling_timer = HZ/2; +} + +/* PRE: perfctr == current->thread.perfctr && IS_RUNNING(perfctr) + * Restart the counters. + */ +static inline void vperfctr_resume(struct vperfctr *perfctr) +{ + perfctr_cpu_resume(&perfctr->cpu_state); + vperfctr_reset_sampling_timer(perfctr); +} + +static inline void vperfctr_resume_with_overflow_check(struct vperfctr *perfctr) +{ +#ifdef CONFIG_PERFCTR_INTERRUPT_SUPPORT + if (perfctr_cpu_has_pending_interrupt(&perfctr->cpu_state)) { + vperfctr_handle_overflow(current, perfctr); + return; + } +#endif + vperfctr_resume(perfctr); +} + +/* Sample the counters but do not suspend them. */ +static void vperfctr_sample(struct vperfctr *perfctr) +{ + if (IS_RUNNING(perfctr)) { + perfctr_cpu_sample(&perfctr->cpu_state); + vperfctr_reset_sampling_timer(perfctr); + } +} + +#ifdef CONFIG_PERFCTR_INTERRUPT_SUPPORT +/* vperfctr interrupt handler (XXX: add buffering support) */ +/* PREEMPT note: called in IRQ context with preemption disabled. */ +static void vperfctr_ihandler(unsigned long pc) +{ + struct task_struct *tsk = current; + struct vperfctr *perfctr; + + perfctr = tsk->thread.perfctr; + if (!perfctr) { + printk(KERN_ERR "%s: BUG! pid %d has no vperfctr\n", + __FUNCTION__, tsk->pid); + return; + } + if (!perfctr_cstatus_has_ictrs(perfctr->cpu_state.user.cstatus)) { + printk(KERN_ERR "%s: BUG! vperfctr has cstatus %#x (pid %d, comm %s)\n", + __FUNCTION__, perfctr->cpu_state.user.cstatus, tsk->pid, tsk->comm); + return; + } + vperfctr_suspend(perfctr); + vperfctr_handle_overflow(tsk, perfctr); +} + +static void vperfctr_handle_overflow(struct task_struct *tsk, + struct vperfctr *perfctr) +{ + unsigned int pmc_mask; + siginfo_t si; + sigset_t old_blocked; + + pmc_mask = perfctr_cpu_identify_overflow(&perfctr->cpu_state); + if (!pmc_mask) { +#ifdef CONFIG_PPC64 + /* On some hardware (ppc64, in particular) it's + * impossible to control interrupts finely enough to + * eliminate overflows on counters we don't care + * about. So in this case just restart the counters + * and keep going. */ + vperfctr_resume(perfctr); +#else + printk(KERN_ERR "%s: BUG! pid %d has unidentifiable overflow source\n", + __FUNCTION__, tsk->pid); +#endif + return; + } + perfctr->ireload_needed = 1; + /* suspend a-mode and i-mode PMCs, leaving only TSC on */ + /* XXX: some people also want to suspend the TSC */ + perfctr->resume_cstatus = perfctr->cpu_state.user.cstatus; + if (perfctr_cstatus_has_tsc(perfctr->resume_cstatus)) { + perfctr->cpu_state.user.cstatus = perfctr_mk_cstatus(1, 0, 0); + vperfctr_resume(perfctr); + } else + perfctr->cpu_state.user.cstatus = 0; + si.si_signo = perfctr->si_signo; + si.si_errno = 0; + si.si_code = SI_PMC_OVF; + si.si_pmc_ovf_mask = pmc_mask; + + /* deliver signal without waking up the receiver */ + spin_lock_irq(&tsk->sighand->siglock); + old_blocked = tsk->blocked; + sigaddset(&tsk->blocked, si.si_signo); + spin_unlock_irq(&tsk->sighand->siglock); + + if (!send_sig_info(si.si_signo, &si, tsk)) + send_sig(si.si_signo, tsk, 1); + + spin_lock_irq(&tsk->sighand->siglock); + tsk->blocked = old_blocked; + recalc_sigpending(); + spin_unlock_irq(&tsk->sighand->siglock); +} +#endif + +/**************************************************************** + * * + * Process management operations. * + * These must all, with the exception of vperfctr_unlink() * + * and __vperfctr_set_cpus_allowed(), be called by the owner * + * process only. * + * * + ****************************************************************/ + +/* do_fork() -> copy_process() -> copy_thread() -> __vperfctr_copy(). + * Inherit the parent's perfctr settings to the child. + * PREEMPT note: do_fork() etc do not run with preemption disabled. +*/ +void __vperfctr_copy(struct task_struct *child_tsk, struct pt_regs *regs) +{ + struct vperfctr *parent_perfctr; + struct vperfctr *child_perfctr; + + /* Do not inherit perfctr settings to kernel-generated + threads, like those created by kmod. */ + child_perfctr = NULL; + if (!user_mode(regs)) + goto out; + + /* Allocation may sleep. Do it before the critical region. */ + child_perfctr = get_empty_vperfctr(); + if (IS_ERR(child_perfctr)) { + child_perfctr = NULL; + goto out; + } + + /* Although we're executing in the parent, if it is scheduled + then a remote monitor may attach and change the perfctr + pointer or the object it points to. This may already have + occurred when we get here, so the old copy of the pointer + in the child cannot be trusted. */ + preempt_disable(); + parent_perfctr = current->thread.perfctr; + if (parent_perfctr) { + child_perfctr->cpu_state.control = parent_perfctr->cpu_state.control; + child_perfctr->si_signo = parent_perfctr->si_signo; + child_perfctr->inheritance_id = parent_perfctr->inheritance_id; + } + preempt_enable(); + if (!parent_perfctr) { + put_vperfctr(child_perfctr); + child_perfctr = NULL; + goto out; + } + (void)perfctr_cpu_update_control(&child_perfctr->cpu_state, 0); + child_perfctr->owner = child_tsk; + out: + child_tsk->thread.perfctr = child_perfctr; +} + +/* Called from exit_thread() or do_vperfctr_unlink(). + * If the counters are running, stop them and sample their final values. + * Mark the vperfctr object as dead. + * Optionally detach the vperfctr object from its owner task. + * PREEMPT note: exit_thread() does not run with preemption disabled. + */ +static void vperfctr_unlink(struct task_struct *owner, struct vperfctr *perfctr, int do_unlink) +{ + /* this synchronises with sys_vperfctr() */ + spin_lock(&perfctr->owner_lock); + perfctr->owner = NULL; + spin_unlock(&perfctr->owner_lock); + + /* perfctr suspend+detach must be atomic wrt process suspend */ + /* this also synchronises with perfctr_set_cpus_allowed() */ + task_lock(owner); + if (IS_RUNNING(perfctr) && owner == current) + vperfctr_suspend(perfctr); + if (do_unlink) + owner->thread.perfctr = NULL; + task_unlock(owner); + + perfctr->cpu_state.user.cstatus = 0; + perfctr->resume_cstatus = 0; + if (do_unlink) + put_vperfctr(perfctr); +} + +void __vperfctr_exit(struct vperfctr *perfctr) +{ + vperfctr_unlink(current, perfctr, 0); +} + +/* release_task() -> perfctr_release_task() -> __vperfctr_release(). + * A task is being released. If it inherited its perfctr settings + * from its parent, then merge its final counts back into the parent. + * Then unlink the child's perfctr. + * PRE: caller has write_lock_irq(&tasklist_lock). + * PREEMPT note: preemption is disabled due to tasklist_lock. + * + * When current == parent_tsk, the child's counts can be merged + * into the parent's immediately. This is the common case. + * + * When current != parent_tsk, the parent must be task_lock()ed + * before its perfctr state can be accessed. task_lock() is illegal + * here due to the write_lock_irq(&tasklist_lock) in release_task(), + * so the operation is done via schedule_work(). + */ +static void do_vperfctr_release(struct vperfctr *child_perfctr, struct task_struct *parent_tsk) +{ + struct vperfctr *parent_perfctr; + unsigned int cstatus, nrctrs, i; + + parent_perfctr = parent_tsk->thread.perfctr; + if (parent_perfctr && child_perfctr) { + spin_lock(&parent_perfctr->children_lock); + if (parent_perfctr->inheritance_id == child_perfctr->inheritance_id) { + cstatus = parent_perfctr->cpu_state.user.cstatus; + if (perfctr_cstatus_has_tsc(cstatus)) + parent_perfctr->children.tsc += + child_perfctr->cpu_state.user.tsc_sum + + child_perfctr->children.tsc; + nrctrs = perfctr_cstatus_nrctrs(cstatus); + for(i = 0; i < nrctrs; ++i) + parent_perfctr->children.pmc[i] += + child_perfctr->cpu_state.user.pmc[i].sum + + child_perfctr->children.pmc[i]; + } + spin_unlock(&parent_perfctr->children_lock); + } + schedule_put_vperfctr(child_perfctr); +} + +static void scheduled_release(void *data) +{ + struct vperfctr *child_perfctr = data; + struct task_struct *parent_tsk = child_perfctr->parent_tsk; + + task_lock(parent_tsk); + do_vperfctr_release(child_perfctr, parent_tsk); + task_unlock(parent_tsk); + put_task_struct(parent_tsk); +} + +void __vperfctr_release(struct task_struct *child_tsk) +{ + struct task_struct *parent_tsk = child_tsk->parent; + struct vperfctr *child_perfctr = child_tsk->thread.perfctr; + + child_tsk->thread.perfctr = NULL; + if (parent_tsk == current) + do_vperfctr_release(child_perfctr, parent_tsk); + else { + get_task_struct(parent_tsk); + INIT_WORK(&child_perfctr->work, scheduled_release, child_perfctr); + child_perfctr->parent_tsk = parent_tsk; + schedule_work(&child_perfctr->work); + } +} + +/* schedule() --> switch_to() --> .. --> __vperfctr_suspend(). + * If the counters are running, suspend them. + * PREEMPT note: switch_to() runs with preemption disabled. + */ +void __vperfctr_suspend(struct vperfctr *perfctr) +{ + if (IS_RUNNING(perfctr)) + vperfctr_suspend(perfctr); +} + +/* schedule() --> switch_to() --> .. --> __vperfctr_resume(). + * PRE: perfctr == current->thread.perfctr + * If the counters are runnable, resume them. + * PREEMPT note: switch_to() runs with preemption disabled. + */ +void __vperfctr_resume(struct vperfctr *perfctr) +{ + if (IS_RUNNING(perfctr)) { +#ifdef CONFIG_PERFCTR_CPUS_FORBIDDEN_MASK + if (unlikely(atomic_read(&perfctr->bad_cpus_allowed)) && + perfctr_cstatus_nrctrs(perfctr->cpu_state.user.cstatus)) { + perfctr->cpu_state.user.cstatus = 0; + perfctr->resume_cstatus = 0; + BUG_ON(current->state != TASK_RUNNING); + send_sig(SIGILL, current, 1); + return; + } +#endif + vperfctr_resume_with_overflow_check(perfctr); + } +} + +/* Called from update_one_process() [triggered by timer interrupt]. + * PRE: perfctr == current->thread.perfctr. + * Sample the counters but do not suspend them. + * Needed to avoid precision loss due to multiple counter + * wraparounds between resume/suspend for CPU-bound processes. + * PREEMPT note: called in IRQ context with preemption disabled. + */ +void __vperfctr_sample(struct vperfctr *perfctr) +{ + if (--perfctr->sampling_timer == 0) + vperfctr_sample(perfctr); +} + +#ifdef CONFIG_PERFCTR_CPUS_FORBIDDEN_MASK +/* Called from set_cpus_allowed(). + * PRE: current holds task_lock(owner) + * PRE: owner->thread.perfctr == perfctr + */ +void __vperfctr_set_cpus_allowed(struct task_struct *owner, + struct vperfctr *perfctr, + cpumask_t new_mask) +{ + if (cpus_intersects(new_mask, perfctr_cpus_forbidden_mask)) { + atomic_set(&perfctr->bad_cpus_allowed, 1); + if (printk_ratelimit()) + printk(KERN_WARNING "perfctr: process %d (comm %s) issued unsafe" + " set_cpus_allowed() on process %d (comm %s)\n", + current->pid, current->comm, owner->pid, owner->comm); + } else + atomic_set(&perfctr->bad_cpus_allowed, 0); +} +#endif + +/**************************************************************** + * * + * Virtual perfctr system calls implementation. * + * These can be called by the owner process (tsk == current), * + * a monitor process which has the owner under ptrace ATTACH * + * control (tsk && tsk != current), or anyone with a handle to * + * an unlinked perfctr (!tsk). * + * * + ****************************************************************/ + +static int do_vperfctr_write(struct vperfctr *perfctr, + unsigned int domain, + const void __user *srcp, + unsigned int srcbytes, + struct task_struct *tsk) +{ + void *tmp; + int err; + + if (!tsk) + return -ESRCH; /* attempt to update unlinked perfctr */ + + if (srcbytes > PAGE_SIZE) /* primitive sanity check */ + return -EINVAL; + tmp = kmalloc(srcbytes, GFP_USER); + if (!tmp) + return -ENOMEM; + err = -EFAULT; + if (copy_from_user(tmp, srcp, srcbytes)) + goto out_kfree; + + /* PREEMPT note: preemption is disabled over the entire + region since we're updating an active perfctr. */ + preempt_disable(); + if (IS_RUNNING(perfctr)) { + if (tsk == current) + vperfctr_suspend(perfctr); + perfctr->cpu_state.user.cstatus = 0; + perfctr->resume_cstatus = 0; + } + + switch (domain) { + case VPERFCTR_DOMAIN_CONTROL: { + struct vperfctr_control control; + + err = -EINVAL; + if (srcbytes > sizeof(control)) + break; + control.si_signo = perfctr->si_signo; + control.preserve = perfctr->preserve; + memcpy(&control, tmp, srcbytes); + /* XXX: validate si_signo? */ + perfctr->si_signo = control.si_signo; + perfctr->preserve = control.preserve; + err = 0; + break; + } + case PERFCTR_DOMAIN_CPU_CONTROL: + err = -EINVAL; + if (srcbytes > sizeof(perfctr->cpu_state.control.header)) + break; + memcpy(&perfctr->cpu_state.control.header, tmp, srcbytes); + err = 0; + break; + case PERFCTR_DOMAIN_CPU_MAP: + err = -EINVAL; + if (srcbytes > sizeof(perfctr->cpu_state.control.pmc_map)) + break; + memcpy(perfctr->cpu_state.control.pmc_map, tmp, srcbytes); + err = 0; + break; + default: + err = perfctr_cpu_control_write(&perfctr->cpu_state.control, + domain, tmp, srcbytes); + } + + preempt_enable(); + out_kfree: + kfree(tmp); + return err; +} + +static int vperfctr_enable_control(struct vperfctr *perfctr, struct task_struct *tsk) +{ + int err; + unsigned int next_cstatus; + unsigned int nrctrs, i; + + if (perfctr->cpu_state.control.header.nractrs || + perfctr->cpu_state.control.header.nrictrs) { + cpumask_t old_mask, new_mask; + + old_mask = tsk->cpus_allowed; + cpus_andnot(new_mask, old_mask, perfctr_cpus_forbidden_mask); + + if (cpus_empty(new_mask)) + return -EINVAL; + if (!cpus_equal(new_mask, old_mask)) + set_cpus_allowed(tsk, new_mask); + } + + perfctr->cpu_state.user.cstatus = 0; + perfctr->resume_cstatus = 0; + + /* remote access note: perfctr_cpu_update_control() is ok */ + err = perfctr_cpu_update_control(&perfctr->cpu_state, 0); + if (err < 0) + return err; + next_cstatus = perfctr->cpu_state.user.cstatus; + if (!perfctr_cstatus_enabled(next_cstatus)) + return 0; + + if (!perfctr_cstatus_has_tsc(next_cstatus)) + perfctr->cpu_state.user.tsc_sum = 0; + + nrctrs = perfctr_cstatus_nrctrs(next_cstatus); + for(i = 0; i < nrctrs; ++i) + if (!(perfctr->preserve & (1<cpu_state.user.pmc[i].sum = 0; + + spin_lock(&perfctr->children_lock); + perfctr->inheritance_id = new_inheritance_id(); + memset(&perfctr->children, 0, sizeof perfctr->children); + spin_unlock(&perfctr->children_lock); + + return 0; +} + +static inline void vperfctr_ireload(struct vperfctr *perfctr) +{ +#ifdef CONFIG_PERFCTR_INTERRUPT_SUPPORT + if (perfctr->ireload_needed) { + perfctr->ireload_needed = 0; + /* remote access note: perfctr_cpu_ireload() is ok */ + perfctr_cpu_ireload(&perfctr->cpu_state); + } +#endif +} + +static int do_vperfctr_resume(struct vperfctr *perfctr, struct task_struct *tsk) +{ + unsigned int resume_cstatus; + int ret; + + if (!tsk) + return -ESRCH; /* attempt to update unlinked perfctr */ + + /* PREEMPT note: preemption is disabled over the entire + region because we're updating an active perfctr. */ + preempt_disable(); + + if (IS_RUNNING(perfctr) && tsk == current) + vperfctr_suspend(perfctr); + + resume_cstatus = perfctr->resume_cstatus; + if (perfctr_cstatus_enabled(resume_cstatus)) { + perfctr->cpu_state.user.cstatus = resume_cstatus; + perfctr->resume_cstatus = 0; + vperfctr_ireload(perfctr); + ret = 0; + } else { + ret = vperfctr_enable_control(perfctr, tsk); + resume_cstatus = perfctr->cpu_state.user.cstatus; + } + + if (ret >= 0 && perfctr_cstatus_enabled(resume_cstatus) && tsk == current) + vperfctr_resume(perfctr); + + preempt_enable(); + + return ret; +} + +static int do_vperfctr_suspend(struct vperfctr *perfctr, struct task_struct *tsk) +{ + if (!tsk) + return -ESRCH; /* attempt to update unlinked perfctr */ + + /* PREEMPT note: preemption is disabled over the entire + region since we're updating an active perfctr. */ + preempt_disable(); + + if (IS_RUNNING(perfctr)) { + if (tsk == current) + vperfctr_suspend(perfctr); + perfctr->resume_cstatus = perfctr->cpu_state.user.cstatus; + perfctr->cpu_state.user.cstatus = 0; + } + + preempt_enable(); + + return 0; +} + +static int do_vperfctr_unlink(struct vperfctr *perfctr, struct task_struct *tsk) +{ + if (tsk) + vperfctr_unlink(tsk, perfctr, 1); + return 0; +} + +static int do_vperfctr_clear(struct vperfctr *perfctr, struct task_struct *tsk) +{ + if (!tsk) + return -ESRCH; /* attempt to update unlinked perfctr */ + + /* PREEMPT note: preemption is disabled over the entire + region because we're updating an active perfctr. */ + preempt_disable(); + + if (IS_RUNNING(perfctr) && tsk == current) + vperfctr_suspend(perfctr); + + memset(&perfctr->cpu_state, 0, sizeof perfctr->cpu_state); + perfctr->resume_cstatus = 0; + + spin_lock(&perfctr->children_lock); + perfctr->inheritance_id = 0; + memset(&perfctr->children, 0, sizeof perfctr->children); + spin_unlock(&perfctr->children_lock); + + preempt_enable(); + + return 0; +} + +static int do_vperfctr_control(struct vperfctr *perfctr, + unsigned int cmd, + struct task_struct *tsk) +{ + switch (cmd) { + case VPERFCTR_CONTROL_UNLINK: + return do_vperfctr_unlink(perfctr, tsk); + case VPERFCTR_CONTROL_SUSPEND: + return do_vperfctr_suspend(perfctr, tsk); + case VPERFCTR_CONTROL_RESUME: + return do_vperfctr_resume(perfctr, tsk); + case VPERFCTR_CONTROL_CLEAR: + return do_vperfctr_clear(perfctr, tsk); + default: + return -EINVAL; + } +} + +static int do_vperfctr_read(struct vperfctr *perfctr, + unsigned int domain, + void __user *dstp, + unsigned int dstbytes, + struct task_struct *tsk) +{ + union { + struct perfctr_sum_ctrs sum; + struct vperfctr_control control; + struct perfctr_sum_ctrs children; + } *tmp; + unsigned int tmpbytes; + int ret; + + tmpbytes = dstbytes; + if (tmpbytes > PAGE_SIZE) /* primitive sanity check */ + return -EINVAL; + if (tmpbytes < sizeof(*tmp)) + tmpbytes = sizeof(*tmp); + tmp = kmalloc(tmpbytes, GFP_USER); + if (!tmp) + return -ENOMEM; + + /* PREEMPT note: While we're reading our own control, another + process may ptrace ATTACH to us and update our control. + Disable preemption to ensure we get a consistent copy. + Not needed for other cases since the perfctr is either + unlinked or its owner is ptrace ATTACH suspended by us. */ + if (tsk == current) + preempt_disable(); + + switch (domain) { + case VPERFCTR_DOMAIN_SUM: { + int j; + + vperfctr_sample(perfctr); + tmp->sum.tsc = perfctr->cpu_state.user.tsc_sum; + for(j = 0; j < ARRAY_SIZE(tmp->sum.pmc); ++j) + tmp->sum.pmc[j] = perfctr->cpu_state.user.pmc[j].sum; + ret = sizeof(tmp->sum); + break; + } + case VPERFCTR_DOMAIN_CONTROL: + tmp->control.si_signo = perfctr->si_signo; + tmp->control.preserve = perfctr->preserve; + ret = sizeof(tmp->control); + break; + case VPERFCTR_DOMAIN_CHILDREN: + if (tsk) + spin_lock(&perfctr->children_lock); + tmp->children = perfctr->children; + if (tsk) + spin_unlock(&perfctr->children_lock); + ret = sizeof(tmp->children); + break; + case PERFCTR_DOMAIN_CPU_CONTROL: + if (tmpbytes > sizeof(perfctr->cpu_state.control.header)) + tmpbytes = sizeof(perfctr->cpu_state.control.header); + memcpy(tmp, &perfctr->cpu_state.control.header, tmpbytes); + ret = tmpbytes; + break; + case PERFCTR_DOMAIN_CPU_MAP: + if (tmpbytes > sizeof(perfctr->cpu_state.control.pmc_map)) + tmpbytes = sizeof(perfctr->cpu_state.control.pmc_map); + memcpy(tmp, perfctr->cpu_state.control.pmc_map, tmpbytes); + ret = tmpbytes; + break; + default: + ret = -EFAULT; + if (copy_from_user(tmp, dstp, dstbytes) == 0) + ret = perfctr_cpu_control_read(&perfctr->cpu_state.control, + domain, tmp, dstbytes); + } + + if (tsk == current) + preempt_enable(); + + if (ret > 0) { + if (ret > dstbytes) + ret = dstbytes; + if (ret > 0 && copy_to_user(dstp, tmp, ret)) + ret = -EFAULT; + } + kfree(tmp); + return ret; +} + +/**************************************************************** + * * + * Virtual perfctr file operations. * + * * + ****************************************************************/ + +static int vperfctr_mmap(struct file *filp, struct vm_area_struct *vma) +{ + struct vperfctr *perfctr; + + /* Only allow read-only mapping of first page. */ + if ((vma->vm_end - vma->vm_start) != PAGE_SIZE || + vma->vm_pgoff != 0 || + (pgprot_val(vma->vm_page_prot) & _PAGE_RW) || + (vma->vm_flags & (VM_WRITE | VM_MAYWRITE))) + return -EPERM; + perfctr = filp->private_data; + if (!perfctr) + return -EPERM; + return remap_pfn_range(vma, vma->vm_start, + virt_to_phys(perfctr) >> PAGE_SHIFT, + PAGE_SIZE, vma->vm_page_prot); +} + +static int vperfctr_release(struct inode *inode, struct file *filp) +{ + struct vperfctr *perfctr = filp->private_data; + filp->private_data = NULL; + if (perfctr) + put_vperfctr(perfctr); + return 0; +} + +static struct file_operations vperfctr_file_ops = { + .mmap = vperfctr_mmap, + .release = vperfctr_release, +}; + +/**************************************************************** + * * + * File system for virtual perfctrs. Based on pipefs. * + * * + ****************************************************************/ + +#define VPERFCTRFS_MAGIC (('V'<<24)|('P'<<16)|('M'<<8)|('C')) + +/* The code to set up a `struct file_system_type' for a pseudo fs + is unfortunately not the same in 2.4 and 2.6. */ +#include /* needed for 2.6, included by fs.h in 2.4 */ + +static struct super_block * +vperfctrfs_get_sb(struct file_system_type *fs_type, + int flags, const char *dev_name, void *data) +{ + return get_sb_pseudo(fs_type, "vperfctr:", NULL, VPERFCTRFS_MAGIC); +} + +static struct file_system_type vperfctrfs_type = { + .name = "vperfctrfs", + .get_sb = vperfctrfs_get_sb, + .kill_sb = kill_anon_super, +}; + +/* XXX: check if s/vperfctr_mnt/vperfctrfs_type.kern_mnt/ would work */ +static struct vfsmount *vperfctr_mnt; +#define vperfctr_fs_init_done() (vperfctr_mnt != NULL) + +static int __init vperfctrfs_init(void) +{ + int err = register_filesystem(&vperfctrfs_type); + if (!err) { + vperfctr_mnt = kern_mount(&vperfctrfs_type); + if (!IS_ERR(vperfctr_mnt)) + return 0; + err = PTR_ERR(vperfctr_mnt); + unregister_filesystem(&vperfctrfs_type); + vperfctr_mnt = NULL; + } + return err; +} + +static void __exit vperfctrfs_exit(void) +{ + unregister_filesystem(&vperfctrfs_type); + mntput(vperfctr_mnt); +} + +static struct inode *vperfctr_get_inode(void) +{ + struct inode *inode; + + inode = new_inode(vperfctr_mnt->mnt_sb); + if (!inode) + return NULL; + inode->i_fop = &vperfctr_file_ops; + inode->i_state = I_DIRTY; + inode->i_mode = S_IFCHR | S_IRUSR | S_IWUSR; + inode->i_uid = current->fsuid; + inode->i_gid = current->fsgid; + inode->i_atime = inode->i_mtime = inode->i_ctime = CURRENT_TIME; + inode->i_blksize = 0; + return inode; +} + +static int vperfctrfs_delete_dentry(struct dentry *dentry) +{ + return 1; +} + +static struct dentry_operations vperfctrfs_dentry_operations = { + .d_delete = vperfctrfs_delete_dentry, +}; + +static struct dentry *vperfctr_d_alloc_root(struct inode *inode) +{ + struct qstr this; + char name[32]; + struct dentry *dentry; + + sprintf(name, "[%lu]", inode->i_ino); + this.name = name; + this.len = strlen(name); + this.hash = inode->i_ino; /* will go */ + dentry = d_alloc(vperfctr_mnt->mnt_sb->s_root, &this); + if (dentry) { + dentry->d_op = &vperfctrfs_dentry_operations; + d_add(dentry, inode); + } + return dentry; +} + +static struct file *vperfctr_get_filp(void) +{ + struct file *filp; + struct inode *inode; + struct dentry *dentry; + + filp = get_empty_filp(); + if (!filp) + goto out; + inode = vperfctr_get_inode(); + if (!inode) + goto out_filp; + dentry = vperfctr_d_alloc_root(inode); + if (!dentry) + goto out_inode; + + filp->f_vfsmnt = mntget(vperfctr_mnt); + filp->f_dentry = dentry; + filp->f_mapping = dentry->d_inode->i_mapping; + + filp->f_pos = 0; + filp->f_flags = 0; + filp->f_op = &vperfctr_file_ops; /* fops_get() if MODULE */ + filp->f_mode = FMODE_READ; + filp->f_version = 0; + + return filp; + + out_inode: + iput(inode); + out_filp: + put_filp(filp); /* doesn't run ->release() like fput() does */ + out: + return NULL; +} + +/**************************************************************** + * * + * Virtual perfctr actual system calls. * + * * + ****************************************************************/ + +/* tid is the actual task/thread id (née pid, stored as ->pid), + pid/tgid is that 2.6 thread group id crap (stored as ->tgid) */ +asmlinkage long sys_vperfctr_open(int tid, int creat) +{ + struct file *filp; + struct task_struct *tsk; + struct vperfctr *perfctr; + int err; + int fd; + + if (!vperfctr_fs_init_done()) + return -ENODEV; + filp = vperfctr_get_filp(); + if (!filp) + return -ENOMEM; + err = fd = get_unused_fd(); + if (err < 0) + goto err_filp; + perfctr = NULL; + if (creat) { + perfctr = get_empty_vperfctr(); /* may sleep */ + if (IS_ERR(perfctr)) { + err = PTR_ERR(perfctr); + goto err_fd; + } + } + tsk = current; + if (tid != 0 && tid != tsk->pid) { /* remote? */ + read_lock(&tasklist_lock); + tsk = find_task_by_pid(tid); + if (tsk) + get_task_struct(tsk); + read_unlock(&tasklist_lock); + err = -ESRCH; + if (!tsk) + goto err_perfctr; + err = ptrace_check_attach(tsk, 0); + if (err < 0) + goto err_tsk; + } + if (creat) { + /* check+install must be atomic to prevent remote-control races */ + task_lock(tsk); + if (!tsk->thread.perfctr) { + perfctr->owner = tsk; + tsk->thread.perfctr = perfctr; + err = 0; + } else + err = -EEXIST; + task_unlock(tsk); + if (err) + goto err_tsk; + } else { + perfctr = tsk->thread.perfctr; + /* XXX: Old API needed to allow NULL perfctr here. + Do we want to keep or change that rule? */ + } + filp->private_data = perfctr; + if (perfctr) + atomic_inc(&perfctr->count); + if (tsk != current) + put_task_struct(tsk); + fd_install(fd, filp); + return fd; + err_tsk: + if (tsk != current) + put_task_struct(tsk); + err_perfctr: + if (perfctr) /* can only occur if creat != 0 */ + put_vperfctr(perfctr); + err_fd: + put_unused_fd(fd); + err_filp: + fput(filp); + return err; +} + +static struct vperfctr *fd_get_vperfctr(int fd) +{ + struct vperfctr *perfctr; + struct file *filp; + int err; + + err = -EBADF; + filp = fget(fd); + if (!filp) + goto out; + err = -EINVAL; + if (filp->f_op != &vperfctr_file_ops) + goto out_filp; + perfctr = filp->private_data; + if (!perfctr) + goto out_filp; + atomic_inc(&perfctr->count); + fput(filp); + return perfctr; + out_filp: + fput(filp); + out: + return ERR_PTR(err); +} + +static struct task_struct *vperfctr_get_tsk(struct vperfctr *perfctr) +{ + struct task_struct *tsk; + + tsk = current; + if (perfctr != current->thread.perfctr) { + /* this synchronises with vperfctr_unlink() and itself */ + spin_lock(&perfctr->owner_lock); + tsk = perfctr->owner; + if (tsk) + get_task_struct(tsk); + spin_unlock(&perfctr->owner_lock); + if (tsk) { + int ret = ptrace_check_attach(tsk, 0); + if (ret < 0) { + put_task_struct(tsk); + return ERR_PTR(ret); + } + } + } + return tsk; +} + +static void vperfctr_put_tsk(struct task_struct *tsk) +{ + if (tsk && tsk != current) + put_task_struct(tsk); +} + +asmlinkage long sys_vperfctr_write(int fd, unsigned int domain, + const void __user *argp, + unsigned int argbytes) +{ + struct vperfctr *perfctr; + struct task_struct *tsk; + int ret; + + perfctr = fd_get_vperfctr(fd); + if (IS_ERR(perfctr)) + return PTR_ERR(perfctr); + tsk = vperfctr_get_tsk(perfctr); + if (IS_ERR(tsk)) { + ret = PTR_ERR(tsk); + goto out; + } + ret = do_vperfctr_write(perfctr, domain, argp, argbytes, tsk); + vperfctr_put_tsk(tsk); + out: + put_vperfctr(perfctr); + return ret; +} + +asmlinkage long sys_vperfctr_control(int fd, unsigned int cmd) +{ + struct vperfctr *perfctr; + struct task_struct *tsk; + int ret; + + perfctr = fd_get_vperfctr(fd); + if (IS_ERR(perfctr)) + return PTR_ERR(perfctr); + tsk = vperfctr_get_tsk(perfctr); + if (IS_ERR(tsk)) { + ret = PTR_ERR(tsk); + goto out; + } + ret = do_vperfctr_control(perfctr, cmd, tsk); + vperfctr_put_tsk(tsk); + out: + put_vperfctr(perfctr); + return ret; +} + +asmlinkage long sys_vperfctr_read(int fd, unsigned int domain, + void __user *argp, unsigned int argbytes) +{ + struct vperfctr *perfctr; + struct task_struct *tsk; + int ret; + + perfctr = fd_get_vperfctr(fd); + if (IS_ERR(perfctr)) + return PTR_ERR(perfctr); + tsk = vperfctr_get_tsk(perfctr); + if (IS_ERR(tsk)) { + ret = PTR_ERR(tsk); + goto out; + } + ret = do_vperfctr_read(perfctr, domain, argp, argbytes, tsk); + vperfctr_put_tsk(tsk); + out: + put_vperfctr(perfctr); + return ret; +} + +/**************************************************************** + * * + * module_init/exit * + * * + ****************************************************************/ + +int __init vperfctr_init(void) +{ + return vperfctrfs_init(); +} + +void __exit vperfctr_exit(void) +{ + vperfctrfs_exit(); +} diff -puN /dev/null drivers/perfctr/virtual.h --- /dev/null 2003-09-15 06:40:47.000000000 -0700 +++ devel-akpm/drivers/perfctr/virtual.h 2005-07-08 23:11:41.000000000 -0700 @@ -0,0 +1,13 @@ +/* $Id: virtual.h,v 1.13 2004/05/31 18:18:55 mikpe Exp $ + * Virtual per-process performance counters. + * + * Copyright (C) 1999-2004 Mikael Pettersson + */ + +#ifdef CONFIG_PERFCTR_VIRTUAL +extern int vperfctr_init(void); +extern void vperfctr_exit(void); +#else +static inline int vperfctr_init(void) { return 0; } +static inline void vperfctr_exit(void) { } +#endif diff -puN include/linux/sched.h~perfctr include/linux/sched.h --- devel/include/linux/sched.h~perfctr 2005-07-08 23:11:41.000000000 -0700 +++ devel-akpm/include/linux/sched.h 2005-07-08 23:11:41.000000000 -0700 @@ -1140,6 +1140,9 @@ extern void unhash_process(struct task_s * subscriptions and synchronises with wait4(). Also used in procfs. Also * pins the final release of task.io_context. * + * Synchronises set_cpus_allowed(), unlink, and creat of ->thread.perfctr. + * [if CONFIG_PERFCTR_VIRTUAL] + * * Nests both inside and outside of read_lock(&tasklist_lock). * It must not be nested with write_lock_irq(&tasklist_lock), * neither inside nor outside. diff -puN arch/ppc/kernel/head.S~perfctr arch/ppc/kernel/head.S --- devel/arch/ppc/kernel/head.S~perfctr 2005-07-08 23:11:41.000000000 -0700 +++ devel-akpm/arch/ppc/kernel/head.S 2005-07-08 23:11:41.000000000 -0700 @@ -502,7 +502,11 @@ SystemCall: Trap_0f: EXCEPTION_PROLOG addi r3,r1,STACK_FRAME_OVERHEAD +#ifdef CONFIG_PERFCTR_INTERRUPT_SUPPORT + EXC_XFER_EE(0xf00, do_perfctr_interrupt) +#else EXC_XFER_EE(0xf00, UnknownException) +#endif /* * Handle TLB miss for instruction on 603/603e. diff -puN /dev/null Documentation/perfctr/low-level-api.txt --- /dev/null 2003-09-15 06:40:47.000000000 -0700 +++ devel-akpm/Documentation/perfctr/low-level-api.txt 2005-07-08 23:11:41.000000000 -0700 @@ -0,0 +1,216 @@ +$Id: low-level-api.txt,v 1.1 2004/07/02 18:57:05 mikpe Exp $ + +PERFCTR LOW-LEVEL DRIVERS API +============================= + +This document describes the common low-level API. +See low-level-$ARCH.txt for architecture-specific documentation. + +General Model +============= +The model is that of a processor with: +- A non-programmable clock-like counter, the "TSC". + The TSC frequency is assumed to be constant, but it is not + assumed to be identical to the core frequency. + The TSC may be absent. +- A set of programmable counters, the "perfctrs" or "pmcs". + Control data may be per-counter, global, or both. + The counters are not assumed to be interchangeable. + + A normal counter that simply counts events is referred to + as an "accumulation-mode" or "a-mode" counter. Its total + count is computed by adding the counts for the individual + periods during which the counter is active. Two per-counter + state variables are used for this: "sum", which is the + total count up to but not including the current period, + and "start", which records the value of the hardware counter + at the start of the current period. At the end of a period, + the hardware counter's value is read again, and the increment + relative the start value is added to the sum. This strategy + is used because it avoids a number of hardware problems. + + A counter that has been programmed to generate an interrupt + on overflow is referred to as an "interrupt-mode" or "i-mode" + counter. I-mode counters are initialised to specific values, + and after overflowing are reset to their (re)start values. + The total event count is available just as for a-mode counters. + + The set of counters may be empty, in which case only the + TSC (which must be present) can be sampled. + +Contents of +================================= + +"struct perfctr_sum_ctrs" +------------------------- +struct perfctr_sum_ctrs { + unsigned long long tsc; + unsigned long long pmc[..]; /* one per counter */ +}; + +Architecture-specific container for counter values. +Used in the kernel/user API, but not by the low-level drivers. + +"struct perfctr_cpu_control" +---------------------------- +This struct includes at least the following fields: + + unsigned int tsc_on; + unsigned int nractrs; /* # of a-mode counters */ + unsigned int nrictrs; /* # of i-mode counters */ + unsigned int pmc_map[..]; /* one per counter: virt-to-phys mapping */ + unsigned int evntsel[..]; /* one per counter: hw control data */ + int ireset[..]; /* one per counter: i-mode (re)start value */ + +Architecture-specific container for control data. +Used both in the kernel/user API and by the low-level drivers +(embedded in "struct perfctr_cpu_state"). + +"tsc_on" is non-zero if the TSC should be sampled. + +"nractrs" is the number of a-mode counters, corresponding to +elements 0..nractrs-1 in the per-counter arrays. + +"nrictrs" is the number of i-mode counters, corresponding to +elements nractrs..nractrs+nrictrs-1 in the per-counter arrays. + +"nractrs+nrictrs" is the total number of counters to program +and sample. A-mode and i-mode counters are separated in order +to allow quick enumeration of either set, which is needed in +some low-level driver operations. + +"pmc_map[]" maps each counter to its corresponding hardware counter +identification. No two counters may map to the same hardware counter. +This mapping is present because the hardware may have asymmetric +counters or other addressing quirks, which means that a counter's index +may not suffice to address its hardware counter. + +"evntsel[]" contains the per-counter control data. Architecture-specific +global control data, if any, is placed in architecture-specific fields. + +"ireset[]" contains the (re)start values for the i-mode counters. +Only indices nractrs..nractrs+nrictrs-1 are used. + +"struct perfctr_cpu_state" +-------------------------- +This struct includes at least the following fields: + + unsigned int cstatus; + unsigned int tsc_start; + unsigned long long tsc_sum; + struct { + unsigned int map; + unsigned int start; + unsigned long long sum; + } pmc[..]; /* one per counter; the size is not part of the user ABI */ +#ifdef __KERNEL__ + struct perfctr_cpu_control control; +#endif + +This type records the state and control data for a collection +of counters. It is used by many low-level operations, and may +be exported to user-space via mmap(). + +"cstatus" is a re-encoding of control.tsc_on/nractrs/nrictrs, +used because it reduces overheads in key low-level operations. +Operations on cstatus values include: +- unsigned int perfctr_mk_cstatus(unsigned int tsc_on, unsigned int nractrs, unsigned int nrictrs); + Construct a cstatus value. +- unsigned int perfctr_cstatus_enabled(unsigned int cstatus); + Check if any part (tsc_on, nractrs, nrictrs) of the cstatus is non-zero. +- int perfctr_cstatus_has_tsc(unsigned int cstatus); + Check if the tsc_on part of the cstatus is non-zero. +- unsigned int perfctr_cstatus_nrctrs(unsigned int cstatus); + Retrieve nractrs+nrictrs from the cstatus. +- unsigned int perfctr_cstatus_has_ictrs(unsigned int cstatus); + Check if the nrictrs part of cstatus is non-zero. + +"tsc_start" and "tsc_sum" record the state of the TSC. + +"pmc[]" contains the per-counter state, in the "start" and "sum" +fields. The "map" field contains the corresponding hardware counter +identification, from the counter's entry in "control.pmc_map[]"; +it is copied into pmc[] to reduce overheads in key low-level operations. + +"control" contains the control data which determines the +behaviour of the counters. + +User-space overflow signal handler items +---------------------------------------- +After a counter has overflowed, a user-space signal handler may +be invoked with a "struct siginfo" identifying the source of the +signal and the set of overflown counters. + +#define SI_PMC_OVF .. + +Value to be stored in "si.si_code". + +#define si_pmc_ovf_mask .. + +Field in which to store a bit-mask of the overflown counters. + +Kernel-internal API +------------------- + +/* Driver init/exit. + perfctr_cpu_init() performs hardware detection and may fail. */ +extern int perfctr_cpu_init(void); +extern void perfctr_cpu_exit(void); + +/* CPU type name. Set if perfctr_cpu_init() was successful. */ +extern char *perfctr_cpu_name; + +/* Hardware reservation. A high-level driver must reserve the + hardware before it may use it, and release it afterwards. + "service" is a unique string identifying the high-level driver. + perfctr_cpu_reserve() returns NULL on success; if another + high-level driver has reserved the hardware, then that + driver's "service" string is returned. */ +extern const char *perfctr_cpu_reserve(const char *service); +extern void perfctr_cpu_release(const char *service); + +/* PRE: state has no running interrupt-mode counters. + Check that the new control data is valid. + Update the low-level driver's private control data. + is_global should be zero for per-process counters and non-zero + for global-mode counters. + Returns a negative error code if the control data is invalid. */ +extern int perfctr_cpu_update_control(struct perfctr_cpu_state *state, int is_global); + +/* Stop i-mode counters. Update sums and start values. + Read a-mode counters. Subtract from start and accumulate into sums. + Must be called with preemption disabled. */ +extern void perfctr_cpu_suspend(struct perfctr_cpu_state *state); + +/* Reset i-mode counters to their start values. + Write control registers. + Read a-mode counters and update their start values. + Must be called with preemption disabled. */ +extern void perfctr_cpu_resume(struct perfctr_cpu_state *state); + +/* Perform an efficient combined suspend/resume operation. + Must be called with preemption disabled. */ +extern void perfctr_cpu_sample(struct perfctr_cpu_state *state); + +/* The type of a perfctr overflow interrupt handler. + It will be called in IRQ context, with preemption disabled. */ +typedef void (*perfctr_ihandler_t)(unsigned long pc); + +/* Install a perfctr overflow interrupt handler. + Should be called after perfctr_cpu_reserve() but before + any counter state has been activated. */ +extern void perfctr_cpu_set_ihandler(perfctr_ihandler_t); + +/* PRE: The state has been suspended and sampled by perfctr_cpu_suspend(). + Should be called from the high-level driver's perfctr_ihandler_t, + and preemption must not have been enabled. + Identify which counters have overflown, reset their start values + from ireset[], and perform any necessary hardware cleanup. + Returns a bit-mask of the overflown counters. */ +extern unsigned int perfctr_cpu_identify_overflow(struct perfctr_cpu_state*); + +/* Call perfctr_cpu_ireload() just before perfctr_cpu_resume() to + bypass internal caching and force a reload of the i-mode pmcs. + This ensures that perfctr_cpu_identify_overflow()'s state changes + are propagated to the hardware. */ +extern void perfctr_cpu_ireload(struct perfctr_cpu_state*); diff -puN /dev/null Documentation/perfctr/low-level-ppc32.txt --- /dev/null 2003-09-15 06:40:47.000000000 -0700 +++ devel-akpm/Documentation/perfctr/low-level-ppc32.txt 2005-07-08 23:11:41.000000000 -0700 @@ -0,0 +1,164 @@ +$Id: low-level-ppc32.txt,v 1.1 2004/07/02 18:57:05 mikpe Exp $ + +PERFCTRS PPC32 LOW-LEVEL API +============================ + +See low-level-api.txt for the common low-level API. +This document only describes ppc32-specific behaviour. +For detailed hardware control register layouts, see +the manufacturers' documentation. + +Supported processors +==================== +- PowerPC 604, 604e, 604ev. +- PowerPC 750/740, 750CX, 750FX, 750GX. +- PowerPC 7400, 7410, 7451/7441, 7457/7447. +- Any generic PowerPC with a timebase register. + +Contents of +================================= + +"struct perfctr_sum_ctrs" +------------------------- +struct perfctr_sum_ctrs { + unsigned long long tsc; + unsigned long long pmc[8]; +}; + +The pmc[] array has room for 8 counters. + +"struct perfctr_cpu_control" +---------------------------- +struct perfctr_cpu_control { + unsigned int tsc_on; + unsigned int nractrs; /* # of a-mode counters */ + unsigned int nrictrs; /* # of i-mode counters */ + unsigned int pmc_map[8]; + unsigned int evntsel[8]; /* one per counter, even on P5 */ + int ireset[8]; /* [0,0x7fffffff], for i-mode counters */ + struct { + unsigned int mmcr0; /* sans PMC{1,2}SEL */ + unsigned int mmcr2; /* only THRESHMULT */ + /* IABR/DABR/BAMR not supported */ + } ppc; + unsigned int _reserved1; + unsigned int _reserved2; + unsigned int _reserved3; + unsigned int _reserved4; +}; + +The per-counter arrays have room for 8 elements. + +ireset[] values must be non-negative, since overflow occurs on +the non-negative-to-negative transition. + +The ppc sub-struct contains PowerPC-specific control data: +- mmcr0: global control data for the MMCR0 SPR; the event + selectors for PMC1 and PMC2 are in evntsel[], not in mmcr0 +- mmcr2: global control data for the MMCR2 SPR; only the + THRESHMULT field can be specified + +"struct perfctr_cpu_state" +-------------------------- +struct perfctr_cpu_state { + unsigned int cstatus; + struct { /* k1 is opaque in the user ABI */ + unsigned int id; + int isuspend_cpu; + } k1; + /* The two tsc fields must be inlined. Placing them in a + sub-struct causes unwanted internal padding on x86-64. */ + unsigned int tsc_start; + unsigned long long tsc_sum; + struct { + unsigned int map; + unsigned int start; + unsigned long long sum; + } pmc[8]; /* the size is not part of the user ABI */ +#ifdef __KERNEL__ + unsigned int ppc_mmcr[3]; + struct perfctr_cpu_control control; +#endif +}; + +The k1 sub-struct is used by the low-level driver for +caching purposes. "id" identifies the control data, and +"isuspend_cpu" identifies the CPU on which the i-mode +counters were last suspended. + +The pmc[] array has room for 8 elements. + +ppc_mmcr[] is computed from control by the low-level driver, +and provides the data for the MMCR0, MMCR1, and MMCR2 SPRs. + +User-space overflow signal handler items +---------------------------------------- +#ifdef __KERNEL__ +#define SI_PMC_OVF (__SI_FAULT|'P') +#else +#define SI_PMC_OVF ('P') +#endif +#define si_pmc_ovf_mask _sifields._pad[0] + +Kernel-internal API +------------------- + +In perfctr_cpu_update_control(), the is_global parameter +is ignored. (It is only relevant for x86.) + +CONFIG_PERFCTR_CPUS_FORBIDDEN_MASK is never defined. +(It is only relevant for x86.) + +Overflow interrupt handling is not yet implemented. + +Processor-specific Notes +======================== + +General +------- +pmc_map[] contains a counter number, an integer between 0 and 5. +It never contains an SPR number. + +Basic operation (the strategy for a-mode counters, caching +control register contents, recording "suspend CPU" for i-mode +counters) is the same as in the x86 driver. + +PowerPC 604/750/74xx +-------------------- +These processors use similar hardware layouts, differing +mainly in the number of counter and control registers. +The set of available events differ greatly, but that only +affects users, not the low-level driver itself. + +The hardware has 2 (604), 4 (604e/750/7400/7410), or 6 +(745x) counters (PMC1 to PMC6), and 1 (604), 2 (604e/750), +or 3 (74xx) control registers (MMCR0 to MMCR2). + +MMCR0 contains global control bits, and the event selection +fields for PMC1 and PMC2. MMCR1 contains event selection fields +for PMC3-PMC6. MMCR2 contains the THRESHMULT flag, which +specifies how MMCR0[THRESHOLD] should be scaled. + +In control.ppc.mmcr0, the PMC1SEL and PMC2SEL fields (0x00001FFF) +are reserved. The PMXE flag (0x04000000) may only be set when +the driver supports overflow interrupts. + +If FCECE or TRIGGER is set in MMCR0 on a 74xx processor, then +MMCR0 can change asynchronously. The driver handles this, at +the cost of some additional work in perfctr_cpu_suspend(). +Not setting these flags avoids that overhead. + +In control.ppc.mmcr2, only the THRESHMULT flag (0x80000000) +may be set, and only on 74xx processors. + +The SIA (sampled instruction address) register is not used. +The SDA (sampled data address) register is 604/604e-only, +and is not used. The BAMR (breakpoint address mask) register +is not used, but it is cleared by the driver. + +Generic PowerPC with timebase +----------------------------- +The driver supports any PowerPC as long as it has a timebase +register, and the TB frequency is available via Open Firmware. +In this case, the only valid usage mode is with tsc_on == 1 +and nractrs == nrictrs == 0 in the control data. diff -puN /dev/null Documentation/perfctr/low-level-x86.txt --- /dev/null 2003-09-15 06:40:47.000000000 -0700 +++ devel-akpm/Documentation/perfctr/low-level-x86.txt 2005-07-08 23:11:41.000000000 -0700 @@ -0,0 +1,360 @@ +$Id: low-level-x86.txt,v 1.1 2004/07/02 18:57:05 mikpe Exp $ + +PERFCTRS X86 LOW-LEVEL API +========================== + +See low-level-api.txt for the common low-level API. +This document only describes x86-specific behaviour. +For detailed hardware control register layouts, see +the manufacturers' documentation. + +Contents +======== +- Supported processors +- Contents of +- Processor-specific Notes +- Implementation Notes + +Supported processors +==================== +- Intel P5, P5MMX, P6, P4. +- AMD K7, K8. (P6 clones, with some changes) +- Cyrix 6x86MX, MII, and III. (good P5 clones) +- Centaur WinChip C6, 2, and 3. (bad P5 clones) +- VIA C3. (bad P6 clone) +- Any generic x86 with a TSC. + +Contents of +================================ + +"struct perfctr_sum_ctrs" +------------------------- +struct perfctr_sum_ctrs { + unsigned long long tsc; + unsigned long long pmc[18]; +}; + +The pmc[] array has room for 18 counters. + +"struct perfctr_cpu_control" +---------------------------- +struct perfctr_cpu_control { + unsigned int tsc_on; + unsigned int nractrs; /* # of a-mode counters */ + unsigned int nrictrs; /* # of i-mode counters */ + unsigned int pmc_map[18]; + unsigned int evntsel[18]; /* one per counter, even on P5 */ + struct { + unsigned int escr[18]; + unsigned int pebs_enable; /* for replay tagging */ + unsigned int pebs_matrix_vert; /* for replay tagging */ + } p4; + int ireset[18]; /* < 0, for i-mode counters */ + unsigned int _reserved1; + unsigned int _reserved2; + unsigned int _reserved3; + unsigned int _reserved4; +}; + +The per-counter arrays have room for 18 elements. + +ireset[] values must be negative, since overflow occurs on +the negative-to-non-negative transition. + +The p4 sub-struct contains P4-specific control data: +- escr[]: the control data to write to the ESCR register + associatied with the counter +- pebs_enable: the control data to write to the PEBS_ENABLE MSR +- pebs_matrix_vert: the control data to write to the + PEBS_MATRIX_VERT MSR + +"struct perfctr_cpu_state" +-------------------------- +struct perfctr_cpu_state { + unsigned int cstatus; + struct { /* k1 is opaque in the user ABI */ + unsigned int id; + int isuspend_cpu; + } k1; + /* The two tsc fields must be inlined. Placing them in a + sub-struct causes unwanted internal padding on x86-64. */ + unsigned int tsc_start; + unsigned long long tsc_sum; + struct { + unsigned int map; + unsigned int start; + unsigned long long sum; + } pmc[18]; /* the size is not part of the user ABI */ +#ifdef __KERNEL__ + struct perfctr_cpu_control control; + unsigned int p4_escr_map[18]; +#endif +}; + +The k1 sub-struct is used by the low-level driver for +caching purposes. "id" identifies the control data, and +"isuspend_cpu" identifies the CPU on which the i-mode +counters were last suspended. + +The pmc[] array has room for 18 elements. + +p4_escr_map[] is computed from control by the low-level driver, +and provides the MSR number for the counter's associated ESCR. + +User-space overflow signal handler items +---------------------------------------- +#ifdef __KERNEL__ +#define SI_PMC_OVF (__SI_FAULT|'P') +#else +#define SI_PMC_OVF ('P') +#endif +#define si_pmc_ovf_mask _sifields._pad[0] + +Kernel-internal API +------------------- + +In perfctr_cpu_update_control(), the is_global parameter controls +whether monitoring the other thread (T1) on HT P4s is permitted +or not. On other processors the parameter is ignored. + +SMP kernels define CONFIG_PERFCTR_CPUS_FORBIDDEN_MASK and +"extern cpumask_t perfctr_cpus_forbidden_mask;". +On HT P4s, resource conflicts can occur because both threads +(T0 and T1) in a processor share the same perfctr registers. +To prevent conflicts, only thread 0 in each processor is allowed +to access the counters. perfctr_cpus_forbidden_mask contains the +smp_processor_id()s of each processor's thread 1, and it is the +responsibility of the high-level driver to ensure that it never +accesses the perfctr state from a forbidden thread. + +Overflow interrupt handling requires local APIC support in the kernel. + +Processor-specific Notes +======================== + +General +------- +pmc_map[] contains a counter number, as used by the RDPMC instruction. +It never contains an MSR number. + +Counters are 32, 40, or 48 bits wide. The driver always only +reads the low 32 bits. This avoids performance issues, and +errata on some processors. + +Writing to counters or their control registers tends to be +very expensive. This is why a-mode counters only use read +operations on the counter registers. Caching of control +register contents is done to avoid writing them. "Suspend CPU" +is recorded for i-mode counters to avoid writing the counter +registers when the counters are resumed (their control +registers must be written at both suspend and resume, however). + +Some processors are unable to stop the counters (Centaur/VIA), +and some are unable to reinitialise them to arbitrary values (P6). +Storing the counters' total counts in the hardware counters +would break as soon as context-switches occur. This is another +reason why the accumulate-differences method for maintaining the +counter values is used. + +Intel P5 +-------- +The hardware stores both counters' control data in a single +control register, the CESR MSR. The evntsel values are +limited to 16 bits each, and are combined by the low-level +driver to form the value for the CESR. Apart from that, +the evntsel values are direct images of the CESR. + +Bits 0xFE00 in an evntsel value are reserved. +At least one evntsel CPL bit (0x00C0) must be set. + +For Cyrix' P5 clones, evntsel bits 0xFA00 are reserved. + +For Centaur's P5 clones, evntsel bits 0xFF00 are reserved. +It has no CPL bits to set. The TSC is broken and cannot be used. + +Intel P6 +-------- +The evntsel values are mapped directly onto the counters' +EVNTSEL control registers. + +The global enable bit (22) in EVNTSEL0 must be set. That bit is +reserved in EVNTSEL1. + +Bits 21 and 19 (0x00280000) in each evntsel are reserved. + +For an i-mode counter, bit 20 (0x00100000) of its evntsel must be +set. For a-mode counters, that bit must not be set. + +Hardware quirk: Counters are 40 bits wide, but writing to a +counter only writes the low 32 bits: remaining bits are +sign-extended from bit 31. + +AMD K7/K8 +--------- +Similar to Intel P6. The main difference is that each evntsel has +its own enable bit, which must be set. + +VIA C3 +------ +Superficially similar to Intel P6, but only PERFCTR1/EVNTSEL1 +are programmable. pmc_map[0] must be 1, if nractrs == 1. + +Bits 0xFFFFFE00 in the evntsel are reserved. There are no auxiliary +control bits to set. + +Generic +------- +Only permits TSC sampling, with tsc_on == 1 and nractrs == nrictrs == 0 +in the control data. + +Intel P4 +-------- +For each counter, its evntsel[] value is mapped onto its CCCR +control register, and its p4.escr[] value is mapped onto its +associated ESCR control register. + +The ESCR register number is computed from the hardware counter +number (from pmc_map[]) and the ESCR SELECT field in the CCCR, +and is cached in p4_escr_map[]. + +pmc_map[] contains the value to pass to RDPMC when reading the +counter. It is strongly recommended to set bit 31 (fast rdpmc). + +In each evntsel/CCCR value: +- the OVF, OVF_PMI_T1 and hardware-reserved bits (0xB80007FF) + are reserved and must not be set +- bit 11 (EXTENDED_CASCADE) is only permitted on P4 models >= 2, + and for counters 12 and 15-17 +- bits 16 and 17 (ACTIVE_THREAD) must both be set on non-HT processors +- at least one of bits 12 (ENABLE), 30 (CASCADE), or 11 (EXTENDED_CASCADE) + must be set +- bit 26 (OVF_PMI_T0) must be clear for a-mode counters, and set + for i-mode counters; if bit 25 (FORCE_OVF) also is set, then + the corresponding ireset[] value must be exactly -1 + +In each p4.escr[] value: +- bit 32 is reserved and must not be set +- the CPL_T1 field (bits 0 and 1) must be zero except on HT processors + when global-mode counters are used +- IQ_ESCR0 and IQ_ESCR1 can only be used on P4 models <= 2 + +PEBS is not supported, but the replay tagging bits in PEBS_ENABLE +and PEBS_MATRIX_VERT may be used. + +If p4.pebs_enable is zero, then p4.pebs_matrix_vert must also be zero. + +If p4.pebs_enable is non-zero: +- only bits 24, 10, 9, 2, 1, and 0 may be set; note that in contrast + to Intel's documentation, bit 25 (ENABLE_PEBS_MY_THR) is not needed + and must not be set +- bit 24 (UOP_TAG) must be set +- at least one of bits 10, 9, 2, 1, or 0 must be set +- in p4.pebs_matrix_vert, all bits except 1 and 0 must be clear, + and at least one of bits 1 and 0 must be set + +Implementation Notes +==================== + +Caching +------- +Each 'struct perfctr_cpu_state' contains two cache-related fields: +- 'id': a unique identifier for the control data contents +- 'isuspend_cpu': the identity of the CPU on which a state containing + interrupt-mode counters was last suspended + +To this the driver adds a per-CPU cache, recording: +- the 'id' of the control data currently in that CPU +- the current contents of each control register + +When perfctr_cpu_update_control() has validated the new control data, +it also updates the id field. + +The driver's internal 'write_control' function, called from the +perfctr_cpu_resume() API function, first checks if the state's id +matches that of the CPU's cache, and if so, returns. Otherwise +it checks each control register in the state and updates those +that do not match the cache. Finally, it writes the state's id +to the cache. Tests on various x86 processor types have shown that +MSR writes are very expensive: the purpose of these cache checks +is to avoid MSR writes whenever possible. + +Unlike accumulation-mode counters, interrupt-mode counters must be +physically stopped when suspended, primilarly to avoid overflow +interrupts in contexts not expecting them, and secondarily to avoid +increments to the counters themselves (see below). + +When suspending interrupt-mode counters, the driver: +- records the CPU identity in the per-CPU cache +- stops each interrupt-mode counter by disabling its control register +- lets the cache and state id values remain the same + +Later, when resuming interrupt-mode counters, the driver: +- if the state and cache id values match: + * the cache id is cleared, to force a reload of the control + registers stopped at suspend (see below) + * if the state's "suspend" CPU identity matches the current CPU, + the counter registers are still valid, and the procedure returns +- if the procedure did not return above, it then loops over each + interrupt-mode counter: + * the counter's control register is physically disabled, unless + the cache indicates that it already is disabled; this is necessary + to prevent premature events and overflow interrupts if the CPU's + registers previously belonged to some other state + * then the counter register itself is restored +After this interrupt-mode specific resume code is complete, the +driver continues by calling 'write_control' as described above. +The state and cache ids will not match, forcing write_control to +reload the disabled interrupt-mode control registers. + +Call-site Backpatching +---------------------- +The x86 family of processors is quite diverse in how their +performance counters work and are accessed. There are three +main designs (P5, P6, and P4) with several variations. +To handle this the processor type detection and initialisation +code sets up a number of function pointers to point to the +correct procedures for the actual CPU type. + +Calls via function pointers are more expensive than direct calls, +so the driver actually performs direct calls to wrappers that +backpatch the original call sites to instead call the actual +CPU-specific functions in the future. + +Unsynchronised code backpatching in SMP systems doesn't work +on Intel P6 processors due to an erratum, so the driver performs +a "finalise backpatching" step after the CPU-specific function +pointers have been set up. This step invokes the API procedures +on a temporary state object, set up to force every backpatchable +call site to be invoked and adjusted. + +Several low-level API procedures are called in the context-switch +path by the per-process perfctrs kernel extension, which motivates +the efforts to reduce runtime overheads as much as possible. + +Overflow Interrupts +------------------- +The x86 hardware enables overflow interrupts via the local +APIC's LVTPC entry, which is only present in P6/K7/K8/P4. + +The low-level driver supports overflow interrupts as follows: +- It reserves a local APIC vector, 0xee, as LOCAL_PERFCTR_VECTOR. +- It adds a local APIC exception handler to entry.S, which + invokes the driver's smp_perfctr_interrupt() procedure. +- It adds code to i8259.c to bind the LOCAL_PERFCTR_VECTOR + interrupt gate to the exception handler in entry.S. +- During processor type detection, it records whether the + processor supports the local APIC, and sets up function pointers + for the suspend and resume operations on interrupt-mode counters. +- When the low-level driver is activated, it enables overflow + interrupts by writing LOCAL_PERFCTR_VECTOR to each CPU's APIC_LVTPC. +- Overflow interrupts now end up in smp_perfctr_interrupt(), which + ACKs the interrupt and invokes the interrupt handler installed + by the high-level service/driver. +- When the low-level driver is deactivated, it disables overflow + interrupts by masking APIC_LVTPC in each CPU. It then releases + the local APIC back to the NMI watchdog. + +At compile-time, the low-level driver indicates overflow interrupt +support by enabling CONFIG_PERFCTR_INTERRUPT_SUPPORT. If the feature +is also available at runtime, it sets the PERFCTR_FEATURE_PCINT flag +in the perfctr_info object. diff -puN /dev/null Documentation/perfctr/overview.txt --- /dev/null 2003-09-15 06:40:47.000000000 -0700 +++ devel-akpm/Documentation/perfctr/overview.txt 2005-07-08 23:11:41.000000000 -0700 @@ -0,0 +1,129 @@ +$Id: perfctr-documentation-update.patch,v 1.1 2004/07/12 05:41:57 akpm Exp $ + +AN OVERVIEW OF PERFCTR +====================== +The perfctr package adds support to the Linux kernel for using +the performance-monitoring counters found in many processors. + +Perfctr is internally organised in three layers: + +- The low-level drivers, one for each supported architecture. + Currently there are two, one for 32 and 64-bit x86 processors, + and one for 32-bit PowerPC processors. + + low-level-api.txt documents the model of the performance counters + used in this package, and the internal API to the low-level drivers. + + low-level-{x86,ppc}.txt provide documentation specific for those + architectures and their low-level drivers. + +- The high-level services. + There is currently one, a kernel extension adding support for + virtualised per-process performance counters. + See virtual.txt for documentation on this kernel extension. + + [There used to be a second high-level service, a simple driver + to control and access all performance counters in all processors. + This driver is currently removed, pending an acceptable new API.] + +- The top-level, which performs initialisation and implements + common procedures and system calls. + +Rationale +--------- +The perfctr package solves three problems: + +- Hardware invariably restricts programming of the performance + counter registers to kernel-level code, and sometimes also + restricts reading the counters to kernel-level code. + + Perfctr adds APIs allowing user-space code access the counters. + In the case of the per-process counters kernel extension, + even non-privileged processes are allowed access. + +- Hardware often limits the precision of the hardware counters, + making them unsuitable for storing total event counts. + + The counts are instead maintained as 64-bit values in software, + with the hardware counters used to derive increments over given + time periods. + +- In a non-modified kernel, the thread state does not include the + performance monitoring counters, and the context switch code + does not save and restore them. In this situation the counters + are system-wide, making them unreliable and inaccurate when used + for monitoring specific processes or specific segments of code. + + The per-process counters kernel extension treats the counter state as + part of the thread state, solving the reliability and accuracy problems. + +Non-goals +--------- +Providing high-level interfaces that abstract and hide the +underlying hardware is a non-goal. Such abstractions can +and should be implemented in user-space, for several reasons: + +- The complexity and variability of the hardware means that + any abstraction would be inaccurate. There would be both + loss of functionality, and presence of functionality which + isn't supportable on any given processor. User-space tools + and libraries can implement this, on top of the processor- + specific interfaces provided by the kernel. + +- The implementation of such an abstraction would be large + and complex. (Consider ESCR register assignment on P4.) + Performing complex actions in user-space simplifies the + kernel, allowing it to concentrate on validating control + data, managing processes, and driving the hardware. + (C.f. the role of compilers.) + +- The abstraction is purely a user-convenience thing. The + kernel-level components have no need for it. + +Common System Calls +=================== +This lists those system calls that are not tied to +a specific high-level service/driver. + +Querying CPU and Driver Information +----------------------------------- +int err = sys_perfctr_info(struct perfctr_info *info, + struct perfctr_cpu_mask *cpus, + struct perfctr_cpu_mask *forbidden); + +This operation retrieves information from the kernel about +the processors in the system. + +If non-NULL, '*info' will be updated with information about the +capabilities of the processor and the low-level driver. + +If non-NULL, '*cpus' will be updated with a bitmask listing the +set of processors in the system. The size of this bitmask is not +statically known, so the protocol is: + +1. User-space initialises cpus->nrwords to the number of elements + allocated for cpus->mask[]. +2. The kernel reads cpus->nrwords, and then writes the required + number of words to cpus->nrwords. +3. If the required number of words is less than the original value + of cpus->nrwords, then an EOVERFLOW error is signalled. +4. Otherwise, the kernel converts its internal cpumask_t value + to the external format and writes that to cpus->mask[]. + +If non-NULL, '*forbidden' will be updated with a bitmask listing +the set of processors in the system on which users must not try +to use performance counters. This is currently only relevant for +hyper-threaded Pentium 4/Xeon systems. The protocol is the same +as for '*cpus'. + +Notes: +- The internal representation of a cpumask_t is as an array of + unsigned long. This representation is unsuitable for user-space, + because it is not binary-compatible between 32 and 64-bit + variants of a big-endian processor. The 'struct perfctr_cpu_mask' + type uses an array of unsigned 32-bit integers. +- The protocol for retrieving a 'struct perfctr_cpu_mask' was + designed to allow user-space to quickly determine the correct + size of the 'mask[]' array. Other system calls use weaker protocols, + which force user-space to guess increasingly larger values in a + loop, until finally an acceptable value was guessed. diff -puN /dev/null Documentation/perfctr/virtual.txt --- /dev/null 2003-09-15 06:40:47.000000000 -0700 +++ devel-akpm/Documentation/perfctr/virtual.txt 2005-07-08 23:11:41.000000000 -0700 @@ -0,0 +1,357 @@ +$Id: virtual.txt,v 1.3 2004/08/09 09:42:22 mikpe Exp $ + +VIRTUAL PER-PROCESS PERFORMANCE COUNTERS +======================================== +This document describes the virtualised per-process performance +counters kernel extension. See "General Model" in low-level-api.txt +for the model of the processor's performance counters. + +Contents +======== +- Summary +- Design & Implementation Notes + * State + * Thread Management Hooks + * Synchronisation Rules + * The Pseudo File System +- API For User-Space + * Opening/Creating the State + * Updating the Control + * Unlinking the State + * Reading the State + * Resuming After Handling Overflow Signal + * Reading the Counter Values +- Limitations / TODO List + +Summary +======= +The virtualised per-process performance counters facility +(virtual perfctrs) is a kernel extension which extends the +thread state to record perfctr settings and values, and augments +the context-switch code to save perfctr values at suspends and +restore them at resumes. This "virtualises" the performance +counters in much the same way as the kernel already virtualises +general-purpose and floating-point registers. + +Virtual perfctrs also adds an API allowing non-privileged +user-space processes to set up and access their perfctrs. + +As this facility is primarily intended to support developers +of user-space code, both virtualisation and allowing access +from non-privileged code are essential features. + +Design & Implementation Notes +============================= + +State +----- +The state of a thread's perfctrs is packaged up in an object of +type 'struct vperfctr'. It consists of CPU-dependent state, a +sampling timer, and some auxiliary administrative data. This is +an independent object, with its own lifetime and access rules. + +The state object is attached to the thread via a pointer in its +thread_struct. While attached, the object records the identity +of its owner thread: this is used for user-space API accesses +from threads other than the owner. + +The state is separate from the thread_struct for several resons: +- It's potentially large, hence it's allocated only when needed. +- It can outlive its owner thread. The state can be opened as + a pseudo file: as long as that file is live, so is the object. +- It can be mapped, via mmap() on the pseudo file's descriptor. + To facilitate this, a full page is allocated and reserved. + +Thread Management Hooks +----------------------- +Virtual perfctrs hooks into several thread management events: + +- exit_thread(): Calls perfctr_exit_thread() to stop the counters + and mark the vperfctr object as dead. + +- copy_thread(): Calls perfctr_copy_thread() to initialise + the child's vperfctr pointer. The child gets a new vperfctr + object containing the same control data as its parent. + Kernel-generated threads do not inherit any vperfctr state. + +- release_task(): Calls perfctr_release_task() to detach the + vperfctr object from the thread. If the child and its parent + still have the same perfctr control settings, then the child's + final counts are propagated back into its parent. + +- switch_to(): + * Calls perfctr_suspend_thread() on the previous thread, to + suspend its counters. + * Calls perfctr_resume_thread() on the next thread, to resume + its counters. Also resets the sampling timer (see below). + +- update_process_times(): Calls perfctr_sample_thread(), which + decrements the sampling timer and samples the counters if the + timer reaches zero. + + Sampling is normally only done at switch_to(), but if too much + time passes before the next switch_to(), a hardware counter may + increment by more than its range (usually 2^32). If this occurs, + the difference from its start value will be incorrect, causing + its updated sum to also be incorrect. The sampling timer is used + to prevent this problem, which has been observed on SMP machines, + and on high clock frequency UP machines. + +- set_cpus_allowed(): Calls perfctr_set_cpus_allowed() to detect + attempts to migrate the thread to a "forbidden" CPU, in which + case a flag in the vperfctr object is set. perfctr_resume_thread() + checks this flag, and if set, marks the counters as stopped and + sends a SIGILL to the thread. + + The notion of forbidden CPUs is a workaround for a design flaw + in hyper-threaded Pentium 4s and Xeons. See low-level-x86.txt + for details. + +To reduce overheads, these hooks are implemented as inline functions +that check if the thread is using perfctrs before calling the code +that implements the behaviour. The hooks also reduce to no-ops if +CONFIG_PERFCTR_VIRTUAL is disabled. + +Synchronisation Rules +--------------------- +There are five types of accesses to a thread's perfctr state: + +1. Thread management events (see above) done by the thread itself. + Suspend, resume, and sample are lock-less. + +2. API operations done by the thread itself. + These are lock-less, except when an individual operation + has specific synchronisation needs. For instance, preemption + is often disabled to prevent accesses due to context switches. + +3. API operations done by a different thread ("monitor thread"). + The owner thread must be suspended for the duration of the operation. + This is ensured by requiring that the monitor thread is ptrace()ing + the owner thread, and that the owner thread is in TASK_STOPPED state. + +4. set_cpus_allowed(). + The kernel does not lock the target during set_cpus_allowed(), + so it can execute concurrently with the owner thread or with + some monitor thread. In particular, the state may be deallocated. + + To solve this problem, both perfctr_set_cpus_allowed() and the + operations that can change the owner thread's perfctr pointer + (creat, unlink, exit) perform a task_lock() on the owner thread + before accessing the perfctr pointer. + +5. release_task(). + Reaping a child may or may not be done by the parent of that child. + When done by the parent, no lock is taken. Otherwise, a task_lock() + on the parent is done before accessing its thread's perfctr pointer. + +The Pseudo File System +---------------------- +The perfctr state is accessed from user-space via a file descriptor. + +The main reason for this is to enable mmap() on the file descriptor, +which gives read-only access to the state. + +The file descriptor is a handle to the perfctr state object. This +allows a very simple implementation of the user-space 'perfex' +program, which runs another program with given perfctr settings +and reports their final values. Without this handle, monitoring +applications like perfex would have to be implemented like debuggers +in order to catch the target thread's exit and retrieve the counter +values before the exit completes and the state disappears. + +The file for a perfctr state object belongs to the vperfctrs pseudo +file system. Files in this file system support only a few operations: +- mmap() +- release() decrements the perfctr object's reference count and + deallocates the object when no references remain +- the listing of a thread's open file descriptors identifies + perfctr state file descriptors as belonging to "vperfctrfs" +The implementation is based on the code for pipefs. + +In previous versions of the perfctr package, the file descriptors +for perfctr state objects also supported the API's ioctl() method. + +API For User-Space +================== + +Opening/Creating the State +-------------------------- +int fd = sys_vperfctr_open(int tid, int creat); + +'tid' must be the id of a thread, or 0 which is interpreted as an +alias for the current thread. + +This operation returns an open file descriptor which is a handle +on the thread's perfctr state object. + +If 'creat' is non-zero and the object did not exist, then it is +created and attached to the thread. The newly created state object +is inactive, with all control fields disabled and all counters +having the value zero. If 'creat' is non-zero and the object +already existed, then an EEXIST error is signalled. + +If 'tid' does not denote the current thread, then it must denote a +thread that is stopped and under ptrace control by the current thread. + +Notes: +- The access rule in the non-self case is the same as for the + ptrace() system call. It ensures that no other thread, including + the target thread itself, can access or change the target thread's + perfctr state during the operation. +- An open file descriptor for a perfctr state object counts as a + reference to that object; even if detached from its thread the + object will not be deallocated until the last reference is gone. +- The file descriptor can be passed to mmap(), for low-overhead + counter sampling. See "READING THE COUNTER VALUES" for details. +- The file descriptor can be passed to another thread. Accesses + from threads other than the owner are permitted as long as they + posses the file descriptor and use ptrace() for synchronisation. + +Updating the Control +-------------------- +int err = sys_vperfctr_control(int fd, const struct vperfctr_control *control); + +'fd' must be the return value from a call to sys_vperfctr_open(), +The perfctr object must still be attached to its owner thread. + +This operation stops and samples any currently running counters in +the thread, and then updates the control settings. If the resulting +state has any enabled counters, then the counters are restarted. + +Before restarting, the counter sums are reset to zero. However, +if a counter's bit is set in the control object's 'preserve' +bitmask field, then that counter's sum is not reset. The TSC's +sum is only reset if the TSC is disabled in the new state. + +If any of the programmable counters are enabled, then the thread's +CPU affinity mask is adjusted to exclude the set of forbidden CPUs. + +If the control data activates any interrupt-mode counters, then +a signal (specified by the 'si_signo' control field) will be sent +to the owner thread after an overflow interrupt. The documentation +for sys_vperfctr_iresume() describes this mechanism. + +If 'fd' does not denote the current thread, then it must denote a +thread that is stopped and under ptrace control by the current thread. +The perfctr state object denoted by 'fd' must still be attached +to its owner thread. + +Notes: +- It is strongly recommended to memset() the vperfctr_control object + to all-bits-zero before setting the fields of interest. +- Stopping the counters is done by invoking the control operation + with a control object that activates neither the TSC nor any PMCs. + +Unlinking the State +------------------- +int err = sys_vperfctr_unlink(int fd); + +'fd' must be the return value from a call to sys_vperfctr_open(). + +This operation stops and samples the thread's counters, and then +detaches the perfctr state object from the thread. If the object +already had been detached, then no action is performed. + +If 'fd' does not denote the current thread, then it must denote a +thread that is stopped and under ptrace control by the current thread. + +Reading the State +----------------- +int err = sys_vperfctr_read(int fd, struct perfctr_sum_ctrs *sum, + struct vperfctr_control *control, + struct perfctr_sum_ctrs *children); + +'fd' must be the return value from a call to sys_vperfctr_open(). + +This operation copies data from the perfctr state object to +user-space. If 'sum' is non-NULL, then the counter sums are +written to it. If 'control' is non-NULL, then the control data +is written to it. If 'children' is non-NULL, then the sums of +exited childrens' counters are written to it. + +If the perfctr state object is attached to the current thread, +then the counters are sampled and updated first. + +If 'fd' does not denote the current thread, then it must denote a +thread that is stopped and under ptrace control by the current thread. + +Notes: +- An alternate and faster way to retrieve the counter sums is described + below. This system call can be used if the hardware does not permit + user-space reads of the counters. + +Resuming After Handling Overflow Signal +--------------------------------------- +int err = sys_vperfctr_iresume(int fd); + +'fd' must be the return value from a call to sys_vperfctr_open(). +The perfctr object must still be attached to its owner thread. + +When an interrupt-mode counter has overflowed, the counters +are sampled and suspended (TSC remains active). Then a signal, +as specified by the 'si_signo' control field, is sent to the +owner thread: the associated 'struct siginfo' has 'si_code' +equal to 'SI_PMC_OVF', and 'si_pmc_ovf_mask' equal to the set +of overflown counters. + +The counters are suspended to avoid generating new performance +counter events during the execution of the signal handler, but +the previous settings are saved. Calling sys_vperfctr_iresume() +restores the previous settings and resumes the counters. Doing +this is optional. + +If 'fd' does not denote the current thread, then it must denote a +thread that is stopped and under ptrace control by the current thread. + +Reading the Counter Values +-------------------------- +The value of a counter is computed from three components: + + value = sum + (now - start); + +Two of these (sum and start) reside in the kernel's state object, +and the third (now) is the contents of the hardware counter. +To perform this computation in user-space requires access to +the state object. This is achieved by passing the file descriptor +from sys_vperfctr_open() to mmap(): + + volatile const struct vperfctr_state *kstate; + kstate = mmap(NULL, PAGE_SIZE, PROT_READ, MAP_SHARED, fd, 0); + +Reading the three components is a non-atomic operation. If the +thread is scheduled during the operation, the three values will +not be consistent and the wrong result will be computed. +To detect this situation, user-space should check the kernel +state's TSC start value before and after the operation, and +retry the operation in case of a mismatch. + +The algorithm for retrieving the value of counter 'i' is: + + tsc0 = kstate->cpu_state.tsc_start; + for(;;) { + rdpmcl(kstate->cpu_state.pmc[i].map, now); + start = kstate->cpu_state.pmc[i].start; + sum = kstate->cpu_state.pmc[i].sum; + tsc1 = kstate->cpu_state.tsc_start; + if (likely(tsc1 == tsc0)) + break; + tsc0 = tsc1; + } + return sum + (now - start); + +The algorithm for retrieving the value of the TSC is similar, +as is the algorithm for retrieving the values of all counters. + +Notes: +- Since the state's TSC time-stamps are used, the algorithm requires + that user-space enables TSC sampling. +- The algorithm requires that the hardware allows user-space reads + of the counter registers. If this property isn't statically known + for the architecture, user-space should retrieve the kernel's + 'struct perfctr_info' object and check that the PERFCTR_FEATURE_RDPMC + flag is set. + +Limitations / TODO List +======================= +- Buffering of overflow samples is not implemented. So far, not a + single user has requested it. diff -puN kernel/exit.c~perfctr kernel/exit.c --- devel/kernel/exit.c~perfctr 2005-07-08 23:11:41.000000000 -0700 +++ devel-akpm/kernel/exit.c 2005-07-08 23:11:41.000000000 -0700 @@ -26,6 +26,7 @@ #include #include #include +#include #include #include #include @@ -101,6 +102,7 @@ repeat: zap_leader = (leader->exit_signal == -1); } + perfctr_release_task(p); sched_exit(p); write_unlock_irq(&tasklist_lock); spin_unlock(&p->proc_lock); diff -puN arch/ppc64/Kconfig~perfctr arch/ppc64/Kconfig --- devel/arch/ppc64/Kconfig~perfctr 2005-07-08 23:11:41.000000000 -0700 +++ devel-akpm/arch/ppc64/Kconfig 2005-07-08 23:11:41.000000000 -0700 @@ -306,6 +306,7 @@ config MSCHUNKS depends on PPC_ISERIES default y +source "drivers/perfctr/Kconfig" config PPC_RTAS bool diff -puN arch/ppc64/kernel/misc.S~perfctr arch/ppc64/kernel/misc.S --- devel/arch/ppc64/kernel/misc.S~perfctr 2005-07-08 23:11:41.000000000 -0700 +++ devel-akpm/arch/ppc64/kernel/misc.S 2005-07-08 23:11:41.000000000 -0700 @@ -1411,3 +1411,7 @@ _GLOBAL(sys_call_table) .llong .sys_ioprio_get .llong .sys_pselect6 /* 275 */ .llong .sys_ppoll + .llong .sys_vperfctr_open + .llong .sys_vperfctr_control + .llong .sys_vperfctr_write + .llong .sys_vperfctr_read /* 280 */ diff -puN arch/ppc64/kernel/process.c~perfctr arch/ppc64/kernel/process.c --- devel/arch/ppc64/kernel/process.c~perfctr 2005-07-08 23:11:41.000000000 -0700 +++ devel-akpm/arch/ppc64/kernel/process.c 2005-07-08 23:11:41.000000000 -0700 @@ -37,6 +37,7 @@ #include #include #include +#include #include #include @@ -218,7 +219,9 @@ struct task_struct *__switch_to(struct t local_irq_save(flags); + perfctr_suspend_thread(&prev->thread); last = _switch(old_thread, new_thread); + perfctr_resume_thread(¤t->thread); local_irq_restore(flags); @@ -318,6 +321,7 @@ void exit_thread(void) last_task_used_altivec = NULL; #endif /* CONFIG_ALTIVEC */ #endif /* CONFIG_SMP */ + perfctr_exit_thread(¤t->thread); } void flush_thread(void) @@ -418,6 +422,8 @@ copy_thread(int nr, unsigned long clone_ */ kregs->nip = *((unsigned long *)ret_from_fork); + perfctr_copy_task(p, regs); + return 0; } diff -puN include/asm-ppc64/processor.h~perfctr include/asm-ppc64/processor.h --- devel/include/asm-ppc64/processor.h~perfctr 2005-07-08 23:11:41.000000000 -0700 +++ devel-akpm/include/asm-ppc64/processor.h 2005-07-08 23:11:41.000000000 -0700 @@ -426,6 +426,8 @@ struct thread_struct { unsigned long vrsave; int used_vr; /* set if process has used altivec */ #endif /* CONFIG_ALTIVEC */ + /* performance counters */ + struct vperfctr *perfctr; }; #define ARCH_MIN_TASKALIGN 16 diff -puN include/asm-ppc64/unistd.h~perfctr include/asm-ppc64/unistd.h --- devel/include/asm-ppc64/unistd.h~perfctr 2005-07-08 23:11:41.000000000 -0700 +++ devel-akpm/include/asm-ppc64/unistd.h 2005-07-08 23:11:41.000000000 -0700 @@ -287,8 +287,12 @@ #define __NR_ioprio_get 274 #define __NR_pselect6 275 #define __NR_ppoll 276 +#define __NR_vperfctr_open 277 +#define __NR_vperfctr_control (__NR_vperfctr_open+1) +#define __NR_vperfctr_write (__NR_vperfctr_open+2) +#define __NR_vperfctr_read (__NR_vperfctr_open+3) -#define __NR_syscalls 277 +#define __NR_syscalls 281 #ifdef __KERNEL__ #define NR_syscalls __NR_syscalls #endif diff -puN /dev/null drivers/perfctr/ppc64.c --- /dev/null 2003-09-15 06:40:47.000000000 -0700 +++ devel-akpm/drivers/perfctr/ppc64.c 2005-07-08 23:11:41.000000000 -0700 @@ -0,0 +1,749 @@ +/* + * PPC64 performance-monitoring counters driver. + * + * based on Mikael Pettersson's 32 bit ppc code + * Copyright (C) 2004 David Gibson, IBM Corporation. + * Copyright (C) 2004 Mikael Pettersson + */ + +#include +#include +#include +#include +#include +#include +#include /* tb_ticks_per_jiffy */ +#include +#include + +#include "ppc64_tests.h" + +extern void ppc64_enable_pmcs(void); + +/* Support for lazy perfctr SPR updates. */ +struct per_cpu_cache { /* roughly a subset of perfctr_cpu_state */ + unsigned int id; /* cache owner id */ + /* Physically indexed cache of the MMCRs. */ + unsigned long ppc64_mmcr0, ppc64_mmcr1, ppc64_mmcra; +}; +static DEFINE_PER_CPU(struct per_cpu_cache, per_cpu_cache); +#define __get_cpu_cache(cpu) (&per_cpu(per_cpu_cache, cpu)) +#define get_cpu_cache() (&__get_cpu_var(per_cpu_cache)) + +/* Structure for counter snapshots, as 32-bit values. */ +struct perfctr_low_ctrs { + u64 tsc; + u32 pmc[8]; +}; + +static unsigned int new_id(void) +{ + static DEFINE_SPINLOCK(lock); + static unsigned int counter; + int id; + + spin_lock(&lock); + id = ++counter; + spin_unlock(&lock); + return id; +} + +static inline u32 read_pmc(int pmc) +{ + switch (pmc) { + case 0: + return mfspr(SPRN_PMC1); + break; + case 1: + return mfspr(SPRN_PMC2); + break; + case 2: + return mfspr(SPRN_PMC3); + break; + case 3: + return mfspr(SPRN_PMC4); + break; + case 4: + return mfspr(SPRN_PMC5); + break; + case 5: + return mfspr(SPRN_PMC6); + break; + case 6: + return mfspr(SPRN_PMC7); + break; + case 7: + return mfspr(SPRN_PMC8); + break; + + default: + return -EINVAL; + } +} + +static inline void write_pmc(int pmc, u32 val) +{ + switch (pmc) { + case 0: + mtspr(SPRN_PMC1, val); + break; + case 1: + mtspr(SPRN_PMC2, val); + break; + case 2: + mtspr(SPRN_PMC3, val); + break; + case 3: + mtspr(SPRN_PMC4, val); + break; + case 4: + mtspr(SPRN_PMC5, val); + break; + case 5: + mtspr(SPRN_PMC6, val); + break; + case 6: + mtspr(SPRN_PMC7, val); + break; + case 7: + mtspr(SPRN_PMC8, val); + break; + } +} + +#ifdef CONFIG_PERFCTR_INTERRUPT_SUPPORT +static void perfctr_default_ihandler(unsigned long pc) +{ + unsigned int mmcr0 = mfspr(SPRN_MMCR0); + + mmcr0 &= ~MMCR0_PMXE; + mtspr(SPRN_MMCR0, mmcr0); +} + +static perfctr_ihandler_t perfctr_ihandler = perfctr_default_ihandler; + +void do_perfctr_interrupt(struct pt_regs *regs) +{ + unsigned long mmcr0; + + /* interrupts are disabled here, so we don't need to + * preempt_disable() */ + + (*perfctr_ihandler)(instruction_pointer(regs)); + + /* clear PMAO so the interrupt doesn't reassert immediately */ + mmcr0 = mfspr(SPRN_MMCR0) & ~MMCR0_PMAO; + mtspr(SPRN_MMCR0, mmcr0); +} + +void perfctr_cpu_set_ihandler(perfctr_ihandler_t ihandler) +{ + perfctr_ihandler = ihandler ? ihandler : perfctr_default_ihandler; +} + +#else +#define perfctr_cstatus_has_ictrs(cstatus) 0 +#endif + + +#if defined(CONFIG_SMP) && defined(CONFIG_PERFCTR_INTERRUPT_SUPPORT) + +static inline void +set_isuspend_cpu(struct perfctr_cpu_state *state, int cpu) +{ + state->isuspend_cpu = cpu; +} + +static inline int +is_isuspend_cpu(const struct perfctr_cpu_state *state, int cpu) +{ + return state->isuspend_cpu == cpu; +} + +static inline void clear_isuspend_cpu(struct perfctr_cpu_state *state) +{ + state->isuspend_cpu = NR_CPUS; +} + +#else +static inline void set_isuspend_cpu(struct perfctr_cpu_state *state, int cpu) { } +static inline int is_isuspend_cpu(const struct perfctr_cpu_state *state, int cpu) { return 1; } +static inline void clear_isuspend_cpu(struct perfctr_cpu_state *state) { } +#endif + + +static void ppc64_clear_counters(void) +{ + mtspr(SPRN_MMCR0, 0); + mtspr(SPRN_MMCR1, 0); + mtspr(SPRN_MMCRA, 0); + + mtspr(SPRN_PMC1, 0); + mtspr(SPRN_PMC2, 0); + mtspr(SPRN_PMC3, 0); + mtspr(SPRN_PMC4, 0); + mtspr(SPRN_PMC5, 0); + mtspr(SPRN_PMC6, 0); + + if (cpu_has_feature(CPU_FTR_PMC8)) { + mtspr(SPRN_PMC7, 0); + mtspr(SPRN_PMC8, 0); + } +} + +/* + * Driver methods, internal and exported. + */ + +static void perfctr_cpu_write_control(const struct perfctr_cpu_state *state) +{ + struct per_cpu_cache *cache; + unsigned long long value; + + cache = get_cpu_cache(); + /* + * Order matters here: update threshmult and event + * selectors before updating global control, which + * potentially enables PMIs. + * + * Since mtspr doesn't accept a runtime value for the + * SPR number, unroll the loop so each mtspr targets + * a constant SPR. + * + * For processors without MMCR2, we ensure that the + * cache and the state indicate the same value for it, + * preventing any actual mtspr to it. Ditto for MMCR1. + */ + value = state->control.mmcra; + if (value != cache->ppc64_mmcra) { + cache->ppc64_mmcra = value; + mtspr(SPRN_MMCRA, value); + } + value = state->control.mmcr1; + if (value != cache->ppc64_mmcr1) { + cache->ppc64_mmcr1 = value; + mtspr(SPRN_MMCR1, value); + } + value = state->control.mmcr0; + if (perfctr_cstatus_has_ictrs(state->user.cstatus)) + value |= MMCR0_PMXE; + if (value != cache->ppc64_mmcr0) { + cache->ppc64_mmcr0 = value; + mtspr(SPRN_MMCR0, value); + } + cache->id = state->id; +} + +static void perfctr_cpu_read_counters(struct perfctr_cpu_state *state, + struct perfctr_low_ctrs *ctrs) +{ + unsigned int cstatus, i, pmc; + + cstatus = state->user.cstatus; + if (perfctr_cstatus_has_tsc(cstatus)) + ctrs->tsc = mftb(); + + for (i = 0; i < perfctr_cstatus_nractrs(cstatus); ++i) { + pmc = state->control.pmc_map[i]; + ctrs->pmc[i] = read_pmc(pmc); + } +} + +#ifdef CONFIG_PERFCTR_INTERRUPT_SUPPORT +static void perfctr_cpu_isuspend(struct perfctr_cpu_state *state) +{ + unsigned int cstatus, nrctrs, i; + int cpu; + + cpu = smp_processor_id(); + set_isuspend_cpu(state, cpu); /* early to limit cpu's live range */ + cstatus = state->user.cstatus; + nrctrs = perfctr_cstatus_nrctrs(cstatus); + for (i = perfctr_cstatus_nractrs(cstatus); i < nrctrs; ++i) { + int pmc = state->control.pmc_map[i]; + u32 now = read_pmc(pmc); + + state->user.pmc[i].sum += (u32)(now-state->user.pmc[i].start); + state->user.pmc[i].start = now; + } +} + +static void perfctr_cpu_iresume(const struct perfctr_cpu_state *state) +{ + struct per_cpu_cache *cache; + unsigned int cstatus, nrctrs, i; + int cpu; + + cpu = smp_processor_id(); + cache = __get_cpu_cache(cpu); + if (cache->id == state->id) { + /* Clearing cache->id to force write_control() + to unfreeze MMCR0 would be done here, but it + is subsumed by resume()'s MMCR0 reload logic. */ + if (is_isuspend_cpu(state, cpu)) { + return; /* skip reload of PMCs */ + } + } + /* + * The CPU state wasn't ours. + * + * The counters must be frozen before being reinitialised, + * to prevent unexpected increments and missed overflows. + * + * All unused counters must be reset to a non-overflow state. + */ + if (!(cache->ppc64_mmcr0 & MMCR0_FC)) { + cache->ppc64_mmcr0 |= MMCR0_FC; + mtspr(SPRN_MMCR0, cache->ppc64_mmcr0); + } + cstatus = state->user.cstatus; + nrctrs = perfctr_cstatus_nrctrs(cstatus); + for (i = perfctr_cstatus_nractrs(cstatus); i < nrctrs; ++i) { + write_pmc(state->control.pmc_map[i], state->user.pmc[i].start); + } +} + +/* Call perfctr_cpu_ireload() just before perfctr_cpu_resume() to + bypass internal caching and force a reload if the I-mode PMCs. */ +void perfctr_cpu_ireload(struct perfctr_cpu_state *state) +{ +#ifdef CONFIG_SMP + clear_isuspend_cpu(state); +#else + get_cpu_cache()->id = 0; +#endif +} + +/* PRE: the counters have been suspended and sampled by perfctr_cpu_suspend() */ +unsigned int perfctr_cpu_identify_overflow(struct perfctr_cpu_state *state) +{ + unsigned int cstatus, nractrs, nrctrs, i; + unsigned int pmc_mask = 0; + int nr_pmcs = 6; + + if (cpu_has_feature(CPU_FTR_PMC8)) + nr_pmcs = 8; + + cstatus = state->user.cstatus; + nractrs = perfctr_cstatus_nractrs(cstatus); + nrctrs = perfctr_cstatus_nrctrs(cstatus); + + /* Ickity, ickity, ick. We don't have fine enough interrupt + * control to disable interrupts on all the counters we're not + * interested in. So, we have to deal with overflows on actrs + * amd unused PMCs as well as the ones we actually care + * about. */ + for (i = 0; i < nractrs; ++i) { + int pmc = state->control.pmc_map[i]; + u32 val = read_pmc(pmc); + + /* For actrs, force a sample if they overflowed */ + + if ((s32)val < 0) { + state->user.pmc[i].sum += (u32)(val - state->user.pmc[i].start); + state->user.pmc[i].start = 0; + write_pmc(pmc, 0); + } + } + for (; i < nrctrs; ++i) { + if ((s32)state->user.pmc[i].start < 0) { /* PPC64-specific */ + int pmc = state->control.pmc_map[i]; + /* XXX: "+=" to correct for overshots */ + state->user.pmc[i].start = state->control.ireset[pmc]; + pmc_mask |= (1 << i); + } + } + + /* Clear any unused overflowed counters, so we don't loop on + * the interrupt */ + for (i = 0; i < nr_pmcs; ++i) { + if (! (state->unused_pmcs & (1<control.header.nractrs; + nrctrs = i + state->control.header.nrictrs; + for(; i < nrctrs; ++i) { + unsigned int pmc = state->control.pmc_map[i]; + if ((int)state->control.ireset[pmc] < 0) /* PPC64-specific */ + return -EINVAL; + state->user.pmc[i].start = state->control.ireset[pmc]; + } + return 0; +} + +#else /* CONFIG_PERFCTR_INTERRUPT_SUPPORT */ +static inline void perfctr_cpu_isuspend(struct perfctr_cpu_state *state) { } +static inline void perfctr_cpu_iresume(const struct perfctr_cpu_state *state) { } +static inline int check_ireset(struct perfctr_cpu_state *state) { return 0; } +#endif /* CONFIG_PERFCTR_INTERRUPT_SUPPORT */ + +static int check_control(struct perfctr_cpu_state *state) +{ + unsigned int i, nractrs, nrctrs, pmc_mask, pmc; + unsigned int nr_pmcs = 6; + + if (cpu_has_feature(CPU_FTR_PMC8)) + nr_pmcs = 8; + + nractrs = state->control.header.nractrs; + nrctrs = nractrs + state->control.header.nrictrs; + if (nrctrs < nractrs || nrctrs > nr_pmcs) + return -EINVAL; + + pmc_mask = 0; + for (i = 0; i < nrctrs; ++i) { + pmc = state->control.pmc_map[i]; + if (pmc >= nr_pmcs || (pmc_mask & (1<control.mmcr0 & MMCR0_PMXE) + || (state->control.mmcr0 & MMCR0_PMAO) + || (state->control.mmcr0 & MMCR0_TBEE) ) + return -EINVAL; + + state->unused_pmcs = ((1 << nr_pmcs)-1) & ~pmc_mask; + + state->id = new_id(); + + return 0; +} + +int perfctr_cpu_update_control(struct perfctr_cpu_state *state, int is_global) +{ + int err; + + clear_isuspend_cpu(state); + state->user.cstatus = 0; + + /* disallow i-mode counters if we cannot catch the interrupts */ + if (!(perfctr_info.cpu_features & PERFCTR_FEATURE_PCINT) + && state->control.header.nrictrs) + return -EPERM; + + err = check_control(state); /* may initialise state->cstatus */ + if (err < 0) + return err; + err = check_ireset(state); + if (err < 0) + return err; + state->user.cstatus |= perfctr_mk_cstatus(state->control.header.tsc_on, + state->control.header.nractrs, + state->control.header.nrictrs); + return 0; +} + +/* + * get_reg_offset() maps SPR numbers to offsets into struct perfctr_cpu_control. + */ +static const struct { + unsigned int spr; + unsigned int offset; + unsigned int size; +} reg_offsets[] = { + { SPRN_MMCR0, offsetof(struct perfctr_cpu_control, mmcr0), sizeof(long) }, + { SPRN_MMCR1, offsetof(struct perfctr_cpu_control, mmcr1), sizeof(long) }, + { SPRN_MMCRA, offsetof(struct perfctr_cpu_control, mmcra), sizeof(long) }, + { SPRN_PMC1, offsetof(struct perfctr_cpu_control, ireset[1-1]), sizeof(int) }, + { SPRN_PMC2, offsetof(struct perfctr_cpu_control, ireset[2-1]), sizeof(int) }, + { SPRN_PMC3, offsetof(struct perfctr_cpu_control, ireset[3-1]), sizeof(int) }, + { SPRN_PMC4, offsetof(struct perfctr_cpu_control, ireset[4-1]), sizeof(int) }, + { SPRN_PMC5, offsetof(struct perfctr_cpu_control, ireset[5-1]), sizeof(int) }, + { SPRN_PMC6, offsetof(struct perfctr_cpu_control, ireset[6-1]), sizeof(int) }, + { SPRN_PMC7, offsetof(struct perfctr_cpu_control, ireset[7-1]), sizeof(int) }, + { SPRN_PMC8, offsetof(struct perfctr_cpu_control, ireset[8-1]), sizeof(int) }, +}; + +static int get_reg_offset(unsigned int spr, unsigned int *size) +{ + unsigned int i; + + for(i = 0; i < ARRAY_SIZE(reg_offsets); ++i) + if (spr == reg_offsets[i].spr) { + *size = reg_offsets[i].size; + return reg_offsets[i].offset; + } + return -1; +} + +static int access_regs(struct perfctr_cpu_control *control, + void *argp, unsigned int argbytes, int do_write) +{ + struct perfctr_cpu_reg *regs; + unsigned int i, nr_regs, size; + int offset; + + nr_regs = argbytes / sizeof(struct perfctr_cpu_reg); + if (nr_regs * sizeof(struct perfctr_cpu_reg) != argbytes) + return -EINVAL; + regs = (struct perfctr_cpu_reg*)argp; + + for(i = 0; i < nr_regs; ++i) { + offset = get_reg_offset(regs[i].nr, &size); + if (offset < 0) + return -EINVAL; + if (size == sizeof(long)) { + unsigned long *where = (unsigned long*)((char*)control + offset); + if (do_write) + *where = regs[i].value; + else + regs[i].value = *where; + } else { + unsigned int *where = (unsigned int*)((char*)control + offset); + if (do_write) + *where = regs[i].value; + else + regs[i].value = *where; + } + } + return argbytes; +} + +int perfctr_cpu_control_write(struct perfctr_cpu_control *control, unsigned int domain, + const void *srcp, unsigned int srcbytes) +{ + if (domain != PERFCTR_DOMAIN_CPU_REGS) + return -EINVAL; + return access_regs(control, (void*)srcp, srcbytes, 1); +} + +int perfctr_cpu_control_read(const struct perfctr_cpu_control *control, unsigned int domain, + void *dstp, unsigned int dstbytes) +{ + if (domain != PERFCTR_DOMAIN_CPU_REGS) + return -EINVAL; + return access_regs((struct perfctr_cpu_control*)control, dstp, dstbytes, 0); +} + +void perfctr_cpu_suspend(struct perfctr_cpu_state *state) +{ + unsigned int i, cstatus; + struct perfctr_low_ctrs now; + + write_perfseq_begin(&state->user.sequence); + + /* quiesce the counters */ + mtspr(SPRN_MMCR0, MMCR0_FC); + get_cpu_cache()->ppc64_mmcr0 = MMCR0_FC; + + if (perfctr_cstatus_has_ictrs(state->user.cstatus)) + perfctr_cpu_isuspend(state); + + perfctr_cpu_read_counters(state, &now); + cstatus = state->user.cstatus; + if (perfctr_cstatus_has_tsc(cstatus)) + state->user.tsc_sum += now.tsc - state->user.tsc_start; + + for (i = 0; i < perfctr_cstatus_nractrs(cstatus); ++i) + state->user.pmc[i].sum += (u32)(now.pmc[i]-state->user.pmc[i].start); + + write_perfseq_end(&state->user.sequence); +} + +void perfctr_cpu_resume(struct perfctr_cpu_state *state) +{ + struct perfctr_low_ctrs now; + unsigned int i, cstatus; + + write_perfseq_begin(&state->user.sequence); + if (perfctr_cstatus_has_ictrs(state->user.cstatus)) + perfctr_cpu_iresume(state); + perfctr_cpu_write_control(state); + + perfctr_cpu_read_counters(state, &now); + cstatus = state->user.cstatus; + if (perfctr_cstatus_has_tsc(cstatus)) + state->user.tsc_start = now.tsc; + + for (i = 0; i < perfctr_cstatus_nractrs(cstatus); ++i) + state->user.pmc[i].start = now.pmc[i]; + + write_perfseq_end(&state->user.sequence); +} + +void perfctr_cpu_sample(struct perfctr_cpu_state *state) +{ + unsigned int i, cstatus, nractrs; + struct perfctr_low_ctrs now; + + write_perfseq_begin(&state->user.sequence); + perfctr_cpu_read_counters(state, &now); + cstatus = state->user.cstatus; + if (perfctr_cstatus_has_tsc(cstatus)) { + state->user.tsc_sum += now.tsc - state->user.tsc_start; + state->user.tsc_start = now.tsc; + } + nractrs = perfctr_cstatus_nractrs(cstatus); + for(i = 0; i < nractrs; ++i) { + state->user.pmc[i].sum += (u32)(now.pmc[i]-state->user.pmc[i].start); + state->user.pmc[i].start = now.pmc[i]; + } + write_perfseq_end(&state->user.sequence); +} + +static void perfctr_cpu_clear_counters(void) +{ + struct per_cpu_cache *cache; + + cache = get_cpu_cache(); + memset(cache, 0, sizeof *cache); + cache->id = 0; + + ppc64_clear_counters(); +} + +/**************************************************************** + * * + * Processor detection and initialisation procedures. * + * * + ****************************************************************/ + +static void ppc64_cpu_setup(void) +{ + /* allow user to initialize these???? */ + + unsigned long long mmcr0 = mfspr(SPRN_MMCR0); + unsigned long long mmcra = mfspr(SPRN_MMCRA); + + + ppc64_enable_pmcs(); + + mmcr0 |= MMCR0_FC; + mtspr(SPRN_MMCR0, mmcr0); + + mmcr0 |= MMCR0_FCM1|MMCR0_PMXE|MMCR0_FCECE; + mmcr0 |= MMCR0_PMC1CE|MMCR0_PMCjCE; + mtspr(SPRN_MMCR0, mmcr0); + + mmcra |= MMCRA_SAMPLE_ENABLE; + mtspr(SPRN_MMCRA, mmcra); + + printk("setup on cpu %d, mmcr0 %lx\n", smp_processor_id(), + mfspr(SPRN_MMCR0)); + printk("setup on cpu %d, mmcr1 %lx\n", smp_processor_id(), + mfspr(SPRN_MMCR1)); + printk("setup on cpu %d, mmcra %lx\n", smp_processor_id(), + mfspr(SPRN_MMCRA)); + +/* mtmsrd(mfmsr() | MSR_PMM); */ + + ppc64_clear_counters(); + + mmcr0 = mfspr(SPRN_MMCR0); + mmcr0 &= ~MMCR0_PMAO; + mmcr0 &= ~MMCR0_FC; + mtspr(SPRN_MMCR0, mmcr0); + + printk("start on cpu %d, mmcr0 %llx\n", smp_processor_id(), mmcr0); +} + + +static void perfctr_cpu_clear_one(void *ignore) +{ + /* PREEMPT note: when called via on_each_cpu(), + this is in IRQ context with preemption disabled. */ + perfctr_cpu_clear_counters(); +} + +static void perfctr_cpu_reset(void) +{ + on_each_cpu(perfctr_cpu_clear_one, NULL, 1, 1); + perfctr_cpu_set_ihandler(NULL); +} + +int __init perfctr_cpu_init(void) +{ + extern unsigned long ppc_proc_freq; + extern unsigned long ppc_tb_freq; + + perfctr_info.cpu_features = PERFCTR_FEATURE_RDTSC + | PERFCTR_FEATURE_RDPMC | PERFCTR_FEATURE_PCINT; + + perfctr_cpu_name = "PowerPC64"; + + perfctr_info.cpu_khz = ppc_proc_freq / 1000; + /* We need to round here rather than truncating, because in a + * few cases the raw ratio can end up being 7.9999 or + * suchlike */ + perfctr_info.tsc_to_cpu_mult = + (ppc_proc_freq + ppc_tb_freq - 1) / ppc_tb_freq; + + on_each_cpu((void *)ppc64_cpu_setup, NULL, 0, 1); + + perfctr_ppc64_init_tests(); + + perfctr_cpu_reset(); + return 0; +} + +void __exit perfctr_cpu_exit(void) +{ + perfctr_cpu_reset(); +} + +/**************************************************************** + * * + * Hardware reservation. * + * * + ****************************************************************/ + +static spinlock_t service_mutex = SPIN_LOCK_UNLOCKED; +static const char *current_service = NULL; + +const char *perfctr_cpu_reserve(const char *service) +{ + const char *ret; + + spin_lock(&service_mutex); + + ret = current_service; + if (ret) + goto out; + + ret = "unknown driver (oprofile?)"; + if (reserve_pmc_hardware(do_perfctr_interrupt) != 0) + goto out; + + current_service = service; + ret = NULL; + + out: + spin_unlock(&service_mutex); + return ret; +} + +void perfctr_cpu_release(const char *service) +{ + spin_lock(&service_mutex); + + if (service != current_service) { + printk(KERN_ERR "%s: attempt by %s to release while reserved by %s\n", + __FUNCTION__, service, current_service); + goto out; + } + + /* power down the counters */ + perfctr_cpu_reset(); + current_service = NULL; + release_pmc_hardware(); + + out: + spin_unlock(&service_mutex); +} diff -puN /dev/null drivers/perfctr/ppc64_tests.c --- /dev/null 2003-09-15 06:40:47.000000000 -0700 +++ devel-akpm/drivers/perfctr/ppc64_tests.c 2005-07-08 23:11:41.000000000 -0700 @@ -0,0 +1,322 @@ +/* + * Performance-monitoring counters driver. + * Optional PPC64-specific init-time tests. + * + * Copyright (C) 2004 David Gibson, IBM Corporation. + * Copyright (C) 2004 Mikael Pettersson + */ +#include +#include +#include +#include +#include +#include +#include /* for tb_ticks_per_jiffy */ +#include "ppc64_tests.h" + +#define NITER 256 +#define X2(S) S"; "S +#define X8(S) X2(X2(X2(S))) + +static void __init do_read_tbl(unsigned int unused) +{ + unsigned int i, dummy; + for(i = 0; i < NITER/8; ++i) + __asm__ __volatile__(X8("mftbl %0") : "=r"(dummy)); +} + +static void __init do_read_pmc1(unsigned int unused) +{ + unsigned int i, dummy; + for(i = 0; i < NITER/8; ++i) + __asm__ __volatile__(X8("mfspr %0," __stringify(SPRN_PMC1)) : "=r"(dummy)); +} + +static void __init do_read_pmc2(unsigned int unused) +{ + unsigned int i, dummy; + for(i = 0; i < NITER/8; ++i) + __asm__ __volatile__(X8("mfspr %0," __stringify(SPRN_PMC2)) : "=r"(dummy)); +} + +static void __init do_read_pmc3(unsigned int unused) +{ + unsigned int i, dummy; + for(i = 0; i < NITER/8; ++i) + __asm__ __volatile__(X8("mfspr %0," __stringify(SPRN_PMC3)) : "=r"(dummy)); +} + +static void __init do_read_pmc4(unsigned int unused) +{ + unsigned int i, dummy; + for(i = 0; i < NITER/8; ++i) + __asm__ __volatile__(X8("mfspr %0," __stringify(SPRN_PMC4)) : "=r"(dummy)); +} + +static void __init do_read_mmcr0(unsigned int unused) +{ + unsigned int i, dummy; + for(i = 0; i < NITER/8; ++i) + __asm__ __volatile__(X8("mfspr %0," __stringify(SPRN_MMCR0)) : "=r"(dummy)); +} + +static void __init do_read_mmcr1(unsigned int unused) +{ + unsigned int i, dummy; + for(i = 0; i < NITER/8; ++i) + __asm__ __volatile__(X8("mfspr %0," __stringify(SPRN_MMCR1)) : "=r"(dummy)); +} + +static void __init do_write_pmc2(unsigned int arg) +{ + unsigned int i; + for(i = 0; i < NITER/8; ++i) + __asm__ __volatile__(X8("mtspr " __stringify(SPRN_PMC2) ",%0") : : "r"(arg)); +} + +static void __init do_write_pmc3(unsigned int arg) +{ + unsigned int i; + for(i = 0; i < NITER/8; ++i) + __asm__ __volatile__(X8("mtspr " __stringify(SPRN_PMC3) ",%0") : : "r"(arg)); +} + +static void __init do_write_pmc4(unsigned int arg) +{ + unsigned int i; + for(i = 0; i < NITER/8; ++i) + __asm__ __volatile__(X8("mtspr " __stringify(SPRN_PMC4) ",%0") : : "r"(arg)); +} + +static void __init do_write_mmcr1(unsigned int arg) +{ + unsigned int i; + for(i = 0; i < NITER/8; ++i) + __asm__ __volatile__(X8("mtspr " __stringify(SPRN_MMCR1) ",%0") : : "r"(arg)); +} + +static void __init do_write_mmcr0(unsigned int arg) +{ + unsigned int i; + for(i = 0; i < NITER/8; ++i) + __asm__ __volatile__(X8("mtspr " __stringify(SPRN_MMCR0) ",%0") : : "r"(arg)); +} + +static void __init do_empty_loop(unsigned int unused) +{ + unsigned i; + for(i = 0; i < NITER/8; ++i) + __asm__ __volatile__("" : : ); +} + +static unsigned __init run(void (*doit)(unsigned int), unsigned int arg) +{ + unsigned int start, stop; + start = mfspr(SPRN_PMC1); + (*doit)(arg); /* should take < 2^32 cycles to complete */ + stop = mfspr(SPRN_PMC1); + return stop - start; +} + +static void __init init_tests_message(void) +{ +#if 0 + printk(KERN_INFO "Please email the following PERFCTR INIT lines " + "to mikpe@csd.uu.se\n" + KERN_INFO "To remove this message, rebuild the driver " + "with CONFIG_PERFCTR_INIT_TESTS=n\n"); + printk(KERN_INFO "PERFCTR INIT: PVR 0x%08x, CPU clock %u kHz, TB clock %lu kHz\n", + pvr, + perfctr_info.cpu_khz, + tb_ticks_per_jiffy*(HZ/10)/(1000/10)); +#endif +} + +static void __init clear(void) +{ + mtspr(SPRN_MMCR0, 0); + mtspr(SPRN_MMCR1, 0); + mtspr(SPRN_MMCRA, 0); + mtspr(SPRN_PMC1, 0); + mtspr(SPRN_PMC2, 0); + mtspr(SPRN_PMC3, 0); + mtspr(SPRN_PMC4, 0); + mtspr(SPRN_PMC5, 0); + mtspr(SPRN_PMC6, 0); + mtspr(SPRN_PMC7, 0); + mtspr(SPRN_PMC8, 0); +} + +static void __init check_fcece(unsigned int pmc1ce) +{ + unsigned int mmcr0; + unsigned int pmc1; + int x = 0; + + /* JHE check out section 1.6.6.2 of the POWER5 pdf */ + + /* + * This test checks if MMCR0[FC] is set after PMC1 overflows + * when MMCR0[FCECE] is set. + * 74xx documentation states this behaviour, while documentation + * for 604/750 processors doesn't mention this at all. + * + * Also output the value of PMC1 shortly after the overflow. + * This tells us if PMC1 really was frozen. On 604/750, it may not + * freeze since we don't enable PMIs. [No freeze confirmed on 750.] + * + * When pmc1ce == 0, MMCR0[PMC1CE] is zero. It's unclear whether + * this masks all PMC1 overflow events or just PMC1 PMIs. + * + * PMC1 counts processor cycles, with 100 to go before overflowing. + * FCECE is set. + * PMC1CE is clear if !pmc1ce, otherwise set. + */ + pmc1 = mfspr(SPRN_PMC1); + + mtspr(SPRN_PMC1, 0x80000000-100); + mmcr0 = MMCR0_FCECE | MMCR0_SHRFC; + + if (pmc1ce) + mmcr0 |= MMCR0_PMC1CE; + + mtspr(SPRN_MMCR0, mmcr0); + + pmc1 = mfspr(SPRN_PMC1); + + do { + do_empty_loop(0); + + pmc1 = mfspr(SPRN_PMC1); + if (x++ > 20000000) { + break; + } + } while (!(mfspr(SPRN_PMC1) & 0x80000000)); + do_empty_loop(0); + + printk(KERN_INFO "PERFCTR INIT: %s(%u): MMCR0[FC] is %u, PMC1 is %#lx\n", + __FUNCTION__, pmc1ce, + !!(mfspr(SPRN_MMCR0) & MMCR0_FC), mfspr(SPRN_PMC1)); + mtspr(SPRN_MMCR0, 0); + mtspr(SPRN_PMC1, 0); +} + +static void __init check_trigger(unsigned int pmc1ce) +{ + unsigned int mmcr0; + unsigned int pmc1; + int x = 0; + + /* + * This test checks if MMCR0[TRIGGER] is reset after PMC1 overflows. + * 74xx documentation states this behaviour, while documentation + * for 604/750 processors doesn't mention this at all. + * [No reset confirmed on 750.] + * + * Also output the values of PMC1 and PMC2 shortly after the overflow. + * PMC2 should be equal to PMC1-0x80000000. + * + * When pmc1ce == 0, MMCR0[PMC1CE] is zero. It's unclear whether + * this masks all PMC1 overflow events or just PMC1 PMIs. + * + * PMC1 counts processor cycles, with 100 to go before overflowing. + * PMC2 counts processor cycles, starting from 0. + * TRIGGER is set, so PMC2 doesn't start until PMC1 overflows. + * PMC1CE is clear if !pmc1ce, otherwise set. + */ + mtspr(SPRN_PMC2, 0); + mtspr(SPRN_PMC1, 0x80000000-100); + mmcr0 = MMCR0_TRIGGER | MMCR0_SHRFC | MMCR0_FCHV; + + if (pmc1ce) + mmcr0 |= MMCR0_PMC1CE; + + mtspr(SPRN_MMCR0, mmcr0); + do { + do_empty_loop(0); + pmc1 = mfspr(SPRN_PMC1); + if (x++ > 20000000) { + break; + } + + } while (!(mfspr(SPRN_PMC1) & 0x80000000)); + do_empty_loop(0); + printk(KERN_INFO "PERFCTR INIT: %s(%u): MMCR0[TRIGGER] is %u, PMC1 is %#lx, PMC2 is %#lx\n", + __FUNCTION__, pmc1ce, + !!(mfspr(SPRN_MMCR0) & MMCR0_TRIGGER), mfspr(SPRN_PMC1), mfspr(SPRN_PMC2)); + mtspr(SPRN_MMCR0, 0); + mtspr(SPRN_PMC1, 0); + mtspr(SPRN_PMC2, 0); +} + +static void __init measure_overheads(void) +{ + int i; + unsigned int mmcr0, loop, ticks[12]; + const char *name[12]; + + clear(); + + /* PMC1 = "processor cycles", + PMC2 = "completed instructions", + not disabled in any mode, + no interrupts */ + /* mmcr0 = (0x01 << 6) | (0x02 << 0); */ + mmcr0 = MMCR0_SHRFC | MMCR0_FCWAIT; + mtspr(SPRN_MMCR0, mmcr0); + + name[0] = "mftbl"; + ticks[0] = run(do_read_tbl, 0); + name[1] = "mfspr (pmc1)"; + ticks[1] = run(do_read_pmc1, 0); + name[2] = "mfspr (pmc2)"; + ticks[2] = run(do_read_pmc2, 0); + name[3] = "mfspr (pmc3)"; + ticks[3] = run(do_read_pmc3, 0); + name[4] = "mfspr (pmc4)"; + ticks[4] = run(do_read_pmc4, 0); + name[5] = "mfspr (mmcr0)"; + ticks[5] = run(do_read_mmcr0, 0); + name[6] = "mfspr (mmcr1)"; + ticks[6] = run(do_read_mmcr1, 0); + name[7] = "mtspr (pmc2)"; + ticks[7] = run(do_write_pmc2, 0); + name[8] = "mtspr (pmc3)"; + ticks[8] = run(do_write_pmc3, 0); + name[9] = "mtspr (pmc4)"; + ticks[9] = run(do_write_pmc4, 0); + name[10] = "mtspr (mmcr1)"; + ticks[10] = run(do_write_mmcr1, 0); + name[11] = "mtspr (mmcr0)"; + ticks[11] = run(do_write_mmcr0, mmcr0); + + loop = run(do_empty_loop, 0); + + clear(); + + init_tests_message(); + printk(KERN_INFO "PERFCTR INIT: NITER == %u\n", NITER); + printk(KERN_INFO "PERFCTR INIT: loop overhead is %u cycles\n", loop); + for(i = 0; i < ARRAY_SIZE(ticks); ++i) { + unsigned int x; + if (!ticks[i]) + continue; + x = ((ticks[i] - loop) * 10) / NITER; + printk(KERN_INFO "PERFCTR INIT: %s cost is %u.%u cycles (%u total)\n", + name[i], x/10, x%10, ticks[i]); + } + + check_fcece(0); +#if 0 + check_fcece(1); + check_trigger(0); + check_trigger(1); +#endif +} + +void __init perfctr_ppc64_init_tests(void) +{ + preempt_disable(); + measure_overheads(); + preempt_enable(); +} diff -puN /dev/null drivers/perfctr/ppc64_tests.h --- /dev/null 2003-09-15 06:40:47.000000000 -0700 +++ devel-akpm/drivers/perfctr/ppc64_tests.h 2005-07-08 23:11:41.000000000 -0700 @@ -0,0 +1,12 @@ +/* + * Performance-monitoring counters driver. + * Optional PPC32-specific init-time tests. + * + * Copyright (C) 2004 Mikael Pettersson + */ + +#ifdef CONFIG_PERFCTR_INIT_TESTS +extern void perfctr_ppc64_init_tests(void); +#else +static inline void perfctr_ppc64_init_tests(void) { } +#endif diff -puN /dev/null include/asm-ppc64/perfctr.h --- /dev/null 2003-09-15 06:40:47.000000000 -0700 +++ devel-akpm/include/asm-ppc64/perfctr.h 2005-07-08 23:11:41.000000000 -0700 @@ -0,0 +1,167 @@ +/* + * PPC64 Performance-Monitoring Counters driver + * + * Copyright (C) 2004 David Gibson, IBM Corporation. + * Copyright (C) 2004 Mikael Pettersson + */ +#ifndef _ASM_PPC64_PERFCTR_H +#define _ASM_PPC64_PERFCTR_H + +#include + +struct perfctr_sum_ctrs { + __u64 tsc; + __u64 pmc[8]; /* the size is not part of the user ABI */ +}; + +struct perfctr_cpu_control_header { + __u32 tsc_on; + __u32 nractrs; /* number of accumulation-mode counters */ + __u32 nrictrs; /* number of interrupt-mode counters */ +}; + +struct perfctr_cpu_state_user { + __u32 cstatus; + /* This is a sequence counter to ensure atomic reads by + * userspace. The mechanism is identical to that used for + * seqcount_t in include/linux/seqlock.h. */ + __u32 sequence; + __u64 tsc_start; + __u64 tsc_sum; + struct { + __u64 start; + __u64 sum; + } pmc[8]; /* the size is not part of the user ABI */ +}; + +/* cstatus is a re-encoding of control.tsc_on/nractrs/nrictrs + which should have less overhead in most cases */ +/* XXX: ppc driver internally also uses cstatus&(1<<30) */ + +static inline +unsigned int perfctr_mk_cstatus(unsigned int tsc_on, unsigned int nractrs, + unsigned int nrictrs) +{ + return (tsc_on<<31) | (nrictrs<<16) | ((nractrs+nrictrs)<<8) | nractrs; +} + +static inline unsigned int perfctr_cstatus_enabled(unsigned int cstatus) +{ + return cstatus; +} + +static inline int perfctr_cstatus_has_tsc(unsigned int cstatus) +{ + return (int)cstatus < 0; /* test and jump on sign */ +} + +static inline unsigned int perfctr_cstatus_nractrs(unsigned int cstatus) +{ + return cstatus & 0x7F; /* and with imm8 */ +} + +static inline unsigned int perfctr_cstatus_nrctrs(unsigned int cstatus) +{ + return (cstatus >> 8) & 0x7F; +} + +static inline unsigned int perfctr_cstatus_has_ictrs(unsigned int cstatus) +{ + return cstatus & (0x7F << 16); +} + +/* + * 'struct siginfo' support for perfctr overflow signals. + * In unbuffered mode, si_code is set to SI_PMC_OVF and a bitmask + * describing which perfctrs overflowed is put in si_pmc_ovf_mask. + * A bitmask is used since more than one perfctr can have overflowed + * by the time the interrupt handler runs. + */ +#define SI_PMC_OVF -8 +#define si_pmc_ovf_mask _sifields._pad[0] /* XXX: use an unsigned field later */ + +#ifdef __KERNEL__ + +#if defined(CONFIG_PERFCTR) + +struct perfctr_cpu_control { + struct perfctr_cpu_control_header header; + u64 mmcr0; + u64 mmcr1; + u64 mmcra; + unsigned int ireset[8]; /* [0,0x7fffffff], for i-mode counters, physical indices */ + unsigned int pmc_map[8]; /* virtual to physical index map */ +}; + +struct perfctr_cpu_state { + /* Don't change field order here without first considering the number + of cache lines touched during sampling and context switching. */ + unsigned int id; + int isuspend_cpu; + struct perfctr_cpu_state_user user; + unsigned int unused_pmcs; + struct perfctr_cpu_control control; +}; + +/* Driver init/exit. */ +extern int perfctr_cpu_init(void); +extern void perfctr_cpu_exit(void); + +/* CPU type name. */ +extern char *perfctr_cpu_name; + +/* Hardware reservation. */ +extern const char *perfctr_cpu_reserve(const char *service); +extern void perfctr_cpu_release(const char *service); + +/* PRE: state has no running interrupt-mode counters. + Check that the new control data is valid. + Update the driver's private control data. + Returns a negative error code if the control data is invalid. */ +extern int perfctr_cpu_update_control(struct perfctr_cpu_state *state, int is_global); + +/* Parse and update control for the given domain. */ +extern int perfctr_cpu_control_write(struct perfctr_cpu_control *control, + unsigned int domain, + const void *srcp, unsigned int srcbytes); + +/* Retrieve and format control for the given domain. + Returns number of bytes written. */ +extern int perfctr_cpu_control_read(const struct perfctr_cpu_control *control, + unsigned int domain, + void *dstp, unsigned int dstbytes); + +/* Read a-mode counters. Subtract from start and accumulate into sums. + Must be called with preemption disabled. */ +extern void perfctr_cpu_suspend(struct perfctr_cpu_state *state); + +/* Write control registers. Read a-mode counters into start. + Must be called with preemption disabled. */ +extern void perfctr_cpu_resume(struct perfctr_cpu_state *state); + +/* Perform an efficient combined suspend/resume operation. + Must be called with preemption disabled. */ +extern void perfctr_cpu_sample(struct perfctr_cpu_state *state); + +/* The type of a perfctr overflow interrupt handler. + It will be called in IRQ context, with preemption disabled. */ +typedef void (*perfctr_ihandler_t)(unsigned long pc); + +/* Operations related to overflow interrupt handling. */ +#ifdef CONFIG_PERFCTR_INTERRUPT_SUPPORT +extern void perfctr_cpu_set_ihandler(perfctr_ihandler_t); +extern void perfctr_cpu_ireload(struct perfctr_cpu_state*); +extern unsigned int perfctr_cpu_identify_overflow(struct perfctr_cpu_state*); +#else +static inline void perfctr_cpu_set_ihandler(perfctr_ihandler_t x) { } +#endif +static inline int perfctr_cpu_has_pending_interrupt(const struct perfctr_cpu_state *state) +{ + return 0; +} + +#endif /* CONFIG_PERFCTR */ + +#endif /* __KERNEL__ */ + +#endif /* _ASM_PPC64_PERFCTR_H */ diff -puN arch/i386/kernel/syscall_table.S~perfctr arch/i386/kernel/syscall_table.S --- devel/arch/i386/kernel/syscall_table.S~perfctr 2005-07-08 23:11:41.000000000 -0700 +++ devel-akpm/arch/i386/kernel/syscall_table.S 2005-07-08 23:11:41.000000000 -0700 @@ -293,3 +293,7 @@ ENTRY(sys_call_table) .long sys_ioprio_get /* 290 */ .long sys_pselect6 .long sys_ppoll + .long sys_vperfctr_open + .long sys_vperfctr_control + .long sys_vperfctr_write /* 295 */ + .long sys_vperfctr_read _