From: Mikael Pettersson This patch adds documentation for perfctr's low-level drivers in Documentation/perfctr/. The internal API between perfctr's low-level and high-level drivers is described, as are the architecture-specific data structures users use to control and inspect the counters. Signed-off-by: Mikael Pettersson DESC perfctr documentation update EDESC From: Mikael Pettersson This patch updates perfctr's documentation: - adds new Implementation Notes section to the x86 documentation - some minor fixes in the x86 documentation - adds new documentation on the per-process perfctrs - adds new overview documentation Signed-off-by: Mikael Pettersson Signed-off-by: Andrew Morton --- 25-akpm/Documentation/perfctr/low-level-api.txt | 216 +++++++++++++ 25-akpm/Documentation/perfctr/low-level-ppc32.txt | 164 ++++++++++ 25-akpm/Documentation/perfctr/low-level-x86.txt | 360 ++++++++++++++++++++++ 25-akpm/Documentation/perfctr/overview.txt | 129 +++++++ 25-akpm/Documentation/perfctr/virtual.txt | 355 +++++++++++++++++++++ 5 files changed, 1224 insertions(+) diff -puN /dev/null Documentation/perfctr/low-level-api.txt --- /dev/null Thu Apr 11 07:25:15 2002 +++ 25-akpm/Documentation/perfctr/low-level-api.txt Mon Aug 16 15:57:30 2004 @@ -0,0 +1,216 @@ +$Id: low-level-api.txt,v 1.1 2004/07/02 18:57:05 mikpe Exp $ + +PERFCTR LOW-LEVEL DRIVERS API +============================= + +This document describes the common low-level API. +See low-level-$ARCH.txt for architecture-specific documentation. + +General Model +============= +The model is that of a processor with: +- A non-programmable clock-like counter, the "TSC". + The TSC frequency is assumed to be constant, but it is not + assumed to be identical to the core frequency. + The TSC may be absent. +- A set of programmable counters, the "perfctrs" or "pmcs". + Control data may be per-counter, global, or both. + The counters are not assumed to be interchangeable. + + A normal counter that simply counts events is referred to + as an "accumulation-mode" or "a-mode" counter. Its total + count is computed by adding the counts for the individual + periods during which the counter is active. Two per-counter + state variables are used for this: "sum", which is the + total count up to but not including the current period, + and "start", which records the value of the hardware counter + at the start of the current period. At the end of a period, + the hardware counter's value is read again, and the increment + relative the start value is added to the sum. This strategy + is used because it avoids a number of hardware problems. + + A counter that has been programmed to generate an interrupt + on overflow is referred to as an "interrupt-mode" or "i-mode" + counter. I-mode counters are initialised to specific values, + and after overflowing are reset to their (re)start values. + The total event count is available just as for a-mode counters. + + The set of counters may be empty, in which case only the + TSC (which must be present) can be sampled. + +Contents of +================================= + +"struct perfctr_sum_ctrs" +------------------------- +struct perfctr_sum_ctrs { + unsigned long long tsc; + unsigned long long pmc[..]; /* one per counter */ +}; + +Architecture-specific container for counter values. +Used in the kernel/user API, but not by the low-level drivers. + +"struct perfctr_cpu_control" +---------------------------- +This struct includes at least the following fields: + + unsigned int tsc_on; + unsigned int nractrs; /* # of a-mode counters */ + unsigned int nrictrs; /* # of i-mode counters */ + unsigned int pmc_map[..]; /* one per counter: virt-to-phys mapping */ + unsigned int evntsel[..]; /* one per counter: hw control data */ + int ireset[..]; /* one per counter: i-mode (re)start value */ + +Architecture-specific container for control data. +Used both in the kernel/user API and by the low-level drivers +(embedded in "struct perfctr_cpu_state"). + +"tsc_on" is non-zero if the TSC should be sampled. + +"nractrs" is the number of a-mode counters, corresponding to +elements 0..nractrs-1 in the per-counter arrays. + +"nrictrs" is the number of i-mode counters, corresponding to +elements nractrs..nractrs+nrictrs-1 in the per-counter arrays. + +"nractrs+nrictrs" is the total number of counters to program +and sample. A-mode and i-mode counters are separated in order +to allow quick enumeration of either set, which is needed in +some low-level driver operations. + +"pmc_map[]" maps each counter to its corresponding hardware counter +identification. No two counters may map to the same hardware counter. +This mapping is present because the hardware may have asymmetric +counters or other addressing quirks, which means that a counter's index +may not suffice to address its hardware counter. + +"evntsel[]" contains the per-counter control data. Architecture-specific +global control data, if any, is placed in architecture-specific fields. + +"ireset[]" contains the (re)start values for the i-mode counters. +Only indices nractrs..nractrs+nrictrs-1 are used. + +"struct perfctr_cpu_state" +-------------------------- +This struct includes at least the following fields: + + unsigned int cstatus; + unsigned int tsc_start; + unsigned long long tsc_sum; + struct { + unsigned int map; + unsigned int start; + unsigned long long sum; + } pmc[..]; /* one per counter; the size is not part of the user ABI */ +#ifdef __KERNEL__ + struct perfctr_cpu_control control; +#endif + +This type records the state and control data for a collection +of counters. It is used by many low-level operations, and may +be exported to user-space via mmap(). + +"cstatus" is a re-encoding of control.tsc_on/nractrs/nrictrs, +used because it reduces overheads in key low-level operations. +Operations on cstatus values include: +- unsigned int perfctr_mk_cstatus(unsigned int tsc_on, unsigned int nractrs, unsigned int nrictrs); + Construct a cstatus value. +- unsigned int perfctr_cstatus_enabled(unsigned int cstatus); + Check if any part (tsc_on, nractrs, nrictrs) of the cstatus is non-zero. +- int perfctr_cstatus_has_tsc(unsigned int cstatus); + Check if the tsc_on part of the cstatus is non-zero. +- unsigned int perfctr_cstatus_nrctrs(unsigned int cstatus); + Retrieve nractrs+nrictrs from the cstatus. +- unsigned int perfctr_cstatus_has_ictrs(unsigned int cstatus); + Check if the nrictrs part of cstatus is non-zero. + +"tsc_start" and "tsc_sum" record the state of the TSC. + +"pmc[]" contains the per-counter state, in the "start" and "sum" +fields. The "map" field contains the corresponding hardware counter +identification, from the counter's entry in "control.pmc_map[]"; +it is copied into pmc[] to reduce overheads in key low-level operations. + +"control" contains the control data which determines the +behaviour of the counters. + +User-space overflow signal handler items +---------------------------------------- +After a counter has overflowed, a user-space signal handler may +be invoked with a "struct siginfo" identifying the source of the +signal and the set of overflown counters. + +#define SI_PMC_OVF .. + +Value to be stored in "si.si_code". + +#define si_pmc_ovf_mask .. + +Field in which to store a bit-mask of the overflown counters. + +Kernel-internal API +------------------- + +/* Driver init/exit. + perfctr_cpu_init() performs hardware detection and may fail. */ +extern int perfctr_cpu_init(void); +extern void perfctr_cpu_exit(void); + +/* CPU type name. Set if perfctr_cpu_init() was successful. */ +extern char *perfctr_cpu_name; + +/* Hardware reservation. A high-level driver must reserve the + hardware before it may use it, and release it afterwards. + "service" is a unique string identifying the high-level driver. + perfctr_cpu_reserve() returns NULL on success; if another + high-level driver has reserved the hardware, then that + driver's "service" string is returned. */ +extern const char *perfctr_cpu_reserve(const char *service); +extern void perfctr_cpu_release(const char *service); + +/* PRE: state has no running interrupt-mode counters. + Check that the new control data is valid. + Update the low-level driver's private control data. + is_global should be zero for per-process counters and non-zero + for global-mode counters. + Returns a negative error code if the control data is invalid. */ +extern int perfctr_cpu_update_control(struct perfctr_cpu_state *state, int is_global); + +/* Stop i-mode counters. Update sums and start values. + Read a-mode counters. Subtract from start and accumulate into sums. + Must be called with preemption disabled. */ +extern void perfctr_cpu_suspend(struct perfctr_cpu_state *state); + +/* Reset i-mode counters to their start values. + Write control registers. + Read a-mode counters and update their start values. + Must be called with preemption disabled. */ +extern void perfctr_cpu_resume(struct perfctr_cpu_state *state); + +/* Perform an efficient combined suspend/resume operation. + Must be called with preemption disabled. */ +extern void perfctr_cpu_sample(struct perfctr_cpu_state *state); + +/* The type of a perfctr overflow interrupt handler. + It will be called in IRQ context, with preemption disabled. */ +typedef void (*perfctr_ihandler_t)(unsigned long pc); + +/* Install a perfctr overflow interrupt handler. + Should be called after perfctr_cpu_reserve() but before + any counter state has been activated. */ +extern void perfctr_cpu_set_ihandler(perfctr_ihandler_t); + +/* PRE: The state has been suspended and sampled by perfctr_cpu_suspend(). + Should be called from the high-level driver's perfctr_ihandler_t, + and preemption must not have been enabled. + Identify which counters have overflown, reset their start values + from ireset[], and perform any necessary hardware cleanup. + Returns a bit-mask of the overflown counters. */ +extern unsigned int perfctr_cpu_identify_overflow(struct perfctr_cpu_state*); + +/* Call perfctr_cpu_ireload() just before perfctr_cpu_resume() to + bypass internal caching and force a reload of the i-mode pmcs. + This ensures that perfctr_cpu_identify_overflow()'s state changes + are propagated to the hardware. */ +extern void perfctr_cpu_ireload(struct perfctr_cpu_state*); diff -puN /dev/null Documentation/perfctr/low-level-ppc32.txt --- /dev/null Thu Apr 11 07:25:15 2002 +++ 25-akpm/Documentation/perfctr/low-level-ppc32.txt Mon Aug 16 15:57:30 2004 @@ -0,0 +1,164 @@ +$Id: low-level-ppc32.txt,v 1.1 2004/07/02 18:57:05 mikpe Exp $ + +PERFCTRS PPC32 LOW-LEVEL API +============================ + +See low-level-api.txt for the common low-level API. +This document only describes ppc32-specific behaviour. +For detailed hardware control register layouts, see +the manufacturers' documentation. + +Supported processors +==================== +- PowerPC 604, 604e, 604ev. +- PowerPC 750/740, 750CX, 750FX, 750GX. +- PowerPC 7400, 7410, 7451/7441, 7457/7447. +- Any generic PowerPC with a timebase register. + +Contents of +================================= + +"struct perfctr_sum_ctrs" +------------------------- +struct perfctr_sum_ctrs { + unsigned long long tsc; + unsigned long long pmc[8]; +}; + +The pmc[] array has room for 8 counters. + +"struct perfctr_cpu_control" +---------------------------- +struct perfctr_cpu_control { + unsigned int tsc_on; + unsigned int nractrs; /* # of a-mode counters */ + unsigned int nrictrs; /* # of i-mode counters */ + unsigned int pmc_map[8]; + unsigned int evntsel[8]; /* one per counter, even on P5 */ + int ireset[8]; /* [0,0x7fffffff], for i-mode counters */ + struct { + unsigned int mmcr0; /* sans PMC{1,2}SEL */ + unsigned int mmcr2; /* only THRESHMULT */ + /* IABR/DABR/BAMR not supported */ + } ppc; + unsigned int _reserved1; + unsigned int _reserved2; + unsigned int _reserved3; + unsigned int _reserved4; +}; + +The per-counter arrays have room for 8 elements. + +ireset[] values must be non-negative, since overflow occurs on +the non-negative-to-negative transition. + +The ppc sub-struct contains PowerPC-specific control data: +- mmcr0: global control data for the MMCR0 SPR; the event + selectors for PMC1 and PMC2 are in evntsel[], not in mmcr0 +- mmcr2: global control data for the MMCR2 SPR; only the + THRESHMULT field can be specified + +"struct perfctr_cpu_state" +-------------------------- +struct perfctr_cpu_state { + unsigned int cstatus; + struct { /* k1 is opaque in the user ABI */ + unsigned int id; + int isuspend_cpu; + } k1; + /* The two tsc fields must be inlined. Placing them in a + sub-struct causes unwanted internal padding on x86-64. */ + unsigned int tsc_start; + unsigned long long tsc_sum; + struct { + unsigned int map; + unsigned int start; + unsigned long long sum; + } pmc[8]; /* the size is not part of the user ABI */ +#ifdef __KERNEL__ + unsigned int ppc_mmcr[3]; + struct perfctr_cpu_control control; +#endif +}; + +The k1 sub-struct is used by the low-level driver for +caching purposes. "id" identifies the control data, and +"isuspend_cpu" identifies the CPU on which the i-mode +counters were last suspended. + +The pmc[] array has room for 8 elements. + +ppc_mmcr[] is computed from control by the low-level driver, +and provides the data for the MMCR0, MMCR1, and MMCR2 SPRs. + +User-space overflow signal handler items +---------------------------------------- +#ifdef __KERNEL__ +#define SI_PMC_OVF (__SI_FAULT|'P') +#else +#define SI_PMC_OVF ('P') +#endif +#define si_pmc_ovf_mask _sifields._pad[0] + +Kernel-internal API +------------------- + +In perfctr_cpu_update_control(), the is_global parameter +is ignored. (It is only relevant for x86.) + +CONFIG_PERFCTR_CPUS_FORBIDDEN_MASK is never defined. +(It is only relevant for x86.) + +Overflow interrupt handling is not yet implemented. + +Processor-specific Notes +======================== + +General +------- +pmc_map[] contains a counter number, an integer between 0 and 5. +It never contains an SPR number. + +Basic operation (the strategy for a-mode counters, caching +control register contents, recording "suspend CPU" for i-mode +counters) is the same as in the x86 driver. + +PowerPC 604/750/74xx +-------------------- +These processors use similar hardware layouts, differing +mainly in the number of counter and control registers. +The set of available events differ greatly, but that only +affects users, not the low-level driver itself. + +The hardware has 2 (604), 4 (604e/750/7400/7410), or 6 +(745x) counters (PMC1 to PMC6), and 1 (604), 2 (604e/750), +or 3 (74xx) control registers (MMCR0 to MMCR2). + +MMCR0 contains global control bits, and the event selection +fields for PMC1 and PMC2. MMCR1 contains event selection fields +for PMC3-PMC6. MMCR2 contains the THRESHMULT flag, which +specifies how MMCR0[THRESHOLD] should be scaled. + +In control.ppc.mmcr0, the PMC1SEL and PMC2SEL fields (0x00001FFF) +are reserved. The PMXE flag (0x04000000) may only be set when +the driver supports overflow interrupts. + +If FCECE or TRIGGER is set in MMCR0 on a 74xx processor, then +MMCR0 can change asynchronously. The driver handles this, at +the cost of some additional work in perfctr_cpu_suspend(). +Not setting these flags avoids that overhead. + +In control.ppc.mmcr2, only the THRESHMULT flag (0x80000000) +may be set, and only on 74xx processors. + +The SIA (sampled instruction address) register is not used. +The SDA (sampled data address) register is 604/604e-only, +and is not used. The BAMR (breakpoint address mask) register +is not used, but it is cleared by the driver. + +Generic PowerPC with timebase +----------------------------- +The driver supports any PowerPC as long as it has a timebase +register, and the TB frequency is available via Open Firmware. +In this case, the only valid usage mode is with tsc_on == 1 +and nractrs == nrictrs == 0 in the control data. diff -puN /dev/null Documentation/perfctr/low-level-x86.txt --- /dev/null Thu Apr 11 07:25:15 2002 +++ 25-akpm/Documentation/perfctr/low-level-x86.txt Mon Aug 16 15:57:33 2004 @@ -0,0 +1,360 @@ +$Id: low-level-x86.txt,v 1.1 2004/07/02 18:57:05 mikpe Exp $ + +PERFCTRS X86 LOW-LEVEL API +========================== + +See low-level-api.txt for the common low-level API. +This document only describes x86-specific behaviour. +For detailed hardware control register layouts, see +the manufacturers' documentation. + +Contents +======== +- Supported processors +- Contents of +- Processor-specific Notes +- Implementation Notes + +Supported processors +==================== +- Intel P5, P5MMX, P6, P4. +- AMD K7, K8. (P6 clones, with some changes) +- Cyrix 6x86MX, MII, and III. (good P5 clones) +- Centaur WinChip C6, 2, and 3. (bad P5 clones) +- VIA C3. (bad P6 clone) +- Any generic x86 with a TSC. + +Contents of +================================ + +"struct perfctr_sum_ctrs" +------------------------- +struct perfctr_sum_ctrs { + unsigned long long tsc; + unsigned long long pmc[18]; +}; + +The pmc[] array has room for 18 counters. + +"struct perfctr_cpu_control" +---------------------------- +struct perfctr_cpu_control { + unsigned int tsc_on; + unsigned int nractrs; /* # of a-mode counters */ + unsigned int nrictrs; /* # of i-mode counters */ + unsigned int pmc_map[18]; + unsigned int evntsel[18]; /* one per counter, even on P5 */ + struct { + unsigned int escr[18]; + unsigned int pebs_enable; /* for replay tagging */ + unsigned int pebs_matrix_vert; /* for replay tagging */ + } p4; + int ireset[18]; /* < 0, for i-mode counters */ + unsigned int _reserved1; + unsigned int _reserved2; + unsigned int _reserved3; + unsigned int _reserved4; +}; + +The per-counter arrays have room for 18 elements. + +ireset[] values must be negative, since overflow occurs on +the negative-to-non-negative transition. + +The p4 sub-struct contains P4-specific control data: +- escr[]: the control data to write to the ESCR register + associatied with the counter +- pebs_enable: the control data to write to the PEBS_ENABLE MSR +- pebs_matrix_vert: the control data to write to the + PEBS_MATRIX_VERT MSR + +"struct perfctr_cpu_state" +-------------------------- +struct perfctr_cpu_state { + unsigned int cstatus; + struct { /* k1 is opaque in the user ABI */ + unsigned int id; + int isuspend_cpu; + } k1; + /* The two tsc fields must be inlined. Placing them in a + sub-struct causes unwanted internal padding on x86-64. */ + unsigned int tsc_start; + unsigned long long tsc_sum; + struct { + unsigned int map; + unsigned int start; + unsigned long long sum; + } pmc[18]; /* the size is not part of the user ABI */ +#ifdef __KERNEL__ + struct perfctr_cpu_control control; + unsigned int p4_escr_map[18]; +#endif +}; + +The k1 sub-struct is used by the low-level driver for +caching purposes. "id" identifies the control data, and +"isuspend_cpu" identifies the CPU on which the i-mode +counters were last suspended. + +The pmc[] array has room for 18 elements. + +p4_escr_map[] is computed from control by the low-level driver, +and provides the MSR number for the counter's associated ESCR. + +User-space overflow signal handler items +---------------------------------------- +#ifdef __KERNEL__ +#define SI_PMC_OVF (__SI_FAULT|'P') +#else +#define SI_PMC_OVF ('P') +#endif +#define si_pmc_ovf_mask _sifields._pad[0] + +Kernel-internal API +------------------- + +In perfctr_cpu_update_control(), the is_global parameter controls +whether monitoring the other thread (T1) on HT P4s is permitted +or not. On other processors the parameter is ignored. + +SMP kernels define CONFIG_PERFCTR_CPUS_FORBIDDEN_MASK and +"extern cpumask_t perfctr_cpus_forbidden_mask;". +On HT P4s, resource conflicts can occur because both threads +(T0 and T1) in a processor share the same perfctr registers. +To prevent conflicts, only thread 0 in each processor is allowed +to access the counters. perfctr_cpus_forbidden_mask contains the +smp_processor_id()s of each processor's thread 1, and it is the +responsibility of the high-level driver to ensure that it never +accesses the perfctr state from a forbidden thread. + +Overflow interrupt handling requires local APIC support in the kernel. + +Processor-specific Notes +======================== + +General +------- +pmc_map[] contains a counter number, as used by the RDPMC instruction. +It never contains an MSR number. + +Counters are 32, 40, or 48 bits wide. The driver always only +reads the low 32 bits. This avoids performance issues, and +errata on some processors. + +Writing to counters or their control registers tends to be +very expensive. This is why a-mode counters only use read +operations on the counter registers. Caching of control +register contents is done to avoid writing them. "Suspend CPU" +is recorded for i-mode counters to avoid writing the counter +registers when the counters are resumed (their control +registers must be written at both suspend and resume, however). + +Some processors are unable to stop the counters (Centaur/VIA), +and some are unable to reinitialise them to arbitrary values (P6). +Storing the counters' total counts in the hardware counters +would break as soon as context-switches occur. This is another +reason why the accumulate-differences method for maintaining the +counter values is used. + +Intel P5 +-------- +The hardware stores both counters' control data in a single +control register, the CESR MSR. The evntsel values are +limited to 16 bits each, and are combined by the low-level +driver to form the value for the CESR. Apart from that, +the evntsel values are direct images of the CESR. + +Bits 0xFE00 in an evntsel value are reserved. +At least one evntsel CPL bit (0x00C0) must be set. + +For Cyrix' P5 clones, evntsel bits 0xFA00 are reserved. + +For Centaur's P5 clones, evntsel bits 0xFF00 are reserved. +It has no CPL bits to set. The TSC is broken and cannot be used. + +Intel P6 +-------- +The evntsel values are mapped directly onto the counters' +EVNTSEL control registers. + +The global enable bit (22) in EVNTSEL0 must be set. That bit is +reserved in EVNTSEL1. + +Bits 21 and 19 (0x00280000) in each evntsel are reserved. + +For an i-mode counter, bit 20 (0x00100000) of its evntsel must be +set. For a-mode counters, that bit must not be set. + +Hardware quirk: Counters are 40 bits wide, but writing to a +counter only writes the low 32 bits: remaining bits are +sign-extended from bit 31. + +AMD K7/K8 +--------- +Similar to Intel P6. The main difference is that each evntsel has +its own enable bit, which must be set. + +VIA C3 +------ +Superficially similar to Intel P6, but only PERFCTR1/EVNTSEL1 +are programmable. pmc_map[0] must be 1, if nractrs == 1. + +Bits 0xFFFFFE00 in the evntsel are reserved. There are no auxiliary +control bits to set. + +Generic +------- +Only permits TSC sampling, with tsc_on == 1 and nractrs == nrictrs == 0 +in the control data. + +Intel P4 +-------- +For each counter, its evntsel[] value is mapped onto its CCCR +control register, and its p4.escr[] value is mapped onto its +associated ESCR control register. + +The ESCR register number is computed from the hardware counter +number (from pmc_map[]) and the ESCR SELECT field in the CCCR, +and is cached in p4_escr_map[]. + +pmc_map[] contains the value to pass to RDPMC when reading the +counter. It is strongly recommended to set bit 31 (fast rdpmc). + +In each evntsel/CCCR value: +- the OVF, OVF_PMI_T1 and hardware-reserved bits (0xB80007FF) + are reserved and must not be set +- bit 11 (EXTENDED_CASCADE) is only permitted on P4 models >= 2, + and for counters 12 and 15-17 +- bits 16 and 17 (ACTIVE_THREAD) must both be set on non-HT processors +- at least one of bits 12 (ENABLE), 30 (CASCADE), or 11 (EXTENDED_CASCADE) + must be set +- bit 26 (OVF_PMI_T0) must be clear for a-mode counters, and set + for i-mode counters; if bit 25 (FORCE_OVF) also is set, then + the corresponding ireset[] value must be exactly -1 + +In each p4.escr[] value: +- bit 32 is reserved and must not be set +- the CPL_T1 field (bits 0 and 1) must be zero except on HT processors + when global-mode counters are used +- IQ_ESCR0 and IQ_ESCR1 can only be used on P4 models <= 2 + +PEBS is not supported, but the replay tagging bits in PEBS_ENABLE +and PEBS_MATRIX_VERT may be used. + +If p4.pebs_enable is zero, then p4.pebs_matrix_vert must also be zero. + +If p4.pebs_enable is non-zero: +- only bits 24, 10, 9, 2, 1, and 0 may be set; note that in contrast + to Intel's documentation, bit 25 (ENABLE_PEBS_MY_THR) is not needed + and must not be set +- bit 24 (UOP_TAG) must be set +- at least one of bits 10, 9, 2, 1, or 0 must be set +- in p4.pebs_matrix_vert, all bits except 1 and 0 must be clear, + and at least one of bits 1 and 0 must be set + +Implementation Notes +==================== + +Caching +------- +Each 'struct perfctr_cpu_state' contains two cache-related fields: +- 'id': a unique identifier for the control data contents +- 'isuspend_cpu': the identity of the CPU on which a state containing + interrupt-mode counters was last suspended + +To this the driver adds a per-CPU cache, recording: +- the 'id' of the control data currently in that CPU +- the current contents of each control register + +When perfctr_cpu_update_control() has validated the new control data, +it also updates the id field. + +The driver's internal 'write_control' function, called from the +perfctr_cpu_resume() API function, first checks if the state's id +matches that of the CPU's cache, and if so, returns. Otherwise +it checks each control register in the state and updates those +that do not match the cache. Finally, it writes the state's id +to the cache. Tests on various x86 processor types have shown that +MSR writes are very expensive: the purpose of these cache checks +is to avoid MSR writes whenever possible. + +Unlike accumulation-mode counters, interrupt-mode counters must be +physically stopped when suspended, primilarly to avoid overflow +interrupts in contexts not expecting them, and secondarily to avoid +increments to the counters themselves (see below). + +When suspending interrupt-mode counters, the driver: +- records the CPU identity in the per-CPU cache +- stops each interrupt-mode counter by disabling its control register +- lets the cache and state id values remain the same + +Later, when resuming interrupt-mode counters, the driver: +- if the state and cache id values match: + * the cache id is cleared, to force a reload of the control + registers stopped at suspend (see below) + * if the state's "suspend" CPU identity matches the current CPU, + the counter registers are still valid, and the procedure returns +- if the procedure did not return above, it then loops over each + interrupt-mode counter: + * the counter's control register is physically disabled, unless + the cache indicates that it already is disabled; this is necessary + to prevent premature events and overflow interrupts if the CPU's + registers previously belonged to some other state + * then the counter register itself is restored +After this interrupt-mode specific resume code is complete, the +driver continues by calling 'write_control' as described above. +The state and cache ids will not match, forcing write_control to +reload the disabled interrupt-mode control registers. + +Call-site Backpatching +---------------------- +The x86 family of processors is quite diverse in how their +performance counters work and are accessed. There are three +main designs (P5, P6, and P4) with several variations. +To handle this the processor type detection and initialisation +code sets up a number of function pointers to point to the +correct procedures for the actual CPU type. + +Calls via function pointers are more expensive than direct calls, +so the driver actually performs direct calls to wrappers that +backpatch the original call sites to instead call the actual +CPU-specific functions in the future. + +Unsynchronised code backpatching in SMP systems doesn't work +on Intel P6 processors due to an erratum, so the driver performs +a "finalise backpatching" step after the CPU-specific function +pointers have been set up. This step invokes the API procedures +on a temporary state object, set up to force every backpatchable +call site to be invoked and adjusted. + +Several low-level API procedures are called in the context-switch +path by the per-process perfctrs kernel extension, which motivates +the efforts to reduce runtime overheads as much as possible. + +Overflow Interrupts +------------------- +The x86 hardware enables overflow interrupts via the local +APIC's LVTPC entry, which is only present in P6/K7/K8/P4. + +The low-level driver supports overflow interrupts as follows: +- It reserves a local APIC vector, 0xee, as LOCAL_PERFCTR_VECTOR. +- It adds a local APIC exception handler to entry.S, which + invokes the driver's smp_perfctr_interrupt() procedure. +- It adds code to i8259.c to bind the LOCAL_PERFCTR_VECTOR + interrupt gate to the exception handler in entry.S. +- During processor type detection, it records whether the + processor supports the local APIC, and sets up function pointers + for the suspend and resume operations on interrupt-mode counters. +- When the low-level driver is activated, it enables overflow + interrupts by writing LOCAL_PERFCTR_VECTOR to each CPU's APIC_LVTPC. +- Overflow interrupts now end up in smp_perfctr_interrupt(), which + ACKs the interrupt and invokes the interrupt handler installed + by the high-level service/driver. +- When the low-level driver is deactivated, it disables overflow + interrupts by masking APIC_LVTPC in each CPU. It then releases + the local APIC back to the NMI watchdog. + +At compile-time, the low-level driver indicates overflow interrupt +support by enabling CONFIG_PERFCTR_INTERRUPT_SUPPORT. If the feature +is also available at runtime, it sets the PERFCTR_FEATURE_PCINT flag +in the perfctr_info object. diff -puN /dev/null Documentation/perfctr/overview.txt --- /dev/null Thu Apr 11 07:25:15 2002 +++ 25-akpm/Documentation/perfctr/overview.txt Mon Aug 16 15:57:33 2004 @@ -0,0 +1,129 @@ +$Id: perfctr-documentation-update.patch,v 1.1 2004/07/12 05:41:57 akpm Exp $ + +AN OVERVIEW OF PERFCTR +====================== +The perfctr package adds support to the Linux kernel for using +the performance-monitoring counters found in many processors. + +Perfctr is internally organised in three layers: + +- The low-level drivers, one for each supported architecture. + Currently there are two, one for 32 and 64-bit x86 processors, + and one for 32-bit PowerPC processors. + + low-level-api.txt documents the model of the performance counters + used in this package, and the internal API to the low-level drivers. + + low-level-{x86,ppc}.txt provide documentation specific for those + architectures and their low-level drivers. + +- The high-level services. + There is currently one, a kernel extension adding support for + virtualised per-process performance counters. + See virtual.txt for documentation on this kernel extension. + + [There used to be a second high-level service, a simple driver + to control and access all performance counters in all processors. + This driver is currently removed, pending an acceptable new API.] + +- The top-level, which performs initialisation and implements + common procedures and system calls. + +Rationale +--------- +The perfctr package solves three problems: + +- Hardware invariably restricts programming of the performance + counter registers to kernel-level code, and sometimes also + restricts reading the counters to kernel-level code. + + Perfctr adds APIs allowing user-space code access the counters. + In the case of the per-process counters kernel extension, + even non-privileged processes are allowed access. + +- Hardware often limits the precision of the hardware counters, + making them unsuitable for storing total event counts. + + The counts are instead maintained as 64-bit values in software, + with the hardware counters used to derive increments over given + time periods. + +- In a non-modified kernel, the thread state does not include the + performance monitoring counters, and the context switch code + does not save and restore them. In this situation the counters + are system-wide, making them unreliable and inaccurate when used + for monitoring specific processes or specific segments of code. + + The per-process counters kernel extension treats the counter state as + part of the thread state, solving the reliability and accuracy problems. + +Non-goals +--------- +Providing high-level interfaces that abstract and hide the +underlying hardware is a non-goal. Such abstractions can +and should be implemented in user-space, for several reasons: + +- The complexity and variability of the hardware means that + any abstraction would be inaccurate. There would be both + loss of functionality, and presence of functionality which + isn't supportable on any given processor. User-space tools + and libraries can implement this, on top of the processor- + specific interfaces provided by the kernel. + +- The implementation of such an abstraction would be large + and complex. (Consider ESCR register assignment on P4.) + Performing complex actions in user-space simplifies the + kernel, allowing it to concentrate on validating control + data, managing processes, and driving the hardware. + (C.f. the role of compilers.) + +- The abstraction is purely a user-convenience thing. The + kernel-level components have no need for it. + +Common System Calls +=================== +This lists those system calls that are not tied to +a specific high-level service/driver. + +Querying CPU and Driver Information +----------------------------------- +int err = sys_perfctr_info(struct perfctr_info *info, + struct perfctr_cpu_mask *cpus, + struct perfctr_cpu_mask *forbidden); + +This operation retrieves information from the kernel about +the processors in the system. + +If non-NULL, '*info' will be updated with information about the +capabilities of the processor and the low-level driver. + +If non-NULL, '*cpus' will be updated with a bitmask listing the +set of processors in the system. The size of this bitmask is not +statically known, so the protocol is: + +1. User-space initialises cpus->nrwords to the number of elements + allocated for cpus->mask[]. +2. The kernel reads cpus->nrwords, and then writes the required + number of words to cpus->nrwords. +3. If the required number of words is less than the original value + of cpus->nrwords, then an EOVERFLOW error is signalled. +4. Otherwise, the kernel converts its internal cpumask_t value + to the external format and writes that to cpus->mask[]. + +If non-NULL, '*forbidden' will be updated with a bitmask listing +the set of processors in the system on which users must not try +to use performance counters. This is currently only relevant for +hyper-threaded Pentium 4/Xeon systems. The protocol is the same +as for '*cpus'. + +Notes: +- The internal representation of a cpumask_t is as an array of + unsigned long. This representation is unsuitable for user-space, + because it is not binary-compatible between 32 and 64-bit + variants of a big-endian processor. The 'struct perfctr_cpu_mask' + type uses an array of unsigned 32-bit integers. +- The protocol for retrieving a 'struct perfctr_cpu_mask' was + designed to allow user-space to quickly determine the correct + size of the 'mask[]' array. Other system calls use weaker protocols, + which force user-space to guess increasingly larger values in a + loop, until finally an acceptable value was guessed. diff -puN /dev/null Documentation/perfctr/virtual.txt --- /dev/null Thu Apr 11 07:25:15 2002 +++ 25-akpm/Documentation/perfctr/virtual.txt Mon Aug 16 15:57:33 2004 @@ -0,0 +1,355 @@ +$Id: perfctr-documentation-update.patch,v 1.1 2004/07/12 05:41:57 akpm Exp $ + +VIRTUAL PER-PROCESS PERFORMANCE COUNTERS +======================================== +This document describes the virtualised per-process performance +counters kernel extension. See "General Model" in low-level-api.txt +for the model of the processor's performance counters. + +Contents +======== +- Summary +- Design & Implementation Notes + * State + * Thread Management Hooks + * Synchronisation Rules + * The Pseudo File System +- API For User-Space + * Opening/Creating the State + * Updating the Control + * Unlinking the State + * Reading the State + * Resuming After Handling Overflow Signal + * Reading the Counter Values +- Limitations / TODO List + +Summary +======= +The virtualised per-process performance counters facility +(virtual perfctrs) is a kernel extension which extends the +thread state to record perfctr settings and values, and augments +the context-switch code to save perfctr values at suspends and +restore them at resumes. This "virtualises" the performance +counters in much the same way as the kernel already virtualises +general-purpose and floating-point registers. + +Virtual perfctrs also adds an API allowing non-privileged +user-space processes to set up and access their perfctrs. + +As this facility is primarily intended to support developers +of user-space code, both virtualisation and allowing access +from non-privileged code are essential features. + +Design & Implementation Notes +============================= + +State +----- +The state of a thread's perfctrs is packaged up in an object of +type 'struct vperfctr'. It consists of CPU-dependent state, a +sampling timer, and some auxiliary administrative data. This is +an independent object, with its own lifetime and access rules. + +The state object is attached to the thread via a pointer in its +thread_struct. While attached, the object records the identity +of its owner thread: this is used for user-space API accesses +from threads other than the owner. + +The state is separate from the thread_struct for several resons: +- It's potentially large, hence it's allocated only when needed. +- It can outlive its owner thread. The state can be opened as + a pseudo file: as long as that file is live, so is the object. +- It can be mapped, via mmap() on the pseudo file's descriptor. + To facilitate this, a full page is allocated and reserved. + +Thread Management Hooks +----------------------- +Virtual perfctrs hooks into several thread management events: + +- exit_thread(): Calls perfctr_exit_thread() to stop the counters + and detach the thread's vperfctr object. + +- copy_thread(): Calls perfctr_copy_thread() to initialise + the child's vperfctr pointer. Currently the settings are + not inherited from parent to child, so the pointer is set + to NULL in the child's thread_struct. + +- switch_to(): + * Calls perfctr_suspend_thread() on the previous thread, to + suspend its counters. + * Calls perfctr_resume_thread() on the next thread, to resume + its counters. Also resets the sampling timer (see below). + +- update_process_times(): Calls perfctr_sample_thread(), which + decrements the sampling timer and samples the counters if the + timer reaches zero. + + Sampling is normally only done at switch_to(), but if too much + time passes before the next switch_to(), a hardware counter may + increment by more than its range (usually 2^32). If this occurs, + the difference from its start value will be incorrect, causing + its updated sum to also be incorrect. The sampling timer is used + to prevent this problem, which has been observed on SMP machines, + and on high clock frequency UP machines. + +- set_cpus_allowed(): Calls perfctr_set_cpus_allowed() to detect + attempts to migrate the thread to a "forbidden" CPU, in which + case a flag in the vperfctr object is set. perfctr_resume_thread() + checks this flag, and if set, marks the counters as stopped and + sends a SIGILL to the thread. + + The notion of forbidden CPUs is a workaround for a design flaw + in hyper-threaded Pentium 4s and Xeons. See low-level-x86.txt + for details. + +To reduce overheads, these hooks are implemented as inline functions +that check if the thread is using perfctrs before calling the code +that implements the behaviour. The hooks also reduce to no-ops if +CONFIG_PERFCTR_VIRTUAL is disabled. + +Synchronisation Rules +--------------------- +There are four types of accesses to a thread's perfctr state: + +1. Thread management events (see above) done by the thread itself. + Suspend, resume, and sample are lock-less. + +2. API operations done by the thread itself. + These are lock-less, except when an individual operation + has specific synchronisation needs. For instance, preemption + is often disabled to prevent accesses due to context switches. + +3. API operations done by a different thread ("monitor thread"). + The owner thread must be suspended for the duration of the operation. + This is ensured by requiring that the monitor thread is ptrace()ing + the owner thread, and that the owner thread is in TASK_STOPPED state. + +4. set_cpus_allowed(). + The kernel does not lock the target during set_cpus_allowed(), + so it can execute concurrently with the owner thread or with + some monitor thread. In particular, the state may be deallocated. + + To solve this problem, both perfctr_set_cpus_allowed() and the + operations that can change the owner thread's perfctr pointer + (creat, unlink, exit) perform a task_lock() on the owner thread + before accessing the perfctr pointer. + + When concurrent set_cpus_allowed() isn't a problem (because the + architecture doesn't have a notion of forbidden CPUs), atomicity + of updates to the thread's perfctr pointer is ensured by disabling + preemption. + +The Pseudo File System +---------------------- +The perfctr state is accessed from user-space via a file descriptor. + +The main reason for this is to enable mmap() on the file descriptor, +which gives read-only access to the state. + +The file descriptor is a handle to the perfctr state object. This +allows a very simple implementation of the user-space 'perfex' +program, which runs another program with given perfctr settings +and reports their final values. Without this handle, monitoring +applications like perfex would have to be implemented like debuggers +in order to catch the target thread's exit and retrieve the counter +values before the exit completes and the state disappears. + +The file for a perfctr state object belongs to the vperfctrs pseudo +file system. Files in this file system support only a few operations: +- mmap() +- release() decrements the perfctr object's reference count and + deallocates the object when no references remain +- the listing of a thread's open file descriptors identifies + perfctr state file descriptors as belonging to "vperfctrfs" +The implementation is based on the code for pipefs. + +In previous versions of the perfctr package, the file descriptors +for perfctr state objects also supported the API's ioctl() method. + +API For User-Space +================== + +Opening/Creating the State +-------------------------- +int fd = sys_vperfctr_open(int tid, int creat); + +'tid' must be the id of a thread, or 0 which is interpreted as an +alias for the current thread. + +This operation returns an open file descriptor which is a handle +on the thread's perfctr state object. + +If 'creat' is non-zero and the object did not exist, then it is +created and attached to the thread. The newly created state object +is inactive, with all control fields disabled and all counters +having the value zero. If 'creat' is non-zero and the object +already existed, then an EEXIST error is signalled. + +If 'tid' does not denote the current thread, then it must denote a +thread that is stopped and under ptrace control by the current thread. + +Notes: +- The access rule in the non-self case is the same as for the + ptrace() system call. It ensures that no other thread, including + the target thread itself, can access or change the target thread's + perfctr state during the operation. +- An open file descriptor for a perfctr state object counts as a + reference to that object; even if detached from its thread the + object will not be deallocated until the last reference is gone. +- The file descriptor can be passed to mmap(), for low-overhead + counter sampling. See "READING THE COUNTER VALUES" for details. +- The file descriptor can be passed to another thread. Accesses + from threads other than the owner are permitted as long as they + posses the file descriptor and use ptrace() for synchronisation. + +Updating the Control +-------------------- +int err = sys_vperfctr_control(int fd, const struct vperfctr_control *control); + +'fd' must be the return value from a call to sys_vperfctr_open(), +The perfctr object must still be attached to its owner thread. + +This operation stops and samples any currently running counters in +the thread, and then updates the control settings. If the resulting +state has any enabled counters, then the counters are restarted. + +Before restarting, the counter sums are reset to zero. However, +if a counter's bit is set in the control object's 'preserve' +bitmask field, then that counter's sum is not reset. The TSC's +sum is only reset if the TSC is disabled in the new state. + +If any of the programmable counters are enabled, then the thread's +CPU affinity mask is adjusted to exclude the set of forbidden CPUs. + +If the control data activates any interrupt-mode counters, then +a signal (specified by the 'si_signo' control field) will be sent +to the owner thread after an overflow interrupt. The documentation +for sys_vperfctr_iresume() describes this mechanism. + +If 'fd' does not denote the current thread, then it must denote a +thread that is stopped and under ptrace control by the current thread. +The perfctr state object denoted by 'fd' must still be attached +to its owner thread. + +Notes: +- It is strongly recommended to memset() the vperfctr_control object + to all-bits-zero before setting the fields of interest. +- Stopping the counters is done by invoking the control operation + with a control object that activates neither the TSC nor any PMCs. + +Unlinking the State +------------------- +int err = sys_vperfctr_unlink(int fd); + +'fd' must be the return value from a call to sys_vperfctr_open(). + +This operation stops and samples the thread's counters, and then +detaches the perfctr state object from the thread. If the object +already had been detached, then no action is performed. + +If 'fd' does not denote the current thread, then it must denote a +thread that is stopped and under ptrace control by the current thread. + +Reading the State +----------------- +int err = sys_vperfctr_read(int fd, struct perfctr_sum_ctrs *sum, + struct vperfctr_control *control); + +'fd' must be the return value from a call to sys_vperfctr_open(). + +This operation copies data from the perfctr state object to +user-space. If 'sum' is non-NULL, then the counter sums are +written to it. If 'control' is non-NULL, then the control data +is written to it. + +If the perfctr state object is attached to the current thread, +then the counters are sampled and updated first. + +If 'fd' does not denote the current thread, then it must denote a +thread that is stopped and under ptrace control by the current thread. + +Notes: +- An alternate and faster way to retrieve the counter sums is described + below. This system call can be used if the hardware does not permit + user-space reads of the counters. + +Resuming After Handling Overflow Signal +--------------------------------------- +int err = sys_vperfctr_iresume(int fd); + +'fd' must be the return value from a call to sys_vperfctr_open(). +The perfctr object must still be attached to its owner thread. + +When an interrupt-mode counter has overflowed, the counters +are sampled and suspended (TSC remains active). Then a signal, +as specified by the 'si_signo' control field, is sent to the +owner thread: the associated 'struct siginfo' has 'si_code' +equal to 'SI_PMC_OVF', and 'si_pmc_ovf_mask' equal to the set +of overflown counters. + +The counters are suspended to avoid generating new performance +counter events during the execution of the signal handler, but +the previous settings are saved. Calling sys_vperfctr_iresume() +restores the previous settings and resumes the counters. Doing +this is optional. + +If 'fd' does not denote the current thread, then it must denote a +thread that is stopped and under ptrace control by the current thread. + +Reading the Counter Values +-------------------------- +The value of a counter is computed from three components: + + value = sum + (now - start); + +Two of these (sum and start) reside in the kernel's state object, +and the third (now) is the contents of the hardware counter. +To perform this computation in user-space requires access to +the state object. This is achieved by passing the file descriptor +from sys_vperfctr_open() to mmap(): + + volatile const struct vperfctr_state *kstate; + kstate = mmap(NULL, PAGE_SIZE, PROT_READ, MAP_SHARED, fd, 0); + +Reading the three components is a non-atomic operation. If the +thread is scheduled during the operation, the three values will +not be consistent and the wrong result will be computed. +To detect this situation, user-space should check the kernel +state's TSC start value before and after the operation, and +retry the operation in case of a mismatch. + +The algorithm for retrieving the value of counter 'i' is: + + tsc0 = kstate->cpu_state.tsc_start; + for(;;) { + rdpmcl(kstate->cpu_state.pmc[i].map, now); + start = kstate->cpu_state.pmc[i].start; + sum = kstate->cpu_state.pmc[i].sum; + tsc1 = kstate->cpu_state.tsc_start; + if (likely(tsc1 == tsc0)) + break; + tsc0 = tsc1; + } + return sum + (now - start); + +The algorithm for retrieving the value of the TSC is similar, +as is the algorithm for retrieving the values of all counters. + +Notes: +- Since the state's TSC time-stamps are used, the algorithm requires + that user-space enables TSC sampling. +- The algorithm requires that the hardware allows user-space reads + of the counter registers. If this property isn't statically known + for the architecture, user-space should retrieve the kernel's + 'struct perfctr_info' object and check that the PERFCTR_FEATURE_RDPMC + flag is set. + +Limitations / TODO List +======================= +- Perfctr settings are not inherited from parent to child at fork(). + The issue is not fork() but propagating final counts from children + to parents, and allowing user-space to distinguish "self" counts + from "children" counts. + An implementation of this feature is being planned. +- Buffering of overflow samples is not implemented. So far, not a + single user has requested it. _