kprobes

kprobes tutorial

This tutorial was developed for the 2006 Ottawa Linux Symposium. I'm hoping it will be useful as a general resource.

There is documentation in Documentation/kprobes.txt in the kernel source.

Why?

kprobes is both useful and cool. Unfortunately most of the kernel developers I've talked to are confused both about why its useful and how to use it. I'm hoping to answer both of those here. I should add that I have no skin in the game personally, I'm not a kprobes developer. I should probably turn this into a HOWTO...

What is kprobes?

Simply, kprobes allows you to write modules that can add debug information to the kernel. It is an alternative to building custom kernels or custom modules. I think the most useful case is when you are dealing with some remote machine (your grandmother or a tester) who hits some bug that you can't figure out by looking at /var/log/messages. Build a kprobes module and have them insmod it on their kernel.

This tutorial deals with kernel kprobing. There is additional work around user-land kprobing that will not be discussed here.

There are three kinds of kprobes:

jprobes
Call a function on the entry to a routine. All the arguments to the routine are passed.
kretprobes
Call a function on the exit from a routine. The registers are passed
kprobe
The guts of kprobes. Any arbitrary kernel instruction can be probed. A function is called passing the registers.

kprobes prerequisites

kprobes has been in mainline since 2.6.9. There are some kernel configuration flags that need to be set to use kprobes. Current enterprise kernels (such as SLES 10) have these turned on, and so does FC5. Some others (cough, Ubuntu, cough) do not. The flags that are required are:

CONFIG_KPROBES
duh
CONFIG_MODULES and CONFIG_MODULE_UNLOAD
kprobes do not require any code changes to the source kernel (thats kind of the idea.) They are loaded into existing kernels as modules. Obviously you need modules configured. You don't actually require MODULE_UNLOAD, but it makes life easier.
CONFIG_KALLSYMS and CONFIG_KALSYMS_ALL
You can set a kprobe by using an address from System.map, but it is easier to code kallsyms_lookup_name().

To build a module (any module) you need to have access to the kernel headers and a suitable compiler.

Simple Case #1

The simplest case, useful in 99% of cases is the jprobe case, where your function gains control on the entry to some arbitrary routine in the kernel.

In this example we will trace do_execve in the kernel. Start from the makefile and source in Documentation/kprobes.txt.

  1: /* Trace do_execv.  Taken basically from Documentation/kprobes.txt */
  2: #include <linux/kernel.h>
  3: #include <linux/module.h>
  4: #include <linux/sched.h>
  5: #include <linux/kprobes.h>
  6: #include <linux/kallsyms.h>
  7: 
  8: /*
  9:  * Pre-entry point for do_execve.
 10:  */
 11: static int my_do_execve(char * filename,
 12:                         char __user *__user *argv,
 13:                         char __user *__user *envp,
 14:                         struct pt_regs * regs)
 15: {
 16:         printk("do_execve for %s from %s\n", filename, current->comm);
 17:         /* Always end with a call to jprobe_return(). */
 18:         jprobe_return();
 19:         /*NOTREACHED*/
 20:         return 0;
 21: }
 22: 
 23: static struct jprobe my_jprobe = {
 24:         .entry = (kprobe_opcode_t *) my_do_execve
 25: };
 26: 
 27: int init_module(void)
 28: {
 29:         int ret;
 30:         my_jprobe.kp.addr = 
 31:                 (kprobe_opcode_t *) kallsyms_lookup_name("do_execve");
 32:         if (!my_jprobe.kp.addr) {
 33:                 printk("Couldn't find %s to plant jprobe\n", "do_execve");
 34:                 return -1;
 35:         }
 36: 
 37:         if ((ret = register_jprobe(&my_jprobe)) <0) {
 38:                 printk("register_jprobe failed, returned %d\n", ret);
 39:                 return -1;
 40:         }
 41:         printk("Planted jprobe at %p, handler addr %p\n",
 42:                my_jprobe.kp.addr, my_jprobe.entry);
 43:         return 0;
 44: }
 45: 
 46: void cleanup_module(void)
 47: {
 48:         unregister_jprobe(&my_jprobe);
 49:         printk("jprobe unregistered\n");
 50: }
 51: 
 52: MODULE_LICENSE("GPL");

Note line 11 Give YOUR routine a different name than the one you are tracing, otherwise kallsyms_lookup_name will get confused.

Note line 18 where the jprobe handler calls jprobe_return(). I mean REALLY note that (don't just return from a jprobe.)

The Makefile (also straight out of Documentation/kprobes.txt) is

# This is taken straight from Documentation/kprobes.txt

obj-m := trace-exec.o
KDIR := /lib/modules/$(shell uname -r)/build
PWD := $(shell pwd)
default:
	$(MAKE) -C $(KDIR) SUBDIRS=$(PWD) modules
clean:
	rm -f *.mod.c *.ko *.o

Easy. insmod the module and away you go! Here's the output on my thinkad

Jul 16 19:20:46 hound kernel: [17213292.188000] do_execve for /bin/sh from wcstatusd
Jul 16 19:20:46 hound kernel: [17213292.204000] do_execve for /sbin/ifconfig from sh
Jul 16 19:20:46 hound kernel: [17213292.216000] do_execve for /bin/sh from wcstatusd
Jul 16 19:20:46 hound kernel: [17213292.232000] do_execve for /sbin/ifconfig from sh
Jul 16 19:20:46 hound kernel: [17213292.248000] do_execve for /bin/sh from wcstatusd
Jul 16 19:20:46 hound kernel: [17213292.264000] do_execve for /sbin/ifconfig from sh
Jul 16 19:20:46 hound kernel: [17213292.296000] do_execve for /bin/sh from wcstatusd
Jul 16 19:20:46 hound kernel: [17213292.376000] do_execve for /sbin/ifconfig from sh
Jul 16 19:20:46 hound kernel: [17213292.392000] do_execve for /bin/sh from wcstatusd
Jul 16 19:20:46 hound kernel: [17213292.456000] do_execve for /sbin/ifconfig from sh
Jul 16 19:20:46 hound kernel: [17213292.472000] do_execve for /bin/sh from wcstatusd

Hmm...what the heck is wcstatusd doing all the time...

powerpc and ia64 problems

If you try the above on powerpc and (I think) ia64 it will fail. And you will bang your head. The reason is that when you take the address of a routine you get the address in a jump table, not the actual code address.

You can fix this by something like the following:

 38:         if ((ret = register_jprobe(&my_jprobe)) <0) {
 39:                 printk("register_jprobe failed, returned %d. "
 40:                        "trying dereferenced address\n", ret);
 41:                 my_jprobe.kp.addr = *((kprobe_opcode_t **)my_jprobe.kp.addr);
 42:                 if ((ret = register_jprobe(&my_jprobe)) <0) {
 43:                         printk("register_jprobe failed again, returned %d\n", 
 44:                                ret);
 45:                         return -1;
 46:                 }
 47:         }

Also beware of the 32-bit compat_* functions. In the above example, do_execve is actually never called on powerpc unless you are running 64-bit processes. compat_do_execve is called instead.

When good traces go bad

Limiting your output

I usually do something like the following to limit the amount of data that comes out (depending on what I am tracing)

...
  8: static atomic_t trace_limit;
...
 15:         if (atomic_inc_return(&trace_limit) > 100)
 16:                 jprobe_return();
...

Handling Faults

It can help to have a fault handler in your kprobes. Unless you plan to always write bug-less kprobes.

...
 10: /* fault_handler: this is called if an exception is generated for any
 11:  * instruction within the pre- or post-handler, or when Kprobes
 12:  * single-steps the probed instruction.
 13:  */
 14: int handler_fault(struct kprobe *p, struct pt_regs *regs, int trapnr)
 15: {
 16:         printk("fault_handler: p->addr=0x%p, trap #%d\n",
 17:                 p->addr, trapnr);
 18:         /* Return 0 because we don't handle the fault. */
 19:         return 0;
 20: }
...
 54:         my_jprobe.kp.fault_handler = handler_fault;
...

Simple Case #2

My Thinkpad has a Centrino processor with speedstep CPU frequency controls. CPU frequency is controlled by the speedstep_centrino module. Lets trace all the calls to centrino_target in that module.

First, here is the routine we want to trace from arch/i386/kernel/cpu/cpufreq/speedstep-centrino.c.

...
584: /**
585:  * centrino_setpolicy - set a new CPUFreq policy
586:  * @policy: new policy
587:  * @target_freq: the target frequency
588:  * @relation: how that frequency relates to achieved frequency (CPUFREQ_RELATION_L or CPUFREQ_RELATION_H)
589:  *
590:  * Sets a new CPUFreq policy.
591:  */
592: static int centrino_target (struct cpufreq_policy *policy,
593:                             unsigned int target_freq,
594:                             unsigned int relation)
595: {
596:         unsigned int    newstate = 0;
597:         unsigned int        msr, oldmsr, h, cpu = policy->cpu;
598:         struct cpufreq_freqs        freqs;
599:         cpumask_t                saved_mask;
600:         int                        retval;
601: 
602:         if (centrino_model[cpu] == NULL)
603:                 return -ENODEV;
604: 
...

Some things to note. First, this routine is in a module, not in the kernel proper. Secondly, the routine is marked static. No problem. if KALLSYMS_ALL is configured, we will still get the address for it.

  1: /* Trace do_execv.  Taken basically from Documentation/kprobes.txt */
  2: #include <linux/kernel.h>
  3: #include <linux/module.h>
  4: #include <linux/cpufreq.h>
  5: #include <linux/kprobes.h>
  6: #include <linux/kallsyms.h>
  7: 
  8: static atomic_t trace_limit;
  9: 
 10: /* fault_handler: this is called if an exception is generated for any
 11:  * instruction within the pre- or post-handler, or when Kprobes
 12:  * single-steps the probed instruction.
 13:  */
 14: int handler_fault(struct kprobe *p, struct pt_regs *regs, int trapnr)
 15: {
 16:         printk("fault_handler: p->addr=0x%p, trap #%d\n",
 17:                 p->addr, trapnr);
 18:         /* Return 0 because we don't handle the fault. */
 19:         return 0;
 20: }
 21: 
 22: /*
 23:  * Pre-entry point for centrino_target.
 24:  */
 25: static int my_centrino_target (struct cpufreq_policy *policy,
 26:                                unsigned int target_freq,
 27:                                unsigned int relation)
 28: {
 29:         if (atomic_inc_return(&trace_limit) > 100)
 30:                 jprobe_return();
 31: 
 32:         printk("centrino_target %u relation %u\n", target_freq, relation);
 33:         dump_stack();
 34: 
 35:         /* Always end with a call to jprobe_return(). */
 36:         jprobe_return();
 37:         /*NOTREACHED*/
 38:         return 0;
 39: }
 40: 
 41: static struct jprobe my_jprobe = {
 42:         .entry = (kprobe_opcode_t *) my_centrino_target,
 43: };
 44: 
 45: int init_module(void)
 46: {
 47:         int ret;
 48:         my_jprobe.kp.addr = 
 49:                 (kprobe_opcode_t *) kallsyms_lookup_name("centrino_target");
 50:         if (!my_jprobe.kp.addr) {
 51:                 printk("Couldn't find %s to plant jprobe\n", "centrino_target");
 52:                 return -1;
 53:         }
 54: 
 55:         my_jprobe.kp.fault_handler = handler_fault;
 56: 
 57:         if ((ret = register_jprobe(&my_jprobe)) <0) {
 58:                 printk("register_jprobe failed, returned %d\n", ret);
 59:                 return -1;
 60:         }
 61:         printk("Planted jprobe at %p, handler addr %p\n",
 62:                my_jprobe.kp.addr, my_jprobe.entry);
 63:         return 0;
 64: }
 65: 
 66: void cleanup_module(void)
 67: {
 68:         unregister_jprobe(&my_jprobe);
 69:         printk("jprobe unregistered\n");
 70: }
 71: 
 72: MODULE_LICENSE("GPL");

Note the use of dump_stack(). The output in /var/log/messages

Jul 16 19:43:01 hound kernel: [17179786.852000] centrino_target 1500000 relation 0
Jul 16 19:43:01 hound kernel: [17179786.852000]  [pg0+949055590/1069523968] my_centrino_target+0x36/0x50 [trace_speedstep]
Jul 16 19:43:01 hound kernel: [17179786.852000]  [pg0+946753796/1069523968] store_speed+0xb4/0xd0 [cpufreq_userspace]
Jul 16 19:43:01 hound kernel: [17179786.852000]  [neigh_get_next+135/160] store+0x47/0x60
Jul 16 19:43:01 hound kernel: [17179786.852000]  [ldm_partition+4123/4304] sysfs_write_file+0x9b/0xf0
Jul 16 19:43:01 hound kernel: [17179786.852000]  [vfs_read+206/432] vfs_write+0xde/0x1b0
Jul 16 19:43:01 hound kernel: [17179786.852000]  [do_sendfile+699/752] sys_write+0x4b/0x80
Jul 16 19:43:01 hound kernel: [17179786.852000]  [need_resched+12/33] sysenter_past_esp+0x54/0x75
Jul 16 19:43:11 hound kernel: [17179797.328000] centrino_target 1400000 relation 0
Jul 16 19:43:11 hound kernel: [17179797.328000]  [pg0+949055590/1069523968] my_centrino_target+0x36/0x50 [trace_speedstep]
Jul 16 19:43:11 hound kernel: [17179797.328000]  [pg0+946753796/1069523968] store_speed+0xb4/0xd0 [cpufreq_userspace]
Jul 16 19:43:11 hound kernel: [17179797.328000]  [neigh_get_next+135/160] store+0x47/0x60
Jul 16 19:43:11 hound kernel: [17179797.328000]  [ldm_partition+4123/4304] sysfs_write_file+0x9b/0xf0
Jul 16 19:43:11 hound kernel: [17179797.328000]  [vfs_read+206/432] vfs_write+0xde/0x1b0
Jul 16 19:43:11 hound kernel: [17179797.328000]  [do_sendfile+699/752] sys_write+0x4b/0x80
Jul 16 19:43:11 hound kernel: [17179797.328000]  [need_resched+12/33] sysenter_past_esp+0x54/0x75

Inlines

kprobes patches a single address (the address used above in kallsyms_lookup_name().) This doesn't handle inlines. There is no automatic way to handle inlines.

Return values (kretprobes)

The second most useful thing to do is track the return value of a routine. kprobes uses kretprobes to handle that. A classic case is tracking memory allocations (kmalloc.)

kprobes is smart enough to trap on any return from a routine (it actually places a trap at the place the routine was called from rather than on every return instruction

kprobes can miss kretprobes. You can check my_kretprobe.nmissed to see if you missed any retprobes.

  1: /* Trace kmalloc.  Taken basically from Documentation/kprobes.txt */
  2: #include <linux/kernel.h>
  3: #include <linux/module.h>
  4: #include <linux/cpufreq.h>
  5: #include <linux/kprobes.h>
  6: #include <linux/kallsyms.h>
  7: 
  8: static atomic_t trace_limit;
  9: 
 10: /*
 11:  * Pre-entry point for kmalloc_target.
 12:  */
 13: static void *my__kmalloc(size_t size, gfp_t flags)
 14: {
 15:         if (atomic_inc_return(&trace_limit) > 100)
 16:                 jprobe_return();
 17: 
 18:         printk("kmalloc call: %u in %s\n", size, current->comm);
 19: 
 20:         /* Always end with a call to jprobe_return(). */
 21:         jprobe_return();
 22:         /*NOTREACHED*/
 23:         return 0;
 24: }
 25: 
 26: static struct jprobe my_jprobe = {
 27:         .entry = (kprobe_opcode_t *) my__kmalloc,
 28: };
 29: 
 30: /* Return-probe handler: If the probed function fails, log the return value. */
 31: static int ret_handler(struct kretprobe_instance *ri, struct pt_regs *regs)
 32: {
 33:         if (atomic_inc_return(&trace_limit) > 100)
 34:                 return 0;
 35:         printk("kmalloc returns %lx\n", regs->eax);
 36:         return 0;
 37: }
 38: 
 39: static struct kretprobe my_kretprobe = {
 40:         .handler = ret_handler,
 41:         /* Probe up to 20 instances concurrently. */
 42:         .maxactive = 20
 43: };
 44: 
 45: int init_module(void)
 46: {
 47:         int ret;
 48:         my_jprobe.kp.addr = my_kretprobe.kp.addr =
 49:                 (kprobe_opcode_t *) kallsyms_lookup_name("__kmalloc");
 50:         if (!my_jprobe.kp.addr) {
 51:                 printk("Couldn't find %s to plant jprobe\n", "__kmalloc");
 52:                 return -1;
 53:         }
 54: 
 55:         if ((ret = register_jprobe(&my_jprobe)) <0) {
 56:                 printk("register_jprobe failed, returned %d\n", ret);
 57:                 return -1;
 58:         }
 59:         if ((ret = register_kretprobe(&my_kretprobe)) < 0) {
 60:                 printk("register_kretprobe failed, returned %d\n", ret);
 61:                 unregister_jprobe(&my_jprobe);
 62:                 return -1;
 63:         }
 64:         printk("Planted jprobe at %p, handler addr %p\n",
 65:                my_jprobe.kp.addr, my_jprobe.entry);
 66:         return 0;
 67: }
 68: 
 69: void cleanup_module(void)
 70: {
 71:         unregister_jprobe(&my_jprobe);
 72:         unregister_kretprobe(&my_kretprobe);
 73:         printk("jprobe unregistered\n");
 74: }
 75: 
 76: MODULE_LICENSE("GPL");

Note that you just return() from kretprobes.

Given SMP, preemption, etc. there isn't an easy way to correlate a jprobe and a kretprobe.

Raw kprobes

jprobes and kretprobes are based on raw kprobes. A raw kprobe just takes an arbitrary address and sets a breakpoint there. Frequently useful in debugging a panic (where you know the faulting address) and inassociation with a disassembled kernel (objdump -D)

kprobes provide "pre" handlers tha run before the specific instruction, and "post" handlers that run afterwards. The following is a generic kprobe that dumps state when an arbitrary address is hit.

  1: /* generic kprobe */
  2: #include <linux/kernel.h>
  3: #include <linux/module.h>
  4: #include <linux/kprobes.h>
  5: #include <linux/kallsyms.h>
  6: #include <linux/sched.h>
  7: 
  8: static unsigned long probe_addr;
  9: 
 10: module_param_named(probe_addr, probe_addr, ulong, S_IRUGO | S_IWUSR);
 11: 
 12: /*For each probe you need to allocate a kprobe structure*/
 13: static struct kprobe kp;
 14: 
 15: static void dump_state(struct pt_regs *regs)
 16: {
 17:         print_symbol(KERN_INFO "EIP is at %s\n", regs->eip);
 18:         printk(KERN_INFO "eax: %08lx   ebx: %08lx   ecx: %08lx   edx: %08lx\n",
 19:                regs->eax, regs->ebx, regs->ecx, regs->edx);
 20:         printk(KERN_INFO "esi: %08lx   edi: %08lx   ebp: %08lx   esp: %08lx\n",
 21:                regs->esi, regs->edi, regs->ebp, regs->esp);
 22:         printk(KERN_INFO "ds: %04x   es: %04x\n",
 23:                 regs->xds & 0xffff, regs->xes & 0xffff);
 24:         printk(KERN_INFO "Process %s (pid: %d, threadinfo=%p task=%p)",
 25:                 current->comm, current->pid, current_thread_info(), current);
 26: 
 27: }
 28: 
 29: /*kprobe pre_handler: called just before the probed instruction is executed*/
 30: int handler_pre(struct kprobe *p, struct pt_regs *regs)
 31: {
 32:         
 33:         printk("pre_handler: p->addr=0x%p\n",
 34:                 p->addr);
 35:         dump_state(regs);
 36:         dump_stack();
 37:         return 0;
 38: }
 39: 
 40: /*kprobe post_handler: called after the probed instruction is executed*/
 41: void handler_post(struct kprobe *p, struct pt_regs *regs, unsigned long flags)
 42: {
 43:         printk("post_handler: p->addr=0x%p\n",
 44:                 p->addr);
 45:         dump_state(regs);
 46:         return;
 47: }
 48: 
 49: /* fault_handler: this is called if an exception is generated for any
 50:  * instruction within the pre- or post-handler, or when Kprobes
 51:  * single-steps the probed instruction.
 52:  */
 53: int handler_fault(struct kprobe *p, struct pt_regs *regs, int trapnr)
 54: {
 55:         printk("fault_handler: p->addr=0x%p, trap #%dn",
 56:                 p->addr, trapnr);
 57:         /* Return 0 because we don't handle the fault. */
 58:         return 0;
 59: }
 60: 
 61: int init_module(void)
 62: {
 63:         int ret;
 64: 
 65:         if (!probe_addr) {
 66:                 printk("trace-generic: provide probe_addr paramter\n");
 67:                 return -1;
 68:         }
 69: 
 70:         kp.pre_handler = handler_pre;
 71:         kp.post_handler = handler_post;
 72:         kp.fault_handler = handler_fault;
 73:         kp.addr = (kprobe_opcode_t*) probe_addr;
 74: 
 75:         printk("setting probe at address %p\n",kp.addr);
 76:         
 77:         /* register the kprobe now */
 78:         if (!kp.addr) {
 79:                 printk("Couldn't find %s to plant kprobe\n", "do_fork");
 80:                 return -1;
 81:         }
 82:         if ((ret = register_kprobe(&kp) < 0)) {
 83:                 printk("register_kprobe failed, returned %d\n", ret);
 84:                 return -1;
 85:         }
 86:         printk("kprobe registered\n");
 87:         return 0;
 88: }
 89: 
 90: void cleanup_module(void)
 91: {
 92:         unregister_kprobe(&kp);
 93:         printk("kprobe unregistered\n");
 94: }
 95: 
 96: MODULE_LICENSE("GPL");

initrd

There's no reason you can't add a kprobes module to your initrds to trace things happening reasonably early during the boot process. Similarly, you can actually add a kprobe to a distro installation if the installer allows you to load extra modules before doing the install.

relayfs

relayfs (http://relayfs.sourceforge.net) provides examples of tying kprobes and relayfs together. Rather than dumping all your trace data into dmesg.