Stupid RCU Tricks: Bug in rcu_init_levelspread()

(If you just want the answer without the story of how I found and fixed the bug, scroll down to the bottom of this page.)

To recap, I recently refactored my rcutorture tests, getting that item off of my todo list after “only” three or four years. To my surprise, one (but only one) of the new rcutorture test scenarios failed, namely the one creatively named TREE08:

CONFIG_SMP=y
CONFIG_NR_CPUS=16
CONFIG_PREEMPT_NONE=n
CONFIG_PREEMPT_VOLUNTARY=n
CONFIG_PREEMPT=y
CONFIG_HZ_PERIODIC=n
CONFIG_NO_HZ_IDLE=y
CONFIG_NO_HZ_FULL=n
CONFIG_RCU_FAST_NO_HZ=n
CONFIG_RCU_TRACE=n
CONFIG_HOTPLUG_CPU=n
CONFIG_SUSPEND=n
CONFIG_HIBERNATION=n
CONFIG_RCU_FANOUT=3
CONFIG_RCU_FANOUT_EXACT=y
CONFIG_RCU_FANOUT_LEAF=2
CONFIG_RCU_NOCB_CPU=y
CONFIG_RCU_NOCB_CPU_ALL=y
CONFIG_DEBUG_LOCK_ALLOC=n
CONFIG_PROVE_RCU_DELAY=n
CONFIG_RCU_CPU_STALL_INFO=n
CONFIG_RCU_CPU_STALL_VERBOSE=n
CONFIG_RCU_BOOST=n
CONFIG_DEBUG_OBJECTS_RCU_HEAD=n
CONFIG_PRINTK_TIME=y

One reason for my surprise was that I had been running similar rcutorture scenarios for some years. In sharp (and happy) contrast to previous efforts featuring but a handful of failures in a given ten-hour run, the effect was not subtle. In fact, TREE08 often failed more than 20 times per minute.

My first thought was that I had blown a recent commit, but testing earlier kernel versions gave similar failure rates.

Additional testing and analysis showed that the bug was quite sensitive to the exact values of the Kconfig parameters. For example, setting CONFIG_RCU_FANOUT=2 or CONFIG_RCU_FANOUT=4 resulted in rock-solid behavior, despite CONFIG_RCU_FANOUT=3 failing miserably. Similarly, setting CONFIG_RCU_FANOUT_EXACT=n (which is thankfully the default) also resulted in rock-solid behavior. This fortunately provides not only a workaround for the bug, but, even better, a workaround that is enabled by default. This enabled-by-default workaround did much to explain the total lack of bug reports. This sensitivity to a few Kconfig parameters implicated the boot-time code that sets up the rcu_node structures. I therefore ran this boot-time setup code in usermode scaffolding, resulting in the following output:

NR_CPUS = 16, CONFIG_RCU_FANOUT = 3, CONFIG_RCU_FANOUT_LEAF = 2, MAX_RCU_LVLS = 4
rcu_fanout_leaf = 16, nr_cpu_ids = 3
NR_CPUS: 16, RCU_NUM_LVLS: 3, rcu_num_lvls: 3, rcu_num_nodes: 12
num_rcu_lvl[0]: 1
num_rcu_lvl[1]: 3
num_rcu_lvl[2]: 8
num_rcu_lvl[3]: 16
num_rcu_lvl[4]: 0
rsp->levelspread[0]: 2
rsp->levelspread[1]: 3
rsp->levelspread[2]: 3
CPU   0: rdp=0x804c420 rnp=0x804b180 (0-2) rnp=0x804b0a8 (0-8) rnp=0x804b060 (0-15)
CPU   1: rdp=0x804c4cc rnp=0x804b180 (0-2) rnp=0x804b0a8 (0-8) rnp=0x804b060 (0-15)
CPU   2: rdp=0x804c578 rnp=0x804b180 (0-2) rnp=0x804b0a8 (0-8) rnp=0x804b060 (0-15)
CPU   3: rdp=0x804c624 rnp=0x804b1c8 (3-5) rnp=0x804b0a8 (0-8) rnp=0x804b060 (0-15)
CPU   4: rdp=0x804c6d0 rnp=0x804b1c8 (3-5) rnp=0x804b0a8 (0-8) rnp=0x804b060 (0-15)
CPU   5: rdp=0x804c77c rnp=0x804b1c8 (3-5) rnp=0x804b0a8 (0-8) rnp=0x804b060 (0-15)
CPU   6: rdp=0x804c828 rnp=0x804b210 (6-8) rnp=0x804b0a8 (0-8) rnp=0x804b060 (0-15)
CPU   7: rdp=0x804c8d4 rnp=0x804b210 (6-8) rnp=0x804b0a8 (0-8) rnp=0x804b060 (0-15)
CPU   8: rdp=0x804c980 rnp=0x804b210 (6-8) rnp=0x804b0a8 (0-8) rnp=0x804b060 (0-15)
CPU   9: rdp=0x804ca2c rnp=0x804b258 (9-11) rnp=0x804b0f0 (9-15) rnp=0x804b060 (0-15)
CPU  10: rdp=0x804cad8 rnp=0x804b258 (9-11) rnp=0x804b0f0 (9-15) rnp=0x804b060 (0-15)
CPU  11: rdp=0x804cb84 rnp=0x804b258 (9-11) rnp=0x804b0f0 (9-15) rnp=0x804b060 (0-15)
CPU  12: rdp=0x804cc30 rnp=0x804b2a0 (12-14) rnp=0x804b0f0 (9-15) rnp=0x804b060 (0-15)
CPU  13: rdp=0x804ccdc rnp=0x804b2a0 (12-14) rnp=0x804b0f0 (9-15) rnp=0x804b060 (0-15)
CPU  14: rdp=0x804cd88 rnp=0x804b2a0 (12-14) rnp=0x804b0f0 (9-15) rnp=0x804b060 (0-15)
CPU  15: rdp=0x804ce34 rnp=0x804b2e8 (15-15) rnp=0x804b0f0 (9-15) rnp=0x804b060 (0-15)

The rdp= numbers show each CPU's rcu_data structure addresses, and the sequence of rnp= numbers show the rcu_node structures when traversing upwards from the corresponding CPU to the root of the rcu_node tree. The numbers in parentheses show which CPUs are handled by each rcu_node structure. The initial lines simply show the values of a few key boot-time macros, variables, and arrays.

Had I been paying attention, I would have noticed that this output shows two very odd things, with one of the called out twice. However, some of my assumptions blinded me to these warning signs.

I therefore enabled RCU event tracing, with the usual false start where I failed to build the kernel with CONFIG_RCU_TRACE=y. This resulted in a 20MB trace, roughly 15MB of which was required to show the entire grace-period dance starting from the callback being registered and ending with rcutorture detecting RCU bug. The following entries caught my eye:

rcu_pree-7       0d..2 5852674733: rcu_grace_period_init: rcu_preempt 18446744073709551553 0 0 15 3
rcu_pree-7       0d..2 5852674946: rcu_grace_period_init: rcu_preempt 18446744073709551553 1 0 8 7
rcu_pree-7       0d..2 5852675138: rcu_grace_period_init: rcu_preempt 18446744073709551553 1 9 15 7
rcu_pree-7       0d..2 5852675313: rcu_grace_period_init: rcu_preempt 18446744073709551553 1 18 15 0
rcu_pree-7       0d..2 5852675631: rcu_grace_period: rcu_preempt 18446744073709551553 cpustart
rcu_pree-7       0d..2 5852675802: rcu_grace_period_init: rcu_preempt 18446744073709551553 2 0 2 7
rcu_pree-7       0d..2 5852675974: rcu_grace_period_init: rcu_preempt 18446744073709551553 2 3 5 7
rcu_pree-7       0d..2 5852676148: rcu_grace_period_init: rcu_preempt 18446744073709551553 2 6 8 7
rcu_pree-7       0d..2 5852676321: rcu_grace_period_init: rcu_preempt 18446744073709551553 2 9 11 7
rcu_pree-7       0d..2 5852676519: rcu_grace_period_init: rcu_preempt 18446744073709551553 2 12 14 7
rcu_pree-7       0d..2 5852676691: rcu_grace_period_init: rcu_preempt 18446744073709551553 2 15 15 1
rcu_pree-7       0d..2 5852676862: rcu_grace_period_init: rcu_preempt 18446744073709551553 2 18 15 0
rcu_pree-7       0d..2 5852677034: rcu_grace_period_init: rcu_preempt 18446744073709551553 2 21 15 0

The smoking gun is the fourth line, the one containing the “18 15”. This indicates that the rcu_node structure corresponding to this trace event covers all CPUs whose numbers (as listed by smp_processor_id()) are greater or equal to 18 and also less than or equal to 15. There are of course no such CPUs. The last two lines show similar brokenness.

These trace events narrowed the bug down to the following code fragment in kernel/rcutree.c:

 1 #ifdef CONFIG_RCU_FANOUT_EXACT
 2 static void __init rcu_init_levelspread(struct rcu_state *rsp)
 3 {
 4   int i;
 5 
 6   for (i = rcu_num_lvls - 1; i > 0; i--)
 7     rsp->levelspread[i] = CONFIG_RCU_FANOUT;
 8   rsp->levelspread[0] = rcu_fanout_leaf;
 9 }
10 #else /* #ifdef CONFIG_RCU_FANOUT_EXACT */
11 static void __init rcu_init_levelspread(struct rcu_state *rsp)
12 {
13   int ccur;
14   int cprv;
15   int i;
16 
17   cprv = nr_cpu_ids;
18   for (i = rcu_num_lvls - 1; i >= 0; i--) {
19     ccur = rsp->levelcnt[i];
20     rsp->levelspread[i] = (cprv + ccur - 1) / ccur;
21     cprv = ccur;
22   }
23 }
24 #endif /* #else #ifdef CONFIG_RCU_FANOUT_EXACT */

The problem is in lines 6-8, which incorrectly set the zero-th element of the array from rcu_fanout_leaf and the rest of the array from CONFIG_RCU_FANOUT. Instead, the last element of the array should be set from rcu_fanout_leaf and the rest of the array from CONFIG_RCU_FANOUT.

The required patch is as follows:

 1 diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
 2 index e0a58eca0092..13d1a1a0d60a 100644
 3 --- a/kernel/rcu/tree.c
 4 +++ b/kernel/rcu/tree.c
 5 @@ -3246,9 +3246,9 @@ static void __init rcu_init_levelspread(struct rcu_state *rsp)
 6  {
 7    int i;
 8  
 9 -  for (i = rcu_num_lvls - 1; i > 0; i--)
10 +  rsp->levelspread[rcu_num_lvls - 1] = rcu_fanout_leaf;
11 +  for (i = rcu_num_lvls - 2; i >= 0; i--)
12      rsp->levelspread[i] = CONFIG_RCU_FANOUT;
13 -  rsp->levelspread[0] = rcu_fanout_leaf;
14  }
15  #else /* #ifdef CONFIG_RCU_FANOUT_EXACT */
16  static void __init rcu_init_levelspread(struct rcu_state *rsp)

One could argue that this bug is so obscure that it can be safely ignored. However, my experience is that amazing depths of obscurity can vanish quite suddenly, with no advance warning.

So one moral of this story is that it is really important to tend your test cases with just as much care as you tend your test code, which in turn needs as least as much attention as does the code making up the actual implementation.

Another moral of this story is that, Star Wars and the benefits of a well-trained intuition notwithstanding, you should not trust your own feelings all that much. In fact, I had been considering dropping the more-complex CONFIG_RCU_FANOUT_EXACT=n case, making the simpler CONFIG_RCU_FANOUT_EXACT=y case not just the default, but the only possible choice. Needless to say, before this bug fix, that would have been a really bad decision!

In short, you should verify your code early, often, and as intensively as you possibly can. This is not simply case of “trust but verify,” but rather a case of “verify first and trust later, if at all.”

Perhaps this should be termed the Indiana Jones philosophy of validation.