Theory of operation

Author:

Sebastian Andrzej Siewior <bigeasy@linutronix.de>

Preface

PREEMPT_RT transforms the Linux kernel into a real-time kernel. It achieves this by replacing locking primitives, such as spinlock_t, with a preemptible and priority-inheritance aware implementation known as rtmutex, and by enforcing the use of threaded interrupts. As a result, the kernel becomes fully preemptible, with the exception of a few critical code paths, including entry code, the scheduler, and low-level interrupt handling routines.

This transformation places the majority of kernel execution contexts under the control of the scheduler and significantly increasing the number of preemption points. Consequently, it reduces the latency between a high-priority task becoming runnable and its actual execution on the CPU.

Scheduling

The core principles of Linux scheduling and the associated user-space API are documented in the man page sched(7) sched(7). By default, the Linux kernel uses the SCHED_OTHER scheduling policy. Under this policy, a task is preempted when the scheduler determines that it has consumed a fair share of CPU time relative to other runnable tasks. However, the policy does not guarantee immediate preemption when a new SCHED_OTHER task becomes runnable. The currently running task may continue executing.

This behavior differs from that of real-time scheduling policies such as SCHED_FIFO. When a task with a real-time policy becomes runnable, the scheduler immediately selects it for execution if it has a higher priority than the currently running task. The task continues to run until it voluntarily yields the CPU, typically by blocking on an event.

Sleeping spin locks

The various lock types and their behavior under real-time configurations are described in detail in Lock types and their rules. In a non-PREEMPT_RT configuration, a spinlock_t is acquired by first disabling preemption and then actively spinning until the lock becomes available. Once the lock is released, preemption is enabled. From a real-time perspective, this approach is undesirable because disabling preemption prevents the scheduler from switching to a higher-priority task, potentially increasing latency.

To address this, PREEMPT_RT replaces spinning locks with sleeping spin locks that do not disable preemption. On PREEMPT_RT, spinlock_t is implemented using rtmutex. Instead of spinning, a task attempting to acquire a contended lock disables CPU migration, donates its priority to the lock owner (priority inheritance), and voluntarily schedules out while waiting for the lock to become available.

Disabling CPU migration provides the same effect as disabling preemption, while still allowing preemption and ensuring that the task continues to run on the same CPU while holding a sleeping lock.

Priority inheritance

Lock types such as spinlock_t and mutex_t in a PREEMPT_RT enabled kernel are implemented on top of rtmutex, which provides support for priority inheritance (PI). When a task blocks on such a lock, the PI mechanism temporarily propagates the blocked task’s scheduling parameters to the lock owner.

For example, if a SCHED_FIFO task A blocks on a lock currently held by a SCHED_OTHER task B, task A’s scheduling policy and priority are temporarily inherited by task B. After this inheritance, task A is put to sleep while waiting for the lock, and task B effectively becomes the highest-priority task in the system. This allows B to continue executing, make progress, and eventually release the lock.

Once B releases the lock, it reverts to its original scheduling parameters, and task A can resume execution.

Threaded interrupts

Interrupt handlers are another source of code that executes with preemption disabled and outside the control of the scheduler. To bring interrupt handling under scheduler control, PREEMPT_RT enforces threaded interrupt handlers.

With forced threading, interrupt handling is split into two stages. The first stage, the primary handler, is executed in IRQ context with interrupts disabled. Its sole responsibility is to wake the associated threaded handler. The second stage, the threaded handler, is the function passed to request_irq() as the interrupt handler. It runs in process context, scheduled by the kernel.

From waking the interrupt thread until threaded handling is completed, the interrupt source is masked in the interrupt controller. This ensures that the device interrupt remains pending but does not retrigger the CPU, allowing the system to exit IRQ context and handle the interrupt in a scheduled thread.

By default, the threaded handler executes with the SCHED_FIFO scheduling policy and a priority of 50 (MAX_RT_PRIO / 2), which is midway between the minimum and maximum real-time priorities.

If the threaded interrupt handler raises any soft interrupts during its execution, those soft interrupt routines are invoked after the threaded handler completes, within the same thread. Preemption remains enabled during the execution of the soft interrupt handler.

Summary

By using sleeping locks and forced-threaded interrupts, PREEMPT_RT significantly reduces sections of code where interrupts or preemption is disabled, allowing the scheduler to preempt the current execution context and switch to a higher-priority task.