diff options
author | Paul E. McKenney <paulmck@kernel.org> | 2021-02-26 14:15:27 -0800 |
---|---|---|
committer | Paul E. McKenney <paulmck@kernel.org> | 2021-02-26 14:15:27 -0800 |
commit | 4fb8e23579545f7e4adbbbd989b9c6487bf99ec1 (patch) | |
tree | d24bff50f0ff45d43e52ca6e7be530f4b34311be | |
parent | 5bd724847a60932aae509fb66eecdc4b5eedee05 (diff) | |
download | perfbook-4fb8e23579545f7e4adbbbd989b9c6487bf99ec1.tar.gz |
advsync: Updates and wordsmithing, take five
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
-rw-r--r-- | advsync/rt.tex | 246 | ||||
-rw-r--r-- | bib/realtime.bib | 20 |
2 files changed, 180 insertions, 86 deletions
diff --git a/advsync/rt.tex b/advsync/rt.tex index e4e0d5a4..8a2c42c5 100644 --- a/advsync/rt.tex +++ b/advsync/rt.tex @@ -220,9 +220,8 @@ that portion of the outside world that is to be monitored or controlled. But what about battery-powered systems? They don't require energy flowing into the system as a whole. }\QuickQuizAnswer{ - Sooner or later, either the battery must be recharged, which - requires energy to flow into the system, or the system will - stop operating. + Sooner or later, the battery must be recharged, which requires + energy to flow into the system. }\QuickQuizEnd A number of systems are intended to operate in environments with impressive @@ -293,9 +292,9 @@ real-time systems. Those queueing-theory results assume infinite ``calling populations'', which in the Linux kernel might correspond to an infinite number of tasks. - As of mid-2016, no real system supports an infinite number of tasks, - so results assuming infinite calling populations should be - expected to have less-than-infinite applicability. + As of early 2021, no real system supports an infinite number of + tasks, so results assuming infinite calling populations should + be expected to have less-than-infinite applicability. Other queueing-theory results have \emph{finite} calling populations, which feature sharply bounded response @@ -429,9 +428,9 @@ large advances are required. \QuickQuiz{ Formal verification is already quite capable, benefiting from decades of intensive study. - Are additional advances \emph{really} required, or is this - just a practitioner's excuse to continue to be lazy and ignore - the awesome power of formal verification? + Are additional advances \emph{really} required, or is this just + a practitioner's excuse to continue to lazily ignore the awesome + power of formal verification? }\QuickQuizAnswer{ Perhaps this situation is just a theoretician's excuse to avoid diving into the messy world of real software? @@ -457,6 +456,11 @@ large advances are required. All that said, there is hope, given recent work formalizing the memory models of real computer systems~\cite{JadeAlglave2011ppcmem,Alglave:2013:SVW:2450268.2450306}. + On the other hand, formal verification has just as much trouble + as does testing with the astronomical number of variants of the + Linux kernel that can be constructed from different combinations + of its tens of thousands of Kconfig options. + Sometimes life is hard! }\QuickQuizEnd In addition to latency requirements for the real-time portions of the @@ -504,9 +508,9 @@ to shift, but such progress is by no means a bad thing. easily developed using standard non-real-time approaches, or whether the more difficult and expensive real-time approaches are required. - In other words, theory is quite important, however, for those - of us who like to get things done, theory supports practice, - never the other way around. + In other words, although theory is quite important, for those of + us called upon to complete practical projects, theory supports + practice, never the other way around. }\QuickQuizEnd Real-time computing is used in industrial-control applications, ranging from @@ -971,9 +975,17 @@ time-critical, so that a timer wheel's millisecond-level granularity is good and sufficient. Another key observation is that error-handling timeouts are normally canceled very early, often before they can be cascaded. -A final observation is that systems commonly have many more error-handling -timeouts than they do timer events, so that an $\O{\log n}$ -data structure should provide acceptable performance for timer events. +In addition, systems commonly have many more error-handling timeouts +than they do timer events, so that an $\O{\log n}$ data structure should +provide acceptable performance for timer events. + +However, it is possible to do better, namely by simply refusing to +cascade timers. +Instead of cascading, the timers that would otherwise have been cascaded +all the way down the calendar queue are handled in place. +This does result in up to a few percent error for the time duration, +but the few situations where this is a problem can instead use tree-based +high-resolution timers (hrtimers). In short, the Linux kernel's \rt\ patchset uses timer wheels for error-handling timeouts and a tree for timer events, providing each @@ -1054,7 +1066,7 @@ while holding the lock, which would prevent the medium-priority threads from preempting it. In the priority-inheritance solution, the high-priority thread attempting -to acquire the lock donate its priority to the low-priority thread holding +to acquire the lock donates its priority to the low-priority thread holding the lock until such time as the lock is released, thus preventing long-term priority inversion. @@ -1132,7 +1144,7 @@ priority-inversion conundrum: However, this approach clearly and severely limits read-side scalability. - The Linux kernel's \rt\ patchset has been able to live with this + The Linux kernel's \rt\ patchset was long able to live with this limitation for several reasons: (1)~Real-time systems have traditionally been relatively small, (2)~Real-time systems have generally focused on process control, thus being unaffected @@ -1140,12 +1152,45 @@ priority-inversion conundrum: (3)~Many of the Linux kernel's reader-writer locks have been converted to RCU. - All that aside, it is quite possible that the Linux kernel - will some day permit limited read-side parallelism for - reader-writer locks subject to priority boosting. + However, the day came when it was absolutely necessary to + permit concurrent readers, as described in the text following + this quiz. }\QuickQuizEnd -In some cases, reader-writer lock priority inversion can be avoided by +The no-concurrent-readers restriction eventually became intolerable, so +the \rt\ developers looked more carefully at how the Linux kernel uses +reader-writer spinlocks. +They learned that time-critical code rarely uses those parts of the +kernel that write-acquire reader-writer locks, so that the prospect +of writer starvation was not a show-stopper. +They therefore constructed a real-time reader-writer lock in which +write-side acquisitions use priority inheritance among each other, +but where read-side acquisitions take absolute priority over +write-side acquisitions. +This approach appears to be working well in practice, and is another +lesson in the importance of clearly understanding what your users +really need. + +One interesting detail of this implementation is that both the +\co{rt_read_lock()} and the \co{rt_write_lock()} functions enter an RCU +read-side critical section and both the \co{rt_read_unlock()} and the +\co{rt_write_unlock()} functions exit that critical section. +This is necessary because the vanilla kernel's reader-writer locking +functions disable preemption across their critical sections, and +there really are reader-writer locking use cases that rely on the fact +that \co{synchronize_rcu()} will therefore wait for all pre-exiting +reader-writer-lock critical sections to complete. +Let this be a lesson to you: +Understanding what your users really need is critically important to +correct operation, not just to performance. +Not only that, but what your users really need changes over time. + +This has the side-effect that all of a \rt\ kernel's reader-writer locking +critical sections are subject to RCU priority boosting. +This provides at least a partial solution to the problem of reader-writer +lock readers being preempted for extended periods of time. + +It is also possible to avoid reader-writer lock priority inversion by converting the reader-writer lock to RCU, as briefly discussed in the next section. @@ -1173,19 +1218,12 @@ void __rcu_read_lock(void) \lnlbl[lock:b] void __rcu_read_unlock(void) \lnlbl[unl:b] { - struct task_struct *t = current; - - if (t->rcu_read_lock_nesting != 1) { \lnlbl[unl:chkn] - --t->rcu_read_lock_nesting; \lnlbl[unl:dec] - } else { \lnlbl[unl:els:b] - barrier(); \lnlbl[unl:bar1] - t->rcu_read_lock_nesting = INT_MIN; \lnlbl[unl:neg] + barrier(); \lnlbl[unl:bar1] + if (!--current->rcu_read_lock_nesting) \lnlbl[unl:decchk] barrier(); \lnlbl[unl:bar2] - if (READ_ONCE(t->rcu_read_unlock_special.s)) \lnlbl[unl:chks] - rcu_read_unlock_special(t); \lnlbl[unl:unls] - barrier(); \lnlbl[unl:bar3] - t->rcu_read_lock_nesting = 0; \lnlbl[unl:zero] - } \lnlbl[unl:els:e] + if (READ_ONCE(current->rcu_read_unlock_special.s)) { \lnlbl[unl:chks] + rcu_read_unlock_special(t); \lnlbl[unl:unls] + } \lnlbl[unl:if:e] } \lnlbl[unl:e] \end{VerbatimL} \end{fcvlabel} @@ -1218,23 +1256,16 @@ RCU read-side critical section to precede the \co{rcu_read_lock()}. \end{fcvref} \begin{fcvref}[ln:advsync:Preemptible Linux-Kernel RCU:unl] -\Clnref{chkn} of \co{__rcu_read_unlock()} checks to see if the nesting -level count is one, in other words, if this corresponds to the outermost +\Clnref{bar1} of \co{__rcu_read_unlock()} prevents the compiler from +reordering the code in the critical section with the remainder of +this function. +\Clnref{decchk} decrements the nesting count and checks to see if it +has become zero, in other words, if this corresponds to the outermost \co{rcu_read_unlock()} of a nested set. -If not, \clnref{dec} decrements this count, and control returns to the caller. -Otherwise, this is the outermost \co{rcu_read_unlock()}, which requires -the end-of-critical-section handling carried out by \clnrefrange{els:b}{els:e}. - -\Clnref{bar1} prevents the compiler from reordering the code in the critical -section with the code comprising the \co{rcu_read_unlock()}. -\Clnref{neg} sets the nesting count to a large negative number -in order to prevent -destructive races with RCU read-side critical sections contained within -interrupt handlers~\cite{PaulEMcKenney2011RCU3.0trainwreck}, -and \clnref{bar2} prevents the compiler from reordering this assignment with -\clnref{chks}'s check for special handling. -If \clnref{chks} determines that special handling is required, \clnref{unls} -invokes \co{rcu_read_unlock_special()} to carry out that special handling. +If so, \clnref{bar2} prevents the compiler from reordering this nesting +update with \clnref{chks}'s check for special handling. +If special handling is required, then the call to +\co{rcu_read_unlock_special()} on \lnref{unls} carries it out. There are several types of special handling that can be required, but we will focus on that required when the RCU read-side critical section @@ -1249,10 +1280,6 @@ for those highest-priority threads, \co{rcu_read_unlock()} will never attempt to acquire any locks. In addition, if implemented carefully, locking can be used to synchronize real-time software~\cite{BjoernBrandenburgPhD,DipankarSarma2004OLSscalability}. - -Whether or not special handling is required, \clnref{bar3} prevents the compiler -from reordering the check on \clnref{chks} with the zeroing of the nesting -count on \clnref{zero}. \end{fcvref} \QuickQuiz{ @@ -1273,9 +1300,21 @@ count on \clnref{zero}. the context switch to complete. }\QuickQuizEnd -This preemptible RCU implementation enables real-time response for +Another important real-time feature of RCU, whether preemptible or +not, is the ability to offload RCU callback execution to a kernel +thread. +To use this, your kernel must be built with \co{CONFIG_RCU_NOCB_CPU=y} +and booted with the \co{rcu_nocbs=} kernel boot parameter specifying +which CPUs are to be offloaded. +Alternatively, any CPU specified by the \co{nohz_full=} kernel boot parameter +described in +\cref{sec:advsync:Polling-Loop Real-Time Support} +will also have its RCU callbacks offloaded. + +In short, this preemptible RCU implementation enables real-time response for read-mostly data structures without the delays inherent to priority -boosting of large numbers of readers. +boosting of large numbers of readers, and also without delays due to +callback invocation. \paragraph{Preemptible spinlocks} are an important part of the \rt\ patchset due to the long-duration @@ -1292,9 +1331,43 @@ achieving real-time latencies down in the tens of microseconds. Fortunately, Linux Foundation organized an effort to fund moving the remaining code from the \rt\ patchset to mainline. +\paragraph{Per-CPU variables}\ are used heavily in the Linux kernel +for performance reasons. +Unfortunately for real-time applications, many use cases for per-CPU +variables require coordinated update of multiple such variables, +which is normally provided by disabling preemption, which in turn +degrades real-time latencies. +Real-time applications clearly need some other way of coordinating +per-CPU variable updates. + +One alternative is to supply per-CPU spinlocks, which as noted above +are actually sleeplocks, so that their critical sections can be +preempted and so that priority inheritance is provided. +In this approach, code updating groups of per-CPU variables must +acquire the current CPU's spinlock, carry out the update, then +release whichever lock is acquired, keeping in mind that a preemption +might have resulted in a migration to some other CPU. +However, this approach introduces both overhead and deadlocks. + +Another alternative, which is used in the \rt\ patchset as of early 2021, +is to convert preemption disabling to migration disabling. +This ensures that a given kernel thread remains on its CPU through +the duration of the per-CPU-variable update, but could also allow some +other kernel thread to intersperse its own update of those same variables, +courtesy of preemption. +There are cases such as statistics gathering where this is not a problem. +In the surprisingly rare case where such mid-update preemption is a problem, +the use case at hand must properly synchronize the updates, perhaps through +a set of per-CPU locks specific to that use case. +Although introducing locks again introduces the possibility of deadlock, +the per-use-case nature of these locks makes any such deadlocks easier +to manage and avoid. + +\paragraph{Closing event-driven remarks.} There are of course any number of other Linux-kernel components that are critically important to achieving world-class real-time latencies, -most recently deadline scheduling, +for example, deadline +scheduling~\cite{DanielBristot2018deadlinesched-1,DanielBristot2018deadlinesched-2}, however, those listed in this section give a good feeling for the workings of the Linux kernel augmented by the \rt\ patchset. @@ -1316,8 +1389,8 @@ given CPU, avoiding all causes of interference. Although the reality is of course more complex, it is becoming possible to do just that, courtesy of the \co{NO_HZ_FULL} implementation led by -Frederic Weisbecker~\cite{JonCorbet2013NO-HZ-FULL,FredericWeisbecker2013nohz} that has been -accepted into version 3.10 of the Linux kernel. +Frederic Weisbecker~\cite{JonCorbet2013NO-HZ-FULL,FredericWeisbecker2013nohz} +that was accepted into version 3.10 of the Linux kernel. Nevertheless, considerable care is required to properly set up such an environment, as it is necessary to control a number of possible sources of OS jitter. @@ -1336,7 +1409,7 @@ contact with the outside world is to reserve a small number of housekeeping CPUs, and to force all interrupts to these CPUs. The \path{Documentation/IRQ-affinity.txt} file in the Linux source tree describes how to direct device interrupts to specified CPU, -which as of early 2015 involves something like the following: +which as of early 2021 involves something like the following: \begin{VerbatimU} $ echo 0f > /proc/irq/44/smp_affinity @@ -1378,17 +1451,16 @@ $ echo -1 > /proc/sys/kernel/sched_rt_runtime_us \end{VerbatimU} You will of course need to be running as root to execute this command, -and you will also need to carefully consider the Spiderman principle. +and you will also need to carefully consider the aforementioned Spiderman +principle. One way to minimize the risks is to offload interrupts and kernel threads/daemons from all CPUs running CPU-bound real-time threads, as described in the paragraphs above. In addition, you should carefully read the material in the \path{Documentation/scheduler} directory. -The material in the \path{sched-rt-group.txt} file is particularly +The material in the \path{sched-rt-group.rst} file is particularly important, especially if you are using the \co{cgroups} real-time features -enabled by the \co{CONFIG_RT_GROUP_SCHED} Kconfig parameter, in which -case you should also read the material in the -\path{Documentation/cgroups} directory. +enabled by the \co{CONFIG_RT_GROUP_SCHED} Kconfig parameter. A fourth source of OS jitter comes from timers. In most cases, keeping a given CPU out of the kernel will prevent @@ -1417,7 +1489,7 @@ A third way is to test the device's ability to support real-time workloads and fix any real-time bugs.\footnote{ If you take this approach, please submit your fixes upstream so that others can benefit. - Keep in mind that when you need to port your application to + After all, when you need to port your application to a later version of the Linux kernel, \emph{you} will be one of those ``others''.} @@ -1426,7 +1498,7 @@ full-system synchronization algorithms, perhaps most notably the global TLB-flush algorithm. This can be avoided by avoiding memory-unmapping operations, and especially avoiding unmapping operations within the kernel. -As of early 2015, the way to avoid in-kernel +As of early 2021, the way to avoid in-kernel unmapping operations is to avoid unloading kernel modules. A seventh source of OS jitter is provided by @@ -1535,7 +1607,7 @@ briefly covers event-based applications. As in all areas of engineering, a robust set of components is essential to productivity and reliability. This section is not a full catalog of real-time software components---such -a catalog would fill an entire book---but rather a brief overview of the +a catalog would fill multiple books---but rather a brief overview of the types of components available. A natural place to look for real-time software components would be @@ -1543,10 +1615,9 @@ algorithms offering wait-free synchronization~\cite{Herlihy91}, and in fact lockless algorithms are very important to real-time computing. However, wait-free synchronization only guarantees forward progress in -finite time, and real-time computing requires algorithms that meet the -far more stringent guarantee of forward progress in bounded time. -After all, a century is finite, but unhelpful when your deadlines are -measured in milliseconds. +finite time. +Although a century is finite, this is unhelpful when your deadlines are +measured in microseconds, let alone milliseconds. Nevertheless, there are some important wait-free algorithms that do provide bounded response time, including atomic test and set, @@ -1558,15 +1629,13 @@ In addition, recent research has confirmed the observation that algorithms with lock-free guarantees\footnote{ Wait-free algorithms guarantee that all threads make progress in finite time, while lock-free algorithms only guarantee that at - least one thread will make progress in finite time.} -provide the same latencies in practice assuming a stochastically -fair scheduler and freedom from fail-stop -bugs~\cite{DanAlitarh2013PracticalProgress}.\footnote{ - This paper also introduces the notion of \emph{bounded minimal - progress}, which is a welcome step on the part of theory - towards real-time practice.} -This means that lock-free stacks and queues are appropriate -for real-time use. + least one thread will make progress in finite time. + See \cref{sec:advsync:Non-Blocking Synchronization} for more details.} +also provide the same latencies in practice (in the wait-free sense), +assuming a stochastically fair scheduler and absence of fail-stop +bugs~\cite{DanAlitarh2013PracticalProgress}. +This means that many non-wait-free stacks and queues are nevertheless +appropriate for real-time use. \QuickQuiz{ But isn't correct operation despite fail-stop bugs @@ -1670,12 +1739,11 @@ still having access to the full features and functions of a general-purpose operating system is to use the Linux kernel's \co{NO_HZ_FULL} capability, described in \cref{sec:advsync:Polling-Loop Real-Time Support}. -This support first became available in version 3.10 of the Linux kernel. \subsubsection{Streaming Applications} \label{sec:advsync:Streaming Applications} -A popular sort of big-data real-time application takes input from numerous +One type of big-data real-time application takes input from numerous sources, processes it internally, and outputs alerts and summaries. These \emph{streaming applications} are often highly parallel, processing different information sources concurrently. @@ -1862,9 +1930,12 @@ on \clnrefrange{upd:b}{upd:e}. the call to \co{malloc()}??? }\QuickQuizAnswerB{ In early 2016, situations forbidding runtime memory were - also not so excited with multithreaded computing. + also not at all interested in multithreaded computing. So the runtime memory allocation is not an additional obstacle to safety criticality. + + However, by 2020 runtime memory allocation in multi-core + real-time systems was gaining some traction. }\QuickQuizEndB % \QuickQuizE{ @@ -1873,6 +1944,9 @@ on \clnrefrange{upd:b}{upd:e}. }\QuickQuizAnswerE{ Indeed you do, and you could use any of a number of techniques discussed earlier in this book. + One of those techniques is use of a single updater thread, + which would result in exactly the code shown in \co{update_cal()} + in \cref{lst:advsync:Real-Time Calibration Using RCU}. }\QuickQuizEndE } @@ -1897,9 +1971,9 @@ data-structure access to real-time programs. \end{figure} The choice between real-time and real-fast computing can be a difficult one. -Because real-time systems often inflict a throughput penalty on non-real-time -computing, using real-time when it is not required can cause problems, -as fancifully depicted by +Because real-time systems often inflict a throughput penalty on +non-real-time computing, using real-time when it is not required is +unwise, as fancifully depicted by \cref{fig:advsync:The Dark Side of Real-Time Computing}. On the other hand, failing to use real-time when it \emph{is} required can also cause problems, as fancifully depicted by diff --git a/bib/realtime.bib b/bib/realtime.bib index eff92791..5abcaa5e 100644 --- a/bib/realtime.bib +++ b/bib/realtime.bib @@ -589,6 +589,26 @@ lastchecked="November 4, 2016", } +@unpublished{DanielBristot2018deadlinesched-1, + Author="Daniel Bristot de Oliveira", + Title="Deadline scheduling part 1 -- overview and theory", + month="January", + day="16", + year="2018", + note="URL: \url{https://lwn.net/Articles/743740/}", + lastchecked="February 26, 2021", +} + +@unpublished{DanielBristot2018deadlinesched-2, + Author="Daniel Bristot de Oliveira", + Title="Deadline scheduler part 2 -- details and usage", + month="January", + day="19", + year="2018", + note="URL: \url{https://lwn.net/Articles/743946/}", + lastchecked="February 26, 2021", +} + @article{Reghenzani:2019:RLK:3309872.3297714, author = {Reghenzani, Federico and Massari, Giuseppe and Fornaciari, William}, title = {The Real-Time {Linux} Kernel: A Survey on {PREEMPT\_RT}}, |