summaryrefslogtreecommitdiffstats
diff options
context:
space:
mode:
authorPaul E. McKenney <paulmck@kernel.org>2021-02-26 14:15:27 -0800
committerPaul E. McKenney <paulmck@kernel.org>2021-02-26 14:15:27 -0800
commit4fb8e23579545f7e4adbbbd989b9c6487bf99ec1 (patch)
treed24bff50f0ff45d43e52ca6e7be530f4b34311be
parent5bd724847a60932aae509fb66eecdc4b5eedee05 (diff)
downloadperfbook-4fb8e23579545f7e4adbbbd989b9c6487bf99ec1.tar.gz
advsync: Updates and wordsmithing, take five
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
-rw-r--r--advsync/rt.tex246
-rw-r--r--bib/realtime.bib20
2 files changed, 180 insertions, 86 deletions
diff --git a/advsync/rt.tex b/advsync/rt.tex
index e4e0d5a4..8a2c42c5 100644
--- a/advsync/rt.tex
+++ b/advsync/rt.tex
@@ -220,9 +220,8 @@ that portion of the outside world that is to be monitored or controlled.
But what about battery-powered systems?
They don't require energy flowing into the system as a whole.
}\QuickQuizAnswer{
- Sooner or later, either the battery must be recharged, which
- requires energy to flow into the system, or the system will
- stop operating.
+ Sooner or later, the battery must be recharged, which requires
+ energy to flow into the system.
}\QuickQuizEnd
A number of systems are intended to operate in environments with impressive
@@ -293,9 +292,9 @@ real-time systems.
Those queueing-theory results assume infinite ``calling populations'',
which in the Linux kernel might correspond to an infinite number
of tasks.
- As of mid-2016, no real system supports an infinite number of tasks,
- so results assuming infinite calling populations should be
- expected to have less-than-infinite applicability.
+ As of early 2021, no real system supports an infinite number of
+ tasks, so results assuming infinite calling populations should
+ be expected to have less-than-infinite applicability.
Other queueing-theory results have \emph{finite}
calling populations, which feature sharply bounded response
@@ -429,9 +428,9 @@ large advances are required.
\QuickQuiz{
Formal verification is already quite capable, benefiting from
decades of intensive study.
- Are additional advances \emph{really} required, or is this
- just a practitioner's excuse to continue to be lazy and ignore
- the awesome power of formal verification?
+ Are additional advances \emph{really} required, or is this just
+ a practitioner's excuse to continue to lazily ignore the awesome
+ power of formal verification?
}\QuickQuizAnswer{
Perhaps this situation is just a theoretician's excuse to avoid
diving into the messy world of real software?
@@ -457,6 +456,11 @@ large advances are required.
All that said, there is hope, given recent work formalizing
the memory models of real computer
systems~\cite{JadeAlglave2011ppcmem,Alglave:2013:SVW:2450268.2450306}.
+ On the other hand, formal verification has just as much trouble
+ as does testing with the astronomical number of variants of the
+ Linux kernel that can be constructed from different combinations
+ of its tens of thousands of Kconfig options.
+ Sometimes life is hard!
}\QuickQuizEnd
In addition to latency requirements for the real-time portions of the
@@ -504,9 +508,9 @@ to shift, but such progress is by no means a bad thing.
easily developed using standard non-real-time approaches, or
whether the more difficult and expensive real-time approaches
are required.
- In other words, theory is quite important, however, for those
- of us who like to get things done, theory supports practice,
- never the other way around.
+ In other words, although theory is quite important, for those of
+ us called upon to complete practical projects, theory supports
+ practice, never the other way around.
}\QuickQuizEnd
Real-time computing is used in industrial-control applications, ranging from
@@ -971,9 +975,17 @@ time-critical, so that a timer wheel's millisecond-level granularity
is good and sufficient.
Another key observation is that error-handling timeouts are normally
canceled very early, often before they can be cascaded.
-A final observation is that systems commonly have many more error-handling
-timeouts than they do timer events, so that an $\O{\log n}$
-data structure should provide acceptable performance for timer events.
+In addition, systems commonly have many more error-handling timeouts
+than they do timer events, so that an $\O{\log n}$ data structure should
+provide acceptable performance for timer events.
+
+However, it is possible to do better, namely by simply refusing to
+cascade timers.
+Instead of cascading, the timers that would otherwise have been cascaded
+all the way down the calendar queue are handled in place.
+This does result in up to a few percent error for the time duration,
+but the few situations where this is a problem can instead use tree-based
+high-resolution timers (hrtimers).
In short, the Linux kernel's \rt\ patchset uses timer wheels for
error-handling timeouts and a tree for timer events, providing each
@@ -1054,7 +1066,7 @@ while holding the lock, which would prevent the medium-priority
threads from preempting it.
In the priority-inheritance solution, the high-priority thread attempting
-to acquire the lock donate its priority to the low-priority thread holding
+to acquire the lock donates its priority to the low-priority thread holding
the lock until such time as the lock is released, thus preventing long-term
priority inversion.
@@ -1132,7 +1144,7 @@ priority-inversion conundrum:
However, this approach clearly and severely limits read-side
scalability.
- The Linux kernel's \rt\ patchset has been able to live with this
+ The Linux kernel's \rt\ patchset was long able to live with this
limitation for several reasons: (1)~Real-time systems have
traditionally been relatively small, (2)~Real-time systems
have generally focused on process control, thus being unaffected
@@ -1140,12 +1152,45 @@ priority-inversion conundrum:
(3)~Many of the Linux kernel's reader-writer locks have been
converted to RCU.
- All that aside, it is quite possible that the Linux kernel
- will some day permit limited read-side parallelism for
- reader-writer locks subject to priority boosting.
+ However, the day came when it was absolutely necessary to
+ permit concurrent readers, as described in the text following
+ this quiz.
}\QuickQuizEnd
-In some cases, reader-writer lock priority inversion can be avoided by
+The no-concurrent-readers restriction eventually became intolerable, so
+the \rt\ developers looked more carefully at how the Linux kernel uses
+reader-writer spinlocks.
+They learned that time-critical code rarely uses those parts of the
+kernel that write-acquire reader-writer locks, so that the prospect
+of writer starvation was not a show-stopper.
+They therefore constructed a real-time reader-writer lock in which
+write-side acquisitions use priority inheritance among each other,
+but where read-side acquisitions take absolute priority over
+write-side acquisitions.
+This approach appears to be working well in practice, and is another
+lesson in the importance of clearly understanding what your users
+really need.
+
+One interesting detail of this implementation is that both the
+\co{rt_read_lock()} and the \co{rt_write_lock()} functions enter an RCU
+read-side critical section and both the \co{rt_read_unlock()} and the
+\co{rt_write_unlock()} functions exit that critical section.
+This is necessary because the vanilla kernel's reader-writer locking
+functions disable preemption across their critical sections, and
+there really are reader-writer locking use cases that rely on the fact
+that \co{synchronize_rcu()} will therefore wait for all pre-exiting
+reader-writer-lock critical sections to complete.
+Let this be a lesson to you:
+Understanding what your users really need is critically important to
+correct operation, not just to performance.
+Not only that, but what your users really need changes over time.
+
+This has the side-effect that all of a \rt\ kernel's reader-writer locking
+critical sections are subject to RCU priority boosting.
+This provides at least a partial solution to the problem of reader-writer
+lock readers being preempted for extended periods of time.
+
+It is also possible to avoid reader-writer lock priority inversion by
converting the reader-writer lock to RCU, as briefly discussed in the
next section.
@@ -1173,19 +1218,12 @@ void __rcu_read_lock(void) \lnlbl[lock:b]
void __rcu_read_unlock(void) \lnlbl[unl:b]
{
- struct task_struct *t = current;
-
- if (t->rcu_read_lock_nesting != 1) { \lnlbl[unl:chkn]
- --t->rcu_read_lock_nesting; \lnlbl[unl:dec]
- } else { \lnlbl[unl:els:b]
- barrier(); \lnlbl[unl:bar1]
- t->rcu_read_lock_nesting = INT_MIN; \lnlbl[unl:neg]
+ barrier(); \lnlbl[unl:bar1]
+ if (!--current->rcu_read_lock_nesting) \lnlbl[unl:decchk]
barrier(); \lnlbl[unl:bar2]
- if (READ_ONCE(t->rcu_read_unlock_special.s)) \lnlbl[unl:chks]
- rcu_read_unlock_special(t); \lnlbl[unl:unls]
- barrier(); \lnlbl[unl:bar3]
- t->rcu_read_lock_nesting = 0; \lnlbl[unl:zero]
- } \lnlbl[unl:els:e]
+ if (READ_ONCE(current->rcu_read_unlock_special.s)) { \lnlbl[unl:chks]
+ rcu_read_unlock_special(t); \lnlbl[unl:unls]
+ } \lnlbl[unl:if:e]
} \lnlbl[unl:e]
\end{VerbatimL}
\end{fcvlabel}
@@ -1218,23 +1256,16 @@ RCU read-side critical section to precede the \co{rcu_read_lock()}.
\end{fcvref}
\begin{fcvref}[ln:advsync:Preemptible Linux-Kernel RCU:unl]
-\Clnref{chkn} of \co{__rcu_read_unlock()} checks to see if the nesting
-level count is one, in other words, if this corresponds to the outermost
+\Clnref{bar1} of \co{__rcu_read_unlock()} prevents the compiler from
+reordering the code in the critical section with the remainder of
+this function.
+\Clnref{decchk} decrements the nesting count and checks to see if it
+has become zero, in other words, if this corresponds to the outermost
\co{rcu_read_unlock()} of a nested set.
-If not, \clnref{dec} decrements this count, and control returns to the caller.
-Otherwise, this is the outermost \co{rcu_read_unlock()}, which requires
-the end-of-critical-section handling carried out by \clnrefrange{els:b}{els:e}.
-
-\Clnref{bar1} prevents the compiler from reordering the code in the critical
-section with the code comprising the \co{rcu_read_unlock()}.
-\Clnref{neg} sets the nesting count to a large negative number
-in order to prevent
-destructive races with RCU read-side critical sections contained within
-interrupt handlers~\cite{PaulEMcKenney2011RCU3.0trainwreck},
-and \clnref{bar2} prevents the compiler from reordering this assignment with
-\clnref{chks}'s check for special handling.
-If \clnref{chks} determines that special handling is required, \clnref{unls}
-invokes \co{rcu_read_unlock_special()} to carry out that special handling.
+If so, \clnref{bar2} prevents the compiler from reordering this nesting
+update with \clnref{chks}'s check for special handling.
+If special handling is required, then the call to
+\co{rcu_read_unlock_special()} on \lnref{unls} carries it out.
There are several types of special handling that can be required, but
we will focus on that required when the RCU read-side critical section
@@ -1249,10 +1280,6 @@ for those highest-priority threads, \co{rcu_read_unlock()} will never
attempt to acquire any locks.
In addition, if implemented carefully, locking can be used to synchronize
real-time software~\cite{BjoernBrandenburgPhD,DipankarSarma2004OLSscalability}.
-
-Whether or not special handling is required, \clnref{bar3} prevents the compiler
-from reordering the check on \clnref{chks} with the zeroing of the nesting
-count on \clnref{zero}.
\end{fcvref}
\QuickQuiz{
@@ -1273,9 +1300,21 @@ count on \clnref{zero}.
the context switch to complete.
}\QuickQuizEnd
-This preemptible RCU implementation enables real-time response for
+Another important real-time feature of RCU, whether preemptible or
+not, is the ability to offload RCU callback execution to a kernel
+thread.
+To use this, your kernel must be built with \co{CONFIG_RCU_NOCB_CPU=y}
+and booted with the \co{rcu_nocbs=} kernel boot parameter specifying
+which CPUs are to be offloaded.
+Alternatively, any CPU specified by the \co{nohz_full=} kernel boot parameter
+described in
+\cref{sec:advsync:Polling-Loop Real-Time Support}
+will also have its RCU callbacks offloaded.
+
+In short, this preemptible RCU implementation enables real-time response for
read-mostly data structures without the delays inherent to priority
-boosting of large numbers of readers.
+boosting of large numbers of readers, and also without delays due to
+callback invocation.
\paragraph{Preemptible spinlocks}
are an important part of the \rt\ patchset due to the long-duration
@@ -1292,9 +1331,43 @@ achieving real-time latencies down in the tens of microseconds.
Fortunately, Linux Foundation organized an effort to fund moving the
remaining code from the \rt\ patchset to mainline.
+\paragraph{Per-CPU variables}\ are used heavily in the Linux kernel
+for performance reasons.
+Unfortunately for real-time applications, many use cases for per-CPU
+variables require coordinated update of multiple such variables,
+which is normally provided by disabling preemption, which in turn
+degrades real-time latencies.
+Real-time applications clearly need some other way of coordinating
+per-CPU variable updates.
+
+One alternative is to supply per-CPU spinlocks, which as noted above
+are actually sleeplocks, so that their critical sections can be
+preempted and so that priority inheritance is provided.
+In this approach, code updating groups of per-CPU variables must
+acquire the current CPU's spinlock, carry out the update, then
+release whichever lock is acquired, keeping in mind that a preemption
+might have resulted in a migration to some other CPU.
+However, this approach introduces both overhead and deadlocks.
+
+Another alternative, which is used in the \rt\ patchset as of early 2021,
+is to convert preemption disabling to migration disabling.
+This ensures that a given kernel thread remains on its CPU through
+the duration of the per-CPU-variable update, but could also allow some
+other kernel thread to intersperse its own update of those same variables,
+courtesy of preemption.
+There are cases such as statistics gathering where this is not a problem.
+In the surprisingly rare case where such mid-update preemption is a problem,
+the use case at hand must properly synchronize the updates, perhaps through
+a set of per-CPU locks specific to that use case.
+Although introducing locks again introduces the possibility of deadlock,
+the per-use-case nature of these locks makes any such deadlocks easier
+to manage and avoid.
+
+\paragraph{Closing event-driven remarks.}
There are of course any number of other Linux-kernel components that are
critically important to achieving world-class real-time latencies,
-most recently deadline scheduling,
+for example, deadline
+scheduling~\cite{DanielBristot2018deadlinesched-1,DanielBristot2018deadlinesched-2},
however, those listed in this section give a good feeling for the workings
of the Linux kernel augmented by the \rt\ patchset.
@@ -1316,8 +1389,8 @@ given CPU, avoiding all causes of interference.
Although the reality is of course more complex, it is becoming
possible to do just that,
courtesy of the \co{NO_HZ_FULL} implementation led by
-Frederic Weisbecker~\cite{JonCorbet2013NO-HZ-FULL,FredericWeisbecker2013nohz} that has been
-accepted into version 3.10 of the Linux kernel.
+Frederic Weisbecker~\cite{JonCorbet2013NO-HZ-FULL,FredericWeisbecker2013nohz}
+that was accepted into version 3.10 of the Linux kernel.
Nevertheless, considerable care is required to properly set up such
an environment, as it is necessary to control a number of possible
sources of OS jitter.
@@ -1336,7 +1409,7 @@ contact with the outside world is to reserve a small number of
housekeeping CPUs, and to force all interrupts to these CPUs.
The \path{Documentation/IRQ-affinity.txt} file in the Linux source tree
describes how to direct device interrupts to specified CPU,
-which as of early 2015 involves something like the following:
+which as of early 2021 involves something like the following:
\begin{VerbatimU}
$ echo 0f > /proc/irq/44/smp_affinity
@@ -1378,17 +1451,16 @@ $ echo -1 > /proc/sys/kernel/sched_rt_runtime_us
\end{VerbatimU}
You will of course need to be running as root to execute this command,
-and you will also need to carefully consider the Spiderman principle.
+and you will also need to carefully consider the aforementioned Spiderman
+principle.
One way to minimize the risks is to offload interrupts and
kernel threads/daemons from all CPUs running CPU-bound real-time
threads, as described in the paragraphs above.
In addition, you should carefully read the material in the
\path{Documentation/scheduler} directory.
-The material in the \path{sched-rt-group.txt} file is particularly
+The material in the \path{sched-rt-group.rst} file is particularly
important, especially if you are using the \co{cgroups} real-time features
-enabled by the \co{CONFIG_RT_GROUP_SCHED} Kconfig parameter, in which
-case you should also read the material in the
-\path{Documentation/cgroups} directory.
+enabled by the \co{CONFIG_RT_GROUP_SCHED} Kconfig parameter.
A fourth source of OS jitter comes from timers.
In most cases, keeping a given CPU out of the kernel will prevent
@@ -1417,7 +1489,7 @@ A third way is to test the device's ability to support real-time
workloads and fix any real-time bugs.\footnote{
If you take this approach, please submit your fixes upstream
so that others can benefit.
- Keep in mind that when you need to port your application to
+ After all, when you need to port your application to
a later version of the Linux kernel, \emph{you} will be one of those
``others''.}
@@ -1426,7 +1498,7 @@ full-system synchronization algorithms, perhaps most notably
the global TLB-flush algorithm.
This can be avoided by avoiding memory-unmapping operations, and especially
avoiding unmapping operations within the kernel.
-As of early 2015, the way to avoid in-kernel
+As of early 2021, the way to avoid in-kernel
unmapping operations is to avoid unloading kernel modules.
A seventh source of OS jitter is provided by
@@ -1535,7 +1607,7 @@ briefly covers event-based applications.
As in all areas of engineering, a robust set of components is essential
to productivity and reliability.
This section is not a full catalog of real-time software components---such
-a catalog would fill an entire book---but rather a brief overview of the
+a catalog would fill multiple books---but rather a brief overview of the
types of components available.
A natural place to look for real-time software components would be
@@ -1543,10 +1615,9 @@ algorithms offering wait-free
synchronization~\cite{Herlihy91}, and in fact lockless
algorithms are very important to real-time computing.
However, wait-free synchronization only guarantees forward progress in
-finite time, and real-time computing requires algorithms that meet the
-far more stringent guarantee of forward progress in bounded time.
-After all, a century is finite, but unhelpful when your deadlines are
-measured in milliseconds.
+finite time.
+Although a century is finite, this is unhelpful when your deadlines are
+measured in microseconds, let alone milliseconds.
Nevertheless, there are some important wait-free algorithms that do
provide bounded response time, including atomic test and set,
@@ -1558,15 +1629,13 @@ In addition, recent research has confirmed the observation that
algorithms with lock-free guarantees\footnote{
Wait-free algorithms guarantee that all threads make progress in
finite time, while lock-free algorithms only guarantee that at
- least one thread will make progress in finite time.}
-provide the same latencies in practice assuming a stochastically
-fair scheduler and freedom from fail-stop
-bugs~\cite{DanAlitarh2013PracticalProgress}.\footnote{
- This paper also introduces the notion of \emph{bounded minimal
- progress}, which is a welcome step on the part of theory
- towards real-time practice.}
-This means that lock-free stacks and queues are appropriate
-for real-time use.
+ least one thread will make progress in finite time.
+ See \cref{sec:advsync:Non-Blocking Synchronization} for more details.}
+also provide the same latencies in practice (in the wait-free sense),
+assuming a stochastically fair scheduler and absence of fail-stop
+bugs~\cite{DanAlitarh2013PracticalProgress}.
+This means that many non-wait-free stacks and queues are nevertheless
+appropriate for real-time use.
\QuickQuiz{
But isn't correct operation despite fail-stop bugs
@@ -1670,12 +1739,11 @@ still having access to the full features and functions of a
general-purpose operating system is to use the Linux kernel's
\co{NO_HZ_FULL} capability, described in
\cref{sec:advsync:Polling-Loop Real-Time Support}.
-This support first became available in version 3.10 of the Linux kernel.
\subsubsection{Streaming Applications}
\label{sec:advsync:Streaming Applications}
-A popular sort of big-data real-time application takes input from numerous
+One type of big-data real-time application takes input from numerous
sources, processes it internally, and outputs alerts and summaries.
These \emph{streaming applications} are often highly parallel, processing
different information sources concurrently.
@@ -1862,9 +1930,12 @@ on \clnrefrange{upd:b}{upd:e}.
the call to \co{malloc()}???
}\QuickQuizAnswerB{
In early 2016, situations forbidding runtime memory were
- also not so excited with multithreaded computing.
+ also not at all interested in multithreaded computing.
So the runtime memory allocation is not an additional
obstacle to safety criticality.
+
+ However, by 2020 runtime memory allocation in multi-core
+ real-time systems was gaining some traction.
}\QuickQuizEndB
%
\QuickQuizE{
@@ -1873,6 +1944,9 @@ on \clnrefrange{upd:b}{upd:e}.
}\QuickQuizAnswerE{
Indeed you do, and you could use any of a number of techniques
discussed earlier in this book.
+ One of those techniques is use of a single updater thread,
+ which would result in exactly the code shown in \co{update_cal()}
+ in \cref{lst:advsync:Real-Time Calibration Using RCU}.
}\QuickQuizEndE
}
@@ -1897,9 +1971,9 @@ data-structure access to real-time programs.
\end{figure}
The choice between real-time and real-fast computing can be a difficult one.
-Because real-time systems often inflict a throughput penalty on non-real-time
-computing, using real-time when it is not required can cause problems,
-as fancifully depicted by
+Because real-time systems often inflict a throughput penalty on
+non-real-time computing, using real-time when it is not required is
+unwise, as fancifully depicted by
\cref{fig:advsync:The Dark Side of Real-Time Computing}.
On the other hand, failing to use real-time when it \emph{is} required
can also cause problems, as fancifully depicted by
diff --git a/bib/realtime.bib b/bib/realtime.bib
index eff92791..5abcaa5e 100644
--- a/bib/realtime.bib
+++ b/bib/realtime.bib
@@ -589,6 +589,26 @@
lastchecked="November 4, 2016",
}
+@unpublished{DanielBristot2018deadlinesched-1,
+ Author="Daniel Bristot de Oliveira",
+ Title="Deadline scheduling part 1 -- overview and theory",
+ month="January",
+ day="16",
+ year="2018",
+ note="URL: \url{https://lwn.net/Articles/743740/}",
+ lastchecked="February 26, 2021",
+}
+
+@unpublished{DanielBristot2018deadlinesched-2,
+ Author="Daniel Bristot de Oliveira",
+ Title="Deadline scheduler part 2 -- details and usage",
+ month="January",
+ day="19",
+ year="2018",
+ note="URL: \url{https://lwn.net/Articles/743946/}",
+ lastchecked="February 26, 2021",
+}
+
@article{Reghenzani:2019:RLK:3309872.3297714,
author = {Reghenzani, Federico and Massari, Giuseppe and Fornaciari, William},
title = {The Real-Time {Linux} Kernel: A Survey on {PREEMPT\_RT}},