together/microopt: Move micro-optimization section

This move is in preparation for adding a new data-structure chapter. Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
author: Paul E. McKenney <paulmck@kernel.org> 2023-10-01 16:24:24 -0700
committer: Paul E. McKenney <paulmck@kernel.org> 2023-10-01 16:24:24 -0700
commit: 36a782e82f66210f213349d15308c8f3e38b9de8 (patch)
tree: 943099761602232fa9aedfef54169dda2e5ed9a1
parent: 3860345ab8d75a321b64062ea013e33591e69586 (diff)
download: perfbook-36a782e82f66210f213349d15308c8f3e38b9de8.tar.gz
3 files changed, 271 insertions, 259 deletions
diff --git a/datastruct/datastruct.tex b/datastruct/datastruct.tex
index 0101f34b..07f99fa4 100644
--- a/datastruct/datastruct.tex
+++ b/datastruct/datastruct.tex
@@ -61,11 +61,6 @@ Because this chapter cannot delve into the details of every concurrent
 data structure,
 \cref{sec:datastruct:Other Data Structures}
 surveys a few of the important ones.
-Although the best performance and scalability results from design rather
-than after-the-fact micro-optimization, micro-optimization is nevertheless
-necessary for the absolute best possible performance and scalability,
-as described in
-\cref{sec:datastruct:Micro-Optimization}.
 Finally, \cref{sec:datastruct:Summary}
 presents a summary of this chapter.
 
@@ -2032,259 +2027,6 @@ queues~\cite{AndreasHaas2012FIFOisnt,ChristophMKirsch2012FIFOisntTR,AndreasHaas2
 It seems likely that continued work with concurrent data structures will
 produce novel algorithms with surprising properties.
 
-\section{Micro-Optimization}
-\label{sec:datastruct:Micro-Optimization}
-%
-\epigraph{The devil is in the details.}{Unknown}
-
-The data structures shown in this chapter were coded straightforwardly,
-with no adaptation to the underlying system's cache hierarchy.
-In addition, many of the implementations used pointers to functions
-for key-to-hash conversions and other frequent operations.
-Although this approach provides simplicity and portability, in many
-cases it does give up some performance.
-
-The following sections touch on specialization, memory conservation,
-and hardware considerations.
-Please do not mistake these short sections for a definitive treatise
-on this subject.
-Whole books have been written on optimizing for a specific CPU, let
-alone to the wide variety of CPU families in common use today.
-
-\subsection{Specialization}
-\label{sec:datastruct:Specialization}
-
-The resizable hash table presented in
-\cref{sec:datastruct:Non-Partitionable Data Structures}
-used an opaque type for the key.
-This allows great flexibility, permitting any sort of key to be
-used, but it also incurs significant overhead due to the calls via
-of pointers to functions.
-Now, modern hardware uses sophisticated branch-prediction techniques
-to minimize this overhead, but on the other hand, real-world software
-is often larger than can be accommodated even by today's large
-hardware branch-prediction tables.
-This is especially the case for calls via pointers, in which case
-the branch prediction hardware must record a pointer in addition
-to branch-taken/branch-not-taken information.
-
-This overhead can be eliminated by specializing a hash-table implementation
-to a given key type and hash function, for example, by using C++ templates.
-Doing so eliminates the \co{->ht_cmp()}, \co{->ht_gethash()}, and
-\co{->ht_getkey()} function pointers in the \co{ht} structure shown in
-\cref{lst:datastruct:Resizable Hash-Table Data Structures} on
-\cpageref{lst:datastruct:Resizable Hash-Table Data Structures}.
-It also eliminates the corresponding calls through these pointers,
-which could allow the compiler to inline the resulting fixed functions,
-eliminating not only the overhead of the call instruction, but the
-argument marshalling as well.
-
-\QuickQuiz{
-	How much do these specializations really save?
-	Are they really worth it?
-}\QuickQuizAnswer{
-	The answer to the first question is left as an exercise to
-	the reader.
-	Try specializing the resizable hash table and see how much
-	performance improvement results.
-	The second question cannot be answered in general, but must
-	instead be answered with respect to a specific use case.
-	Some use cases are extremely sensitive to performance and
-	scalability, while others are less so.
-}\QuickQuizEnd
-
-All that aside, one of the great benefits of modern hardware compared
-to that available when I first started learning to program back in
-the early 1970s is that much less specialization is required.
-This allows much greater productivity than was possible back in the
-days of four-kilobyte address spaces.
-
-\subsection{Bits and Bytes}
-\label{sec:datastruct:Bits and Bytes}
-
-The hash tables discussed in this chapter made almost no attempt to conserve
-memory.
-For example, the \co{->ht_idx} field in the \co{ht} structure in
-\cref{lst:datastruct:Resizable Hash-Table Data Structures} on
-\cpageref{lst:datastruct:Resizable Hash-Table Data Structures}
-always has a value of either zero or one, yet takes up a full 32 bits
-of memory.
-It could be eliminated, for example, by stealing a bit from the
-\co{->ht_resize_key} field.
-This works because the \co{->ht_resize_key} field is large enough to
-address every byte of memory and the \co{ht_bucket} structure
-is more than one byte long, so that
-the \co{->ht_resize_key} field must have several bits to spare.
-
-This sort of bit-packing trick is frequently used in data structures
-that are highly replicated, as is the \co{page} structure in the Linux
-kernel.
-However, the resizable hash table's \co{ht} structure is not all that
-highly replicated.
-It is instead the \co{ht_bucket} structures we should focus on.
-There are two major opportunities for shrinking the \co{ht_bucket} structure:
-\begin{enumerate*}[(1)]
-\item Placing the \tco{->htb_lock} field in a low-order bit of one of the
-\tco{->htb_head} pointers and
-\item Reducing the number of pointers required.
-\end{enumerate*}
-
-The first opportunity might make use of bit-spinlocks in the Linux
-kernel, which are provided by the \path{include/linux/bit_spinlock.h}
-header file.
-These are used in space-critical data structures in the Linux kernel,
-but are not without their disadvantages:
-
-\begin{enumerate}
-\item	They are significantly slower than the traditional spinlock
-	primitives.
-\item	They cannot participate in the lockdep deadlock detection
-	tooling in the Linux kernel~\cite{JonathanCorbet2006lockdep}.
-\item	They do not record lock ownership, further complicating
-	debugging.
-\item	They do not participate in priority boosting in \rt\ kernels,
-	which means that preemption must be disabled when holding
-	bit spinlocks, which can degrade real-time latency.
-\end{enumerate}
-
-Despite these disadvantages, bit-spinlocks are extremely useful when
-memory is at a premium.
-
-One aspect of the second opportunity was covered in
-\cref{sec:datastruct:Other Resizable Hash Tables},
-which presented resizable hash tables that require only one
-set of bucket-list pointers in place of the pair of sets required
-by the resizable hash table presented in
-\cref{sec:datastruct:Non-Partitionable Data Structures}.
-Another approach would be to use singly linked bucket lists in
-place of the doubly linked lists used in this chapter.
-One downside of this approach is that deletion would then require
-additional overhead, either by marking the outgoing pointer
-for later removal
-or by searching the bucket list for the element being deleted.
-
-In short, there is a tradeoff between minimal memory overhead on
-the one hand, and performance and simplicity on the other.
-Fortunately, the relatively large memories available on modern
-systems have allowed us to prioritize performance and simplicity
-over memory overhead.
-However, even though the year~2022's pocket-sized smartphones sport
-many gigabytes of memory and its mid-range servers sport terabytes, it
-is sometimes necessary to take extreme measures to reduce memory overhead.
-
-\subsection{Hardware Considerations}
-\label{sec:datastruct:Hardware Considerations}
-
-Modern computers typically move data between CPUs and main memory in
-fixed-sized blocks that range in size from 32 bytes to 256 bytes.
-These blocks are called \emph{\IXpl{cache line}}, and are extremely important
-to high performance and scalability, as was discussed in
-\cref{sec:cpu:Overheads}.
-One timeworn way to kill both performance and scalability is to
-place incompatible variables into the same cacheline.
-For example, suppose that a resizable hash table data element had
-the \co{ht_elem} structure in the same cacheline as a frequently incremented
-counter.
-The frequent incrementing would cause the cacheline to be present at
-the CPU doing the incrementing, but nowhere else.
-If other CPUs attempted to traverse the hash bucket list containing
-that element, they would incur expensive cache misses, degrading both
-performance and scalability.
-
-\begin{listing}
-\begin{VerbatimL}
-struct hash_elem {
-	struct ht_elem e;
-	long __attribute__ ((aligned(64))) counter;
-};
-\end{VerbatimL}
-\caption{Alignment for 64-Byte Cache Lines}
-\label{lst:datastruct:Alignment for 64-Byte Cache Lines}
-\end{listing}
-
-One way to solve this problem on systems with 64-byte cache line is shown in
-\cref{lst:datastruct:Alignment for 64-Byte Cache Lines}.
-Here \GCC's \co{aligned} attribute is used to force the \co{->counter}
-and the \co{ht_elem} structure into separate cache lines.
-This would allow CPUs to traverse the hash bucket list at full speed
-despite the frequent incrementing.
-
-Of course, this raises the question ``How did we know that cache lines
-are 64 bytes in size?''
-On a Linux system, this information may be obtained from the
-\path{/sys/devices/system/cpu/cpu*/cache/} directories, and it is even
-possible to make the installation process rebuild the application to
-accommodate the system's hardware structure.
-However, this would be more difficult if you wanted your application to
-also run on a variety of environments, including non-Linux systems.
-Furthermore, even if you were content to run only on Linux, such a
-self-modifying installation poses validation challenges.
-For example, systems with 32-byte cachelines might work well, but
-performance might suffer on systems with 64-byte cachelines due
-to \IX{false sharing}.
-
-Fortunately, there are some rules of thumb that work reasonably well in
-practice, which were gathered into a 1995
-paper~\cite{BenjaminGamsa95a}.\footnote{
-	A number of these rules are paraphrased and expanded on here
-	with permission from Orran Krieger.}
-The first group of rules involve rearranging structures to accommodate
-cache geometry:
-
-\begin{enumerate}
-\item	Place read-mostly data far from frequently updated data.
-	For example, place read-mostly data at the beginning of the
-	structure and frequently updated data at the end.
-	Place data that is rarely accessed in between.
-\item	If the structure has groups of fields such that each group is
-	updated by an independent code path, separate these groups
-	from each other.
-	Again, it can be helpful to place rarely accessed data
-	between the groups.
-	In some cases, it might also make sense to place each such group
-	into a separate structure referenced by the original structure.
-\item	Where possible, associate update-mostly data with a CPU, thread,
-	or task.
-	We saw several very effective examples of this rule of thumb
-	in the counter implementations in
-	\cref{chp:Counting}.
-\item	Going one step further, partition your data on a per-CPU,
-	per-thread, or per-task basis, as was discussed in
-	\cref{chp:Data Ownership}.
-\end{enumerate}
-
-There has been some work towards automated trace-based rearrangement
-of structure
-fields~\cite{Golovanevsky:2010:TDL:2174824.2174835}.
-This work might well ease one of the more painstaking tasks
-required to get excellent performance and scalability from
-multithreaded software.
-
-An additional set of rules of thumb deal with locks:
-
-\begin{enumerate}
-\item	Given a heavily contended lock protecting data that is
-	frequently modified, take one of the following approaches:
-	\begin{enumerate}
-	\item	Place the lock in a different cacheline than the data
-		that it protects.
-	\item	Use a lock that is adapted for high contention, such
-		as a queued lock.
-	\item	Redesign to reduce lock contention.
-		(This approach is best, but is not always trivial.)
-	\end{enumerate}
-\item	Place uncontended locks into the same cache line as the data
-	that they protect.
-	This approach means that the cache miss that brings the
-	lock to the current CPU also brings its data.
-\item	Protect read-mostly data with \IXpl{hazard pointer}, RCU, or, for
-	long-duration critical sections, reader-writer locks.
-\end{enumerate}
-
-Of course, these are rules of thumb rather than absolute rules.
-Some experimentation is required to work out which are most applicable
-to a given situation.
-
 \section{Summary}
 \label{sec:datastruct:Summary}
 %
diff --git a/together/microopt.tex b/together/microopt.tex
new file mode 100644
index 00000000..2d5434a0
--- /dev/null
+++ b/together/microopt.tex
@@ -0,0 +1,262 @@
+% together/microopt.tex
+% mainfile: ../perfbook.tex
+% SPDX-License-Identifier: CC-BY-SA-3.0
+
+\section{Micro-Optimization}
+\label{sec:together:Micro-Optimization}
+%
+\epigraph{The devil is in the details.}{Unknown}
+
+The data structures shown in this book were coded straightforwardly,
+with no adaptation to the underlying system's cache hierarchy.
+In addition, many of the implementations used pointers to functions
+for key-to-hash conversions and other frequent operations.
+Although this approach provides simplicity and portability, in many
+cases it does give up some performance.
+
+The following sections touch on specialization, memory conservation,
+and hardware considerations.
+Please do not mistake these short sections for a definitive treatise
+on this subject.
+Whole books have been written on optimizing for a specific CPU, let
+alone to the wide variety of CPU families in common use today.
+
+\subsection{Specialization}
+\label{sec:together:Specialization}
+
+The resizable hash table presented in
+\cref{sec:datastruct:Non-Partitionable Data Structures}
+on
+\cpageref{sec:datastruct:Non-Partitionable Data Structures}
+used an opaque type for the key.
+This allows great flexibility, permitting any sort of key to be
+used, but it also incurs significant overhead due to the calls via
+of pointers to functions.
+Now, modern hardware uses sophisticated branch-prediction techniques
+to minimize this overhead, but on the other hand, real-world software
+is often larger than can be accommodated even by today's large
+hardware branch-prediction tables.
+This is especially the case for calls via pointers, in which case
+the branch prediction hardware must record a pointer in addition
+to branch-taken/branch-not-taken information.
+
+This overhead can be eliminated by specializing a hash-table implementation
+to a given key type and hash function, for example, by using C++ templates.
+Doing so eliminates the \co{->ht_cmp()}, \co{->ht_gethash()}, and
+\co{->ht_getkey()} function pointers in the \co{ht} structure shown in
+\cref{lst:datastruct:Resizable Hash-Table Data Structures} on
+\cpageref{lst:datastruct:Resizable Hash-Table Data Structures}.
+It also eliminates the corresponding calls through these pointers,
+which could allow the compiler to inline the resulting fixed functions,
+eliminating not only the overhead of the call instruction, but the
+argument marshalling as well.
+
+\QuickQuiz{
+	How much do these specializations really save?
+	Are they really worth it?
+}\QuickQuizAnswer{
+	The answer to the first question is left as an exercise to
+	the reader.
+	Try specializing the resizable hash table and see how much
+	performance improvement results.
+	The second question cannot be answered in general, but must
+	instead be answered with respect to a specific use case.
+	Some use cases are extremely sensitive to performance and
+	scalability, while others are less so.
+}\QuickQuizEnd
+
+All that aside, one of the great benefits of modern hardware compared
+to that available when I first started learning to program back in
+the early 1970s is that much less specialization is required.
+This allows much greater productivity than was possible back in the
+days of four-kilobyte address spaces.
+
+\subsection{Bits and Bytes}
+\label{sec:together:Bits and Bytes}
+
+The hash tables discussed in this chapter made almost no attempt to conserve
+memory.
+For example, the \co{->ht_idx} field in the \co{ht} structure in
+\cref{lst:datastruct:Resizable Hash-Table Data Structures} on
+\cpageref{lst:datastruct:Resizable Hash-Table Data Structures}
+always has a value of either zero or one, yet takes up a full 32 bits
+of memory.
+It could be eliminated, for example, by stealing a bit from the
+\co{->ht_resize_key} field.
+This works because the \co{->ht_resize_key} field is large enough to
+address every byte of memory and the \co{ht_bucket} structure
+is more than one byte long, so that
+the \co{->ht_resize_key} field must have several bits to spare.
+
+This sort of bit-packing trick is frequently used in data structures
+that are highly replicated, as is the \co{page} structure in the Linux
+kernel.
+However, the resizable hash table's \co{ht} structure is not all that
+highly replicated.
+It is instead the \co{ht_bucket} structures we should focus on.
+There are two major opportunities for shrinking the \co{ht_bucket} structure:
+\begin{enumerate*}[(1)]
+\item Placing the \tco{->htb_lock} field in a low-order bit of one of the
+\tco{->htb_head} pointers and
+\item Reducing the number of pointers required.
+\end{enumerate*}
+
+The first opportunity might make use of bit-spinlocks in the Linux
+kernel, which are provided by the \path{include/linux/bit_spinlock.h}
+header file.
+These are used in space-critical data structures in the Linux kernel,
+but are not without their disadvantages:
+
+\begin{enumerate}
+\item	They are significantly slower than the traditional spinlock
+	primitives.
+\item	They cannot participate in the lockdep deadlock detection
+	tooling in the Linux kernel~\cite{JonathanCorbet2006lockdep}.
+\item	They do not record lock ownership, further complicating
+	debugging.
+\item	They do not participate in priority boosting in \rt\ kernels,
+	which means that preemption must be disabled when holding
+	bit spinlocks, which can degrade real-time latency.
+\end{enumerate}
+
+Despite these disadvantages, bit-spinlocks are extremely useful when
+memory is at a premium.
+
+One aspect of the second opportunity was covered in
+\cref{sec:datastruct:Other Resizable Hash Tables}
+on
+\cpageref{sec:datastruct:Other Resizable Hash Tables},
+which presented resizable hash tables that require only one
+set of bucket-list pointers in place of the pair of sets required
+by the resizable hash table presented in
+\cref{sec:datastruct:Non-Partitionable Data Structures}
+on
+\cpageref{sec:datastruct:Non-Partitionable Data Structures}.
+Another approach would be to use singly linked bucket lists in
+place of the doubly linked lists used in this chapter.
+One downside of this approach is that deletion would then require
+additional overhead, either by marking the outgoing pointer
+for later removal
+or by searching the bucket list for the element being deleted.
+
+In short, there is a tradeoff between minimal memory overhead on
+the one hand, and performance and simplicity on the other.
+Fortunately, the relatively large memories available on modern
+systems have allowed us to prioritize performance and simplicity
+over memory overhead.
+However, even though the year~2022's pocket-sized smartphones sport
+many gigabytes of memory and its mid-range servers sport terabytes, it
+is sometimes necessary to take extreme measures to reduce memory overhead.
+
+\subsection{Hardware Considerations}
+\label{sec:together:Hardware Considerations}
+
+Modern computers typically move data between CPUs and main memory in
+fixed-sized blocks that range in size from 32 bytes to 256 bytes.
+These blocks are called \emph{\IXpl{cache line}}, and are extremely important
+to high performance and scalability, as was discussed in
+\cref{sec:cpu:Overheads}.
+One timeworn way to kill both performance and scalability is to
+place incompatible variables into the same cacheline.
+For example, suppose that a resizable hash table data element had
+the \co{ht_elem} structure in the same cacheline as a frequently incremented
+counter.
+The frequent incrementing would cause the cacheline to be present at
+the CPU doing the incrementing, but nowhere else.
+If other CPUs attempted to traverse the hash bucket list containing
+that element, they would incur expensive cache misses, degrading both
+performance and scalability.
+
+\begin{listing}
+\begin{VerbatimL}
+struct hash_elem {
+	struct ht_elem e;
+	long __attribute__ ((aligned(64))) counter;
+};
+\end{VerbatimL}
+\caption{Alignment for 64-Byte Cache Lines}
+\label{lst:together:Alignment for 64-Byte Cache Lines}
+\end{listing}
+
+One way to solve this problem on systems with 64-byte cache line is shown in
+\cref{lst:together:Alignment for 64-Byte Cache Lines}.
+Here \GCC's \co{aligned} attribute is used to force the \co{->counter}
+and the \co{ht_elem} structure into separate cache lines.
+This would allow CPUs to traverse the hash bucket list at full speed
+despite the frequent incrementing.
+
+Of course, this raises the question ``How did we know that cache lines
+are 64 bytes in size?''
+On a Linux system, this information may be obtained from the
+\path{/sys/devices/system/cpu/cpu*/cache/} directories, and it is even
+possible to make the installation process rebuild the application to
+accommodate the system's hardware structure.
+However, this would be more difficult if you wanted your application to
+also run on a variety of environments, including non-Linux systems.
+Furthermore, even if you were content to run only on Linux, such a
+self-modifying installation poses validation challenges.
+For example, systems with 32-byte cachelines might work well, but
+performance might suffer on systems with 64-byte cachelines due
+to \IX{false sharing}.
+
+Fortunately, there are some rules of thumb that work reasonably well in
+practice, which were gathered into a 1995
+paper~\cite{BenjaminGamsa95a}.\footnote{
+	A number of these rules are paraphrased and expanded on here
+	with permission from Orran Krieger.}
+The first group of rules involve rearranging structures to accommodate
+cache geometry:
+
+\begin{enumerate}
+\item	Place read-mostly data far from frequently updated data.
+	For example, place read-mostly data at the beginning of the
+	structure and frequently updated data at the end.
+	Place data that is rarely accessed in between.
+\item	If the structure has groups of fields such that each group is
+	updated by an independent code path, separate these groups
+	from each other.
+	Again, it can be helpful to place rarely accessed data
+	between the groups.
+	In some cases, it might also make sense to place each such group
+	into a separate structure referenced by the original structure.
+\item	Where possible, associate update-mostly data with a CPU, thread,
+	or task.
+	We saw several very effective examples of this rule of thumb
+	in the counter implementations in
+	\cref{chp:Counting}.
+\item	Going one step further, partition your data on a per-CPU,
+	per-thread, or per-task basis, as was discussed in
+	\cref{chp:Data Ownership}.
+\end{enumerate}
+
+There has been some work towards automated trace-based rearrangement
+of structure
+fields~\cite{Golovanevsky:2010:TDL:2174824.2174835}.
+This work might well ease one of the more painstaking tasks
+required to get excellent performance and scalability from
+multithreaded software.
+
+An additional set of rules of thumb deal with locks:
+
+\begin{enumerate}
+\item	Given a heavily contended lock protecting data that is
+	frequently modified, take one of the following approaches:
+	\begin{enumerate}
+	\item	Place the lock in a different cacheline than the data
+		that it protects.
+	\item	Use a lock that is adapted for high contention, such
+		as a queued lock.
+	\item	Redesign to reduce lock contention.
+		(This approach is best, but is not always trivial.)
+	\end{enumerate}
+\item	Place uncontended locks into the same cache line as the data
+	that they protect.
+	This approach means that the cache miss that brings the
+	lock to the current CPU also brings its data.
+\item	Protect read-mostly data with \IXpl{hazard pointer}, RCU, or, for
+	long-duration critical sections, reader-writer locks.
+\end{enumerate}
+
+Of course, these are rules of thumb rather than absolute rules.
+Some experimentation is required to work out which are most applicable
+to a given situation.
diff --git a/together/together.tex b/together/together.tex
index e54aa141..7cce31e6 100644
--- a/together/together.tex
+++ b/together/together.tex
@@ -21,15 +21,23 @@ refurbishes reference counting,
 helps with hazard pointers,
 \cref{sec:together:Sequence-Locking Specials}
 surmises on sequence-locking specials,
-and finally
 \cref{sec:together:RCU Rescues}
+and
 reflects on RCU rescues.
+Finally,
+although the best performance and scalability results from design rather
+than after-the-fact micro-optimization, micro-optimization is nevertheless
+necessary for the absolute best possible performance and scalability.
+Therefore,
+\cref{sec:together:Micro-Optimization}
+meanders through a few micro-optimizations.
 
 \input{together/count.tex}
 \input{together/refcnt.tex}
 \input{together/hazptr.tex}
 \input{together/seqlock.tex}
 \input{together/applyrcu.tex}
+\input{together/microopt.tex}
 % batching to trade off latency for perf/scale.
 
 \QuickQuizAnswersChp{qqztogether}
author	Paul E. McKenney <paulmck@kernel.org>	2023-10-01 16:24:24 -0700
committer	Paul E. McKenney <paulmck@kernel.org>	2023-10-01 16:24:24 -0700
commit	36a782e82f66210f213349d15308c8f3e38b9de8 (patch)
tree	943099761602232fa9aedfef54169dda2e5ed9a1
parent	3860345ab8d75a321b64062ea013e33591e69586 (diff)
download	perfbook-36a782e82f66210f213349d15308c8f3e38b9de8.tar.gz