o Julia language as parallel panacea? https://julialang.org/ and https://en.wikipedia.org/wiki/Julia_(programming_language @@@ o Move "Why Memory Barriers?" appendix to precede software memory-ordering chapter. o Store buffers before cache coherency protocols. o C.1 Cache Structure: Update to 2022. Check up on recent multi-socket systems. o Example cache layout for capacity miss? o Better illustration of the miss examples in the text starting with the paragraph starting with "The situation depicted in..." Maybe use smaller number of sets? But hexadecimal simplifies things. o C.1 Cache Structure: Have subsections on store buffers and misordering. Pull examples from slidesets. o C.2 Cache-Coherence Protocols: "(now IBM)" to "(then IBM and now obsolete)" or similar. o C.3 Stores Result in Unnecessary Stalls: Portions independent of MESI to C.1.x. o Maybe leave ordering-hostile-architecture section in the appendix, and see if shared-store-buffer architecture justifies LKMM. o Move 15.1 Ordering Why and How? to new hardware section. But leave cheat sheet in chapter 15. Ditto 15.2.1 Variables With Multiple Values. Maybe just fold parts of Appendix C into the beginnings of Chapter 15? o Old single-CPU cacheless systems. o Caches. o Store buffers. QQ: Didn't store buffers come first? A: In some cases, yes. o Invalidation queues. o Add a section on RCU design tradeoffs, perhaps extend to cover hazard pointers and sequence locking. Work with Joel on this item. o QSBR vs. rcu_read_lock() actually doing something. o Counter-based methods. o Time-based methods. o Encourage holdout readers? o resched_cpu() o Preemption disabling. o Priority boosting. o Enable help from preemption points and might-sleep calls. o Special cases for single-CPU non-preemptible situations. o Complex data structures: o Just wrap the update in sequence locking. o Issaquah Challenge. o Jon Walpole's and my algorithms (and recent rediscovery). o Consider adding kfifo to NBS section. Perhaps also atomic queues and stacks. Reference hazard pointers. LIFO push stack, definitely!!! [DONE] o Single writer principle. Mechanical sympathy. ("Disruptor" presentation at LCA2013.) Martin Thompson: mechanical-sympathy.blogspot.com Mike Barker: bad-concurrency.blogspot.com github.com/LMAX-Exchange/disruptor o Add gcc padding and alignment directives to toolsoftrade. o Add RACES paper material, perhaps as another "future" section. Also add notes from RACES workshop. o In cpu/hwfreelunch.tex, add asynchronous-hardware citation from Ivan Sutherland or some such. o In intro/intro.tex in the "Use Existing Parallel Software" section, add citations for web-application serving and map-reduce. o Consider refactoring test code in CodeSamples directory tree. Probably some opportunity for common measurement code and also common argument-parsing code. o Add discussion of reader-writer locking to the locking chapter implementation discussion. Ticket rwlocks, brlock/lglock, unfair locks (and reasons for them), forward reference to seqlock, RCU lock, and barrier lock. o Consider an HPC-style barrier section in defer chapter. o Consider expanding on dequeue discussion, especially on queues, in data-structures chapter (e.g., limits of ordering guarantees). Need hash tables there, of course! o Data-race detection [Bart Van Assche January 2010] KCSAN, Helgrind, DRD, ThreadSanitizer, Intel Thread Checker, Intel Parallel Studio, DIOTA, ... Valgrind-based tools allow suppression patterns that span multiple call stack frames, other tools reportedly do not. Citation: J. Maebe, M. Ronsse, and K. De Bosschere. *DIOTA: Dynamic instrumentation, optimization and transformation of applications*. In Proceedings of WBT-2002, Charlottesville, Virginia, USA, September 2002. o Consider adding discussion of Amdahl's Law and Gustafson's Law to the introduction. [Kay Muller] o Consider adding something about VLIW and SIMD to hwhabits? Maybe something more about memory hierarchy? [Kay Muller] o Need some list somewhere of environments that allow you to avoid worrying about locking and threading: BOINC, map-reduce, SQL, sequential-program optimizations, ... o Update "SMP Synchronization Design" to cover parallelism in general. Discuss partitioning up front. Describe two major cases, pipelining and data parallelism. Code locking can be analogous to pipelining, but probably needs a separate section. o What is the form of the need for added performance? o Throughput in face of many relatively small units of work. (multiple instances. data parallelism. DBMS. Web serving.) o Throughput of large block(s) of work. (pipelining, or "series". parallel decompostion, or "parallel".) o Latency of units of work that are large relative to the communications overhead. (overlapped pipelining, parallel decomposition.) o Latency of an aggregate of many units of work that are small relative to the communications overhead. (batch the units of work into larger units and then re-ask the question relative to the batches) o Latency of units of work that are small relative to the communications overhead. (sequential programming optimizations. full-up redesign. approximations/heuristic techniques. overclocking. implement in hardware, e.g., FPGAs.) o Large programs or systems might have different needs in different places -- thus might need combination of techniques. o Move data-structures discussion to this chapter. o Add quote "weak ordering requires strong minds" and dissection thereof to memory-ordering discussion. o Add section to intro/hwhabits.tex giving intuitive notion of how cache coherence protocol and write buffer makes atomic instructions expensive. Draw from URochester slide set (in paper/scalability/TR-TMeval or thereabouts). o Find Herlihy's topology-analogy stuff and add hints to the formal-methods section. o Fill out material. o GPGPUs in Data Ownership chapter. o Fill out "defer" section, starting with fork-join examples. Relate to state-space complexity. o Add material from "is parallel programming hard" position paper. (After publication.) x Pull in material from LJ2003 RCU and dissertation demonstrating locking performance issues. o Need citation for Matlab*p. x Add Steve's and my dyntick paper to analysis as an example of simpler examples. x Add stuff from differential profiling paper. x Memory barriers -- get principles lined out, then show "bookend" implementations. x Add appendix with mathematics for hard realtime response using locks. Or just cite Bjoern Brandenberg's dissertation. x Create code examples for SMPdesign stuff, fill out Hierarchical Locking, Resource Allocation Caches, Performance Summary. MOSTLY DONE... x Refocus the SMPdesign chapter to partitioning. x Fill out RCU patterns, also corresponding code and performance summary. x RCU and atomic list movement (Josh?) x RCU and reference counts x Converting rwlocks to RCU x Consider adding history of RCU from OSR paper (low priority) o Other stuff Paul McKenney has from previous publications. o Other stuff that must be generated or harvested from elsewhere. o Note that pthread_mutex permits static initialization. (Windows apparently does not.) x Finish whymemorybarriers.tex o Come up with suitable API for example code, so that a single program can be used for each example, covering the pthreads, Linux-kernel-module, and other C-language environments. double dgettimeofday(void); void spin_lock_init(spinlock_t *sp); /* Must be called. */ void spin_lock(spinlock_t *sp); int spin_trylock(spinlock_t *sp); void spin_unlock(spinlock_t *sp); thread_id_t create_thread(void *(*func), void *arg); void *wait_thread(pthread_id_t tid); -- See CodeSamples/pthreads/api-pthreads.h for a start. Clearly Java will need its own code. ------------------------------------------------------------------------ x Add the various livejournal posts on memory order, especially the one on transitivity. Consider also adding Will Deacon's January 6 2011 QoS story. x Fix the references in defer/rcu*.tex to "green" and "red" cells in the various tables. Change to refer to the symbols actually used. x Verify versions of cartoons. (Need whippersnapper in intro) x Upgrade Quick Quiz scripts to remove material in second parameter when aggregating into "answers" section. Trivialize by changing required format to put closing "}" before end tag. Then get rid of the special cases for code formatting in QuickQuiz answers. x Move API description to an appendix. x Fill out material. x Finish converting WhyRCU LWN series. x Add cautions to analysis section. Steve's and my "Integrating and Validating dynticks and Preemptable RCU" could be a good start, comparing to the rcutree.c dynticks implementation. x Also add "what to do instead of parallel programming" section. x Move the formal-verification sections to an appendix. x Add the dyntick verification. (~/paper/KernelJanitor/RCU/dyntick/LWN) x Fix rcutorture.h to correctly compute from time. Allow the test time as a third argument. x Reference counting: basic problem: make data structure stick around while you are incrementing the reference count. Strategies: lock, pre-existing reference, atomic check for zero reference count coupled with GC/RCU, various non-blocking-synchronization tricks. x Copy rcutree full data-structure diagram to section describing force_quiescent_state() callee traversal of the leaf rcu_node structures. Decorate with right-facing arrow to show search direction. Or perhaps copy diagram that shows initialization state. x Add \RevisedFrom to origpub.tex to allow something like: The ideas discussed in Section... were originally presented in XXX\cite{}. Allow ranges of sections for OriginallyFrom. x Discuss other textbooks in intro. x Figure out some way to get section headers between "Answers to Quick Quizzes" for different chapters. Want to avoid any section header for a chapter that has no "quick quizzes". Not a big deal, nice to have. x Add a "Tools of the Trade" chapter before "SMP Synchronization Design" giving an overview of how locking and atomic operations work. Need simple fork/join, along with foreshadowing of how good design can simplify implementation. x Add variables to rest of examples in toyrcu.tex. x In the double-ended-queue code, consider renaming the "enqueue" operations to "push" and the "dequeue" operations to "pop". x Need a chapter on counting. x Add the separate-thread approach suggested unwittingly by Christoph Lameter to the count.tex chapter. x Embed fonts from .eps files: http://www.wkiri.com/today/?p=60 See the sed script: cat original_graphics.eps | sed 's+Times-Bold+NimbusSanL-Bold+g' |\ sed 's+Times-Roman+NimbusSanL-Regu+g' |\ sed 's+Times+NimbusSanL-Regu+g' |\ sed 's+Helvetica-BoldOblique+NimbusSanL-BoldItal+g' |\ sed 's+Helvetica-Oblique+NimbusSanL-ReguItal+g' |\ sed 's+Helvetica-Bold+NimbusSanL-Bold+g' |\ sed 's+Helvetica-Bold-iso+NimbusSanL-Bold+g' |\ sed 's+Helvetica+NimbusSanL-Regu+g' |\ sed 's+Helvetica-iso+NimbusSanL-Regu+g' |\ sed 's+Symbol+StandardSymL+g' > new_graphics.eps And the additional conversions: Courier NimbusMonL-Regu Courier-Bold NimbusMonL-Bold Courier-Oblique NimbusMonL-ReguObli Courier-BoldOblique NimbusMonL-BoldObli x Change calls to bzero() to memset(). [Horst] x dvipdf is quite slow, especially on 1-up setups: time dvips perfbook real 0m3.405s user 0m3.184s sys 0m0.224s time psnup -2 perfbook.ps perfbook-2up.ps real 0m0.852s user 0m0.248s sys 0m0.308s time ps2pdf perfbook-2up.ps perfbook-2up.pdf real 0m25.383s user 0m24.646s sys 0m0.276s time dvipdf perfbook real 1m31.456s user 0m48.347s sys 0m45.599s Could parallelism be profitably applied here? ;-) Or perhaps just careful avoidance of quadratic algorithms? Use pdflatex instead, much faster! x Explicitly permit reformatted copies (single-column, ebook, etc.). x Change from "preemptable" to "preemptible". [Horst] x Produce an HTML version. One possibility is latex2html and scripting, and tamasrepus suggests http://johnmacfarlane.net/pandoc/. Challenges include the scripting and special latex macros that hendle the quick quizzes and the illustration credits. x Add discussion of seqlock to defer chapter between refcnt and RCU. Use trivial example showing consistency. Note inapplicability to pointer-based data structures. x Update chain of RCU implementations to add http://lttng.org/urcu. x Line up reviewers. Need a range of viewpoints, and especially a range of familiarity with SMP coding. [Now that the book is public, just let them line themselves up!] x Work out who else to invite to contribute. Possibilities include: [Now that the book is public, just let them line themselves up!] o Real-life use of RCU: Dipankar Sarma, Steve Hemminger, Maneesh Soni, David Howells, ... o Nick Piggin: radix tree. o Robert Ohlsson: fib trie. o Steve Hemminger: brlock replacement. o SELinux guys. o Extreme scalability: Christoph Lameter, Jesse Barnes, other SGI and ex-SGI folks. o Various architecture maintainers for issues on abstract CPUs and other hardware issues. o Chinese angle: guys from Tsinghua U. in Beijing. o Stuff from Rusty's Unreliable Guides. o Rusty Commentary version. o Tom Hart on RCU performance measurement. Perhaps also on Hazard Pointers (maybe Maged Michael as well or instead). o Josh Triplett on RCU error checking in sparse. o Val Henson on academic research on parallel SMP. o http://globaltext.org/ -- mailto:rwatson@terry.uga.edu +1-706-542-3706. "Wiki-based" textbooks. This is the Terry College of Business at U. of Georgia, but what is wrong with a little cross-pollenation? The usual experience is that somewhere between 20% and 50% of the people who get excited about contributing really will do so. x Need to make a chapter on locking (as opposed to the current one on synchronization). [IN PROGRESS] x Consider using parallel maze solving as an example. See: http://www.futurechips.org/tips-for-power-coders/parallel-programming-tutorial-parallelizing-somewhat-complex-code-part-1.html A better approach might assign one processor to each end of the maze. Chopping the maze up into blocks might help, but this would likely encounter pathological cases with lots of boundary crossing. Another approach is to start each CPU at a random point in the maze (in addition to the one each at start and end). Give each CPU a color, and when there is a path from beginning to end, you are done. Possibly an application for parallel union-find. ;-) x Update TM section based on OSR "grass is greener" paper and relevant paragraph from urcu paper. x Move OSR RCU history paper to appendix. x Add mechanism to automatically display authorhood. x My local view adds a modified book.cls that adds the bibliography to the table of contents. However, the license on book.cls forbids distributing it without also distributing latex. Need to come up with a better way! In the meantime, edit your local copy of book.cls and remove the "*" following the "chapter" directive in "thebibliography" macro. x Add support for real 64-bit x86 builds. x Use "inkscape --export-pdf=fn.pdf fn.svg" to create PDFs from inkscape .svg files. Alternatively: inkscape --export-pdf=fn.pdf --export-latex fn.svg gets an fn.pdf_tex file that allows text in the diagram to be typeset by Latex, allowing things like references and citations to be automatically generated within inkscape drawings. x Add Hazard-Pointer section between refcnt and seqlock sections of Deferred Processing chapter. x Add SNZI to the counting chapter. x In the intro/intro.tex quick-quiz answer citing Herb Sutter's "Effective Concurrency" column, also cite Anthony Williams's book. x Put RCUApplicability.eps into the defer chapter. x Change "non-idempotent" to "irrevocable" and "idempotent" to "revocable" in TM discussion. x Add a chapter on performance analysis and benchmarking. X Move memory-allocation code from "Partitioning and Synchronization Design" to "Data Ownership"? x Cite http://heather.cs.ucdavis.edu/~matloff/158/PLN/ParProcBook.pdf for reference on CUDA, MPI, etc. x Change ASCII double quotes to TeX quotes. [Horst] x Use the existence problem as motivation for RCU. Show some solutions in the locking chapter. See the day-2 ACACES slideset. See also the RCU tutorial and the SC22WG14 email. o spin_lock_irqsave(): Critical sections must guarantee forward progress against everything except NMI handlers. o raw_spin_lock(): Critical sections must guarantee forward progress against everything except IRQ (including softirq) and NMI handlers. o spin_lock(): Critical sections must guarantee forward progress against everything except IRQ (again including softirq) and NMI handlers and (given CONFIG_PREEMPT_RT) higher-priority realtime tasks. o mutex_lock(): Critical sections need not guarantee forward progress, as general blocking is permitted. The other issue is the scope of the lock. The Linux kernel has the following: o BKL: global scope. o Everything else: scope defined by the use of the underlying lock variable. One of the many reasons that we are trying to get rid of BKL is because it combines global scope with relatively weak forward-progress guarantees. x First edition changes: X Move "Data Structures" to precede "Applying RCU". Add a tree or two. x Change "Applying RCU" name to "Putting It All Together". Add examples from dcache, semaphore mapping, ... Publish v0.9-yyyy.mm.dd, call for comments. Best 3? sets of comments get a dead-tree signed first edition. Which will be v1.0-yyyy.mm.dd. x Consider linking quick quizzes to the answers. X Convert bloatwatch RCU and add to appendix/rcuimpl. X RCU implementations appendix: x RCU requirements/desiderata. (See paper on desk at home for start, along with the usual papers.) Also look in the advsync/rcu.tex section on RCU, which needs to be folded into the sections in defer/ o Uniprocessor RCU implementation. Simplify the current one so that the rcu and rcu_bh implementations are identical, as appropriate for a device not subject to networking DoS attacks. x Toy implementations. Follow wikipedia, but add reference-count approach. x Performance and scalability for toy versions. x User-level RCU implementations (after linux.conf.au 2008) o Non-realtime kernel implementations. Take from dissertation and from LWN articles. Or, given treercu, just cite dissertation. x Realtime kernel implementations. o RCU priority boosting. I suppose getting a good boosting implementation would help... x Hierarchical implementation (after November 20, 2008 due to LWN subscriber-only period). Also propagate references into appendix/formal/dyntickrcu.tex. Also add code walkthrough. You will need to do the walkthrough to find rcutree bugs anyway... o Derive RCU requirements from list of examples, propagate list up to defer chapter. (rcufundamental.tex) o Crisply present advantages and disadvantages of RCU in defer chapter. x https://lwn.net/Articles/667593/ to QC futures. Plus https://lwn.net/Articles/667720/. x Expant SAT-solver section with recent experience. x Move intro/hwhabits.tex to cpu/. Add sections with increasingly detailed pictures of what a CPU does, including speed-of-light round-trip time (~3 cm, or a bit more than an inch) at 5GHz. 3D fabrication might help, but watch the fabrication cost and the heat dissipation!!! Also, electrons travel from 3-30 times more slowly in silicon than light does in a vacuum, and even more slowly if there are levels of clocked logic in the way. Include actual numbers/graphs from real machines. Show effects of various forms of misbehavior. x Add Makefile (and comments) to CodeSamples directories. (defer and SMPdesign done.) x Update Test/perf.gpl set of tests for atomic operations and cache-line movement and add to CodeSamples/cpu, then rerun this to create a new table of atomic operations. X Reorganize memory-ordering introduction with Michel Dagenais's feedback in mind. See email dated August 20, 2012. x Add more terms to vocabulary. More TM stuff, for example. Also RCU nomenclature [done]. ------------------------------------------------------------------------ PROGRESS August 5, 2006: 76 pages August 12, 2006: 84 pages August 20, 2006: 90 pages September 3, 2006: 113 pages September 17, 2006: 123 pages -- LJ paper, discussions on memory barriers -- October 11, 2006: 111 pages November 5, 2006: 135 pages October 20, 2008: 176 pages October 30, 2008: 199 pages November 7, 2008: 205 pages Change page-layout parameters... November 7, 2008: 168 pages November 9, 2008: 200 pages November 16, 2008: 220 pages November 30, 2008: 232 pages December 8, 2008: 250 pages December 23, 2008: 266 pages January 13, 2009: 276 pages January 31, 2009: 288 pages February 28, 2009: 292 pages May 24, 2009: 302 pages July 3, 2009: 308 pages February 21, 2010: 342 pages February 21, 2010: 354 pages Change page-layout parameters... February 20, 2011: 379 pages March 23, 2011: Test edition posted on lulu.com: http://www.lulu.com/content/paperback-book/is-parallel-programming-hard-and-if-so-what-can-you-do-about-it/10357070