How Can Dismal Designs Be Dealt With?

The good news is that there are a number of techniques that may be applied to inherently sequential designs. The bad news is that there are a number of techniques that may be applied to inherently sequential designs — and none of them are guaranteed to work in all cases, at least given the ever-present constraints on time, effort, and cost.

Here are a few such techniques:

In theory, one could design hardware such that the overhead of all CPUs concurrently incrementing a single global variable rises only as the log of the number of the CPUs rather than linearly with CPUs. If transistor density continues to rise in accordance with Moore's Law, we might yet see such systems. That said, while logarithmic complexity is more attractive than linear complexity, we really want constant complexity.
Each thread can atomically increment the global variable by a large constant (for example, 256), and then use all of the numbers in the corresponding range. This is an example of batching. The idea behind batching is that if you are going to undertake an expensive operation like atomic increment, you should get your money's worth, for example, by processing multiple transactions with a single increment, or passing convoys of messages with a single classification operation.
Ensure that the individual units of work (for example, transactions or messages) are so large that the overhead of the atomic increment is insignificant by comparison. This approach is similar to batching, but at a different level of abstraction.
Run multiple instances of your sequential program, perhaps using any of a number of virtualization techniques.
Replace the problematic portions of your program with optimized parallel library functions or applications.
If feasible, replace your synchronization with something lighter weight and more scalable.
Re-design and rewrite the application from scratch to be more parallel-friendly.

Most of all, be patient and plan ahead. If your program contains decades of sequential assumptions in its design, APIs, and code, you probably aren't going to be able to fix them all in a single weekend, no matter how much coffee (or, in Europe, beer) you drink. Measure the performance and scalability, locate the bottlenecks, and fix the worst first. Always keep in mind the possibility of redesigning based on existing scalable code. For example, perhaps you can replace much of your application with a database, perhaps you can use a parallel sort library, perhaps you can simply run multiple single-threaded instances of your application, and so on.

But this still leaves an important question unanswered: are existing single-threaded designs, APIs, and code really a problem?

The answer is, as always, “it depends.” What it depends on is how much work is done per global shared-memory update and how many CPUs are concurrently running in the application. The following plot shows data from a crude queueing-model analysis of the situtation:

More CPUs means less synchronization

Synchronization efficiency (the y axis) is the ratio of the time required for the work at hand to the time required for both the work and the synchronization. If your application is to scale well, its synchronization efficiency must be close to 1.0. Putting it another way, if your application's synchronization efficiency is 0.1 (or, if you prefer, 10%), then your application is wasting 90% of the CPU time it is using, requiring no fewer than ten CPUs to achieve the same throughput that a single CPU would achieve given a single-threaded version of your application.

Each trace corresponds to a unit of work whose average overhead is the specified multiple of that of synchronization overhead. So, if a unit of work takes about ten times as long as an atomic increment, synchronization efficiency drops off dramatically as the number of CPUs approaches ten.

Looking at the trace labeled “100”, we see that if the application is to scale to anywhere near 100 CPUs, each unit of work must consume more than 100 times more time than a single synchronization operation. On modern CPUs, this means that each unit of work must entail many thousands of instructions.

In short, if you were thinking of scaling your application by handing a single “add” instruction out to each thread while using an atomic increment of a global variable for synchronization, you really need to think again. You will instead need about 100,000 “add” instructions per unit of work to scale to 100 CPUs. If you want to scale to more than 100 CPUs, you will need more than 100,000 “add” instructions.

Alternatively, you will need to use something considerably more scalable for your synchronization than that atomic increment of a global variable, for example, using one of the techniques listed above.