IOMMUs, "IO Memory Management Units", are hardware devices that translate device DMA addresses to machine addresses. Isolation capable IOMMUs perform a valuable system service, preventing rogue devices from performing errant or malicious DMAs, thereby substantially increasing the system's reliability and availability. Without an IOMMU, a peripheral device could be programmed to overwrite any part of the system's memory. An isolation capable IOMMU restricts a device so that it can only access parts of memory it has been explicitly granted access to. Operating systems utilize IOMMUs to isolate device drivers; hypervisors utilize IOMMUs to grant secure direct hardware access to virtual machines. With the imminent publication of the PCI-SIG's IO Virtualization standard, as well as Intel and AMD's introduction of isolation capable IOMMUs in all new servers, IOMMUs will become ubiquitous.
IOMMUs can impose a performance penalty due to the extra memory accesses required to perform DMA operations. The exact performance degradation depends on the IOMMU design, its caching architecture, the way it is programmed and the workload. In this paper, we present the performance characteristics of the Calgary and DART IOMMUs in Linux, both on bare metal and hypervisors. We measure the throughput and CPU utilization of several IO workloads with and without an IOMMU and analyze the results. We then discuss potential strategies for mitigating the IOMMU's costs. We conclude by presenting a set of optimizations we have implemented and the resulting performance improvements.
With Linux for the Sony PS3, the IBM QS2x blades and the Toshiba Celleb platform having hit mainstream Linux distributions, programming for the Cell BE is becoming increasingly interesting for developers of performance computing. This talk is about the concepts of the architecture and how to develop applications for it.
Most importantly, there will be an overview of new feature additions and latest developments, including:
Preemptive scheduling on SPUs (finally!): While it has been possible to to run concurrent SPU programs for some time, there was only a very limited version of the scheduler implemented. Now we have a full time-slicing scheduler with normal and real-time priorities, SPU affinity and gang scheduling.
Using SPUs for offloading kernel tasks: There are a few compute intensive tasks like RAID-6 or IPsec processing that can benefit from running partially on an SPU. Interesting aspects of the implementation are how to balance kernel SPU threads against user processing, how to efficiently communicate with the SPU from the kernel and measurements to see if it is actually worthwhile
Overlay programming: One significant limitation of the SPU is the size of the local memory that is used both its code and data. Recent compilers support overlays of code segments, a technique widely known in the previous century but mostly forgotten in Linux programming nowadays.
This paper will discuss the difficulties and methods involved in debugging the Linux kernel. Intermittent errors that occur once every few years are hard to debug ... but a problem when running across thousands of machines simultaneously. The more we scale to very large clusters, the more reliablilty becomes critical. In such environments, many of the normal debugging luxuries are gone (like a serial console, or any physical access), and we're forced to change to a different strategy to solve thorny intermittent race conditions.
We need (and have created) powerful but lightweight kernel tracing tools that are critical for cluster debugging, but also make powerful weapons in a smaller scale enviroment, where they can help debug issues more quickly and less intrusively. Real world usage examples will be included.
Cache memory compression (or compressed caching) was originally developed for desktop and server platforms but has also attracted interest on embedded systems where generally, memory is a scarce resource and hardware changes bring more costs and energy consumption. Cache memory compression brings a considerable advantage in input-output intensive applications by means of using a virtually larger cache for the local file system through compression algorithms. As a result, it increases the probability of fetching the necessary data in RAM itself, avoiding the need to make low calls to local storage. This work evaluates an Open Source implementation of the cache memory compression applied to Linux on an embedded platform, dealing with the unavoidable processor and memory resource limitations as well as with existing architectural differences.
We will describe the Compressed Cache (CCache) design, compression algorithm used, memory behavior tests, performance and power consumption overheads and CCache tuning for embedded Linux
Major Linux distributors have been shipping ACPI in Linux for several years, yet mis-perceptions about ACPI persist in the Linux Community. This paper addresses the top 10 myths about ACPI in Linux.
A broad range of Linux users, administrators, and developers will understand and benefit from this presentation without any in-depth knowledge of ACPI.
Subtitle: Genesis and Status
Subtitle: A Short Overview
Subtitle: How Fast is it Going, Who is Doing It, What They are Doing, and Who is Sponsoring It
Subtitle: a large scale cross-platform desktop application
Per-task delay accounting is a new function of the Linux kernel which measures where Linux tasks spend time waiting (for CPU time, completion of submitted I/O, resolving page faults, etc).
Subtitle: Documenting and Automatic Collateral Evolutions in Linux Device Drivers
Until now, most of the focus in Linux CPU power management has been on active CPU power management. cpufreq, which changes the processor frequency and/or voltage and manages the CPU performance levels and power consumption based on CPU load. Another dimension of CPU power management is CPU idling power. In general, there is now more focus shifting towards idle power (Energy star) and new platforms/processors are supporting multiple idle-states with different power and wakeup latency characteristics. Today most of the mobile processors support multiple idle states with varying amount of power consumed in those idle states and each state has an entry-exit latency associated. This emphasis on idle power necessitates the need for a generic Linux kernel framework to manage idle CPUs.
This paper covers 'cpuidle' - an effort toward a generic idle framework in the Linux kernel. The goal is to have a clean interface for any platform to make use of different CPU idle levels and also to provide abstraction between idle-drivers and idle-governors allowing for independent development. The target audience includes those who have a general interest in idle processor power management and its impact on battery life, developers who would like to create new and better governors, and developers interested in utilizing the cpuidle infrastructure on new platforms.
Subtitle: Resource Efficient OS-Level Virtualization
Note: paper not in proceedings
2007 looks like being the year when Free Software and Open Source finally make their differences (which have been bubbling away under the surface for over a decade) manifest. For all of its differences with "free software", Linux consistently maintains the greatest amount of innovation of any of the Open Source Operating Systems. We'll take a whimsical and offbeat tour of the reasons why this might be so from the point of view of the maintainer of possibly the least popular (certainly the least used) kernel architecture.