#%include "default.mgp" %tab 1 size 6, vgap 40, prefix " ", icon box "green" 50 %tab 2 size 5, vgap 40, prefix " ", icon arc "yellow" 50 %tab 3 size 4, vgap 40, prefix " --", icon delta3 "white" 40 %%%%%%%%%%%%% %page %bgrad 25 25 256 45 1 "darkgreen" "black" "black" "black" "black" "black" "black" "darkgreen" %tfont "/usr/share/fonts/default/TrueType/timr____.ttf" %fore "white" %center Asynchronous IO for Linux 2.5 %fore "darkorange" %size 4 Suparna Bhattacharya Badari Pulavarty Steven Pratt Janet Morgan %size 4 {suparna@in, pbadari@us, slpratt@us, janetmor@us}.ibm.com %left %size 8 %%%%%%%%%%%% %page %bgrad 25 25 256 45 1 "darkgreen" "black" "black" "black" "black" "black" "black" "darkgreen" %fore "white" What and Why ? %fore "skyblue" AIO overlaps processing with I/O operations Submit I/O without waiting for completion Submit multiple I/O requests together (batching) %pause Can improve application and system performance Web-servers, databases, I/O intensive apps Avoid need for lots of threads Adapt to varying loads (event driven model) Improved throughput and utilization of CPU & devices Optimize disk activity Combining/reordering of indiv requests %%%%%%%%%%%%%%%%%%%%% %page %bgrad 25 25 256 45 1 "darkgreen" "black" "black" "black" "black" "black" "black" "darkgreen" %fore "white" Linux AIO Evolution %fore "skyblue" AIO implementations for Linux %pause User-level POSIX AIO Library (current glibc AIO) %pause SGI KAIO %pause Red Hat's 2.4 kernel AIO patches (Benjamin LaHaise) %pause 2.5 kernel AIO infrastructure and add-on patchsets %%%%%%%%%%%%%%%%%%%%% %page %bgrad 25 25 256 45 1 "darkgreen" "black" "black" "black" "black" "black" "black" "darkgreen" %fore "white" AIO Architecture Decisions %fore "skyblue" External interface choices Common interface for sync & async mode %fore "orange" Unique set of interfaces for AIO (e.g. POSIX AIO) can address specific requirements e.g., batch submission %fore "skyblue" %pause Alternative system design principles Sync and async share a common code path sync = async + wait %fore "orange" Sync and async paths may diverge as needed may be tuned for different perf characteristics %fore "skyblue" %%%%%%%%%%%%%%%%%%%%% %page %bgrad 25 25 256 45 1 "darkgreen" "black" "black" "black" "black" "black" "black" "darkgreen" %fore "white" AIO Design Models %fore "skyblue" Approaches for providing asynchrony Offload entire sync I/O to thread pools User-level threads, e.g., glibc Hybrid split-phase I/O approach (e.g., SGI KAIO) Async submission, Pool of threads to wait for completion %fore "orange" True async state machine for every operation Series of non-blocking steps; context retained across steps Event driven execution %fore "skyblue" %pause Handling user-context dependencies Maintain dedicated per-address space service threads Executes context dependent steps in caller's context %fore "orange" Convert parameters to context independent form e.g., Mapping user-space buffers into kernel address space %fore "skyblue" %%%%%%%%%%%%%%%%%%%%% %page %bgrad 25 25 256 45 1 "darkgreen" "black" "black" "black" "black" "black" "black" "darkgreen" %fore "white" 2.4 AIO Design Critique %fore "skyblue" Advantages Sync I/O performance unaffected by AIO code Allows AIO to be tuned optimally for AIO patterns Work-to-do state machine => high asynchrony Smooth (exact) continuation of fully non-blocking steps Avoids extra threads to complete I/O in caller's context %pause Disadvantages Duplication of logic between sync and async paths Difficult to maintain Async state machine complex Harder to debug, Prone to certain type of races Inefficient utilization of TLB for small buffers %%%%%%%%%%%%%%%%%%%%% %page %bgrad 25 25 256 45 1 "darkgreen" "black" "black" "black" "black" "black" "black" "darkgreen" %fore "white" 2.5 AIO Design Principles %fore "skyblue" AIO now a first-class citizen of the kernel Common code path for AIO and regular sync I/O sync = async + wait (not exactly, but mostly !) Retry based model Convert blocking points to retry exits Retries triggered through async notification when ready Schedules a restart of the operation for the remaining transfer I/O cancellation Just disables future retries Worker threads take on caller's address space Takes care of user-context dependencies No need to map buffers or pin memory %%%%%%%%%%%%%%%%%%%%%% %page %center %image "fig1.gif" %leftfill %%%%%%%%%%%%%%%%%%%%%% %page %bgrad 25 25 256 45 1 "darkgreen" "black" "black" "black" "black" "black" "black" "darkgreen" %fore "white" AIO Support for Direct I/O %fore "skyblue" 2.5 O_DIRECT implementation Streams entire I/O direct to BIOs Pipelines pinning of pages and submission of I/O FS implements direct_IO address-space method %pause Asynchronous DIO I/O completion async (return -EIOCBQUEUED) BIO callback completes iocb from interrupt context when entire DIO is over %pause Multiple potential blocking points in submission path not yet converted to be asynchronous %page %bgrad 25 25 256 45 1 "darkgreen" "black" "black" "black" "black" "black" "black" "darkgreen" %fore "white" AIO Support for Buffered I/O %fore "skyblue" Low level retry primitives (async wait queue based) wait_on_page_bit() -> wait_on_page_bit_wq () wait_on_buffer() -> wait_on_buffer_wq () blk_congestion_wait() -> blk_congestion_wait_wq() %fore "green" %pause prepare_to_wait(wqh, wait, state); //won't alter task state if async wait while (!cond) { if (!is_sync_wait(wait)) return -EIOCBRETRY io_schedule(); } %fore "skyblue" %pause Graceful error propagation with indication of how much is done Completion of iocbs driven by caller of AIO fops Only one retry instance active at a time for a given iocb %%%%%%%%%%%%%%%%%%%%%% %page %bgrad 25 25 256 45 1 "darkgreen" "black" "black" "black" "black" "black" "black" "darkgreen" %fore "white" Buffered AIO Read %fore "skyblue" Buffered file system read for each page in the range page_cache_readahead %fore "red" lock page %fore "skyblue" read page if not uptodate %fore "red" wait till page is unlocked (indicates I/O completion) %fore "skyblue" copy data to user buffer %pause Changes for buffered AIO read Convert blocking points to retry exits, return bytes done Handling short reads High level retry handler tracks bytes left in kiocb Readahead changes %%%%%%%%%%%%%%%%%%%%%% %page %bgrad 25 25 256 45 1 "darkgreen" "black" "black" "black" "black" "black" "black" "darkgreen" %fore "white" Buffered AIO Write %fore "skyblue" Buffered file system write %size 4 for each page in the range %fore "red" filesystem get block to map buffers (meta-data I/O) read in block(s) that are to be partially written %fore "skyblue" copy data from user buffer mark buffers dirty if (O_SYNC) writeout dirty mapping pages sync meta data updates %fore "red" wait for writeback to complete on pages just locked down for I/O %size 8 %fore "skyblue" %pause Changes for buffered AIO write Convert blocking points to retry exits Rewrite O_SYNC handling: sync_page_range Retry remaining range as writeback finishes in parts %%%%%%%%%%%%%%%%%%%%% %page %bgrad 25 25 256 45 1 "darkgreen" "black" "black" "black" "black" "black" "black" "darkgreen" %fore "white" Observations on the 2.5 AIO Design %fore "skyblue" Async I/O patterns expose new optimization requirements on code originally designed for sync I/O Not all code paths are suitable for retries without a bit of restructuring Retries repeat some initial processing steps each time Saving state across retries may help at the cost of reduced generality Switching address spaces in retry thread has its costs Tuning of AIO workqueues and degree of asynchrony of retry instances affect overall system performance %%%%%%%%%%%%%%%%%%%%% %page %bgrad 25 25 256 45 1 "darkgreen" "black" "black" "black" "black" "black" "black" "darkgreen" %size 6 %fore "white" Our Thanks to ... %fore "skyblue" %size 4 Benjamin LaHaise Andrew Morton Philip Copeland %pause Also Stephen Tweedie , Christoph Hellwig, Hugh Dickens, Daniel McNiel, William Lee Irwin, Mathew Wilcox and other people on the linux-aio and the linux-kernel mailing lists for various suggestions and fixes %pause %size 6 %fore "white" Downloads: %fore "orange" %size 4 Filesystem AIO patches are available as part of the 2.6-mm series maintained by Andrew Morton Rawiobench benchmark available at http://www-124.ibm.com/developerworks/opensource/performance/rawread %%%%%%%%%%%%%%%%%%%%%% %page %bgrad 25 25 256 45 1 "darkgreen" "black" "black" "black" "black" "black" "black" "darkgreen" %fore "white" Disclaimers and Trademarks %fore "skyblue" %size 4 This work represents the view of the authors and does not necessarily represent the view of IBM. IBM is a registered trademark of International Business Machines Corporation in the United States and/or other countries. Linux is a registered trademark of Linus Torvalds. Other company, product, and service names may be trademarks or service marks of others. %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%