#%include "default.mgp" %tab 1 size 6, vgap 40, prefix " ", icon box "green" 50 %tab 2 size 5, vgap 40, prefix " ", icon arc "yellow" 50 %tab 3 size 4, vgap 40, prefix " --", icon delta3 "white" 40 %%%%%%%%%%%%% %page %bgrad 25 25 256 45 1 "darkgreen" "black" "black" "black" "black" "black" "black" "darkgreen" %tfont "/usr/share/fonts/default/TrueType/timr____.ttf" %fore "white" %center %size 8 Linux AIO Performance and Robustness for Enterprise Workloads %fore "darkorange" %size 4 Suparna Bhattacharya John Tran Mike Sullivan Chris Mason %size 4 {suparna@in, jbtran@us, mksully@us}.ibm.com, mason@suse.com %left %size 8 %%%%%%%%%%%% %page %bgrad 25 25 256 45 1 "darkgreen" "black" "black" "black" "black" "black" "black" "darkgreen" %fore "white" Overview of kernel AIO in Linux(tm) 2.6 %fore "skyblue" AIO overlaps processing with I/O operations improved utilization of CPU and devices especially under variable loads %pause AIO syscalls io_setup(max_events, &ctx) io_submit(ctx, nr, iocbs[]) IO_CMD_PREAD, IO_CMD_PWRITE, IO_CMD_POLL io_getevents(ctx, min_nr, nr, events[], timeout) io_destroy(ctx) io_cancel(ctx, iocb, &result) %%%%%%%%%%%%%%%%%%%%% %page %bgrad 25 25 256 45 1 "darkgreen" "black" "black" "black" "black" "black" "black" "darkgreen" %fore "white" Retry-based AIO Recap %fore "skyblue" Sync I/O and AIO share a common code path Async io_wait context for AIO Allocated as part of kiocb Replaces default on-stack wait queue entry for sync I/O Retry exit for AIO instead of blocking Return number of bytes completed or -EIOCBRETRY %pause AIO proceeds as a series of non-blocking iterations Retry kicked via async wait queue callback on wakeup Reissues fop->aio_read/write Modified arguments representing the remaining I/O Worker threads use caller's address space during retries %%%%%%%%%%%%%%%%%%%%% %page %bgrad 25 25 256 45 1 "darkgreen" "black" "black" "black" "black" "black" "black" "darkgreen" %fore "white" Impact of Readahead on AIO Reads %fore "skyblue" %size 4 %fore "green" Current Ahead ----|-------|--------|----------------|----- ^start ^start+size ^ahead_start ^ahead_start+ahead_size ^preoffset %fore "skyblue" %pause %size 6 Impact on Streaming Random AIO Reads fop->aio_read(fd, o1, 16384) = -EIOCBRETRY Readahead o1 to o1+64KB, pre = o1 %pause fop->aio_read(fd, o2, 16384) = -EIOCBRETRY Readahead o2 to o2+8KB, pre = o2 %pause fop->aio_read(fd, o3, 16384) = -EIOCBRETRY No readahead, Slow read 4KB %pause fop->aio_read(fd, o1, 16384) = 16384 %pause fop->aio_read(fd, o2, 16384) = 8192 %pause fop->aio_read(fd, o3, 16384) = 4096 %size 8 %%%%%%%%%%%%%%%%%%%%% %page %bgrad 25 25 256 45 1 "darkgreen" "black" "black" "black" "black" "black" "black" "darkgreen" %fore "white" Upfront Readahead for AIO %fore "skyblue" Tells readahead logic about all pages in a request Doesn't repeat readahead on retries %pause I/O Pattern with Streaming AIO Reads fop->aio_read(fd, o1, 16384) = -EIOCBRETRY Readahead o1 to 64KB, pre = o1+12KB fop->aio_read(fd, o2, 16384) = -EIOCBRETRY Readahead o2 to 20KB, pre = o2+16KB fop->aio_read(fd, o1, 16384) = 16384 fop->aio_read(fd, o2, 16384) = 16384 %pause Addressing Sendfile regression Upfront only within max readahead pages Restrict to AIO case %fore "skyblue" %%%%%%%%%%%%%%%%%%%%% %page %bgrad 25 25 256 45 1 "darkgreen" "black" "black" "black" "black" "black" "black" "darkgreen" %fore "white" Random AIO Read Throughput %fore "skyblue" %center %newimage -scrzoom 70 "aioread-perf.eps" %leftfill %size 4 Filesystem: ext3 4KB blocksize, 1GB file AIC7896 Ultra2 SCSI 4-way Pentium(tm) III 700MHz, 512MB %size 8 %%%%%%%%%%%%%%%%%%%%% %page %bgrad 25 25 256 45 1 "darkgreen" "black" "black" "black" "black" "black" "black" "darkgreen" %fore "white" AIO DIO vs Cached I/O Integrity %fore "skyblue" DIO vs Buffered I/O races (2.4 & 2.6) - sct Meta-data mods vs actual disk block instantiation Atomicity of flush & DIO read - i_sem Truncate vs DIO write/read - r/w i_alloc_sem DIO fallback to buffered I/O for writes to sparse regions %pause AIO DIO specific races (2.6) AIO-DIO read/write vs truncate hold i_alloc_sem till I/O completion AIO-DIO file extends block instantiation atomicity with i_size updates force synchronous behaviour AIO-DIO writes to sparse regions request spanning allocated & sparse regions force synchronous behaviour %%%%%%%%%%%%%%%%%%%%% %page %bgrad 25 25 256 45 1 "darkgreen" "black" "black" "black" "black" "black" "black" "darkgreen" %fore "white" Concurrent Synchronized I/O %fore "skyblue" Synchronized Writes (O_SYNC, fsync) I/O committed to disk before request completes %pause Concurrent DIO writes i_sem serializes parallel DIO writes could be released after blocks looked up and I/O started (helps streaming AIO-DIO writes) %pause Concurrent O_SYNC buffered writes Per-address space page lists (dirty & writeback) i_sem held across traversal to writeback completion serializes parallel O_SYNC writes & difficult to retry-enable for AIO races between sync and background writes Radix-tree lookup of range to O_SYNC avoids i_sem across I/O waits & AIO retry friendly %%%%%%%%%%%%%%%%%%%%%% %page %bgrad 25 25 256 45 1 "darkgreen" "black" "black" "black" "black" "black" "black" "darkgreen" %fore "white" Tagged Radix-tree based Writeback %fore "skyblue" Lookup dirty or writeback pages in O(log64(n)) Adds tag bits for each slot to each radix-tree node To search keep going down sub-trees under slots with tag set Tagged gang-lookup for in-order searches in a range Replaces per-address space page list traversals %pause To synchronize writes to disk ... Radix-tree walk and writeout dirty pages in range Radix-tree walk and wait on writeback pages in range Repeatable logic for wait on writeback for AIO O_SYNC Issue all writeouts for the range during first iteration Wait for writeback completion converted a retry exit Retries don't re-dirty pages, fall through to wait on writeback step %page %bgrad 25 25 256 45 1 "darkgreen" "black" "black" "black" "black" "black" "black" "darkgreen" %fore "white" Random AIO Write Throughput %fore "skyblue" %center %newimage -scrzoom 70 "aiowrite-perf.eps" %leftfill %size 4 Filesystem: ext3 4KB blocksize, 1GB file AIC7896 Ultra2 SCSI 4-way PIII 700MHz, 512MB %size 8 %%%%%%%%%%%%%%%%%%%%%% %page %bgrad 25 25 256 45 1 "darkgreen" "black" "black" "black" "black" "black" "black" "darkgreen" %fore "white" Retry Storms & Filtered Waitqueues %fore "skyblue" %size 4 Hashed wait queues & overloaded wakeups can interact with AIO retry-logic in tricky ways %pause %fore "green" CPU1 CPU2 lock_page(px) ... unlock_page(px) lock_page(py) wait_on_page_writeback_wq(px) ... unlock_page(py) wakes up px triggering a retry <----------------------------------- lock_page(px) wait_on_page_writeback_wq(py) ... unlock_page(py) ---- wakes up py --- causes retry ----> %pause %fore "orange" Filtered wakeups ensure specificity of the wakeup to the specific object and reason for wakeup, eliminating redundant wakeups, so we no longer have to worry about situations like the above. %fore "skyblue" %size 8 %%%%%%%%%%%%%%%%%%%%%% %page %bgrad 25 25 256 45 1 "darkgreen" "black" "black" "black" "black" "black" "black" "darkgreen" %fore "white" AIO and Database Workloads %fore "skyblue" DB2(R) page cleaners Flush dirty buffer pool pages to disk Number and behaviour can be tuned according to demand Frees agent processes to be dedicated to processing txns AIO reduces no. of page cleaner processes Helps OLTP workloads Stream random synchronized AIO writes to preallocated blocks Individual request size = Database page size (e.g. 8KB) Keep disk queues maximally utilized and limit contention %pause AIO reads for prefetching data Expected to help decision support workloads Streaming large random AIO reads Individual request size = Database extent size (e.g. 256KB) Need to tune readahead setting for device for buffered AIO %%%%%%%%%%%%%%%%%%%%% %page %bgrad 25 25 256 45 1 "darkgreen" "black" "black" "black" "black" "black" "black" "darkgreen" %fore "white" AIO OLTP Performance - Raw Devices %fore "skyblue" %size 3 %area 50 20 0 20, leftfill, fore "green" Update-intensive OLTP database workload Derived from a TPC benchmark, but in no way comparable to any TPC results %area 50 20 50 20, leftfill, fore "green" DB2 V8, Linux 2.6.1, 2-way AMD Opteron, QLogic 2342 FC, 2 storage servers x 8 disk enclosures x 14 disks each RAID-0 configuration, stripe size 256KB %pause %area 100 50 0 40, leftfill, fore "skyblue" %center %size 8 ______________________________________________________________ Configuration Relative Throughput ______________________________________________________________ 1 page cleaner with AIO 133 55 page cleaners without AIO 122 ______________________________________________________________ %pause ______________________________________________________________ Configuration Page Cleaner Writes (%) ______________________________________________________________ 1 page cleaner with AIO 100 55 page cleaners without AIO 70 ______________________________________________________________ %leftfill %%%%%%%%%%%%%%%%%%%%% %page %size 8 %fore "white" AIO OLTP Performance - Filesystems %fore "skyblue" %size 3 %area 50 20 0 20, leftfill, fore "green" Update-intensive OLTP database workload Derived from a TPC benchmark, but in no way comparable to any TPC results %area 50 20 50 20, leftfill, fore "green" DB2 V8, Linux 2.6.0+mm1, 2-way AMD Opteron QLogic 2310 FC, 4 disk enclosures, each with 2 disk RAID-0 arrays, stripe size 256KB %pause %area 100 50 0 40, leftfill, fore "skyblue" %center %size 8 ______________________________________________________________ Configuration Relative Throughput ______________________________________________________________ 5 page cleaners with AIO (buffered) 113.7 5 page cleaners without AIO 100 ______________________________________________________________ %pause ______________________________________________________________ Configuration Page Cleaner Writes (%) ______________________________________________________________ 5 page cleaners with AIO (buffered) 100 5 page cleaners without AIO 37 ______________________________________________________________ %leftfill %%%%%%%%%%%%%%%%%%%%% %page %size 8 %fore "white" Combining AIO & Communications I/O %fore "skyblue" Different API models Communications IO uses epoll + O_NONBLOCK Disk-based AIO uses native AIO API %pause Combining into a single event loop Extend epoll for notification of AIO completion AIO poll IO_CMD_POLL (fd, event mask) retry based implementation Native AIO support for communications I/O e.g. AIO pipe (retry-based implementation) read immediate added to retry-model context switch reduction vs IO stalls AIO on sockets not implemented yet %%%%%%%%%%%%%%%%%%%%% %page %fore "white" Conclusions %fore "skyblue" In retrospect ... Real challenges are beyond sync->async conversion AIO exposed I/O patterns less likely with sync I/O alone AIO appeared to magnify some problems early hashed waitqueues -> filtered wakeups readahead window collapse with large random reads Enhancements improved concurrency for sync I/O paths %pause Room for future work More benchmarking, analysis & optimizations AIO fsync More widely used AIO applications Study to determine if network AIO is worth it Enabling efficient POSIX AIO implementations %%%%%%%%%%%%%%%%%%%%% %page %bgrad 25 25 256 45 1 "darkgreen" "black" "black" "black" "black" "black" "black" "darkgreen" %size 6 %fore "white" Our Thanks to ... %fore "skyblue" %size 4 Andrew Morton Daniel McNeil Janet Morgan Badari Pulavarthy Stephen Tweedie William Lee Irwin %pause and several other people on the linux-aio and the linux-kernel mailing lists for various suggestions and fixes %pause %size 6 %fore "white" Downloads: %fore "orange" %size 4 AIO patches discussed here are available at www.kernel.org/pub/linux/kernel/people/suparna/aio AIO web page: http://lse.sf.net/io/aio.html %%%%%%%%%%%%%%%%%%%%%% %page %size 8 %bgrad 25 25 256 45 1 "darkgreen" "black" "black" "black" "black" "black" "black" "darkgreen" %fore "white" Disclaimers and Trademarks %fore "skyblue" %size 4 This work represents the view of the authors and does not necessarily represent the view of IBM. IBM and DB2 are registered trademarks of International Business Machines Corporation in the United States and/or other countries. Linux is a registered trademark of Linus Torvalds. Pentium is a trademark of Intel Corporation in the United States, other countries or both Other company, product, and service names may be trademarks or service marks of others. The benchmarks discussed in this presentation were conducted for research purposes only, under laboratory conditions. Results will not be realised in all computing environments %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%