Background ---------- The way the pci dma mapping functions work is by taking a virtual address and mapping it to a bus address. This works fine for I/O to pages that have a virtual address mapping. For architectures with HIGHMEM, pages in the highmem zone do not have such a mapping though (because of the limited address space). The highmem pages can be mapped directly to a bus address though, so if we can collapse the page -> virt -> bus translation into a simple page -> bus address that's all we need. Usually block drivers will use the pci_map_sg functionto do this mapping, with sg->address being set to the virtual address of the buffer. The current approach I took is to add a sg->page too, and have pci_map_sg map sg->page (if set) or sg->address accordingly. Of course then we also need a sg->offset, that was added to struct scatterlist too. For single page mappings, there is pci_map_page. Why --- Data in the page cache can reside in highmem. When we want to write out one of these pages, the following happens: - WRITE buffer_head, with b_page in highmem - Allocate new buffer_head and page in low memory - Copy data from highmem page to newly allocated low mem page - Continue writing out new page - On end I/O, free bounce buffer_head and page or for a read - READ buffer_head, with b_page in highmem - Allocate new buffer_head and page in low memory - Continue reading in data - On end I/O, copy back data to high memory page and free bounce To access data in highmem pages, Linux plays TLB tricks by temporarily mapping it into reserved low memory. The READ copy-back is extra expensive, since the non-caching atomic mapping function has to be used because it can (depending on low level driver) be called from IRQ context. Thus the entire page copy also happens with interrupts disabled! This is what Linux 2.2/2.4 currently does, and it doesn't take a genious to note that this is less than optimal. Simple kernel profiling when doing dbench disk testing shows the bouncing eating significant amounts of CPU. Requiring a full page and buffer_head allocation on doing I/O is very unhandy too, and has led to numerous deadlocks in the past. Memory zoning ------------- Status of bio patches --------------------- Functional and stable for IDE and SCSI, the various RAID controllers (cpqarray, cciss, DAC960) have not been tested as I don't have the hw but they should work too. The cciss/cpqarray maintainer is currently testing cciss. Highmem I/O should work on: - IDE (DMA of course, various PIO modes checked for correct mapping and it seems good) - cpqarray and cciss (untested) - SCSI SCSI and IDE both pass cerberus stress testing (aic7xxx and piix, respectively), so I think the patch can be considered fairly stable. SCSI support is currently enabled for aic7xxx and sym53c8xx. IDE highmem has been tested the most and I'd be surprised if there are any show-stopper bugs left there. ll_rw_kio works, and brw_kiovec was greatly simplified by this. Not just that, but with bio we can remove the embedded buffer_head and blocks array in struct kiobuf which leads to a ~8kB reduction in size for each kiobuf. No real performance testing on this yet, just a simple dd: (clean kernel, 2.4.5-pre4) bart:~ # time dd if=/dev/raw1 of=/dev/null bs=4k 128008+0 records in 128008+0 records out real 0m43.130s user 0m0.290s sys 0m7.060s (2.4.5-pre4 + bio) bart:~ # time dd if=/dev/raw1 of=/dev/null bs=4k 128008+0 records in 128008+0 records out real 0m38.478s user 0m0.204s sys 0m5.091s Pending / TODO -------------- - Convert more drivers that support I/O to higher memory pages to do so instead of using bounce buffers. - Drop plug_tq and convert to per-queue unplugging. - Merge my queue-barrier patch - Look into moving more of the per-major arrays into the request_queue - Fix md and lvm - Fix ide-scsi - Fix drivers/s390 block stuff - Probably still some drivers that aren't converted and not listed, look - Loose kdev_t from the block layer completely Feel free to jump in and start hacking on the TODO list :-) Jens Axboe