sphinx.addnodesdocument)}( rawsourcechildren]( translations LanguagesNode)}(hhh](h pending_xref)}(hhh]docutils.nodesTextChinese (Simplified)}parenthsba attributes}(ids]classes]names]dupnames]backrefs] refdomainstdreftypedoc reftargeth](h Wei Yang <}(hhhhhNhNubh reference)}(hweiyang@linux.vnet.ibm.comh]hweiyang@linux.vnet.ibm.com}(hhhhhNhNubah}(h]h ]h"]h$]h&]refuri!mailto:weiyang@linux.vnet.ibm.comuh1hhhubh>}(hhhhhNhNubeh}(h]h ]h"]h$]h&]uh1hhhhKhhhhubh)}(h)Benjamin Herrenschmidt h](hBenjamin Herrenschmidt <}(hhhhhNhNubh)}(hbenh@au1.ibm.comh]hbenh@au1.ibm.com}(hhhhhNhNubah}(h]h ]h"]h$]h&]refurimailto:benh@au1.ibm.comuh1hhhubh>}(hhhhhNhNubeh}(h]h ]h"]h$]h&]uh1hhhhKhhhhubh)}(h#Bjorn Helgaas h](hBjorn Helgaas <}(hhhhhNhNubh)}(hbhelgaas@google.comh]hbhelgaas@google.com}(hjhhhNhNubah}(h]h ]h"]h$]h&]refurimailto:bhelgaas@google.comuh1hhhubh>}(hhhhhNhNubeh}(h]h ]h"]h$]h&]uh1hhhhK hhhhubh)}(h 26 Aug 2014h]h 26 Aug 2014}(hj!hhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhK hhhhubh)}(hX[This document describes the requirement from hardware for PCI MMIO resource sizing and assignment on PowerKVM and how generic PCI code handles this requirement. The first two sections describe the concepts of Partitionable Endpoints and the implementation on P8 (IODA2). The next two sections talks about considerations on enabling SRIOV on IODA2.h]hX[This document describes the requirement from hardware for PCI MMIO resource sizing and assignment on PowerKVM and how generic PCI code handles this requirement. The first two sections describe the concepts of Partitionable Endpoints and the implementation on P8 (IODA2). The next two sections talks about considerations on enabling SRIOV on IODA2.}(hj/hhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhK hhhhubh)}(hhh](h)}(h*1. Introduction to Partitionable Endpointsh]h*1. Introduction to Partitionable Endpoints}(hj@hhhNhNubah}(h]h ]h"]h$]h&]uh1hhj=hhhhhKubh)}(hXAA Partitionable Endpoint (PE) is a way to group the various resources associated with a device or a set of devices to provide isolation between partitions (i.e., filtering of DMA, MSIs etc.) and to provide a mechanism to freeze a device that is causing errors in order to limit the possibility of propagation of bad data.h]hXAA Partitionable Endpoint (PE) is a way to group the various resources associated with a device or a set of devices to provide isolation between partitions (i.e., filtering of DMA, MSIs etc.) and to provide a mechanism to freeze a device that is causing errors in order to limit the possibility of propagation of bad data.}(hjNhhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhKhj=hhubh)}(hThere is thus, in HW, a table of PE states that contains a pair of "frozen" state bits (one for MMIO and one for DMA, they get set together but can be cleared independently) for each PE.h]hThere is thus, in HW, a table of PE states that contains a pair of “frozen” state bits (one for MMIO and one for DMA, they get set together but can be cleared independently) for each PE.}(hj\hhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhKhj=hhubh)}(hWhen a PE is frozen, all stores in any direction are dropped and all loads return all 1's value. MSIs are also blocked. There's a bit more state that captures things like the details of the error that caused the freeze etc., but that's not critical.h]hWhen a PE is frozen, all stores in any direction are dropped and all loads return all 1’s value. MSIs are also blocked. There’s a bit more state that captures things like the details of the error that caused the freeze etc., but that’s not critical.}(hjjhhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhK hj=hhubh)}(hrThe interesting part is how the various PCIe transactions (MMIO, DMA, ...) are matched to their corresponding PEs.h]hrThe interesting part is how the various PCIe transactions (MMIO, DMA, ...) are matched to their corresponding PEs.}(hjxhhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhK%hj=hhubh)}(hThe following section provides a rough description of what we have on P8 (IODA2). Keep in mind that this is all per PHB (PCI host bridge). Each PHB is a completely separate HW entity that replicates the entire logic, so has its own set of PEs, etc.h]hThe following section provides a rough description of what we have on P8 (IODA2). Keep in mind that this is all per PHB (PCI host bridge). Each PHB is a completely separate HW entity that replicates the entire logic, so has its own set of PEs, etc.}(hjhhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhK(hj=hhubeh}(h]'introduction-to-partitionable-endpointsah ]h"]*1. introduction to partitionable endpointsah$]h&]uh1hhhhhhhhKubh)}(hhh](h)}(h:2. Implementation of Partitionable Endpoints on P8 (IODA2)h]h:2. Implementation of Partitionable Endpoints on P8 (IODA2)}(hjhhhNhNubah}(h]h ]h"]h$]h&]uh1hhjhhhhhK.ubh)}(h6P8 supports up to 256 Partitionable Endpoints per PHB.h]h6P8 supports up to 256 Partitionable Endpoints per PHB.}(hjhhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhK0hjhhubh block_quote)}(hX* Inbound For DMA, MSIs and inbound PCIe error messages, we have a table (in memory but accessed in HW by the chip) that provides a direct correspondence between a PCIe RID (bus/dev/fn) with a PE number. We call this the RTT. - For DMA we then provide an entire address space for each PE that can contain two "windows", depending on the value of PCI address bit 59. Each window can be configured to be remapped via a "TCE table" (IOMMU translation table), which has various configurable characteristics not described here. - For MSIs, we have two windows in the address space (one at the top of the 32-bit space and one much higher) which, via a combination of the address and MSI value, will result in one of the 2048 interrupts per bridge being triggered. There's a PE# in the interrupt controller descriptor table as well which is compared with the PE# obtained from the RTT to "authorize" the device to emit that specific interrupt. - Error messages just use the RTT. * Outbound. That's where the tricky part is. Like other PCI host bridges, the Power8 IODA2 PHB supports "windows" from the CPU address space to the PCI address space. There is one M32 window and sixteen M64 windows. They have different characteristics. First what they have in common: they forward a configurable portion of the CPU address space to the PCIe bus and must be naturally aligned power of two in size. The rest is different: - The M32 window: * Is limited to 4GB in size. * Drops the top bits of the address (above the size) and replaces them with a configurable value. This is typically used to generate 32-bit PCIe accesses. We configure that window at boot from FW and don't touch it from Linux; it's usually set to forward a 2GB portion of address space from the CPU to PCIe 0x8000_0000..0xffff_ffff. (Note: The top 64KB are actually reserved for MSIs but this is not a problem at this point; we just need to ensure Linux doesn't assign anything there, the M32 logic ignores that however and will forward in that space if we try). * It is divided into 256 segments of equal size. A table in the chip maps each segment to a PE#. That allows portions of the MMIO space to be assigned to PEs on a segment granularity. For a 2GB window, the segment granularity is 2GB/256 = 8MB. Now, this is the "main" window we use in Linux today (excluding SR-IOV). We basically use the trick of forcing the bridge MMIO windows onto a segment alignment/granularity so that the space behind a bridge can be assigned to a PE. Ideally we would like to be able to have individual functions in PEs but that would mean using a completely different address allocation scheme where individual function BARs can be "grouped" to fit in one or more segments. - The M64 windows: * Must be at least 256MB in size. * Do not translate addresses (the address on PCIe is the same as the address on the PowerBus). There is a way to also set the top 14 bits which are not conveyed by PowerBus but we don't use this. * Can be configured to be segmented. When not segmented, we can specify the PE# for the entire window. When segmented, a window has 256 segments; however, there is no table for mapping a segment to a PE#. The segment number *is* the PE#. * Support overlaps. If an address is covered by multiple windows, there's a defined ordering for which window applies. We have code (fairly new compared to the M32 stuff) that exploits that for large BARs in 64-bit space: We configure an M64 window to cover the entire region of address space that has been assigned by FW for the PHB (about 64GB, ignore the space for the M32, it comes out of a different "reserve"). We configure it as segmented. Then we do the same thing as with M32, using the bridge alignment trick, to match to those giant segments. Since we cannot remap, we have two additional constraints: - We do the PE# allocation *after* the 64-bit space has been assigned because the addresses we use directly determine the PE#. We then update the M32 PE# for the devices that use both 32-bit and 64-bit spaces or assign the remaining PE# to 32-bit only devices. - We cannot "group" segments in HW, so if a device ends up using more than one segment, we end up with more than one PE#. There is a HW mechanism to make the freeze state cascade to "companion" PEs but that only works for PCIe error messages (typically used so that if you freeze a switch, it freezes all its children). So we do it in SW. We lose a bit of effectiveness of EEH in that case, but that's the best we found. So when any of the PEs freezes, we freeze the other ones for that "domain". We thus introduce the concept of "master PE" which is the one used for DMA, MSIs, etc., and "secondary PEs" that are used for the remaining M64 segments. We would like to investigate using additional M64 windows in "single PE" mode to overlay over specific BARs to work around some of that, for example for devices with very large BARs, e.g., GPUs. It would make sense, but we haven't done it yet. h]h bullet_list)}(hhh](h list_item)}(hXInbound For DMA, MSIs and inbound PCIe error messages, we have a table (in memory but accessed in HW by the chip) that provides a direct correspondence between a PCIe RID (bus/dev/fn) with a PE number. We call this the RTT. - For DMA we then provide an entire address space for each PE that can contain two "windows", depending on the value of PCI address bit 59. Each window can be configured to be remapped via a "TCE table" (IOMMU translation table), which has various configurable characteristics not described here. - For MSIs, we have two windows in the address space (one at the top of the 32-bit space and one much higher) which, via a combination of the address and MSI value, will result in one of the 2048 interrupts per bridge being triggered. There's a PE# in the interrupt controller descriptor table as well which is compared with the PE# obtained from the RTT to "authorize" the device to emit that specific interrupt. - Error messages just use the RTT. h](h)}(hInboundh]hInbound}(hjhhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhK2hjubh)}(hFor DMA, MSIs and inbound PCIe error messages, we have a table (in memory but accessed in HW by the chip) that provides a direct correspondence between a PCIe RID (bus/dev/fn) with a PE number. We call this the RTT.h]hFor DMA, MSIs and inbound PCIe error messages, we have a table (in memory but accessed in HW by the chip) that provides a direct correspondence between a PCIe RID (bus/dev/fn) with a PE number. We call this the RTT.}(hjhhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhK4hjubj)}(hhh](j)}(hX'For DMA we then provide an entire address space for each PE that can contain two "windows", depending on the value of PCI address bit 59. Each window can be configured to be remapped via a "TCE table" (IOMMU translation table), which has various configurable characteristics not described here. h]h)}(hX&For DMA we then provide an entire address space for each PE that can contain two "windows", depending on the value of PCI address bit 59. Each window can be configured to be remapped via a "TCE table" (IOMMU translation table), which has various configurable characteristics not described here.h]hX.For DMA we then provide an entire address space for each PE that can contain two “windows”, depending on the value of PCI address bit 59. Each window can be configured to be remapped via a “TCE table” (IOMMU translation table), which has various configurable characteristics not described here.}(hjhhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhK9hjubah}(h]h ]h"]h$]h&]uh1jhjubj)}(hXFor MSIs, we have two windows in the address space (one at the top of the 32-bit space and one much higher) which, via a combination of the address and MSI value, will result in one of the 2048 interrupts per bridge being triggered. There's a PE# in the interrupt controller descriptor table as well which is compared with the PE# obtained from the RTT to "authorize" the device to emit that specific interrupt. h]h)}(hXFor MSIs, we have two windows in the address space (one at the top of the 32-bit space and one much higher) which, via a combination of the address and MSI value, will result in one of the 2048 interrupts per bridge being triggered. There's a PE# in the interrupt controller descriptor table as well which is compared with the PE# obtained from the RTT to "authorize" the device to emit that specific interrupt.h]hXFor MSIs, we have two windows in the address space (one at the top of the 32-bit space and one much higher) which, via a combination of the address and MSI value, will result in one of the 2048 interrupts per bridge being triggered. There’s a PE# in the interrupt controller descriptor table as well which is compared with the PE# obtained from the RTT to “authorize” the device to emit that specific interrupt.}(hjhhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhK?hjubah}(h]h ]h"]h$]h&]uh1jhjubj)}(h!Error messages just use the RTT. h]h)}(h Error messages just use the RTT.h]h Error messages just use the RTT.}(hjhhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhKFhjubah}(h]h ]h"]h$]h&]uh1jhjubeh}(h]h ]h"]h$]h&]bullet-uh1jhhhK9hjubeh}(h]h ]h"]h$]h&]uh1jhjubj)}(hX5Outbound. That's where the tricky part is. Like other PCI host bridges, the Power8 IODA2 PHB supports "windows" from the CPU address space to the PCI address space. There is one M32 window and sixteen M64 windows. They have different characteristics. First what they have in common: they forward a configurable portion of the CPU address space to the PCIe bus and must be naturally aligned power of two in size. The rest is different: - The M32 window: * Is limited to 4GB in size. * Drops the top bits of the address (above the size) and replaces them with a configurable value. This is typically used to generate 32-bit PCIe accesses. We configure that window at boot from FW and don't touch it from Linux; it's usually set to forward a 2GB portion of address space from the CPU to PCIe 0x8000_0000..0xffff_ffff. (Note: The top 64KB are actually reserved for MSIs but this is not a problem at this point; we just need to ensure Linux doesn't assign anything there, the M32 logic ignores that however and will forward in that space if we try). * It is divided into 256 segments of equal size. A table in the chip maps each segment to a PE#. That allows portions of the MMIO space to be assigned to PEs on a segment granularity. For a 2GB window, the segment granularity is 2GB/256 = 8MB. Now, this is the "main" window we use in Linux today (excluding SR-IOV). We basically use the trick of forcing the bridge MMIO windows onto a segment alignment/granularity so that the space behind a bridge can be assigned to a PE. Ideally we would like to be able to have individual functions in PEs but that would mean using a completely different address allocation scheme where individual function BARs can be "grouped" to fit in one or more segments. - The M64 windows: * Must be at least 256MB in size. * Do not translate addresses (the address on PCIe is the same as the address on the PowerBus). There is a way to also set the top 14 bits which are not conveyed by PowerBus but we don't use this. * Can be configured to be segmented. When not segmented, we can specify the PE# for the entire window. When segmented, a window has 256 segments; however, there is no table for mapping a segment to a PE#. The segment number *is* the PE#. * Support overlaps. If an address is covered by multiple windows, there's a defined ordering for which window applies. We have code (fairly new compared to the M32 stuff) that exploits that for large BARs in 64-bit space: We configure an M64 window to cover the entire region of address space that has been assigned by FW for the PHB (about 64GB, ignore the space for the M32, it comes out of a different "reserve"). We configure it as segmented. Then we do the same thing as with M32, using the bridge alignment trick, to match to those giant segments. Since we cannot remap, we have two additional constraints: - We do the PE# allocation *after* the 64-bit space has been assigned because the addresses we use directly determine the PE#. We then update the M32 PE# for the devices that use both 32-bit and 64-bit spaces or assign the remaining PE# to 32-bit only devices. - We cannot "group" segments in HW, so if a device ends up using more than one segment, we end up with more than one PE#. There is a HW mechanism to make the freeze state cascade to "companion" PEs but that only works for PCIe error messages (typically used so that if you freeze a switch, it freezes all its children). So we do it in SW. We lose a bit of effectiveness of EEH in that case, but that's the best we found. So when any of the PEs freezes, we freeze the other ones for that "domain". We thus introduce the concept of "master PE" which is the one used for DMA, MSIs, etc., and "secondary PEs" that are used for the remaining M64 segments. We would like to investigate using additional M64 windows in "single PE" mode to overlay over specific BARs to work around some of that, for example for devices with very large BARs, e.g., GPUs. It would make sense, but we haven't done it yet. h](h)}(h+Outbound. That's where the tricky part is.h]h-Outbound. That’s where the tricky part is.}(hjEhhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhKHhjAubh)}(hXLike other PCI host bridges, the Power8 IODA2 PHB supports "windows" from the CPU address space to the PCI address space. There is one M32 window and sixteen M64 windows. They have different characteristics. First what they have in common: they forward a configurable portion of the CPU address space to the PCIe bus and must be naturally aligned power of two in size. The rest is different:h]hXLike other PCI host bridges, the Power8 IODA2 PHB supports “windows” from the CPU address space to the PCI address space. There is one M32 window and sixteen M64 windows. They have different characteristics. First what they have in common: they forward a configurable portion of the CPU address space to the PCIe bus and must be naturally aligned power of two in size. The rest is different:}(hjShhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhKJhjAubj)}(hhh]j)}(hXsThe M32 window: * Is limited to 4GB in size. * Drops the top bits of the address (above the size) and replaces them with a configurable value. This is typically used to generate 32-bit PCIe accesses. We configure that window at boot from FW and don't touch it from Linux; it's usually set to forward a 2GB portion of address space from the CPU to PCIe 0x8000_0000..0xffff_ffff. (Note: The top 64KB are actually reserved for MSIs but this is not a problem at this point; we just need to ensure Linux doesn't assign anything there, the M32 logic ignores that however and will forward in that space if we try). * It is divided into 256 segments of equal size. A table in the chip maps each segment to a PE#. That allows portions of the MMIO space to be assigned to PEs on a segment granularity. For a 2GB window, the segment granularity is 2GB/256 = 8MB. h](h)}(hThe M32 window:h]hThe M32 window:}(hjhhhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhKQhjdubj)}(hhh](j)}(hIs limited to 4GB in size. h]h)}(hIs limited to 4GB in size.h]hIs limited to 4GB in size.}(hj}hhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhKShjyubah}(h]h ]h"]h$]h&]uh1jhjvubj)}(hX4Drops the top bits of the address (above the size) and replaces them with a configurable value. This is typically used to generate 32-bit PCIe accesses. We configure that window at boot from FW and don't touch it from Linux; it's usually set to forward a 2GB portion of address space from the CPU to PCIe 0x8000_0000..0xffff_ffff. (Note: The top 64KB are actually reserved for MSIs but this is not a problem at this point; we just need to ensure Linux doesn't assign anything there, the M32 logic ignores that however and will forward in that space if we try). h]h)}(hX3Drops the top bits of the address (above the size) and replaces them with a configurable value. This is typically used to generate 32-bit PCIe accesses. We configure that window at boot from FW and don't touch it from Linux; it's usually set to forward a 2GB portion of address space from the CPU to PCIe 0x8000_0000..0xffff_ffff. (Note: The top 64KB are actually reserved for MSIs but this is not a problem at this point; we just need to ensure Linux doesn't assign anything there, the M32 logic ignores that however and will forward in that space if we try).h]hX9Drops the top bits of the address (above the size) and replaces them with a configurable value. This is typically used to generate 32-bit PCIe accesses. We configure that window at boot from FW and don’t touch it from Linux; it’s usually set to forward a 2GB portion of address space from the CPU to PCIe 0x8000_0000..0xffff_ffff. (Note: The top 64KB are actually reserved for MSIs but this is not a problem at this point; we just need to ensure Linux doesn’t assign anything there, the M32 logic ignores that however and will forward in that space if we try).}(hjhhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhKUhjubah}(h]h ]h"]h$]h&]uh1jhjvubj)}(hIt is divided into 256 segments of equal size. A table in the chip maps each segment to a PE#. That allows portions of the MMIO space to be assigned to PEs on a segment granularity. For a 2GB window, the segment granularity is 2GB/256 = 8MB. h]h)}(hIt is divided into 256 segments of equal size. A table in the chip maps each segment to a PE#. That allows portions of the MMIO space to be assigned to PEs on a segment granularity. For a 2GB window, the segment granularity is 2GB/256 = 8MB.h]hIt is divided into 256 segments of equal size. A table in the chip maps each segment to a PE#. That allows portions of the MMIO space to be assigned to PEs on a segment granularity. For a 2GB window, the segment granularity is 2GB/256 = 8MB.}(hjhhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhK_hjubah}(h]h ]h"]h$]h&]uh1jhjvubeh}(h]h ]h"]h$]h&]j9*uh1jhhhKShjdubeh}(h]h ]h"]h$]h&]uh1jhjaubah}(h]h ]h"]h$]h&]j9j:uh1jhhhKQhjAubh)}(hNow, this is the "main" window we use in Linux today (excluding SR-IOV). We basically use the trick of forcing the bridge MMIO windows onto a segment alignment/granularity so that the space behind a bridge can be assigned to a PE.h]hNow, this is the “main” window we use in Linux today (excluding SR-IOV). We basically use the trick of forcing the bridge MMIO windows onto a segment alignment/granularity so that the space behind a bridge can be assigned to a PE.}(hjhhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhKdhjAubh)}(hIdeally we would like to be able to have individual functions in PEs but that would mean using a completely different address allocation scheme where individual function BARs can be "grouped" to fit in one or more segments.h]hIdeally we would like to be able to have individual functions in PEs but that would mean using a completely different address allocation scheme where individual function BARs can be “grouped” to fit in one or more segments.}(hjhhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhKihjAubj)}(hhh]j)}(hXqThe M64 windows: * Must be at least 256MB in size. * Do not translate addresses (the address on PCIe is the same as the address on the PowerBus). There is a way to also set the top 14 bits which are not conveyed by PowerBus but we don't use this. * Can be configured to be segmented. When not segmented, we can specify the PE# for the entire window. When segmented, a window has 256 segments; however, there is no table for mapping a segment to a PE#. The segment number *is* the PE#. * Support overlaps. If an address is covered by multiple windows, there's a defined ordering for which window applies. h](h)}(hThe M64 windows:h]hThe M64 windows:}(hjhhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhKnhjubj)}(hhh](j)}(h Must be at least 256MB in size. h]h)}(hMust be at least 256MB in size.h]hMust be at least 256MB in size.}(hj hhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhKphjubah}(h]h ]h"]h$]h&]uh1jhjubj)}(hDo not translate addresses (the address on PCIe is the same as the address on the PowerBus). There is a way to also set the top 14 bits which are not conveyed by PowerBus but we don't use this. h]h)}(hDo not translate addresses (the address on PCIe is the same as the address on the PowerBus). There is a way to also set the top 14 bits which are not conveyed by PowerBus but we don't use this.h]hDo not translate addresses (the address on PCIe is the same as the address on the PowerBus). There is a way to also set the top 14 bits which are not conveyed by PowerBus but we don’t use this.}(hj$hhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhKrhj ubah}(h]h ]h"]h$]h&]uh1jhjubj)}(hCan be configured to be segmented. When not segmented, we can specify the PE# for the entire window. When segmented, a window has 256 segments; however, there is no table for mapping a segment to a PE#. The segment number *is* the PE#. h]h)}(hCan be configured to be segmented. When not segmented, we can specify the PE# for the entire window. When segmented, a window has 256 segments; however, there is no table for mapping a segment to a PE#. The segment number *is* the PE#.h](hCan be configured to be segmented. When not segmented, we can specify the PE# for the entire window. When segmented, a window has 256 segments; however, there is no table for mapping a segment to a PE#. The segment number }(hj<hhhNhNubhemphasis)}(h*is*h]his}(hjFhhhNhNubah}(h]h ]h"]h$]h&]uh1jDhj<ubh the PE#.}(hj<hhhNhNubeh}(h]h ]h"]h$]h&]uh1hhhhKvhj8ubah}(h]h ]h"]h$]h&]uh1jhjubj)}(hvSupport overlaps. If an address is covered by multiple windows, there's a defined ordering for which window applies. h]h)}(huSupport overlaps. If an address is covered by multiple windows, there's a defined ordering for which window applies.h]hwSupport overlaps. If an address is covered by multiple windows, there’s a defined ordering for which window applies.}(hjhhhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhK{hjdubah}(h]h ]h"]h$]h&]uh1jhjubeh}(h]h ]h"]h$]h&]j9juh1jhhhKphjubeh}(h]h ]h"]h$]h&]uh1jhjubah}(h]h ]h"]h$]h&]j9j:uh1jhhhKnhjAubh)}(hfWe have code (fairly new compared to the M32 stuff) that exploits that for large BARs in 64-bit space:h]hfWe have code (fairly new compared to the M32 stuff) that exploits that for large BARs in 64-bit space:}(hjhhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhK~hjAubh)}(hWe configure an M64 window to cover the entire region of address space that has been assigned by FW for the PHB (about 64GB, ignore the space for the M32, it comes out of a different "reserve"). We configure it as segmented.h]hWe configure an M64 window to cover the entire region of address space that has been assigned by FW for the PHB (about 64GB, ignore the space for the M32, it comes out of a different “reserve”). We configure it as segmented.}(hjhhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhKhjAubh)}(hjThen we do the same thing as with M32, using the bridge alignment trick, to match to those giant segments.h]hjThen we do the same thing as with M32, using the bridge alignment trick, to match to those giant segments.}(hjhhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhKhjAubh)}(h:Since we cannot remap, we have two additional constraints:h]h:Since we cannot remap, we have two additional constraints:}(hjhhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhKhjAubj)}(hhh](j)}(hXWe do the PE# allocation *after* the 64-bit space has been assigned because the addresses we use directly determine the PE#. We then update the M32 PE# for the devices that use both 32-bit and 64-bit spaces or assign the remaining PE# to 32-bit only devices. h]h)}(hXWe do the PE# allocation *after* the 64-bit space has been assigned because the addresses we use directly determine the PE#. We then update the M32 PE# for the devices that use both 32-bit and 64-bit spaces or assign the remaining PE# to 32-bit only devices.h](hWe do the PE# allocation }(hjhhhNhNubjE)}(h*after*h]hafter}(hjhhhNhNubah}(h]h ]h"]h$]h&]uh1jDhjubh the 64-bit space has been assigned because the addresses we use directly determine the PE#. We then update the M32 PE# for the devices that use both 32-bit and 64-bit spaces or assign the remaining PE# to 32-bit only devices.}(hjhhhNhNubeh}(h]h ]h"]h$]h&]uh1hhhhKhjubah}(h]h ]h"]h$]h&]uh1jhjubj)}(hXWe cannot "group" segments in HW, so if a device ends up using more than one segment, we end up with more than one PE#. There is a HW mechanism to make the freeze state cascade to "companion" PEs but that only works for PCIe error messages (typically used so that if you freeze a switch, it freezes all its children). So we do it in SW. We lose a bit of effectiveness of EEH in that case, but that's the best we found. So when any of the PEs freezes, we freeze the other ones for that "domain". We thus introduce the concept of "master PE" which is the one used for DMA, MSIs, etc., and "secondary PEs" that are used for the remaining M64 segments. h]h)}(hXWe cannot "group" segments in HW, so if a device ends up using more than one segment, we end up with more than one PE#. There is a HW mechanism to make the freeze state cascade to "companion" PEs but that only works for PCIe error messages (typically used so that if you freeze a switch, it freezes all its children). So we do it in SW. We lose a bit of effectiveness of EEH in that case, but that's the best we found. So when any of the PEs freezes, we freeze the other ones for that "domain". We thus introduce the concept of "master PE" which is the one used for DMA, MSIs, etc., and "secondary PEs" that are used for the remaining M64 segments.h]hXWe cannot “group” segments in HW, so if a device ends up using more than one segment, we end up with more than one PE#. There is a HW mechanism to make the freeze state cascade to “companion” PEs but that only works for PCIe error messages (typically used so that if you freeze a switch, it freezes all its children). So we do it in SW. We lose a bit of effectiveness of EEH in that case, but that’s the best we found. So when any of the PEs freezes, we freeze the other ones for that “domain”. We thus introduce the concept of “master PE” which is the one used for DMA, MSIs, etc., and “secondary PEs” that are used for the remaining M64 segments.}(hjhhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhKhjubah}(h]h ]h"]h$]h&]uh1jhjubeh}(h]h ]h"]h$]h&]j9j:uh1jhhhKhjAubh)}(hWe would like to investigate using additional M64 windows in "single PE" mode to overlay over specific BARs to work around some of that, for example for devices with very large BARs, e.g., GPUs. It would make sense, but we haven't done it yet.h]hWe would like to investigate using additional M64 windows in “single PE” mode to overlay over specific BARs to work around some of that, for example for devices with very large BARs, e.g., GPUs. It would make sense, but we haven’t done it yet.}(hjhhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhKhjAubeh}(h]h ]h"]h$]h&]uh1jhjubeh}(h]h ]h"]h$]h&]j9juh1jhhhK2hjubah}(h]h ]h"]h$]h&]uh1jhhhK2hjhhubeh}(h]5implementation-of-partitionable-endpoints-on-p8-ioda2ah ]h"]:2. implementation of partitionable endpoints on p8 (ioda2)ah$]h&]uh1hhhhhhhhK.ubh)}(hhh](h)}(h(3. Considerations for SR-IOV on PowerKVMh]h(3. Considerations for SR-IOV on PowerKVM}(hj<hhhNhNubah}(h]h ]h"]h$]h&]uh1hhj9hhhhhKubj)}(hX* SR-IOV Background The PCIe SR-IOV feature allows a single Physical Function (PF) to support several Virtual Functions (VFs). Registers in the PF's SR-IOV Capability control the number of VFs and whether they are enabled. When VFs are enabled, they appear in Configuration Space like normal PCI devices, but the BARs in VF config space headers are unusual. For a non-VF device, software uses BARs in the config space header to discover the BAR sizes and assign addresses for them. For VF devices, software uses VF BAR registers in the *PF* SR-IOV Capability to discover sizes and assign addresses. The BARs in the VF's config space header are read-only zeros. When a VF BAR in the PF SR-IOV Capability is programmed, it sets the base address for all the corresponding VF(n) BARs. For example, if the PF SR-IOV Capability is programmed to enable eight VFs, and it has a 1MB VF BAR0, the address in that VF BAR sets the base of an 8MB region. This region is divided into eight contiguous 1MB regions, each of which is a BAR0 for one of the VFs. Note that even though the VF BAR describes an 8MB region, the alignment requirement is for a single VF, i.e., 1MB in this example. There are several strategies for isolating VFs in PEs: - M32 window: There's one M32 window, and it is split into 256 equally-sized segments. The finest granularity possible is a 256MB window with 1MB segments. VF BARs that are 1MB or larger could be mapped to separate PEs in this window. Each segment can be individually mapped to a PE via the lookup table, so this is quite flexible, but it works best when all the VF BARs are the same size. If they are different sizes, the entire window has to be small enough that the segment size matches the smallest VF BAR, which means larger VF BARs span several segments. - Non-segmented M64 window: A non-segmented M64 window is mapped entirely to a single PE, so it could only isolate one VF. - Single segmented M64 windows: A segmented M64 window could be used just like the M32 window, but the segments can't be individually mapped to PEs (the segment number is the PE#), so there isn't as much flexibility. A VF with multiple BARs would have to be in a "domain" of multiple PEs, which is not as well isolated as a single PE. - Multiple segmented M64 windows: As usual, each window is split into 256 equally-sized segments, and the segment number is the PE#. But if we use several M64 windows, they can be set to different base addresses and different segment sizes. If we have VFs that each have a 1MB BAR and a 32MB BAR, we could use one M64 window to assign 1MB segments and another M64 window to assign 32MB segments. Finally, the plan to use M64 windows for SR-IOV, which will be described more in the next two sections. For a given VF BAR, we need to effectively reserve the entire 256 segments (256 * VF BAR size) and position the VF BAR to start at the beginning of a free range of segments/PEs inside that M64 window. The goal is of course to be able to give a separate PE for each VF. The IODA2 platform has 16 M64 windows, which are used to map MMIO range to PE#. Each M64 window defines one MMIO range and this range is divided into 256 segments, with each segment corresponding to one PE. We decide to leverage this M64 window to map VFs to individual PEs, since SR-IOV VF BARs are all the same size. But doing so introduces another problem: total_VFs is usually smaller than the number of M64 window segments, so if we map one VF BAR directly to one M64 window, some part of the M64 window will map to another device's MMIO range. IODA supports 256 PEs, so segmented windows contain 256 segments, so if total_VFs is less than 256, we have the situation in Figure 1.0, where segments [total_VFs, 255] of the M64 window may map to some MMIO range on other devices:: 0 1 total_VFs - 1 +------+------+- -+------+------+ | | | ... | | | +------+------+- -+------+------+ VF(n) BAR space 0 1 total_VFs - 1 255 +------+------+- -+------+------+- -+------+------+ | | | ... | | | ... | | | +------+------+- -+------+------+- -+------+------+ M64 window Figure 1.0 Direct map VF(n) BAR space Our current solution is to allocate 256 segments even if the VF(n) BAR space doesn't need that much, as shown in Figure 1.1:: 0 1 total_VFs - 1 255 +------+------+- -+------+------+- -+------+------+ | | | ... | | | ... | | | +------+------+- -+------+------+- -+------+------+ VF(n) BAR space + extra 0 1 total_VFs - 1 255 +------+------+- -+------+------+- -+------+------+ | | | ... | | | ... | | | +------+------+- -+------+------+- -+------+------+ M64 window Figure 1.1 Map VF(n) BAR space + extra Allocating the extra space ensures that the entire M64 window will be assigned to this one SR-IOV device and none of the space will be available for other devices. Note that this only expands the space reserved in software; there are still only total_VFs VFs, and they only respond to segments [0, total_VFs - 1]. There's nothing in hardware that responds to segments [total_VFs, 255]. h](j)}(hhh]j)}(hXSR-IOV Background The PCIe SR-IOV feature allows a single Physical Function (PF) to support several Virtual Functions (VFs). Registers in the PF's SR-IOV Capability control the number of VFs and whether they are enabled. When VFs are enabled, they appear in Configuration Space like normal PCI devices, but the BARs in VF config space headers are unusual. For a non-VF device, software uses BARs in the config space header to discover the BAR sizes and assign addresses for them. For VF devices, software uses VF BAR registers in the *PF* SR-IOV Capability to discover sizes and assign addresses. The BARs in the VF's config space header are read-only zeros. When a VF BAR in the PF SR-IOV Capability is programmed, it sets the base address for all the corresponding VF(n) BARs. For example, if the PF SR-IOV Capability is programmed to enable eight VFs, and it has a 1MB VF BAR0, the address in that VF BAR sets the base of an 8MB region. This region is divided into eight contiguous 1MB regions, each of which is a BAR0 for one of the VFs. Note that even though the VF BAR describes an 8MB region, the alignment requirement is for a single VF, i.e., 1MB in this example. h](h)}(hSR-IOV Backgroundh]hSR-IOV Background}(hjUhhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhKhjQubh)}(hThe PCIe SR-IOV feature allows a single Physical Function (PF) to support several Virtual Functions (VFs). Registers in the PF's SR-IOV Capability control the number of VFs and whether they are enabled.h]hThe PCIe SR-IOV feature allows a single Physical Function (PF) to support several Virtual Functions (VFs). Registers in the PF’s SR-IOV Capability control the number of VFs and whether they are enabled.}(hjchhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhKhjQubh)}(hXWhen VFs are enabled, they appear in Configuration Space like normal PCI devices, but the BARs in VF config space headers are unusual. For a non-VF device, software uses BARs in the config space header to discover the BAR sizes and assign addresses for them. For VF devices, software uses VF BAR registers in the *PF* SR-IOV Capability to discover sizes and assign addresses. The BARs in the VF's config space header are read-only zeros.h](hX;When VFs are enabled, they appear in Configuration Space like normal PCI devices, but the BARs in VF config space headers are unusual. For a non-VF device, software uses BARs in the config space header to discover the BAR sizes and assign addresses for them. For VF devices, software uses VF BAR registers in the }(hjqhhhNhNubjE)}(h*PF*h]hPF}(hjyhhhNhNubah}(h]h ]h"]h$]h&]uh1jDhjqubh{ SR-IOV Capability to discover sizes and assign addresses. The BARs in the VF’s config space header are read-only zeros.}(hjqhhhNhNubeh}(h]h ]h"]h$]h&]uh1hhhhKhjQubh)}(hXWhen a VF BAR in the PF SR-IOV Capability is programmed, it sets the base address for all the corresponding VF(n) BARs. For example, if the PF SR-IOV Capability is programmed to enable eight VFs, and it has a 1MB VF BAR0, the address in that VF BAR sets the base of an 8MB region. This region is divided into eight contiguous 1MB regions, each of which is a BAR0 for one of the VFs. Note that even though the VF BAR describes an 8MB region, the alignment requirement is for a single VF, i.e., 1MB in this example.h]hXWhen a VF BAR in the PF SR-IOV Capability is programmed, it sets the base address for all the corresponding VF(n) BARs. For example, if the PF SR-IOV Capability is programmed to enable eight VFs, and it has a 1MB VF BAR0, the address in that VF BAR sets the base of an 8MB region. This region is divided into eight contiguous 1MB regions, each of which is a BAR0 for one of the VFs. Note that even though the VF BAR describes an 8MB region, the alignment requirement is for a single VF, i.e., 1MB in this example.}(hjhhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhKhjQubeh}(h]h ]h"]h$]h&]uh1jhjNubah}(h]h ]h"]h$]h&]j9juh1jhhhKhjJubh)}(h6There are several strategies for isolating VFs in PEs:h]h6There are several strategies for isolating VFs in PEs:}(hjhhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhKhjJubj)}(hhh](j)}(hX3M32 window: There's one M32 window, and it is split into 256 equally-sized segments. The finest granularity possible is a 256MB window with 1MB segments. VF BARs that are 1MB or larger could be mapped to separate PEs in this window. Each segment can be individually mapped to a PE via the lookup table, so this is quite flexible, but it works best when all the VF BARs are the same size. If they are different sizes, the entire window has to be small enough that the segment size matches the smallest VF BAR, which means larger VF BARs span several segments. h]h)}(hX2M32 window: There's one M32 window, and it is split into 256 equally-sized segments. The finest granularity possible is a 256MB window with 1MB segments. VF BARs that are 1MB or larger could be mapped to separate PEs in this window. Each segment can be individually mapped to a PE via the lookup table, so this is quite flexible, but it works best when all the VF BARs are the same size. If they are different sizes, the entire window has to be small enough that the segment size matches the smallest VF BAR, which means larger VF BARs span several segments.h]hX4M32 window: There’s one M32 window, and it is split into 256 equally-sized segments. The finest granularity possible is a 256MB window with 1MB segments. VF BARs that are 1MB or larger could be mapped to separate PEs in this window. Each segment can be individually mapped to a PE via the lookup table, so this is quite flexible, but it works best when all the VF BARs are the same size. If they are different sizes, the entire window has to be small enough that the segment size matches the smallest VF BAR, which means larger VF BARs span several segments.}(hjhhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhKhjubah}(h]h ]h"]h$]h&]uh1jhjubj)}(hyNon-segmented M64 window: A non-segmented M64 window is mapped entirely to a single PE, so it could only isolate one VF. h]h)}(hxNon-segmented M64 window: A non-segmented M64 window is mapped entirely to a single PE, so it could only isolate one VF.h]hxNon-segmented M64 window: A non-segmented M64 window is mapped entirely to a single PE, so it could only isolate one VF.}(hjhhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhKhjubah}(h]h ]h"]h$]h&]uh1jhjubj)}(hXNSingle segmented M64 windows: A segmented M64 window could be used just like the M32 window, but the segments can't be individually mapped to PEs (the segment number is the PE#), so there isn't as much flexibility. A VF with multiple BARs would have to be in a "domain" of multiple PEs, which is not as well isolated as a single PE. h]h)}(hXMSingle segmented M64 windows: A segmented M64 window could be used just like the M32 window, but the segments can't be individually mapped to PEs (the segment number is the PE#), so there isn't as much flexibility. A VF with multiple BARs would have to be in a "domain" of multiple PEs, which is not as well isolated as a single PE.h]hXUSingle segmented M64 windows: A segmented M64 window could be used just like the M32 window, but the segments can’t be individually mapped to PEs (the segment number is the PE#), so there isn’t as much flexibility. A VF with multiple BARs would have to be in a “domain” of multiple PEs, which is not as well isolated as a single PE.}(hjhhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhKhjubah}(h]h ]h"]h$]h&]uh1jhjubj)}(hXMultiple segmented M64 windows: As usual, each window is split into 256 equally-sized segments, and the segment number is the PE#. But if we use several M64 windows, they can be set to different base addresses and different segment sizes. If we have VFs that each have a 1MB BAR and a 32MB BAR, we could use one M64 window to assign 1MB segments and another M64 window to assign 32MB segments. h]h)}(hXMultiple segmented M64 windows: As usual, each window is split into 256 equally-sized segments, and the segment number is the PE#. But if we use several M64 windows, they can be set to different base addresses and different segment sizes. If we have VFs that each have a 1MB BAR and a 32MB BAR, we could use one M64 window to assign 1MB segments and another M64 window to assign 32MB segments.h]hXMultiple segmented M64 windows: As usual, each window is split into 256 equally-sized segments, and the segment number is the PE#. But if we use several M64 windows, they can be set to different base addresses and different segment sizes. If we have VFs that each have a 1MB BAR and a 32MB BAR, we could use one M64 window to assign 1MB segments and another M64 window to assign 32MB segments.}(hjhhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhKhjubah}(h]h ]h"]h$]h&]uh1jhjubeh}(h]h ]h"]h$]h&]j9j:uh1jhhhKhjJubh)}(hX1Finally, the plan to use M64 windows for SR-IOV, which will be described more in the next two sections. For a given VF BAR, we need to effectively reserve the entire 256 segments (256 * VF BAR size) and position the VF BAR to start at the beginning of a free range of segments/PEs inside that M64 window.h]hX1Finally, the plan to use M64 windows for SR-IOV, which will be described more in the next two sections. For a given VF BAR, we need to effectively reserve the entire 256 segments (256 * VF BAR size) and position the VF BAR to start at the beginning of a free range of segments/PEs inside that M64 window.}(hj"hhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhKhjJubh)}(hCThe goal is of course to be able to give a separate PE for each VF.h]hCThe goal is of course to be able to give a separate PE for each VF.}(hj0hhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhKhjJubh)}(hThe IODA2 platform has 16 M64 windows, which are used to map MMIO range to PE#. Each M64 window defines one MMIO range and this range is divided into 256 segments, with each segment corresponding to one PE.h]hThe IODA2 platform has 16 M64 windows, which are used to map MMIO range to PE#. Each M64 window defines one MMIO range and this range is divided into 256 segments, with each segment corresponding to one PE.}(hj>hhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhKhjJubh)}(hoWe decide to leverage this M64 window to map VFs to individual PEs, since SR-IOV VF BARs are all the same size.h]hoWe decide to leverage this M64 window to map VFs to individual PEs, since SR-IOV VF BARs are all the same size.}(hjLhhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhKhjJubh)}(hBut doing so introduces another problem: total_VFs is usually smaller than the number of M64 window segments, so if we map one VF BAR directly to one M64 window, some part of the M64 window will map to another device's MMIO range.h]hBut doing so introduces another problem: total_VFs is usually smaller than the number of M64 window segments, so if we map one VF BAR directly to one M64 window, some part of the M64 window will map to another device’s MMIO range.}(hjZhhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhKhjJubh)}(hIODA supports 256 PEs, so segmented windows contain 256 segments, so if total_VFs is less than 256, we have the situation in Figure 1.0, where segments [total_VFs, 255] of the M64 window may map to some MMIO range on other devices::h]hIODA supports 256 PEs, so segmented windows contain 256 segments, so if total_VFs is less than 256, we have the situation in Figure 1.0, where segments [total_VFs, 255] of the M64 window may map to some MMIO range on other devices:}(hjhhhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhKhjJubh literal_block)}(hX 0 1 total_VFs - 1 +------+------+- -+------+------+ | | | ... | | | +------+------+- -+------+------+ VF(n) BAR space 0 1 total_VFs - 1 255 +------+------+- -+------+------+- -+------+------+ | | | ... | | | ... | | | +------+------+- -+------+------+- -+------+------+ M64 window Figure 1.0 Direct map VF(n) BAR spaceh]hX 0 1 total_VFs - 1 +------+------+- -+------+------+ | | | ... | | | +------+------+- -+------+------+ VF(n) BAR space 0 1 total_VFs - 1 255 +------+------+- -+------+------+- -+------+------+ | | | ... | | | ... | | | +------+------+- -+------+------+- -+------+------+ M64 window Figure 1.0 Direct map VF(n) BAR space}hjxsbah}(h]h ]h"]h$]h&] xml:spacepreserveuh1jvhhhKhjJubh)}(h}Our current solution is to allocate 256 segments even if the VF(n) BAR space doesn't need that much, as shown in Figure 1.1::h]h~Our current solution is to allocate 256 segments even if the VF(n) BAR space doesn’t need that much, as shown in Figure 1.1:}(hjhhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhKhjJubjw)}(hXn0 1 total_VFs - 1 255 +------+------+- -+------+------+- -+------+------+ | | | ... | | | ... | | | +------+------+- -+------+------+- -+------+------+ VF(n) BAR space + extra 0 1 total_VFs - 1 255 +------+------+- -+------+------+- -+------+------+ | | | ... | | | ... | | | +------+------+- -+------+------+- -+------+------+ M64 window Figure 1.1 Map VF(n) BAR space + extrah]hXn0 1 total_VFs - 1 255 +------+------+- -+------+------+- -+------+------+ | | | ... | | | ... | | | +------+------+- -+------+------+- -+------+------+ VF(n) BAR space + extra 0 1 total_VFs - 1 255 +------+------+- -+------+------+- -+------+------+ | | | ... | | | ... | | | +------+------+- -+------+------+- -+------+------+ M64 window Figure 1.1 Map VF(n) BAR space + extra}hjsbah}(h]h ]h"]h$]h&]jjuh1jvhhhMhjJubh)}(hXAllocating the extra space ensures that the entire M64 window will be assigned to this one SR-IOV device and none of the space will be available for other devices. Note that this only expands the space reserved in software; there are still only total_VFs VFs, and they only respond to segments [0, total_VFs - 1]. There's nothing in hardware that responds to segments [total_VFs, 255].h]hXAllocating the extra space ensures that the entire M64 window will be assigned to this one SR-IOV device and none of the space will be available for other devices. Note that this only expands the space reserved in software; there are still only total_VFs VFs, and they only respond to segments [0, total_VFs - 1]. There’s nothing in hardware that responds to segments [total_VFs, 255].}(hjhhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhMhjJubeh}(h]h ]h"]h$]h&]uh1jhhhKhj9hhubeh}(h]%considerations-for-sr-iov-on-powerkvmah ]h"](3. considerations for sr-iov on powerkvmah$]h&]uh1hhhhhhhhKubh)}(hhh](h)}(h(4. Implications for the Generic PCI Codeh]h(4. Implications for the Generic PCI Code}(hjhhhNhNubah}(h]h ]h"]h$]h&]uh1hhjhhhhhMubh)}(hrThe PCIe SR-IOV spec requires that the base of the VF(n) BAR space be aligned to the size of an individual VF BAR.h]hrThe PCIe SR-IOV spec requires that the base of the VF(n) BAR space be aligned to the size of an individual VF BAR.}(hjhhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhMhjhhubh)}(hXDIn IODA2, the MMIO address determines the PE#. If the address is in an M32 window, we can set the PE# by updating the table that translates segments to PE#s. Similarly, if the address is in an unsegmented M64 window, we can set the PE# for the window. But if it's in a segmented M64 window, the segment number is the PE#.h]hXFIn IODA2, the MMIO address determines the PE#. If the address is in an M32 window, we can set the PE# by updating the table that translates segments to PE#s. Similarly, if the address is in an unsegmented M64 window, we can set the PE# for the window. But if it’s in a segmented M64 window, the segment number is the PE#.}(hjhhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhMhjhhubh)}(hTherefore, the only way to control the PE# for a VF is to change the base of the VF(n) BAR space in the VF BAR. If the PCI core allocates the exact amount of space required for the VF(n) BAR space, the VF BAR value is fixed and cannot be changed.h]hTherefore, the only way to control the PE# for a VF is to change the base of the VF(n) BAR space in the VF BAR. If the PCI core allocates the exact amount of space required for the VF(n) BAR space, the VF BAR value is fixed and cannot be changed.}(hjhhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhM%hjhhubh)}(hOn the other hand, if the PCI core allocates additional space, the VF BAR value can be changed as long as the entire VF(n) BAR space remains inside the space allocated by the core.h]hOn the other hand, if the PCI core allocates additional space, the VF BAR value can be changed as long as the entire VF(n) BAR space remains inside the space allocated by the core.}(hjhhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhM*hjhhubh)}(hX#Ideally the segment size will be the same as an individual VF BAR size. Then each VF will be in its own PE. The VF BARs (and therefore the PE#s) are contiguous. If VF0 is in PE(x), then VF(n) is in PE(x+n). If we allocate 256 segments, there are (256 - numVFs) choices for the PE# of VF0.h]hX#Ideally the segment size will be the same as an individual VF BAR size. Then each VF will be in its own PE. The VF BARs (and therefore the PE#s) are contiguous. If VF0 is in PE(x), then VF(n) is in PE(x+n). If we allocate 256 segments, there are (256 - numVFs) choices for the PE# of VF0.}(hj hhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhM.hjhhubh)}(hXIf the segment size is smaller than the VF BAR size, it will take several segments to cover a VF BAR, and a VF will be in several PEs. This is possible, but the isolation isn't as good, and it reduces the number of PE# choices because instead of consuming only numVFs segments, the VF(n) BAR space will consume (numVFs * n) segments. That means there aren't as many available segments for adjusting base of the VF(n) BAR space.h]hXIf the segment size is smaller than the VF BAR size, it will take several segments to cover a VF BAR, and a VF will be in several PEs. This is possible, but the isolation isn’t as good, and it reduces the number of PE# choices because instead of consuming only numVFs segments, the VF(n) BAR space will consume (numVFs * n) segments. That means there aren’t as many available segments for adjusting base of the VF(n) BAR space.}(hjhhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhM3hjhhubeh}(h]%implications-for-the-generic-pci-codeah ]h"](4. implications for the generic pci codeah$]h&]uh1hhhhhhhhMubeh}(h]3pci-express-i-o-virtualization-resource-on-powerenvah ]h"]3pci express i/o virtualization resource on powerenvah$]h&]uh1hhhhhhhhKubeh}(h]h ]h"]h$]h&]sourcehuh1hcurrent_sourceN current_lineNsettingsdocutils.frontendValues)}(hN generatorN datestampN source_linkN source_urlN toc_backlinksentryfootnote_backlinksK sectnum_xformKstrip_commentsNstrip_elements_with_classesN strip_classesN report_levelK halt_levelKexit_status_levelKdebugNwarning_streamN tracebackinput_encoding utf-8-siginput_encoding_error_handlerstrictoutput_encodingutf-8output_encoding_error_handlerjXerror_encodingutf-8error_encoding_error_handlerbackslashreplace language_codeenrecord_dependenciesNconfigN id_prefixhauto_id_prefixid dump_settingsNdump_internalsNdump_transformsNdump_pseudo_xmlNexpose_internalsNstrict_visitorN_disable_configN_sourceh _destinationN _config_files]7/var/lib/git/docbuild/linux/Documentation/docutils.confafile_insertion_enabled raw_enabledKline_length_limitM'pep_referencesN pep_base_urlhttps://peps.python.org/pep_file_url_templatepep-%04drfc_referencesN rfc_base_url&https://datatracker.ietf.org/doc/html/ tab_widthKtrim_footnote_reference_spacesyntax_highlightlong smart_quotessmartquotes_locales]character_level_inline_markupdoctitle_xform docinfo_xformKsectsubtitle_xform image_loadinglinkembed_stylesheetcloak_email_addressessection_self_linkenvNubreporterNindirect_targets]substitution_defs}substitution_names}refnames}refids}nameids}(j2j/jjj6j3jjj*j'u nametypes}(j2jj6jj*uh}(j/hjj=j3jjj9j'ju footnote_refs} citation_refs} autofootnotes]autofootnote_refs]symbol_footnotes]symbol_footnote_refs] footnotes] citations]autofootnote_startKsymbol_footnote_startK id_counter collectionsCounter}Rparse_messages]transform_messages] transformerN include_log] decorationNhhub.