aboutsummaryrefslogtreecommitdiffstats
path: root/virt
AgeCommit message (Collapse)AuthorFilesLines
2024-04-11KVM: Drop unused @may_block param from gfn_to_pfn_cache_invalidate_start()Sean Christopherson3-7/+4
Remove gfn_to_pfn_cache_invalidate_start()'s unused @may_block parameter, which was leftover from KVM's abandoned (for now) attempt to support guest usage of gfn_to_pfn caches. Fixes: a4bff3df5147 ("KVM: pfncache: remove KVM_GUEST_USES_PFN usage") Reported-by: Like Xu <like.xu.linux@gmail.com> Cc: Paul Durrant <paul@xen.org> Cc: David Woodhouse <dwmw2@infradead.org> Reviewed-by: Paul Durrant <paul@xen.org> Reviewed-by: David Woodhouse <dwmw@amazon.co.uk> Link: https://lore.kernel.org/r/20240305003742.245767-1-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-04-08KVM: Explicitly disallow activatating a gfn_to_pfn_cache with INVALID_GPASean Christopherson1-0/+7
Explicit disallow activating a gfn_to_pfn_cache with an error gpa, i.e. INVALID_GPA, to ensure that KVM doesn't mistake a GPA-based cache for an HVA-based cache (KVM uses INVALID_GPA as a magic value to differentiate between GPA-based and HVA-based caches). WARN if KVM attempts to activate a cache with INVALID_GPA, purely so that new caches need to at least consider what to do with a "bad" GPA, as all existing usage of kvm_gpc_activate() guarantees gpa != INVALID_GPA. I.e. removing the WARN in the future is completely reasonable if doing so would yield cleaner/better code overall. Reviewed-by: David Woodhouse <dwmw@amazon.co.uk> Reviewed-by: Paul Durrant <paul@xen.org> Link: https://lore.kernel.org/r/20240320001542.3203871-4-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-04-08KVM: Check validity of offset+length of gfn_to_pfn_cache prior to activationSean Christopherson1-7/+9
When activating a gfn_to_pfn_cache, verify that the offset+length is sane and usable before marking the cache active. Letting __kvm_gpc_refresh() detect the problem results in a cache being marked active without setting the GPA (or any other fields), which in turn results in KVM trying to refresh a cache with INVALID_GPA. Attempting to refresh a cache with INVALID_GPA isn't functionally problematic, but it runs afoul of the sanity check that exactly one of GPA or userspace HVA is valid, i.e. that a cache is either GPA-based or HVA-based. Reported-by: syzbot+106a4f72b0474e1d1b33@syzkaller.appspotmail.com Closes: https://lore.kernel.org/all/0000000000005fa5cc0613f1cebd@google.com Fixes: 721f5b0dda78 ("KVM: pfncache: allow a cache to be activated with a fixed (userspace) HVA") Cc: David Woodhouse <dwmw2@infradead.org> Cc: Paul Durrant <paul@xen.org> Reviewed-by: Paul Durrant <paul@xen.org> Reviewed-by: David Woodhouse <dwmw@amazon.co.uk> Link: https://lore.kernel.org/r/20240320001542.3203871-3-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-04-08KVM: Add helpers to consolidate gfn_to_pfn_cache's page split checkSean Christopherson1-8/+19
Add a helper to check that the incoming length for a gfn_to_pfn_cache is valid with respect to the cache's GPA and/or HVA. To avoid activating a cache with a bogus GPA, a future fix will fork the page split check in the inner refresh path into activate() and the public rerfresh() APIs, at which point KVM will check the length in three separate places. Deliberately keep the "page offset" logic open coded, as the only other path that consumes the offset, __kvm_gpc_refresh(), already needs to differentiate between GPA-based and HVA-based caches, and it's not obvious that using a helper is a net positive in overall code readability. Note, for GPA-based caches, this has a subtle side effect of using the GPA instead of the resolved HVA in the check() path, but that should be a nop as the HVA offset is derived from the GPA, i.e. the two offsets are identical, barring a KVM bug. Reviewed-by: Paul Durrant <paul@xen.org> Reviewed-by: David Woodhouse <dwmw@amazon.co.uk> Link: https://lore.kernel.org/r/20240320001542.3203871-2-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-03-11Merge tag 'kvm-x86-xen-6.9' of https://github.com/kvm-x86/linux into HEADPaolo Bonzini1-108/+143
KVM Xen and pfncache changes for 6.9: - Rip out the half-baked support for using gfn_to_pfn caches to manage pages that are "mapped" into guests via physical addresses. - Add support for using gfn_to_pfn caches with only a host virtual address, i.e. to bypass the "gfn" stage of the cache. The primary use case is overlay pages, where the guest may change the gfn used to reference the overlay page, but the backing hva+pfn remains the same. - Add an ioctl() to allow mapping Xen's shared_info page using an hva instead of a gpa, so that userspace doesn't need to reconfigure and invalidate the cache/mapping if the guest changes the gpa (but userspace keeps the resolved hva the same). - When possible, use a single host TSC value when computing the deadline for Xen timers in order to improve the accuracy of the timer emulation. - Inject pending upcall events when the vCPU software-enables its APIC to fix a bug where an upcall can be lost (and to follow Xen's behavior). - Fall back to the slow path instead of warning if "fast" IRQ delivery of Xen events fails, e.g. if the guest has aliased xAPIC IDs. - Extend gfn_to_pfn_cache's mutex to cover (de)activation (in addition to refresh), and drop a now-redundant acquisition of xen_lock (that was protecting the shared_info cache) to fix a deadlock due to recursively acquiring xen_lock.
2024-03-11Merge tag 'kvm-x86-misc-6.9' of https://github.com/kvm-x86/linux into HEADPaolo Bonzini1-1/+20
KVM x86 misc changes for 6.9: - Explicitly initialize a variety of on-stack variables in the emulator that triggered KMSAN false positives (though in fairness in KMSAN, it's comically difficult to see that the uninitialized memory is never truly consumed). - Fix the deubgregs ABI for 32-bit KVM, and clean up code related to reading DR6 and DR7. - Rework the "force immediate exit" code so that vendor code ultimately decides how and when to force the exit. This allows VMX to further optimize handling preemption timer exits, and allows SVM to avoid sending a duplicate IPI (SVM also has a need to force an exit). - Fix a long-standing bug where kvm_has_noapic_vcpu could be left elevated if vCPU creation ultimately failed, and add WARN to guard against similar bugs. - Provide a dedicated arch hook for checking if a different vCPU was in-kernel (for directed yield), and simplify the logic for checking if the currently loaded vCPU is in-kernel. - Misc cleanups and fixes.
2024-03-11Merge tag 'kvm-x86-generic-6.9' of https://github.com/kvm-x86/linux into HEADPaolo Bonzini1-2/+4
KVM common MMU changes for 6.9: - Harden KVM against underflowing the active mmu_notifier invalidation count, so that "bad" invalidations (usually due to bugs elsehwere in the kernel) are detected earlier and are less likely to hang the kernel. - Fix a benign bug in __kvm_mmu_topup_memory_cache() where the object size and number of objects parameters to kvmalloc_array() were swapped.
2024-03-11Merge tag 'kvm-x86-asyncpf-6.9' of https://github.com/kvm-x86/linux into HEADPaolo Bonzini1-24/+49
KVM async page fault changes for 6.9: - Always flush the async page fault workqueue when a work item is being removed, especially during vCPU destruction, to ensure that there are no workers running in KVM code when all references to KVM-the-module are gone, i.e. to prevent a use-after-free if kvm.ko is unloaded. - Grab a reference to the VM's mm_struct in the async #PF worker itself instead of gifting the worker a reference, e.g. so that there's no need to remember to *conditionally* clean up after the worker.
2024-03-11Merge tag 'kvmarm-6.9' of ↵Paolo Bonzini1-6/+2
https://git.kernel.org/pub/scm/linux/kernel/git/kvmarm/kvmarm into HEAD KVM/arm64 updates for 6.9 - Infrastructure for building KVM's trap configuration based on the architectural features (or lack thereof) advertised in the VM's ID registers - Support for mapping vfio-pci BARs as Normal-NC (vaguely similar to x86's WC) at stage-2, improving the performance of interacting with assigned devices that can tolerate it - Conversion of KVM's representation of LPIs to an xarray, utilized to address serialization some of the serialization on the LPI injection path - Support for _architectural_ VHE-only systems, advertised through the absence of FEAT_E2H0 in the CPU's ID register - Miscellaneous cleanups, fixes, and spelling corrections to KVM and selftests
2024-03-09Merge tag 'kvm-x86-guest_memfd_fixes-6.8' of ↵Paolo Bonzini1-1/+7
https://github.com/kvm-x86/linux into HEAD KVM GUEST_MEMFD fixes for 6.8: - Make KVM_MEM_GUEST_MEMFD mutually exclusive with KVM_MEM_READONLY to avoid creating ABI that KVM can't sanely support. - Update documentation for KVM_SW_PROTECTED_VM to make it abundantly clear that such VMs are purely a development and testing vehicle, and come with zero guarantees. - Limit KVM_SW_PROTECTED_VM guests to the TDP MMU, as the long term plan is to support confidential VMs with deterministic private memory (SNP and TDX) only in the TDP MMU. - Fix a bug in a GUEST_MEMFD negative test that resulted in false passes when verifying that KVM_MEM_GUEST_MEMFD memslots can't be dirty logged.
2024-03-04KVM: pfncache: simplify locking and make more self-containedDavid Woodhouse1-10/+11
The locking on the gfn_to_pfn_cache is... interesting. And awful. There is a rwlock in ->lock which readers take to ensure protection against concurrent changes. But __kvm_gpc_refresh() makes assumptions that certain fields will not change even while it drops the write lock and performs MM operations to revalidate the target PFN and kernel mapping. Commit 93984f19e7bc ("KVM: Fully serialize gfn=>pfn cache refresh via mutex") partly addressed that — not by fixing it, but by adding a new mutex, ->refresh_lock. This prevented concurrent __kvm_gpc_refresh() calls on a given gfn_to_pfn_cache, but is still only a partial solution. There is still a theoretical race where __kvm_gpc_refresh() runs in parallel with kvm_gpc_deactivate(). While __kvm_gpc_refresh() has dropped the write lock, kvm_gpc_deactivate() clears the ->active flag and unmaps ->khva. Then __kvm_gpc_refresh() determines that the previous ->pfn and ->khva are still valid, and reinstalls those values into the structure. This leaves the gfn_to_pfn_cache with the ->valid bit set, but ->active clear. And a ->khva which looks like a reasonable kernel address but is actually unmapped. All it takes is a subsequent reactivation to cause that ->khva to be dereferenced. This would theoretically cause an oops which would look something like this: [1724749.564994] BUG: unable to handle page fault for address: ffffaa3540ace0e0 [1724749.565039] RIP: 0010:__kvm_xen_has_interrupt+0x8b/0xb0 I say "theoretically" because theoretically, that oops that was seen in production cannot happen. The code which uses the gfn_to_pfn_cache is supposed to have its *own* locking, to further paper over the fact that the gfn_to_pfn_cache's own papering-over (->refresh_lock) of its own rwlock abuse is not sufficient. For the Xen vcpu_info that external lock is the vcpu->mutex, and for the shared info it's kvm->arch.xen.xen_lock. Those locks ought to protect the gfn_to_pfn_cache against concurrent deactivation vs. refresh in all but the cases where the vcpu or kvm object is being *destroyed*, in which case the subsequent reactivation should never happen. Theoretically. Nevertheless, this locking abuse is awful and should be fixed, even if no clear explanation can be found for how the oops happened. So expand the use of the ->refresh_lock mutex to ensure serialization of activate/deactivate vs. refresh and make the pfncache locking entirely self-sufficient. This means that a future commit can simplify the locking in the callers, such as the Xen emulation code which has an outstanding problem with recursive locking of kvm->arch.xen.xen_lock, which will no longer be necessary. The rwlock abuse described above is still not best practice, although it's harmless now that the ->refresh_lock is held for the entire duration while the offending code drops the write lock, does some other stuff, then takes the write lock again and assumes nothing changed. That can also be fixed^W cleaned up in a subsequent commit, but this commit is a simpler basis for the Xen deadlock fix mentioned above. Signed-off-by: David Woodhouse <dwmw@amazon.co.uk> Reviewed-by: Paul Durrant <paul@xen.org> Link: https://lore.kernel.org/r/20240227115648.3104-5-dwmw2@infradead.org [sean: use guard(mutex) to fix a missed unlock] Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-02-23KVM: Get rid of return value from kvm_arch_create_vm_debugfs()Oliver Upton1-6/+2
The general expectation with debugfs is that any initialization failure is nonfatal. Nevertheless, kvm_arch_create_vm_debugfs() allows implementations to return an error and kvm_create_vm_debugfs() allows that to fail VM creation. Change to a void return to discourage architectures from making debugfs failures fatal for the VM. Seems like everyone already had the right idea, as all implementations already return 0 unconditionally. Acked-by: Marc Zyngier <maz@kernel.org> Acked-by: Paolo Bonzini <pbonzini@redhat.com> Link: https://lore.kernel.org/r/20240216155941.2029458-1-oliver.upton@linux.dev Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
2024-02-22KVM: Make KVM_MEM_GUEST_MEMFD mutually exclusive with KVM_MEM_READONLYSean Christopherson1-1/+7
Disallow creating read-only memslots that support GUEST_MEMFD, as GUEST_MEMFD is fundamentally incompatible with KVM's semantics for read-only memslots. Read-only memslots allow the userspace VMM to emulate option ROMs by filling the backing memory with readable, executable code and data, while triggering emulated MMIO on writes. GUEST_MEMFD doesn't currently support writes from userspace and KVM doesn't support emulated MMIO on private accesses, i.e. the guest can only ever read zeros, and writes will always be treated as errors. Cc: Fuad Tabba <tabba@google.com> Cc: Michael Roth <michael.roth@amd.com> Cc: Isaku Yamahata <isaku.yamahata@gmail.com> Cc: Yu Zhang <yu.c.zhang@linux.intel.com> Cc: Chao Peng <chao.p.peng@linux.intel.com> Fixes: a7800aa80ea4 ("KVM: Add KVM_CREATE_GUEST_MEMFD ioctl() for guest-specific backing memory") Link: https://lore.kernel.org/r/20240222190612.2942589-2-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-02-22KVM: fix kvm_mmu_memory_cache allocation warningArnd Bergmann1-1/+1
gcc-14 notices that the arguments to kvmalloc_array() are mixed up: arch/x86/kvm/../../../virt/kvm/kvm_main.c: In function '__kvm_mmu_topup_memory_cache': arch/x86/kvm/../../../virt/kvm/kvm_main.c:424:53: error: 'kvmalloc_array' sizes specified with 'sizeof' in the earlier argument and not in the later argument [-Werror=calloc-transposed-args] 424 | mc->objects = kvmalloc_array(sizeof(void *), capacity, gfp); | ^~~~ arch/x86/kvm/../../../virt/kvm/kvm_main.c:424:53: note: earlier argument should specify number of elements, later size of each element The code still works correctly, but the incorrect order prevents the compiler from properly tracking the object sizes. Fixes: 837f66c71207 ("KVM: Allow for different capacities in kvm_mmu_memory_cache structs") Signed-off-by: Arnd Bergmann <arnd@arndb.de> Reviewed-by: Marc Zyngier <maz@kernel.org> Link: https://lore.kernel.org/r/20240212112419.1186065-1-arnd@kernel.org Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-02-22KVM: Add a comment explaining the directed yield pending interrupt logicSean Christopherson1-0/+7
Add a comment to explain why KVM treats vCPUs with pending interrupts as in-kernel when a vCPU wants to yield to a vCPU that was preempted while running in kernel mode. Link: https://lore.kernel.org/r/20240110003938.490206-5-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-02-22KVM: Add dedicated arch hook for querying if vCPU was preempted in-kernelSean Christopherson1-1/+13
Plumb in a dedicated hook for querying whether or not a vCPU was preempted in-kernel. Unlike literally every other architecture, x86's VMX can check if a vCPU is in kernel context if and only if the vCPU is loaded on the current pCPU. x86's kvm_arch_vcpu_in_kernel() works around the limitation by querying kvm_get_running_vcpu() and redirecting to vcpu->arch.preempted_in_kernel as needed. But that's unnecessary, confusing, and fragile, e.g. x86 has had at least one bug where KVM incorrectly used a stale preempted_in_kernel. No functional change intended. Reviewed-by: Yuan Yao <yuan.yao@intel.com> Link: https://lore.kernel.org/r/20240110003938.490206-2-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-02-22KVM: pfncache: check the need for invalidation under read lock firstPaul Durrant1-3/+19
When processing mmu_notifier invalidations for gpc caches, pre-check for overlap with the invalidation event while holding gpc->lock for read, and only take gpc->lock for write if the cache needs to be invalidated. Doing a pre-check without taking gpc->lock for write avoids unnecessarily contending the lock for unrelated invalidations, which is very beneficial for caches that are heavily used (but rarely subjected to mmu_notifier invalidations). Signed-off-by: Paul Durrant <pdurrant@amazon.com> Reviewed-by: David Woodhouse <dwmw@amazon.co.uk> Link: https://lore.kernel.org/r/20240215152916.1158-20-paul@xen.org Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-02-20KVM: pfncache: allow a cache to be activated with a fixed (userspace) HVAPaul Durrant1-28/+70
Some pfncache pages may actually be overlays on guest memory that have a fixed HVA within the VMM. It's pointless to invalidate such cached mappings if the overlay is moved so allow a cache to be activated directly with the HVA to cater for such cases. A subsequent patch will make use of this facility. Signed-off-by: Paul Durrant <pdurrant@amazon.com> Reviewed-by: David Woodhouse <dwmw@amazon.co.uk> Link: https://lore.kernel.org/r/20240215152916.1158-10-paul@xen.org Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-02-20KVM: pfncache: include page offset in uhva and use it consistentlyPaul Durrant1-8/+21
Currently the pfncache page offset is sometimes determined using the gpa and sometimes the khva, whilst the uhva is always page-aligned. After a subsequent patch is applied the gpa will not always be valid so adjust the code to include the page offset in the uhva and use it consistently as the source of truth. Also, where a page-aligned address is required, use PAGE_ALIGN_DOWN() for clarity. No functional change intended. Signed-off-by: Paul Durrant <pdurrant@amazon.com> Reviewed-by: David Woodhouse <dwmw@amazon.co.uk> Link: https://lore.kernel.org/r/20240215152916.1158-8-paul@xen.org Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-02-20KVM: pfncache: stop open-coding offset_in_page()Paul Durrant1-3/+3
Some code in pfncache uses offset_in_page() but in other places it is open- coded. Use offset_in_page() consistently everywhere. Signed-off-by: Paul Durrant <pdurrant@amazon.com> Reviewed-by: David Woodhouse <dwmw@amazon.co.uk> Link: https://lore.kernel.org/r/20240215152916.1158-7-paul@xen.org Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-02-20KVM: pfncache: remove KVM_GUEST_USES_PFN usagePaul Durrant1-52/+9
As noted in [1] the KVM_GUEST_USES_PFN usage flag is never set by any callers of kvm_gpc_init(), and for good reason: the implementation is incomplete/broken. And it's not clear that there will ever be a user of KVM_GUEST_USES_PFN, as coordinating vCPUs with mmu_notifier events is non-trivial. Remove KVM_GUEST_USES_PFN and all related code, e.g. dropping KVM_GUEST_USES_PFN also makes the 'vcpu' argument redundant, to avoid having to reason about broken code as __kvm_gpc_refresh() evolves. Moreover, all existing callers specify KVM_HOST_USES_PFN so the usage check in hva_to_pfn_retry() and hence the 'usage' argument to kvm_gpc_init() are also redundant. [1] https://lore.kernel.org/all/ZQiR8IpqOZrOpzHC@google.com Signed-off-by: Paul Durrant <pdurrant@amazon.com> Reviewed-by: David Woodhouse <dwmw@amazon.co.uk> Link: https://lore.kernel.org/r/20240215152916.1158-6-paul@xen.org [sean: explicitly call out that guest usage is incomplete] Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-02-20KVM: pfncache: remove unnecessary exportsPaul Durrant1-5/+0
There is no need for the existing kvm_gpc_XXX() functions to be exported. Clean up now before additional functions are added in subsequent patches. Signed-off-by: Paul Durrant <pdurrant@amazon.com> Reviewed-by: David Woodhouse <dwmw@amazon.co.uk> Link: https://lore.kernel.org/r/20240215152916.1158-3-paul@xen.org Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-02-20KVM: pfncache: Add a map helper functionPaul Durrant1-18/+29
There is a pfncache unmap helper but mapping is open-coded. Arguably this is fine because mapping is done in only one place, hva_to_pfn_retry(), but adding the helper does make that function more readable. No functional change intended. Signed-off-by: Paul Durrant <pdurrant@amazon.com> Reviewed-by: David Woodhouse <dwmw@amazon.co.uk> Link: https://lore.kernel.org/r/20240215152916.1158-2-paul@xen.org Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-02-08Merge branch 'kvm-kconfig'Paolo Bonzini1-3/+1
Cleanups to Kconfig definitions for KVM * replace HAVE_KVM with an architecture-dependent symbol, when CONFIG_KVM may or may not be available depending on CPU capabilities (MIPS) * replace HAVE_KVM with IS_ENABLED(CONFIG_KVM) for host-side code that is not part of the KVM module, so that it is completely compiled out * factor common "select" statements in common code instead of requiring each architecture to specify it
2024-02-08treewide: remove CONFIG_HAVE_KVMPaolo Bonzini1-3/+0
It has no users anymore. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-02-08kvm: move "select IRQ_BYPASS_MANAGER" to common codePaolo Bonzini1-0/+1
CONFIG_IRQ_BYPASS_MANAGER is a dependency of the common code included by CONFIG_HAVE_KVM_IRQ_BYPASS. There is no advantage in adding the corresponding "select" directive to each architecture. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-02-08kvm: replace __KVM_HAVE_READONLY_MEM with Kconfig symbolPaolo Bonzini2-1/+4
KVM uses __KVM_HAVE_* symbols in the architecture-dependent uapi/asm/kvm.h to mask unused definitions in include/uapi/linux/kvm.h. __KVM_HAVE_READONLY_MEM however was nothing but a misguided attempt to define KVM_CAP_READONLY_MEM only on architectures where KVM_CHECK_EXTENSION(KVM_CAP_READONLY_MEM) could possibly return nonzero. This however does not make sense, and it prevented userspace from supporting this architecture-independent feature without recompilation. Therefore, these days __KVM_HAVE_READONLY_MEM does not mask anything and is only used in virt/kvm/kvm_main.c. Userspace does not need to test it and there should be no need for it to exist. Remove it and replace it with a Kconfig symbol within Linux source code. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-02-06KVM: Nullify async #PF worker's "apf" pointer as soon as it might be freedSean Christopherson1-5/+6
Nullify the async #PF worker's local "apf" pointer immediately after the point where the structure can be freed by the vCPU. The existing comment is helpful, but easy to overlook as there is no associated code. Update the comment to clarify that it can be freed by as soon as the lock is dropped, as "after this point" isn't strictly accurate, nor does it help understand what prevents the structure from being freed earlier. Reviewed-by: Xu Yilun <yilun.xu@intel.com> Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com> Link: https://lore.kernel.org/r/20240110011533.503302-5-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-02-06KVM: Get reference to VM's address space in the async #PF workerSean Christopherson1-14/+18
Get a reference to the target VM's address space in async_pf_execute() instead of gifting a reference from kvm_setup_async_pf(). Keeping the address space alive just to service an async #PF is counter-productive, i.e. if the process is exiting and all vCPUs are dead, then NOT doing get_user_pages_remote() and freeing the address space asap is desirable. Handling the mm reference entirely within async_pf_execute() also simplifies the async #PF flows as a whole, e.g. it's not immediately obvious when the worker task vs. the vCPU task is responsible for putting the gifted mm reference. Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com> Reviewed-by: Xu Yilun <yilun.xu@intel.com> Link: https://lore.kernel.org/r/20240110011533.503302-4-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-02-06KVM: Put mm immediately after async #PF worker completes remote gup()Sean Christopherson1-2/+1
Put the async #PF worker's reference to the VM's address space as soon as the worker is done with the mm. This will allow deferring getting a reference to the worker itself without having to track whether or not getting a reference succeeded. Note, if the vCPU is still alive, there is no danger of the worker getting stuck with tearing down the host page tables, as userspace also holds a reference (obviously), i.e. there is no risk of delaying the page-present notification due to triggering the slow path in mmput(). Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com> Reviewed-by: Xu Yilun <yilun.xu@intel.com> Link: https://lore.kernel.org/r/20240110011533.503302-3-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-02-06KVM: Always flush async #PF workqueue when vCPU is being destroyedSean Christopherson1-5/+26
Always flush the per-vCPU async #PF workqueue when a vCPU is clearing its completion queue, e.g. when a VM and all its vCPUs is being destroyed. KVM must ensure that none of its workqueue callbacks is running when the last reference to the KVM _module_ is put. Gifting a reference to the associated VM prevents the workqueue callback from dereferencing freed vCPU/VM memory, but does not prevent the KVM module from being unloaded before the callback completes. Drop the misguided VM refcount gifting, as calling kvm_put_kvm() from async_pf_execute() if kvm_put_kvm() flushes the async #PF workqueue will result in deadlock. async_pf_execute() can't return until kvm_put_kvm() finishes, and kvm_put_kvm() can't return until async_pf_execute() finishes: WARNING: CPU: 8 PID: 251 at virt/kvm/kvm_main.c:1435 kvm_put_kvm+0x2d/0x320 [kvm] Modules linked in: vhost_net vhost vhost_iotlb tap kvm_intel kvm irqbypass CPU: 8 PID: 251 Comm: kworker/8:1 Tainted: G W 6.6.0-rc1-e7af8d17224a-x86/gmem-vm #119 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015 Workqueue: events async_pf_execute [kvm] RIP: 0010:kvm_put_kvm+0x2d/0x320 [kvm] Call Trace: <TASK> async_pf_execute+0x198/0x260 [kvm] process_one_work+0x145/0x2d0 worker_thread+0x27e/0x3a0 kthread+0xba/0xe0 ret_from_fork+0x2d/0x50 ret_from_fork_asm+0x11/0x20 </TASK> ---[ end trace 0000000000000000 ]--- INFO: task kworker/8:1:251 blocked for more than 120 seconds. Tainted: G W 6.6.0-rc1-e7af8d17224a-x86/gmem-vm #119 "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. task:kworker/8:1 state:D stack:0 pid:251 ppid:2 flags:0x00004000 Workqueue: events async_pf_execute [kvm] Call Trace: <TASK> __schedule+0x33f/0xa40 schedule+0x53/0xc0 schedule_timeout+0x12a/0x140 __wait_for_common+0x8d/0x1d0 __flush_work.isra.0+0x19f/0x2c0 kvm_clear_async_pf_completion_queue+0x129/0x190 [kvm] kvm_arch_destroy_vm+0x78/0x1b0 [kvm] kvm_put_kvm+0x1c1/0x320 [kvm] async_pf_execute+0x198/0x260 [kvm] process_one_work+0x145/0x2d0 worker_thread+0x27e/0x3a0 kthread+0xba/0xe0 ret_from_fork+0x2d/0x50 ret_from_fork_asm+0x11/0x20 </TASK> If kvm_clear_async_pf_completion_queue() actually flushes the workqueue, then there's no need to gift async_pf_execute() a reference because all invocations of async_pf_execute() will be forced to complete before the vCPU and its VM are destroyed/freed. And that in turn fixes the module unloading bug as __fput() won't do module_put() on the last vCPU reference until the vCPU has been freed, e.g. if closing the vCPU file also puts the last reference to the KVM module. Note that kvm_check_async_pf_completion() may also take the work item off the completion queue and so also needs to flush the work queue, as the work will not be seen by kvm_clear_async_pf_completion_queue(). Waiting on the workqueue could theoretically delay a vCPU due to waiting for the work to complete, but that's a very, very small chance, and likely a very small delay. kvm_arch_async_page_present_queued() unconditionally makes a new request, i.e. will effectively delay entering the guest, so the remaining work is really just: trace_kvm_async_pf_completed(addr, cr2_or_gpa); __kvm_vcpu_wake_up(vcpu); mmput(mm); and mmput() can't drop the last reference to the page tables if the vCPU is still alive, i.e. the vCPU won't get stuck tearing down page tables. Add a helper to do the flushing, specifically to deal with "wakeup all" work items, as they aren't actually work items, i.e. are never placed in a workqueue. Trying to flush a bogus workqueue entry rightly makes __flush_work() complain (kudos to whoever added that sanity check). Note, commit 5f6de5cbebee ("KVM: Prevent module exit until all VMs are freed") *tried* to fix the module refcounting issue by having VMs grab a reference to the module, but that only made the bug slightly harder to hit as it gave async_pf_execute() a bit more time to complete before the KVM module could be unloaded. Fixes: af585b921e5d ("KVM: Halt vcpu if page it tries to access is swapped out") Cc: stable@vger.kernel.org Cc: David Matlack <dmatlack@google.com> Reviewed-by: Xu Yilun <yilun.xu@intel.com> Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com> Link: https://lore.kernel.org/r/20240110011533.503302-2-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-01-29KVM: Harden against unpaired kvm_mmu_notifier_invalidate_range_end() callsSean Christopherson1-1/+3
When handling the end of an mmu_notifier invalidation, WARN if mn_active_invalidate_count is already 0 do not decrement it further, i.e. avoid causing mn_active_invalidate_count to underflow/wrap. In the worst case scenario, effectively corrupting mn_active_invalidate_count could cause kvm_swap_active_memslots() to hang indefinitely. end() calls are *supposed* to be paired with start(), i.e. underflow can only happen if there is a bug elsewhere in the kernel, but due to lack of lockdep assertions in the mmu_notifier helpers, it's all too easy for a bug to go unnoticed for some time, e.g. see the recently introduced PAGEMAP_SCAN ioctl(). Ideally, mmu_notifiers would incorporate lockdep assertions, but users of mmu_notifiers aren't required to hold any one specific lock, i.e. adding the necessary annotations to make lockdep aware of all locks that are mutally exclusive with mm_take_all_locks() isn't trivial. Link: https://lore.kernel.org/all/000000000000f6d051060c6785bc@google.com Link: https://lore.kernel.org/r/20240110004239.491290-1-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-01-17Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvmLinus Torvalds7-109/+1032
Pull kvm updates from Paolo Bonzini: "Generic: - Use memdup_array_user() to harden against overflow. - Unconditionally advertise KVM_CAP_DEVICE_CTRL for all architectures. - Clean up Kconfigs that all KVM architectures were selecting - New functionality around "guest_memfd", a new userspace API that creates an anonymous file and returns a file descriptor that refers to it. guest_memfd files are bound to their owning virtual machine, cannot be mapped, read, or written by userspace, and cannot be resized. guest_memfd files do however support PUNCH_HOLE, which can be used to switch a memory area between guest_memfd and regular anonymous memory. - New ioctl KVM_SET_MEMORY_ATTRIBUTES allowing userspace to specify per-page attributes for a given page of guest memory; right now the only attribute is whether the guest expects to access memory via guest_memfd or not, which in Confidential SVMs backed by SEV-SNP, TDX or ARM64 pKVM is checked by firmware or hypervisor that guarantees confidentiality (AMD PSP, Intel TDX module, or EL2 in the case of pKVM). x86: - Support for "software-protected VMs" that can use the new guest_memfd and page attributes infrastructure. This is mostly useful for testing, since there is no pKVM-like infrastructure to provide a meaningfully reduced TCB. - Fix a relatively benign off-by-one error when splitting huge pages during CLEAR_DIRTY_LOG. - Fix a bug where KVM could incorrectly test-and-clear dirty bits in non-leaf TDP MMU SPTEs if a racing thread replaces a huge SPTE with a non-huge SPTE. - Use more generic lockdep assertions in paths that don't actually care about whether the caller is a reader or a writer. - let Xen guests opt out of having PV clock reported as "based on a stable TSC", because some of them don't expect the "TSC stable" bit (added to the pvclock ABI by KVM, but never set by Xen) to be set. - Revert a bogus, made-up nested SVM consistency check for TLB_CONTROL. - Advertise flush-by-ASID support for nSVM unconditionally, as KVM always flushes on nested transitions, i.e. always satisfies flush requests. This allows running bleeding edge versions of VMware Workstation on top of KVM. - Sanity check that the CPU supports flush-by-ASID when enabling SEV support. - On AMD machines with vNMI, always rely on hardware instead of intercepting IRET in some cases to detect unmasking of NMIs - Support for virtualizing Linear Address Masking (LAM) - Fix a variety of vPMU bugs where KVM fail to stop/reset counters and other state prior to refreshing the vPMU model. - Fix a double-overflow PMU bug by tracking emulated counter events using a dedicated field instead of snapshotting the "previous" counter. If the hardware PMC count triggers overflow that is recognized in the same VM-Exit that KVM manually bumps an event count, KVM would pend PMIs for both the hardware-triggered overflow and for KVM-triggered overflow. - Turn off KVM_WERROR by default for all configs so that it's not inadvertantly enabled by non-KVM developers, which can be problematic for subsystems that require no regressions for W=1 builds. - Advertise all of the host-supported CPUID bits that enumerate IA32_SPEC_CTRL "features". - Don't force a masterclock update when a vCPU synchronizes to the current TSC generation, as updating the masterclock can cause kvmclock's time to "jump" unexpectedly, e.g. when userspace hotplugs a pre-created vCPU. - Use RIP-relative address to read kvm_rebooting in the VM-Enter fault paths, partly as a super minor optimization, but mostly to make KVM play nice with position independent executable builds. - Guard KVM-on-HyperV's range-based TLB flush hooks with an #ifdef on CONFIG_HYPERV as a minor optimization, and to self-document the code. - Add CONFIG_KVM_HYPERV to allow disabling KVM support for HyperV "emulation" at build time. ARM64: - LPA2 support, adding 52bit IPA/PA capability for 4kB and 16kB base granule sizes. Branch shared with the arm64 tree. - Large Fine-Grained Trap rework, bringing some sanity to the feature, although there is more to come. This comes with a prefix branch shared with the arm64 tree. - Some additional Nested Virtualization groundwork, mostly introducing the NV2 VNCR support and retargetting the NV support to that version of the architecture. - A small set of vgic fixes and associated cleanups. Loongarch: - Optimization for memslot hugepage checking - Cleanup and fix some HW/SW timer issues - Add LSX/LASX (128bit/256bit SIMD) support RISC-V: - KVM_GET_REG_LIST improvement for vector registers - Generate ISA extension reg_list using macros in get-reg-list selftest - Support for reporting steal time along with selftest s390: - Bugfixes Selftests: - Fix an annoying goof where the NX hugepage test prints out garbage instead of the magic token needed to run the test. - Fix build errors when a header is delete/moved due to a missing flag in the Makefile. - Detect if KVM bugged/killed a selftest's VM and print out a helpful message instead of complaining that a random ioctl() failed. - Annotate the guest printf/assert helpers with __printf(), and fix the various bugs that were lurking due to lack of said annotation" * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (185 commits) x86/kvm: Do not try to disable kvmclock if it was not enabled KVM: x86: add missing "depends on KVM" KVM: fix direction of dependency on MMU notifiers KVM: introduce CONFIG_KVM_COMMON KVM: arm64: Add missing memory barriers when switching to pKVM's hyp pgd KVM: arm64: vgic-its: Avoid potential UAF in LPI translation cache RISC-V: KVM: selftests: Add get-reg-list test for STA registers RISC-V: KVM: selftests: Add steal_time test support RISC-V: KVM: selftests: Add guest_sbi_probe_extension RISC-V: KVM: selftests: Move sbi_ecall to processor.c RISC-V: KVM: Implement SBI STA extension RISC-V: KVM: Add support for SBI STA registers RISC-V: KVM: Add support for SBI extension registers RISC-V: KVM: Add SBI STA info to vcpu_arch RISC-V: KVM: Add steal-update vcpu request RISC-V: KVM: Add SBI STA extension skeleton RISC-V: paravirt: Implement steal-time support RISC-V: Add SBI STA extension definitions RISC-V: paravirt: Add skeleton for pv-time support RISC-V: KVM: Fix indentation in kvm_riscv_vcpu_set_reg_csr() ...
2024-01-08Merge tag 'vfs-6.8.misc' of ↵Linus Torvalds1-2/+2
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull misc vfs updates from Christian Brauner: "This contains the usual miscellaneous features, cleanups, and fixes for vfs and individual fses. Features: - Add Jan Kara as VFS reviewer - Show correct device and inode numbers in proc/<pid>/maps for vma files on stacked filesystems. This is now easily doable thanks to the backing file work from the last cycles. This comes with selftests Cleanups: - Remove a redundant might_sleep() from wait_on_inode() - Initialize pointer with NULL, not 0 - Clarify comment on access_override_creds() - Rework and simplify eventfd_signal() and eventfd_signal_mask() helpers - Process aio completions in batches to avoid needless wakeups - Completely decouple struct mnt_idmap from namespaces. We now only keep the actual idmapping around and don't stash references to namespaces - Reformat maintainer entries to indicate that a given subsystem belongs to fs/ - Simplify fput() for files that were never opened - Get rid of various pointless file helpers - Rename various file helpers - Rename struct file members after SLAB_TYPESAFE_BY_RCU switch from last cycle - Make relatime_need_update() return bool - Use GFP_KERNEL instead of GFP_USER when allocating superblocks - Replace deprecated ida_simple_*() calls with their current ida_*() counterparts Fixes: - Fix comments on user namespace id mapping helpers. They aren't kernel doc comments so they shouldn't be using /** - s/Retuns/Returns/g in various places - Add missing parameter documentation on can_move_mount_beneath() - Rename i_mapping->private_data to i_mapping->i_private_data - Fix a false-positive lockdep warning in pipe_write() for watch queues - Improve __fget_files_rcu() code generation to improve performance - Only notify writer that pipe resizing has finished after setting pipe->max_usage otherwise writers are never notified that the pipe has been resized and hang - Fix some kernel docs in hfsplus - s/passs/pass/g in various places - Fix kernel docs in ntfs - Fix kcalloc() arguments order reported by gcc 14 - Fix uninitialized value in reiserfs" * tag 'vfs-6.8.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (36 commits) reiserfs: fix uninit-value in comp_keys watch_queue: fix kcalloc() arguments order ntfs: dir.c: fix kernel-doc function parameter warnings fs: fix doc comment typo fs tree wide selftests/overlayfs: verify device and inode numbers in /proc/pid/maps fs/proc: show correct device and inode numbers in /proc/pid/maps eventfd: Remove usage of the deprecated ida_simple_xx() API fs: super: use GFP_KERNEL instead of GFP_USER for super block allocation fs/hfsplus: wrapper.c: fix kernel-doc warnings fs: add Jan Kara as reviewer fs/inode: Make relatime_need_update return bool pipe: wakeup wr_wait after setting max_usage file: remove __receive_fd() file: stop exposing receive_fd_user() fs: replace f_rcuhead with f_task_work file: remove pointless wrapper file: s/close_fd_get_file()/file_close_fd()/g Improve __fget_files_rcu() code generation (and thus __fget_light()) file: massage cleanup of files that failed to open fs/pipe: Fix lockdep false-positive in watchqueue pipe_write() ...
2024-01-08Merge tag 'kvm-x86-generic-6.8' of https://github.com/kvm-x86/linux into HEADPaolo Bonzini1-3/+3
Common KVM changes for 6.8: - Use memdup_array_user() to harden against overflow. - Unconditionally advertise KVM_CAP_DEVICE_CTRL for all architectures.
2024-01-08KVM: fix direction of dependency on MMU notifiersPaolo Bonzini1-1/+1
KVM_GENERIC_MEMORY_ATTRIBUTES requires the generic MMU notifier code, because it uses kvm_mmu_invalidate_begin/end. However, it would not work with a bespoke implementation of MMU notifiers that does not use KVM_GENERIC_MMU_NOTIFIER, because most likely it would not synchronize correctly on invalidation. So the right thing to do is to note the problematic configuration if the architecture does not select itself KVM_GENERIC_MMU_NOTIFIER; not to enable it blindly. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-01-08KVM: introduce CONFIG_KVM_COMMONPaolo Bonzini1-0/+5
CONFIG_HAVE_KVM is currently used by some architectures to either enabled the KVM config proper, or to enable host-side code that is not part of the KVM module. However, CONFIG_KVM's "select" statement in virt/kvm/Kconfig corresponds to a third meaning, namely to enable common Kconfigs required by all architectures that support KVM. These three meanings can be replaced respectively by an architecture-specific Kconfig, by IS_ENABLED(CONFIG_KVM), or by a new Kconfig symbol that is in turn selected by the architecture-specific "config KVM". Start by introducing such a new Kconfig symbol, CONFIG_KVM_COMMON. Unlike CONFIG_HAVE_KVM, it is selected by CONFIG_KVM, not by architecture code, and it brings in all dependencies of common KVM code. In particular, INTERVAL_TREE was missing in loongarch and riscv, so that is another thing that is fixed. Fixes: 8132d887a702 ("KVM: remove CONFIG_HAVE_KVM_EVENTFD", 2023-12-08) Reported-by: Randy Dunlap <rdunlap@infradead.org> Closes: https://lore.kernel.org/all/44907c6b-c5bd-4e4a-a921-e4d3825539d8@infradead.org/ Reviewed-by: Andrew Jones <ajones@ventanamicro.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-01-02Merge tag 'kvm-riscv-6.8-1' of https://github.com/kvm-riscv/linux into HEADPaolo Bonzini1-1/+2
KVM/riscv changes for 6.8 part #1 - KVM_GET_REG_LIST improvement for vector registers - Generate ISA extension reg_list using macros in get-reg-list selftest - Steal time account support along with selftest
2024-01-02Merge tag 'loongarch-kvm-6.8' of ↵Paolo Bonzini1-10/+8
git://git.kernel.org/pub/scm/linux/kernel/git/chenhuacai/linux-loongson into HEAD LoongArch KVM changes for v6.8 1. Optimization for memslot hugepage checking. 2. Cleanup and fix some HW/SW timer issues. 3. Add LSX/LASX (128bit/256bit SIMD) support.
2023-12-22Merge tag 'kvmarm-fixes-6.7-2' of ↵Paolo Bonzini1-1/+2
git://git.kernel.org/pub/scm/linux/kernel/git/kvmarm/kvmarm into kvm-master KVM/arm64 fixes for 6.7, part #2 - Ensure a vCPU's redistributor is unregistered from the MMIO bus if vCPU creation fails - Fix building KVM selftests for arm64 from the top-level Makefile
2023-12-12KVM: Convert comment into an assertion in kvm_io_bus_register_dev()Marc Zyngier1-1/+2
Instead of having a comment indicating the need to hold slots_lock when calling kvm_io_bus_register_dev(), make it explicit with a lockdep assertion. Signed-off-by: Marc Zyngier <maz@kernel.org> Link: https://lore.kernel.org/r/20231207151201.3028710-6-maz@kernel.org Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
2023-12-08KVM: clean up directives to compile out irqfdsPaolo Bonzini1-14/+12
Keep all #ifdef CONFIG_HAVE_KVM_IRQCHIP parts of eventfd.c together, and compile out the irqfds field of struct kvm if the symbol is not defined. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2023-12-08KVM: remove deprecated UAPIsPaolo Bonzini1-5/+0
The deprecated interfaces were removed 15 years ago. KVM's device assignment was deprecated in 4.2 and removed 6.5 years ago; the only interest might be in compiling ancient versions of QEMU, but QEMU has been using its own imported copy of the kernel headers since June 2011. So again we go into archaeology territory; just remove the cruft. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2023-12-08KVM: remove CONFIG_HAVE_KVM_IRQFDPaolo Bonzini3-8/+5
All platforms with a kernel irqchip have support for irqfd. Unify the two configuration items so that userspace can expect to use irqfd to inject interrupts into the irqchip. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2023-12-08KVM: remove CONFIG_HAVE_KVM_EVENTFDPaolo Bonzini1-4/+1
virt/kvm/eventfd.c is compiled unconditionally, meaning that the ioeventfds member of struct kvm is accessed unconditionally. CONFIG_HAVE_KVM_EVENTFD therefore must be defined for KVM common code to compile successfully, remove it. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2023-12-08KVM: guest-memfd: fix unused-function warningPaolo Bonzini1-2/+0
With migration disabled, one function becomes unused: virt/kvm/guest_memfd.c:262:12: error: 'kvm_gmem_migrate_folio' defined but not used [-Werror=unused-function] 262 | static int kvm_gmem_migrate_folio(struct address_space *mapping, | ^~~~~~~~~~~~~~~~~~~~~~ Remove the #ifdef around the reference so that fallback_migrate_folio() is never used. The gmem implementation of the hook is trivial; since the gmem mapping is unmovable, the pages should not be migrated anyway. Fixes: a7800aa80ea4 ("KVM: Add KVM_CREATE_GUEST_MEMFD ioctl() for guest-specific backing memory") Reported-by: Arnd Bergmann <arnd@arndb.de> Suggested-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2023-12-01Revert "KVM: Prevent module exit until all VMs are freed"Sean Christopherson1-7/+0
Revert KVM's misguided attempt to "fix" a use-after-module-unload bug that was actually due to failure to flush a workqueue, not a lack of module refcounting. Pinning the KVM module until kvm_vm_destroy() doesn't prevent use-after-free due to the module being unloaded, as userspace can invoke delete_module() the instant the last reference to KVM is put, i.e. can cause all KVM code to be unmapped while KVM is actively executing said code. Generally speaking, the many instances of module_put(THIS_MODULE) notwithstanding, outside of a few special paths, a module can never safely put the last reference to itself without creating deadlock, i.e. something external to the module *must* put the last reference. In other words, having VMs grab a reference to the KVM module is futile, pointless, and as evidenced by the now-reverted commit 70375c2d8fa3 ("Revert "KVM: set owner of cpu and vm file operations""), actively dangerous. This reverts commit 405294f29faee5de8c10cb9d4a90e229c2835279 and commit 5f6de5cbebee925a612856fce6f9182bb3eee0db. Fixes: 405294f29fae ("KVM: Unconditionally get a ref to /dev/kvm module when creating a VM") Fixes: 5f6de5cbebee ("KVM: Prevent module exit until all VMs are freed") Link: https://lore.kernel.org/r/20231018204624.1905300-4-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-12-01KVM: Set file_operations.owner appropriately for all such structuresSean Christopherson1-3/+8
Set .owner for all KVM-owned filed types so that the KVM module is pinned until any files with callbacks back into KVM are completely freed. Using "struct kvm" as a proxy for the module, i.e. keeping KVM-the-module alive while there are active VMs, doesn't provide full protection. Userspace can invoke delete_module() the instant the last reference to KVM is put. If KVM itself puts the last reference, e.g. via kvm_destroy_vm(), then it's possible for KVM to be preempted and deleted/unloaded before KVM fully exits, e.g. when the task running kvm_destroy_vm() is scheduled back in, it will jump to a code page that is no longer mapped. Note, file types that can call into sub-module code, e.g. kvm-intel.ko or kvm-amd.ko on x86, must use the module pointer passed to kvm_init(), not THIS_MODULE (which points at kvm.ko). KVM assumes that if /dev/kvm is reachable, e.g. VMs are active, then the vendor module is loaded. To reduce the probability of forgetting to set .owner entirely, use THIS_MODULE for stats files where KVM does not call back into vendor code. This reverts commit 70375c2d8fa3fb9b0b59207a9c5df1e2e1205c10, and fixes several other file types that have been buggy since their introduction. Fixes: 70375c2d8fa3 ("Revert "KVM: set owner of cpu and vm file operations"") Fixes: 3bcd0662d66f ("KVM: X86: Introduce mmu_rmaps_stat per-vm debugfs file") Reported-by: Al Viro <viro@zeniv.linux.org.uk> Link: https://lore.kernel.org/all/20231010003746.GN800259@ZenIV Link: https://lore.kernel.org/r/20231018204624.1905300-2-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-12-01KVM: Harden copying of userspace-array against overflowPhilipp Stanner1-3/+2
kvm_main.c utilizes vmemdup_user() and array_size() to copy a userspace array. Currently, this does not check for an overflow. Use the new wrapper vmemdup_array_user() to copy the array more safely. Note, KVM explicitly checks the number of entries before duplicating the array, i.e. adding the overflow check should be a glorified nop. Suggested-by: Dave Airlie <airlied@redhat.com> Signed-off-by: Philipp Stanner <pstanner@redhat.com> Link: https://lore.kernel.org/r/20231102181526.43279-4-pstanner@redhat.com [sean: call out that KVM pre-checks the number of entries] Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-11-30KVM: move KVM_CAP_DEVICE_CTRL to the generic checkWei Wang1-0/+1
KVM_CAP_DEVICE_CTRL allows userspace to check if the kvm_device framework (e.g. KVM_CREATE_DEVICE) is supported by KVM. Move KVM_CAP_DEVICE_CTRL to the generic check for the two reasons: 1) it already supports arch agnostic usages (i.e. KVM_DEV_TYPE_VFIO). For example, userspace VFIO implementation may needs to create KVM_DEV_TYPE_VFIO on x86, riscv, or arm etc. It is simpler to have it checked at the generic code than at each arch's code. 2) KVM_CREATE_DEVICE has been added to the generic code. Link: https://lore.kernel.org/all/20221215115207.14784-1-wei.w.wang@intel.com Signed-off-by: Wei Wang <wei.w.wang@intel.com> Reviewed-by: Sean Christopherson <seanjc@google.com> Acked-by: Anup Patel <anup@brainfault.org> (riscv) Reviewed-by: Oliver Upton <oliver.upton@linux.dev> Acked-by: Michael Ellerman <mpe@ellerman.id.au> (powerpc) Link: https://lore.kernel.org/r/20230315101606.10636-1-wei.w.wang@intel.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-11-28eventfd: simplify eventfd_signal()Christian Brauner1-2/+2
Ever since the eventfd type was introduced back in 2007 in commit e1ad7468c77d ("signal/timer/event: eventfd core") the eventfd_signal() function only ever passed 1 as a value for @n. There's no point in keeping that additional argument. Link: https://lore.kernel.org/r/20231122-vfs-eventfd-signal-v2-2-bd549b14ce0c@kernel.org Acked-by: Xu Yilun <yilun.xu@intel.com> Acked-by: Andrew Donnellan <ajd@linux.ibm.com> # ocxl Acked-by: Eric Farman <farman@linux.ibm.com> # s390 Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Christian Brauner <brauner@kernel.org>
2023-11-14Merge branch 'kvm-guestmemfd' into HEADPaolo Bonzini6-77/+1010
Introduce several new KVM uAPIs to ultimately create a guest-first memory subsystem within KVM, a.k.a. guest_memfd. Guest-first memory allows KVM to provide features, enhancements, and optimizations that are kludgly or outright impossible to implement in a generic memory subsystem. The core KVM ioctl() for guest_memfd is KVM_CREATE_GUEST_MEMFD, which similar to the generic memfd_create(), creates an anonymous file and returns a file descriptor that refers to it. Again like "regular" memfd files, guest_memfd files live in RAM, have volatile storage, and are automatically released when the last reference is dropped. The key differences between memfd files (and every other memory subystem) is that guest_memfd files are bound to their owning virtual machine, cannot be mapped, read, or written by userspace, and cannot be resized. guest_memfd files do however support PUNCH_HOLE, which can be used to convert a guest memory area between the shared and guest-private states. A second KVM ioctl(), KVM_SET_MEMORY_ATTRIBUTES, allows userspace to specify attributes for a given page of guest memory. In the long term, it will likely be extended to allow userspace to specify per-gfn RWX protections, including allowing memory to be writable in the guest without it also being writable in host userspace. The immediate and driving use case for guest_memfd are Confidential (CoCo) VMs, specifically AMD's SEV-SNP, Intel's TDX, and KVM's own pKVM. For such use cases, being able to map memory into KVM guests without requiring said memory to be mapped into the host is a hard requirement. While SEV+ and TDX prevent untrusted software from reading guest private data by encrypting guest memory, pKVM provides confidentiality and integrity *without* relying on memory encryption. In addition, with SEV-SNP and especially TDX, accessing guest private memory can be fatal to the host, i.e. KVM must be prevent host userspace from accessing guest memory irrespective of hardware behavior. Long term, guest_memfd may be useful for use cases beyond CoCo VMs, for example hardening userspace against unintentional accesses to guest memory. As mentioned earlier, KVM's ABI uses userspace VMA protections to define the allow guest protection (with an exception granted to mapping guest memory executable), and similarly KVM currently requires the guest mapping size to be a strict subset of the host userspace mapping size. Decoupling the mappings sizes would allow userspace to precisely map only what is needed and with the required permissions, without impacting guest performance. A guest-first memory subsystem also provides clearer line of sight to things like a dedicated memory pool (for slice-of-hardware VMs) and elimination of "struct page" (for offload setups where userspace _never_ needs to DMA from or into guest memory). guest_memfd is the result of 3+ years of development and exploration; taking on memory management responsibilities in KVM was not the first, second, or even third choice for supporting CoCo VMs. But after many failed attempts to avoid KVM-specific backing memory, and looking at where things ended up, it is quite clear that of all approaches tried, guest_memfd is the simplest, most robust, and most extensible, and the right thing to do for KVM and the kernel at-large. The "development cycle" for this version is going to be very short; ideally, next week I will merge it as is in kvm/next, taking this through the KVM tree for 6.8 immediately after the end of the merge window. The series is still based on 6.6 (plus KVM changes for 6.7) so it will require a small fixup for changes to get_file_rcu() introduced in 6.7 by commit 0ede61d8589c ("file: convert to SLAB_TYPESAFE_BY_RCU"). The fixup will be done as part of the merge commit, and most of the text above will become the commit message for the merge. Pending post-merge work includes: - hugepage support - looking into using the restrictedmem framework for guest memory - introducing a testing mechanism to poison memory, possibly using the same memory attributes introduced here - SNP and TDX support There are two non-KVM patches buried in the middle of this series: fs: Rename anon_inode_getfile_secure() and anon_inode_getfd_secure() mm: Add AS_UNMOVABLE to mark mapping as completely unmovable The first is small and mostly suggested-by Christian Brauner; the second a bit less so but it was written by an mm person (Vlastimil Babka).
2023-11-14KVM: x86: Add support for "protected VMs" that can utilize private memorySean Christopherson1-0/+5
Add a new x86 VM type, KVM_X86_SW_PROTECTED_VM, to serve as a development and testing vehicle for Confidential (CoCo) VMs, and potentially to even become a "real" product in the distant future, e.g. a la pKVM. The private memory support in KVM x86 is aimed at AMD's SEV-SNP and Intel's TDX, but those technologies are extremely complex (understatement), difficult to debug, don't support running as nested guests, and require hardware that's isn't universally accessible. I.e. relying SEV-SNP or TDX for maintaining guest private memory isn't a realistic option. At the very least, KVM_X86_SW_PROTECTED_VM will enable a variety of selftests for guest_memfd and private memory support without requiring unique hardware. Signed-off-by: Sean Christopherson <seanjc@google.com> Reviewed-by: Paolo Bonzini <pbonzini@redhat.com> Message-Id: <20231027182217.3615211-24-seanjc@google.com> Reviewed-by: Fuad Tabba <tabba@google.com> Tested-by: Fuad Tabba <tabba@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2023-11-14KVM: Allow arch code to track number of memslot address spaces per VMSean Christopherson2-13/+15
Let x86 track the number of address spaces on a per-VM basis so that KVM can disallow SMM memslots for confidential VMs. Confidentials VMs are fundamentally incompatible with emulating SMM, which as the name suggests requires being able to read and write guest memory and register state. Disallowing SMM will simplify support for guest private memory, as KVM will not need to worry about tracking memory attributes for multiple address spaces (SMM is the only "non-default" address space across all architectures). Signed-off-by: Sean Christopherson <seanjc@google.com> Reviewed-by: Paolo Bonzini <pbonzini@redhat.com> Reviewed-by: Fuad Tabba <tabba@google.com> Tested-by: Fuad Tabba <tabba@google.com> Message-Id: <20231027182217.3615211-23-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2023-11-14KVM: Add KVM_CREATE_GUEST_MEMFD ioctl() for guest-specific backing memorySean Christopherson5-5/+623
Introduce an ioctl(), KVM_CREATE_GUEST_MEMFD, to allow creating file-based memory that is tied to a specific KVM virtual machine and whose primary purpose is to serve guest memory. A guest-first memory subsystem allows for optimizations and enhancements that are kludgy or outright infeasible to implement/support in a generic memory subsystem. With guest_memfd, guest protections and mapping sizes are fully decoupled from host userspace mappings. E.g. KVM currently doesn't support mapping memory as writable in the guest without it also being writable in host userspace, as KVM's ABI uses VMA protections to define the allow guest protection. Userspace can fudge this by establishing two mappings, a writable mapping for the guest and readable one for itself, but that’s suboptimal on multiple fronts. Similarly, KVM currently requires the guest mapping size to be a strict subset of the host userspace mapping size, e.g. KVM doesn’t support creating a 1GiB guest mapping unless userspace also has a 1GiB guest mapping. Decoupling the mappings sizes would allow userspace to precisely map only what is needed without impacting guest performance, e.g. to harden against unintentional accesses to guest memory. Decoupling guest and userspace mappings may also allow for a cleaner alternative to high-granularity mappings for HugeTLB, which has reached a bit of an impasse and is unlikely to ever be merged. A guest-first memory subsystem also provides clearer line of sight to things like a dedicated memory pool (for slice-of-hardware VMs) and elimination of "struct page" (for offload setups where userspace _never_ needs to mmap() guest memory). More immediately, being able to map memory into KVM guests without mapping said memory into the host is critical for Confidential VMs (CoCo VMs), the initial use case for guest_memfd. While AMD's SEV and Intel's TDX prevent untrusted software from reading guest private data by encrypting guest memory with a key that isn't usable by the untrusted host, projects such as Protected KVM (pKVM) provide confidentiality and integrity *without* relying on memory encryption. And with SEV-SNP and TDX, accessing guest private memory can be fatal to the host, i.e. KVM must be prevent host userspace from accessing guest memory irrespective of hardware behavior. Attempt #1 to support CoCo VMs was to add a VMA flag to mark memory as being mappable only by KVM (or a similarly enlightened kernel subsystem). That approach was abandoned largely due to it needing to play games with PROT_NONE to prevent userspace from accessing guest memory. Attempt #2 to was to usurp PG_hwpoison to prevent the host from mapping guest private memory into userspace, but that approach failed to meet several requirements for software-based CoCo VMs, e.g. pKVM, as the kernel wouldn't easily be able to enforce a 1:1 page:guest association, let alone a 1:1 pfn:gfn mapping. And using PG_hwpoison does not work for memory that isn't backed by 'struct page', e.g. if devices gain support for exposing encrypted memory regions to guests. Attempt #3 was to extend the memfd() syscall and wrap shmem to provide dedicated file-based guest memory. That approach made it as far as v10 before feedback from Hugh Dickins and Christian Brauner (and others) led to it demise. Hugh's objection was that piggybacking shmem made no sense for KVM's use case as KVM didn't actually *want* the features provided by shmem. I.e. KVM was using memfd() and shmem to avoid having to manage memory directly, not because memfd() and shmem were the optimal solution, e.g. things like read/write/mmap in shmem were dead weight. Christian pointed out flaws with implementing a partial overlay (wrapping only _some_ of shmem), e.g. poking at inode_operations or super_operations would show shmem stuff, but address_space_operations and file_operations would show KVM's overlay. Paraphrashing heavily, Christian suggested KVM stop being lazy and create a proper API. Link: https://lore.kernel.org/all/20201020061859.18385-1-kirill.shutemov@linux.intel.com Link: https://lore.kernel.org/all/20210416154106.23721-1-kirill.shutemov@linux.intel.com Link: https://lore.kernel.org/all/20210824005248.200037-1-seanjc@google.com Link: https://lore.kernel.org/all/20211111141352.26311-1-chao.p.peng@linux.intel.com Link: https://lore.kernel.org/all/20221202061347.1070246-1-chao.p.peng@linux.intel.com Link: https://lore.kernel.org/all/ff5c5b97-acdf-9745-ebe5-c6609dd6322e@google.com Link: https://lore.kernel.org/all/20230418-anfallen-irdisch-6993a61be10b@brauner Link: https://lore.kernel.org/all/ZEM5Zq8oo+xnApW9@google.com Link: https://lore.kernel.org/linux-mm/20230306191944.GA15773@monkey Link: https://lore.kernel.org/linux-mm/ZII1p8ZHlHaQ3dDl@casper.infradead.org Cc: Fuad Tabba <tabba@google.com> Cc: Vishal Annapurve <vannapurve@google.com> Cc: Ackerley Tng <ackerleytng@google.com> Cc: Jarkko Sakkinen <jarkko@kernel.org> Cc: Maciej Szmigiero <mail@maciej.szmigiero.name> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: David Hildenbrand <david@redhat.com> Cc: Quentin Perret <qperret@google.com> Cc: Michael Roth <michael.roth@amd.com> Cc: Wang <wei.w.wang@intel.com> Cc: Liam Merwick <liam.merwick@oracle.com> Cc: Isaku Yamahata <isaku.yamahata@gmail.com> Co-developed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Co-developed-by: Yu Zhang <yu.c.zhang@linux.intel.com> Signed-off-by: Yu Zhang <yu.c.zhang@linux.intel.com> Co-developed-by: Chao Peng <chao.p.peng@linux.intel.com> Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com> Co-developed-by: Ackerley Tng <ackerleytng@google.com> Signed-off-by: Ackerley Tng <ackerleytng@google.com> Co-developed-by: Isaku Yamahata <isaku.yamahata@intel.com> Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com> Co-developed-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Co-developed-by: Michael Roth <michael.roth@amd.com> Signed-off-by: Michael Roth <michael.roth@amd.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20231027182217.3615211-17-seanjc@google.com> Reviewed-by: Fuad Tabba <tabba@google.com> Tested-by: Fuad Tabba <tabba@google.com> Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2023-11-13KVM: Introduce per-page memory attributesChao Peng2-0/+220
In confidential computing usages, whether a page is private or shared is necessary information for KVM to perform operations like page fault handling, page zapping etc. There are other potential use cases for per-page memory attributes, e.g. to make memory read-only (or no-exec, or exec-only, etc.) without having to modify memslots. Introduce the KVM_SET_MEMORY_ATTRIBUTES ioctl, advertised by KVM_CAP_MEMORY_ATTRIBUTES, to allow userspace to set the per-page memory attributes to a guest memory range. Use an xarray to store the per-page attributes internally, with a naive, not fully optimized implementation, i.e. prioritize correctness over performance for the initial implementation. Use bit 3 for the PRIVATE attribute so that KVM can use bits 0-2 for RWX attributes/protections in the future, e.g. to give userspace fine-grained control over read, write, and execute protections for guest memory. Provide arch hooks for handling attribute changes before and after common code sets the new attributes, e.g. x86 will use the "pre" hook to zap all relevant mappings, and the "post" hook to track whether or not hugepages can be used to map the range. To simplify the implementation wrap the entire sequence with kvm_mmu_invalidate_{begin,end}() even though the operation isn't strictly guaranteed to be an invalidation. For the initial use case, x86 *will* always invalidate memory, and preventing arch code from creating new mappings while the attributes are in flux makes it much easier to reason about the correctness of consuming attributes. It's possible that future usages may not require an invalidation, e.g. if KVM ends up supporting RWX protections and userspace grants _more_ protections, but again opt for simplicity and punt optimizations to if/when they are needed. Suggested-by: Sean Christopherson <seanjc@google.com> Link: https://lore.kernel.org/all/Y2WB48kD0J4VGynX@google.com Cc: Fuad Tabba <tabba@google.com> Cc: Xu Yilun <yilun.xu@intel.com> Cc: Mickaël Salaün <mic@digikod.net> Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com> Co-developed-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20231027182217.3615211-14-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2023-11-13KVM: Drop .on_unlock() mmu_notifier hookSean Christopherson1-10/+1
Drop the .on_unlock() mmu_notifer hook now that it's no longer used for notifying arch code that memory has been reclaimed. Adding .on_unlock() and invoking it *after* dropping mmu_lock was a terrible idea, as doing so resulted in .on_lock() and .on_unlock() having divergent and asymmetric behavior, and set future developers up for failure, i.e. all but asked for bugs where KVM relied on using .on_unlock() to try to run a callback while holding mmu_lock. Opportunistically add a lockdep assertion in kvm_mmu_invalidate_end() to guard against future bugs of this nature. Reported-by: Isaku Yamahata <isaku.yamahata@intel.com> Link: https://lore.kernel.org/all/20230802203119.GB2021422@ls.amr.corp.intel.com Signed-off-by: Sean Christopherson <seanjc@google.com> Reviewed-by: Paolo Bonzini <pbonzini@redhat.com> Reviewed-by: Fuad Tabba <tabba@google.com> Tested-by: Fuad Tabba <tabba@google.com> Message-Id: <20231027182217.3615211-12-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2023-11-13KVM: Add a dedicated mmu_notifier flag for reclaiming freed memorySean Christopherson1-16/+37
Handle AMD SEV's kvm_arch_guest_memory_reclaimed() hook by having __kvm_handle_hva_range() return whether or not an overlapping memslot was found, i.e. mmu_lock was acquired. Using the .on_unlock() hook works, but kvm_arch_guest_memory_reclaimed() needs to run after dropping mmu_lock, which makes .on_lock() and .on_unlock() asymmetrical. Use a small struct to return the tuple of the notifier-specific return, plus whether or not overlap was found. Because the iteration helpers are __always_inlined, practically speaking, the struct will never actually be returned from a function call (not to mention the size of the struct will be two bytes in practice). Signed-off-by: Sean Christopherson <seanjc@google.com> Reviewed-by: Paolo Bonzini <pbonzini@redhat.com> Reviewed-by: Fuad Tabba <tabba@google.com> Tested-by: Fuad Tabba <tabba@google.com> Message-Id: <20231027182217.3615211-11-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2023-11-13KVM: Introduce KVM_SET_USER_MEMORY_REGION2Sean Christopherson1-8/+49
Introduce a "version 2" of KVM_SET_USER_MEMORY_REGION so that additional information can be supplied without setting userspace up to fail. The padding in the new kvm_userspace_memory_region2 structure will be used to pass a file descriptor in addition to the userspace_addr, i.e. allow userspace to point at a file descriptor and map memory into a guest that is NOT mapped into host userspace. Alternatively, KVM could simply add "struct kvm_userspace_memory_region2" without a new ioctl(), but as Paolo pointed out, adding a new ioctl() makes detection of bad flags a bit more robust, e.g. if the new fd field is guarded only by a flag and not a new ioctl(), then a userspace bug (setting a "bad" flag) would generate out-of-bounds access instead of an -EINVAL error. Cc: Jarkko Sakkinen <jarkko@kernel.org> Reviewed-by: Paolo Bonzini <pbonzini@redhat.com> Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Reviewed-by: Fuad Tabba <tabba@google.com> Tested-by: Fuad Tabba <tabba@google.com> Message-Id: <20231027182217.3615211-9-seanjc@google.com> Acked-by: Kai Huang <kai.huang@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2023-11-13KVM: Convert KVM_ARCH_WANT_MMU_NOTIFIER to CONFIG_KVM_GENERIC_MMU_NOTIFIERSean Christopherson2-5/+9
Convert KVM_ARCH_WANT_MMU_NOTIFIER into a Kconfig and select it where appropriate to effectively maintain existing behavior. Using a proper Kconfig will simplify building more functionality on top of KVM's mmu_notifier infrastructure. Add a forward declaration of kvm_gfn_range to kvm_types.h so that including arch/powerpc/include/asm/kvm_ppc.h's with CONFIG_KVM=n doesn't generate warnings due to kvm_gfn_range being undeclared. PPC defines hooks for PR vs. HV without guarding them via #ifdeffery, e.g. bool (*unmap_gfn_range)(struct kvm *kvm, struct kvm_gfn_range *range); bool (*age_gfn)(struct kvm *kvm, struct kvm_gfn_range *range); bool (*test_age_gfn)(struct kvm *kvm, struct kvm_gfn_range *range); bool (*set_spte_gfn)(struct kvm *kvm, struct kvm_gfn_range *range); Alternatively, PPC could forward declare kvm_gfn_range, but there's no good reason not to define it in common KVM. Acked-by: Anup Patel <anup@brainfault.org> Signed-off-by: Sean Christopherson <seanjc@google.com> Reviewed-by: Paolo Bonzini <pbonzini@redhat.com> Reviewed-by: Fuad Tabba <tabba@google.com> Tested-by: Fuad Tabba <tabba@google.com> Message-Id: <20231027182217.3615211-8-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2023-11-13KVM: WARN if there are dangling MMU invalidations at VM destructionSean Christopherson1-1/+8
Add an assertion that there are no in-progress MMU invalidations when a VM is being destroyed, with the exception of the scenario where KVM unregisters its MMU notifier between an .invalidate_range_start() call and the corresponding .invalidate_range_end(). KVM can't detect unpaired calls from the mmu_notifier due to the above exception waiver, but the assertion can detect KVM bugs, e.g. such as the bug that *almost* escaped initial guest_memfd development. Link: https://lore.kernel.org/all/e397d30c-c6af-e68f-d18e-b4e3739c5389@linux.intel.com Signed-off-by: Sean Christopherson <seanjc@google.com> Reviewed-by: Paolo Bonzini <pbonzini@redhat.com> Reviewed-by: Fuad Tabba <tabba@google.com> Tested-by: Fuad Tabba <tabba@google.com> Message-Id: <20231027182217.3615211-5-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2023-11-13KVM: Use gfn instead of hva for mmu_notifier_retryChao Peng1-9/+34
Currently in mmu_notifier invalidate path, hva range is recorded and then checked against by mmu_invalidate_retry_hva() in the page fault handling path. However, for the soon-to-be-introduced private memory, a page fault may not have a hva associated, checking gfn(gpa) makes more sense. For existing hva based shared memory, gfn is expected to also work. The only downside is when aliasing multiple gfns to a single hva, the current algorithm of checking multiple ranges could result in a much larger range being rejected. Such aliasing should be uncommon, so the impact is expected small. Suggested-by: Sean Christopherson <seanjc@google.com> Cc: Xu Yilun <yilun.xu@intel.com> Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com> Reviewed-by: Fuad Tabba <tabba@google.com> Tested-by: Fuad Tabba <tabba@google.com> [sean: convert vmx_set_apic_access_page_addr() to gfn-based API] Signed-off-by: Sean Christopherson <seanjc@google.com> Reviewed-by: Paolo Bonzini <pbonzini@redhat.com> Reviewed-by: Xu Yilun <yilun.xu@linux.intel.com> Message-Id: <20231027182217.3615211-4-seanjc@google.com> Reviewed-by: Kai Huang <kai.huang@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2023-11-13KVM: Assert that mmu_invalidate_in_progress *never* goes negativeSean Christopherson1-2/+1
Move the assertion on the in-progress invalidation count from the primary MMU's notifier path to KVM's common notification path, i.e. assert that the count doesn't go negative even when the invalidation is coming from KVM itself. Opportunistically convert the assertion to a KVM_BUG_ON(), i.e. kill only the affected VM, not the entire kernel. A corrupted count is fatal to the VM, e.g. the non-zero (negative) count will cause mmu_invalidate_retry() to block any and all attempts to install new mappings. But it's far from guaranteed that an end() without a start() is fatal or even problematic to anything other than the target VM, e.g. the underlying bug could simply be a duplicate call to end(). And it's much more likely that a missed invalidation, i.e. a potential use-after-free, would manifest as no notification whatsoever, not an end() without a start(). Signed-off-by: Sean Christopherson <seanjc@google.com> Reviewed-by: Paolo Bonzini <pbonzini@redhat.com> Reviewed-by: Fuad Tabba <tabba@google.com> Tested-by: Fuad Tabba <tabba@google.com> Message-Id: <20231027182217.3615211-3-seanjc@google.com> Reviewed-by: Kai Huang <kai.huang@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2023-11-13KVM: Tweak kvm_hva_range and hva_handler_t to allow reusing for gfn rangesSean Christopherson1-15/+19
Rework and rename "struct kvm_hva_range" into "kvm_mmu_notifier_range" so that the structure can be used to handle notifications that operate on gfn context, i.e. that aren't tied to a host virtual address. Rename the handler typedef too (arguably it should always have been gfn_handler_t). Practically speaking, this is a nop for 64-bit kernels as the only meaningful change is to store start+end as u64s instead of unsigned longs. Reviewed-by: Paolo Bonzini <pbonzini@redhat.com> Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Reviewed-by: Fuad Tabba <tabba@google.com> Tested-by: Fuad Tabba <tabba@google.com> Message-Id: <20231027182217.3615211-2-seanjc@google.com> Reviewed-by: Kai Huang <kai.huang@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2023-09-07Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvmLinus Torvalds2-17/+40
Pull kvm updates from Paolo Bonzini: "ARM: - Clean up vCPU targets, always returning generic v8 as the preferred target - Trap forwarding infrastructure for nested virtualization (used for traps that are taken from an L2 guest and are needed by the L1 hypervisor) - FEAT_TLBIRANGE support to only invalidate specific ranges of addresses when collapsing a table PTE to a block PTE. This avoids that the guest refills the TLBs again for addresses that aren't covered by the table PTE. - Fix vPMU issues related to handling of PMUver. - Don't unnecessary align non-stack allocations in the EL2 VA space - Drop HCR_VIRT_EXCP_MASK, which was never used... - Don't use smp_processor_id() in kvm_arch_vcpu_load(), but the cpu parameter instead - Drop redundant call to kvm_set_pfn_accessed() in user_mem_abort() - Remove prototypes without implementations RISC-V: - Zba, Zbs, Zicntr, Zicsr, Zifencei, and Zihpm support for guest - Added ONE_REG interface for SATP mode - Added ONE_REG interface to enable/disable multiple ISA extensions - Improved error codes returned by ONE_REG interfaces - Added KVM_GET_REG_LIST ioctl() implementation for KVM RISC-V - Added get-reg-list selftest for KVM RISC-V s390: - PV crypto passthrough enablement (Tony, Steffen, Viktor, Janosch) Allows a PV guest to use crypto cards. Card access is governed by the firmware and once a crypto queue is "bound" to a PV VM every other entity (PV or not) looses access until it is not bound anymore. Enablement is done via flags when creating the PV VM. - Guest debug fixes (Ilya) x86: - Clean up KVM's handling of Intel architectural events - Intel bugfixes - Add support for SEV-ES DebugSwap, allowing SEV-ES guests to use debug registers and generate/handle #DBs - Clean up LBR virtualization code - Fix a bug where KVM fails to set the target pCPU during an IRTE update - Fix fatal bugs in SEV-ES intrahost migration - Fix a bug where the recent (architecturally correct) change to reinject #BP and skip INT3 broke SEV guests (can't decode INT3 to skip it) - Retry APIC map recalculation if a vCPU is added/enabled - Overhaul emergency reboot code to bring SVM up to par with VMX, tie the "emergency disabling" behavior to KVM actually being loaded, and move all of the logic within KVM - Fix user triggerable WARNs in SVM where KVM incorrectly assumes the TSC ratio MSR cannot diverge from the default when TSC scaling is disabled up related code - Add a framework to allow "caching" feature flags so that KVM can check if the guest can use a feature without needing to search guest CPUID - Rip out the ancient MMU_DEBUG crud and replace the useful bits with CONFIG_KVM_PROVE_MMU - Fix KVM's handling of !visible guest roots to avoid premature triple fault injection - Overhaul KVM's page-track APIs, and KVMGT's usage, to reduce the API surface that is needed by external users (currently only KVMGT), and fix a variety of issues in the process Generic: - Wrap kvm_{gfn,hva}_range.pte in a union to allow mmu_notifier events to pass action specific data without needing to constantly update the main handlers. - Drop unused function declarations Selftests: - Add testcases to x86's sync_regs_test for detecting KVM TOCTOU bugs - Add support for printf() in guest code and covert all guest asserts to use printf-based reporting - Clean up the PMU event filter test and add new testcases - Include x86 selftests in the KVM x86 MAINTAINERS entry" * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (279 commits) KVM: x86/mmu: Include mmu.h in spte.h KVM: x86/mmu: Use dummy root, backed by zero page, for !visible guest roots KVM: x86/mmu: Disallow guest from using !visible slots for page tables KVM: x86/mmu: Harden TDP MMU iteration against root w/o shadow page KVM: x86/mmu: Harden new PGD against roots without shadow pages KVM: x86/mmu: Add helper to convert root hpa to shadow page drm/i915/gvt: Drop final dependencies on KVM internal details KVM: x86/mmu: Handle KVM bookkeeping in page-track APIs, not callers KVM: x86/mmu: Drop @slot param from exported/external page-track APIs KVM: x86/mmu: Bug the VM if write-tracking is used but not enabled KVM: x86/mmu: Assert that correct locks are held for page write-tracking KVM: x86/mmu: Rename page-track APIs to reflect the new reality KVM: x86/mmu: Drop infrastructure for multiple page-track modes KVM: x86/mmu: Use page-track notifiers iff there are external users KVM: x86/mmu: Move KVM-only page-track declarations to internal header KVM: x86: Remove the unused page-track hook track_flush_slot() drm/i915/gvt: switch from ->track_flush_slot() to ->track_remove_region() KVM: x86: Add a new page-track hook to handle memslot deletion drm/i915/gvt: Don't bother removing write-protection on to-be-deleted slot KVM: x86: Reject memslot MOVE operations if KVMGT is attached ...
2023-08-31Merge tag 'kvm-x86-generic-6.6' of https://github.com/kvm-x86/linux into HEADPaolo Bonzini1-9/+10
Common KVM changes for 6.6: - Wrap kvm_{gfn,hva}_range.pte in a union to allow mmu_notifier events to pass action specific data without needing to constantly update the main handlers. - Drop unused function declarations
2023-08-31Merge tag 'kvmarm-6.6' of ↵Paolo Bonzini2-8/+30
git://git.kernel.org/pub/scm/linux/kernel/git/kvmarm/kvmarm into HEAD KVM/arm64 updates for Linux 6.6 - Add support for TLB range invalidation of Stage-2 page tables, avoiding unnecessary invalidations. Systems that do not implement range invalidation still rely on a full invalidation when dealing with large ranges. - Add infrastructure for forwarding traps taken from a L2 guest to the L1 guest, with L0 acting as the dispatcher, another baby step towards the full nested support. - Simplify the way we deal with the (long deprecated) 'CPU target', resulting in a much needed cleanup. - Fix another set of PMU bugs, both on the guest and host sides, as we seem to never have any shortage of those... - Relax the alignment requirements of EL2 VA allocations for non-stack allocations, as we were otherwise wasting a lot of that precious VA space. - The usual set of non-functional cleanups, although I note the lack of spelling fixes...
2023-08-30Merge tag 'vfio-v6.6-rc1' of https://github.com/awilliam/linux-vfioLinus Torvalds1-84/+77
Pull VFIO updates from Alex Williamson: - VFIO direct character device (cdev) interface support. This extracts the vfio device fd from the container and group model, and is intended to be the native uAPI for use with IOMMUFD (Yi Liu) - Enhancements to the PCI hot reset interface in support of cdev usage (Yi Liu) - Fix a potential race between registering and unregistering vfio files in the kvm-vfio interface and extend use of a lock to avoid extra drop and acquires (Dmitry Torokhov) - A new vfio-pci variant driver for the AMD/Pensando Distributed Services Card (PDS) Ethernet device, supporting live migration (Brett Creeley) - Cleanups to remove redundant owner setup in cdx and fsl bus drivers, and simplify driver init/exit in fsl code (Li Zetao) - Fix uninitialized hole in data structure and pad capability structures for alignment (Stefan Hajnoczi) * tag 'vfio-v6.6-rc1' of https://github.com/awilliam/linux-vfio: (53 commits) vfio/pds: Send type for SUSPEND_STATUS command vfio/pds: fix return value in pds_vfio_get_lm_file() pds_core: Fix function header descriptions vfio: align capability structures vfio/type1: fix cap_migration information leak vfio/fsl-mc: Use module_fsl_mc_driver macro to simplify the code vfio/cdx: Remove redundant initialization owner in vfio_cdx_driver vfio/pds: Add Kconfig and documentation vfio/pds: Add support for firmware recovery vfio/pds: Add support for dirty page tracking vfio/pds: Add VFIO live migration support vfio/pds: register with the pds_core PF pds_core: Require callers of register/unregister to pass PF drvdata vfio/pds: Initial support for pds VFIO driver vfio: Commonize combine_ranges for use in other VFIO drivers kvm/vfio: avoid bouncing the mutex when adding and deleting groups kvm/vfio: ensure kvg instance stays around in kvm_vfio_group_add() docs: vfio: Add vfio device cdev description vfio: Compile vfio_group infrastructure optionally vfio: Move the IOMMU_CAP_CACHE_COHERENCY check in __vfio_register_dev() ...
2023-08-21kvm: explicitly set FOLL_HONOR_NUMA_FAULT in hva_to_pfn_slow()David Hildenbrand1-1/+12
KVM is *the* case we know that really wants to honor NUMA hinting falls. As we want to stop setting FOLL_HONOR_NUMA_FAULT implicitly, set FOLL_HONOR_NUMA_FAULT whenever we might obtain pages on behalf of a VCPU to map them into a secondary MMU, and add a comment why. Do that unconditionally in hva_to_pfn_slow() when calling get_user_pages_unlocked(). kvmppc_book3s_instantiate_page(), hva_to_pfn_fast() and gfn_to_page_many_atomic() are similarly used to map pages into a secondary MMU. However, FOLL_WRITE and get_user_page_fast_only() always implicitly honor NUMA hinting faults -- as documented for FOLL_HONOR_NUMA_FAULT -- so we can limit this change to a single location for now. Don't set it in check_user_page_hwpoison(), where we really only want to check if the mapped page is HW-poisoned. We won't set it for other KVM users of get_user_pages()/pin_user_pages() * arch/powerpc/kvm/book3s_64_mmu_hv.c: not used to map pages into a secondary MMU. * arch/powerpc/kvm/e500_mmu.c: only used on shared TLB pages with userspace * arch/s390/kvm/*: s390x only supports a single NUMA node either way * arch/x86/kvm/svm/sev.c: not used to map pages into a secondary MMU. This is a preparation for making FOLL_HONOR_NUMA_FAULT no longer implicitly be set by get_user_pages() and friends. Link: https://lkml.kernel.org/r/20230803143208.383663-4-david@redhat.com Signed-off-by: David Hildenbrand <david@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: liubo <liubo254@huawei.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Mel Gorman <mgorman@suse.de> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Peter Xu <peterx@redhat.com> Cc: Shuah Khan <shuah@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-08-17KVM: Wrap kvm_{gfn,hva}_range.pte in a per-action unionSean Christopherson1-9/+10
Wrap kvm_{gfn,hva}_range.pte in a union so that future notifier events can pass event specific information up and down the stack without needing to constantly expand and churn the APIs. Lockless aging of SPTEs will pass around a bitmap, and support for memory attributes will pass around the new attributes for the range. Add a "KVM_NO_ARG" placeholder to simplify handling events without an argument (creating a dummy union variable is midly annoying). Opportunstically drop explicit zero-initialization of the "pte" field, as omitting the field (now a union) has the same effect. Cc: Yu Zhao <yuzhao@google.com> Link: https://lore.kernel.org/all/CAOUHufagkd2Jk3_HrVoFFptRXM=hX2CV8f+M-dka-hJU4bP8kw@mail.gmail.com Reviewed-by: Oliver Upton <oliver.upton@linux.dev> Acked-by: Yu Zhao <yuzhao@google.com> Link: https://lore.kernel.org/r/20230729004144.1054885-1-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-08-17KVM: Move kvm_arch_flush_remote_tlbs_memslot() to common codeDavid Matlack1-2/+16
Move kvm_arch_flush_remote_tlbs_memslot() to common code and drop "arch_" from the name. kvm_arch_flush_remote_tlbs_memslot() is just a range-based TLB invalidation where the range is defined by the memslot. Now that kvm_flush_remote_tlbs_range() can be called from common code we can just use that and drop a bunch of duplicate code from the arch directories. Note this adds a lockdep assertion for slots_lock being held when calling kvm_flush_remote_tlbs_memslot(), which was previously only asserted on x86. MIPS has calls to kvm_flush_remote_tlbs_memslot(), but they all hold the slots_lock, so the lockdep assertion continues to hold true. Also drop the CONFIG_KVM_GENERIC_DIRTYLOG_READ_PROTECT ifdef gating kvm_flush_remote_tlbs_memslot(), since it is no longer necessary. Signed-off-by: David Matlack <dmatlack@google.com> Signed-off-by: Raghavendra Rao Ananta <rananta@google.com> Reviewed-by: Gavin Shan <gshan@redhat.com> Reviewed-by: Shaoqin Huang <shahuang@redhat.com> Acked-by: Anup Patel <anup@brainfault.org> Acked-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Marc Zyngier <maz@kernel.org> Link: https://lore.kernel.org/r/20230811045127.3308641-7-rananta@google.com
2023-08-17KVM: Allow range-based TLB invalidation from common codeDavid Matlack1-0/+13
Make kvm_flush_remote_tlbs_range() visible in common code and create a default implementation that just invalidates the whole TLB. This paves the way for several future features/cleanups: - Introduction of range-based TLBI on ARM. - Eliminating kvm_arch_flush_remote_tlbs_memslot() - Moving the KVM/x86 TDP MMU to common code. No functional change intended. Signed-off-by: David Matlack <dmatlack@google.com> Signed-off-by: Raghavendra Rao Ananta <rananta@google.com> Reviewed-by: Gavin Shan <gshan@redhat.com> Reviewed-by: Shaoqin Huang <shahuang@redhat.com> Reviewed-by: Anup Patel <anup@brainfault.org> Acked-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Marc Zyngier <maz@kernel.org> Link: https://lore.kernel.org/r/20230811045127.3308641-6-rananta@google.com
2023-08-17KVM: Remove CONFIG_HAVE_KVM_ARCH_TLB_FLUSH_ALLRaghavendra Rao Ananta2-5/+0
kvm_arch_flush_remote_tlbs() or CONFIG_HAVE_KVM_ARCH_TLB_FLUSH_ALL are two mechanisms to solve the same problem, allowing architecture-specific code to provide a non-IPI implementation of remote TLB flushing. Dropping CONFIG_HAVE_KVM_ARCH_TLB_FLUSH_ALL allows KVM to standardize all architectures on kvm_arch_flush_remote_tlbs() instead of maintaining two mechanisms. Signed-off-by: Raghavendra Rao Ananta <rananta@google.com> Reviewed-by: Shaoqin Huang <shahuang@redhat.com> Reviewed-by: Gavin Shan <gshan@redhat.com> Reviewed-by: Philippe Mathieu-Daudé <philmd@linaro.org> Signed-off-by: Marc Zyngier <maz@kernel.org> Link: https://lore.kernel.org/r/20230811045127.3308641-5-rananta@google.com
2023-08-17KVM: Rename kvm_arch_flush_remote_tlb() to kvm_arch_flush_remote_tlbs()David Matlack1-1/+1
Rename kvm_arch_flush_remote_tlb() and the associated macro __KVM_HAVE_ARCH_FLUSH_REMOTE_TLB to kvm_arch_flush_remote_tlbs() and __KVM_HAVE_ARCH_FLUSH_REMOTE_TLBS respectively. Making the name plural matches kvm_flush_remote_tlbs() and makes it more clear that this function can affect more than one remote TLB. No functional change intended. Signed-off-by: David Matlack <dmatlack@google.com> Signed-off-by: Raghavendra Rao Ananta <rananta@google.com> Reviewed-by: Gavin Shan <gshan@redhat.com> Reviewed-by: Philippe Mathieu-Daudé <philmd@linaro.org> Reviewed-by: Shaoqin Huang <shahuang@redhat.com> Acked-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Marc Zyngier <maz@kernel.org> Link: https://lore.kernel.org/r/20230811045127.3308641-2-rananta@google.com
2023-08-03kvm/vfio: avoid bouncing the mutex when adding and deleting groupsDmitry Torokhov1-17/+9
Stop taking kv->lock mutex in kvm_vfio_update_coherency() and instead call it with this mutex held: the callers of the function usually already have it taken (and released) before calling kvm_vfio_update_coherency(). This avoid bouncing the lock up and down. The exception is kvm_vfio_release() where we do not take the lock, but it is being executed when the very last reference to kvm_device is being dropped, so there are no concerns about concurrency. Suggested-by: Alex Williamson <alex.williamson@redhat.com> Reviewed-by: Alex Williamson <alex.williamson@redhat.com> Signed-off-by: Dmitry Torokhov <dmitry.torokhov@gmail.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com> Link: https://lore.kernel.org/r/20230714224538.404793-2-dmitry.torokhov@gmail.com Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2023-08-03kvm/vfio: ensure kvg instance stays around in kvm_vfio_group_add()Dmitry Torokhov1-1/+1
kvm_vfio_group_add() creates kvg instance, links it to kv->group_list, and calls kvm_vfio_file_set_kvm() with kvg->file as an argument after dropping kv->lock. If we race group addition and deletion calls, kvg instance may get freed by the time we get around to calling kvm_vfio_file_set_kvm(). Previous iterations of the code did not reference kvg->file outside of the critical section, but used a temporary variable. Still, they had similar problem of the file reference being owned by kvg structure and potential for kvm_vfio_group_del() dropping it before kvm_vfio_group_add() had a chance to complete. Fix this by moving call to kvm_vfio_file_set_kvm() under the protection of kv->lock. We already call it while holding the same lock when vfio group is being deleted, so it should be safe here as well. Fixes: 2fc1bec15883 ("kvm: set/clear kvm to/from vfio_group when group add/delete") Reviewed-by: Alex Williamson <alex.williamson@redhat.com> Signed-off-by: Dmitry Torokhov <dmitry.torokhov@gmail.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com> Link: https://lore.kernel.org/r/20230714224538.404793-1-dmitry.torokhov@gmail.com Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2023-07-29KVM: Grab a reference to KVM for VM and vCPU stats file descriptorsSean Christopherson1-0/+24
Grab a reference to KVM prior to installing VM and vCPU stats file descriptors to ensure the underlying VM and vCPU objects are not freed until the last reference to any and all stats fds are dropped. Note, the stats paths manually invoke fd_install() and so don't need to grab a reference before creating the file. Fixes: ce55c049459c ("KVM: stats: Support binary stats retrieval for a VCPU") Fixes: fcfe1baeddbf ("KVM: stats: Support binary stats retrieval for a VM") Reported-by: Zheng Zhang <zheng.zhang@email.ucr.edu> Closes: https://lore.kernel.org/all/CAC_GQSr3xzZaeZt85k_RCBd5kfiOve8qXo7a81Cq53LuVQ5r=Q@mail.gmail.com Cc: stable@vger.kernel.org Cc: Kees Cook <keescook@chromium.org> Signed-off-by: Sean Christopherson <seanjc@google.com> Reviewed-by: Kees Cook <keescook@chromium.org> Message-Id: <20230711230131.648752-2-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2023-07-25kvm/vfio: Accept vfio device file from userspaceYi Liu1-6/+6
This defines KVM_DEV_VFIO_FILE* and make alias with KVM_DEV_VFIO_GROUP*. Old userspace uses KVM_DEV_VFIO_GROUP* works as well. Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com> Tested-by: Terrence Xu <terrence.xu@intel.com> Tested-by: Nicolin Chen <nicolinc@nvidia.com> Tested-by: Matthew Rosato <mjrosato@linux.ibm.com> Tested-by: Yanting Jiang <yanting.jiang@intel.com> Tested-by: Shameer Kolothum <shameerali.kolothum.thodi@huawei.com> Tested-by: Zhenzhong Duan <zhenzhong.duan@intel.com> Signed-off-by: Yi Liu <yi.l.liu@intel.com> Link: https://lore.kernel.org/r/20230718135551.6592-6-yi.l.liu@intel.com Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2023-07-25kvm/vfio: Prepare for accepting vfio device fdYi Liu1-57/+58
This renames kvm_vfio_group related helpers to prepare for accepting vfio device fd. No functional change is intended. Reviewed-by: Kevin Tian <kevin.tian@intel.com> Reviewed-by: Eric Auger <eric.auger@redhat.com> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Tested-by: Terrence Xu <terrence.xu@intel.com> Tested-by: Nicolin Chen <nicolinc@nvidia.com> Tested-by: Matthew Rosato <mjrosato@linux.ibm.com> Tested-by: Yanting Jiang <yanting.jiang@intel.com> Tested-by: Shameer Kolothum <shameerali.kolothum.thodi@huawei.com> Tested-by: Zhenzhong Duan <zhenzhong.duan@intel.com> Signed-off-by: Yi Liu <yi.l.liu@intel.com> Link: https://lore.kernel.org/r/20230718135551.6592-5-yi.l.liu@intel.com Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2023-07-25vfio: Refine vfio file kAPIs for KVMYi Liu1-5/+5
This prepares for making the below kAPIs to accept both group file and device file instead of only vfio group file. bool vfio_file_enforced_coherent(struct file *file); void vfio_file_set_kvm(struct file *file, struct kvm *kvm); Reviewed-by: Kevin Tian <kevin.tian@intel.com> Reviewed-by: Eric Auger <eric.auger@redhat.com> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Tested-by: Terrence Xu <terrence.xu@intel.com> Tested-by: Nicolin Chen <nicolinc@nvidia.com> Tested-by: Matthew Rosato <mjrosato@linux.ibm.com> Tested-by: Yanting Jiang <yanting.jiang@intel.com> Tested-by: Shameer Kolothum <shameerali.kolothum.thodi@huawei.com> Tested-by: Zhenzhong Duan <zhenzhong.duan@intel.com> Signed-off-by: Yi Liu <yi.l.liu@intel.com> Link: https://lore.kernel.org/r/20230718135551.6592-3-yi.l.liu@intel.com Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2023-07-03Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvmLinus Torvalds3-41/+27
Pull kvm updates from Paolo Bonzini: "ARM64: - Eager page splitting optimization for dirty logging, optionally allowing for a VM to avoid the cost of hugepage splitting in the stage-2 fault path. - Arm FF-A proxy for pKVM, allowing a pKVM host to safely interact with services that live in the Secure world. pKVM intervenes on FF-A calls to guarantee the host doesn't misuse memory donated to the hyp or a pKVM guest. - Support for running the split hypervisor with VHE enabled, known as 'hVHE' mode. This is extremely useful for testing the split hypervisor on VHE-only systems, and paves the way for new use cases that depend on having two TTBRs available at EL2. - Generalized framework for configurable ID registers from userspace. KVM/arm64 currently prevents arbitrary CPU feature set configuration from userspace, but the intent is to relax this limitation and allow userspace to select a feature set consistent with the CPU. - Enable the use of Branch Target Identification (FEAT_BTI) in the hypervisor. - Use a separate set of pointer authentication keys for the hypervisor when running in protected mode, as the host is untrusted at runtime. - Ensure timer IRQs are consistently released in the init failure paths. - Avoid trapping CTR_EL0 on systems with Enhanced Virtualization Traps (FEAT_EVT), as it is a register commonly read from userspace. - Erratum workaround for the upcoming AmpereOne part, which has broken hardware A/D state management. RISC-V: - Redirect AMO load/store misaligned traps to KVM guest - Trap-n-emulate AIA in-kernel irqchip for KVM guest - Svnapot support for KVM Guest s390: - New uvdevice secret API - CMM selftest and fixes - fix racy access to target CPU for diag 9c x86: - Fix missing/incorrect #GP checks on ENCLS - Use standard mmu_notifier hooks for handling APIC access page - Drop now unnecessary TR/TSS load after VM-Exit on AMD - Print more descriptive information about the status of SEV and SEV-ES during module load - Add a test for splitting and reconstituting hugepages during and after dirty logging - Add support for CPU pinning in demand paging test - Add support for AMD PerfMonV2, with a variety of cleanups and minor fixes included along the way - Add a "nx_huge_pages=never" option to effectively avoid creating NX hugepage recovery threads (because nx_huge_pages=off can be toggled at runtime) - Move handling of PAT out of MTRR code and dedup SVM+VMX code - Fix output of PIC poll command emulation when there's an interrupt - Add a maintainer's handbook to document KVM x86 processes, preferred coding style, testing expectations, etc. - Misc cleanups, fixes and comments Generic: - Miscellaneous bugfixes and cleanups Selftests: - Generate dependency files so that partial rebuilds work as expected" * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (153 commits) Documentation/process: Add a maintainer handbook for KVM x86 Documentation/process: Add a label for the tip tree handbook's coding style KVM: arm64: Fix misuse of KVM_ARM_VCPU_POWER_OFF bit index RISC-V: KVM: Remove unneeded semicolon RISC-V: KVM: Allow Svnapot extension for Guest/VM riscv: kvm: define vcpu_sbi_ext_pmu in header RISC-V: KVM: Expose IMSIC registers as attributes of AIA irqchip RISC-V: KVM: Add in-kernel virtualization of AIA IMSIC RISC-V: KVM: Expose APLIC registers as attributes of AIA irqchip RISC-V: KVM: Add in-kernel emulation of AIA APLIC RISC-V: KVM: Implement device interface for AIA irqchip RISC-V: KVM: Skeletal in-kernel AIA irqchip support RISC-V: KVM: Set kvm_riscv_aia_nr_hgei to zero RISC-V: KVM: Add APLIC related defines RISC-V: KVM: Add IMSIC related defines RISC-V: KVM: Implement guest external interrupt line management KVM: x86: Remove PRIx* definitions as they are solely for user space s390/uv: Update query for secret-UVCs s390/uv: replace scnprintf with sysfs_emit s390/uvdevice: Add 'Lock Secret Store' UVC ...
2023-07-01Merge tag 'kvm-x86-vmx-6.5' of https://github.com/kvm-x86/linux into HEADPaolo Bonzini1-18/+0
KVM VMX changes for 6.5: - Fix missing/incorrect #GP checks on ENCLS - Use standard mmu_notifier hooks for handling APIC access page - Misc cleanups
2023-07-01Merge tag 'kvm-x86-generic-6.5' of https://github.com/kvm-x86/linux into HEADPaolo Bonzini3-22/+25
Common KVM changes for 6.5: - Fix unprotected vcpu->pid dereference via debugfs - Fix KVM_BUG() and KVM_BUG_ON() macros with 64-bit conditionals - Refactor failure path in kvm_io_bus_unregister_dev() to simplify the code - Misc cleanups
2023-07-01Merge tag 'kvmarm-6.5' of ↵Paolo Bonzini1-1/+2
git://git.kernel.org/pub/scm/linux/kernel/git/kvmarm/kvmarm into HEAD KVM/arm64 updates for 6.5 - Eager page splitting optimization for dirty logging, optionally allowing for a VM to avoid the cost of block splitting in the stage-2 fault path. - Arm FF-A proxy for pKVM, allowing a pKVM host to safely interact with services that live in the Secure world. pKVM intervenes on FF-A calls to guarantee the host doesn't misuse memory donated to the hyp or a pKVM guest. - Support for running the split hypervisor with VHE enabled, known as 'hVHE' mode. This is extremely useful for testing the split hypervisor on VHE-only systems, and paves the way for new use cases that depend on having two TTBRs available at EL2. - Generalized framework for configurable ID registers from userspace. KVM/arm64 currently prevents arbitrary CPU feature set configuration from userspace, but the intent is to relax this limitation and allow userspace to select a feature set consistent with the CPU. - Enable the use of Branch Target Identification (FEAT_BTI) in the hypervisor. - Use a separate set of pointer authentication keys for the hypervisor when running in protected mode, as the host is untrusted at runtime. - Ensure timer IRQs are consistently released in the init failure paths. - Avoid trapping CTR_EL0 on systems with Enhanced Virtualization Traps (FEAT_EVT), as it is a register commonly read from userspace. - Erratum workaround for the upcoming AmpereOne part, which has broken hardware A/D state management. As a consequence of the hVHE series reworking the arm64 software features framework, the for-next/module-alloc branch from the arm64 tree comes along for the ride.
2023-06-28Merge tag 'mm-stable-2023-06-24-19-15' of ↵Linus Torvalds2-7/+9
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull mm updates from Andrew Morton: - Yosry Ahmed brought back some cgroup v1 stats in OOM logs - Yosry has also eliminated cgroup's atomic rstat flushing - Nhat Pham adds the new cachestat() syscall. It provides userspace with the ability to query pagecache status - a similar concept to mincore() but more powerful and with improved usability - Mel Gorman provides more optimizations for compaction, reducing the prevalence of page rescanning - Lorenzo Stoakes has done some maintanance work on the get_user_pages() interface - Liam Howlett continues with cleanups and maintenance work to the maple tree code. Peng Zhang also does some work on maple tree - Johannes Weiner has done some cleanup work on the compaction code - David Hildenbrand has contributed additional selftests for get_user_pages() - Thomas Gleixner has contributed some maintenance and optimization work for the vmalloc code - Baolin Wang has provided some compaction cleanups, - SeongJae Park continues maintenance work on the DAMON code - Huang Ying has done some maintenance on the swap code's usage of device refcounting - Christoph Hellwig has some cleanups for the filemap/directio code - Ryan Roberts provides two patch series which yield some rationalization of the kernel's access to pte entries - use the provided APIs rather than open-coding accesses - Lorenzo Stoakes has some fixes to the interaction between pagecache and directio access to file mappings - John Hubbard has a series of fixes to the MM selftesting code - ZhangPeng continues the folio conversion campaign - Hugh Dickins has been working on the pagetable handling code, mainly with a view to reducing the load on the mmap_lock - Catalin Marinas has reduced the arm64 kmalloc() minimum alignment from 128 to 8 - Domenico Cerasuolo has improved the zswap reclaim mechanism by reorganizing the LRU management - Matthew Wilcox provides some fixups to make gfs2 work better with the buffer_head code - Vishal Moola also has done some folio conversion work - Matthew Wilcox has removed the remnants of the pagevec code - their functionality is migrated over to struct folio_batch * tag 'mm-stable-2023-06-24-19-15' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (380 commits) mm/hugetlb: remove hugetlb_set_page_subpool() mm: nommu: correct the range of mmap_sem_read_lock in task_mem() hugetlb: revert use of page_cache_next_miss() Revert "page cache: fix page_cache_next/prev_miss off by one" mm/vmscan: fix root proactive reclaim unthrottling unbalanced node mm: memcg: rename and document global_reclaim() mm: kill [add|del]_page_to_lru_list() mm: compaction: convert to use a folio in isolate_migratepages_block() mm: zswap: fix double invalidate with exclusive loads mm: remove unnecessary pagevec includes mm: remove references to pagevec mm: rename invalidate_mapping_pagevec to mapping_try_invalidate mm: remove struct pagevec net: convert sunrpc from pagevec to folio_batch i915: convert i915_gpu_error to use a folio_batch pagevec: rename fbatch_count() mm: remove check_move_unevictable_pages() drm: convert drm_gem_put_pages() to use a folio_batch i915: convert shmem_sg_free_table() to use a folio_batch scatterlist: add sg_set_folio() ...
2023-06-22KVM: Avoid illegal stage2 mapping on invalid memory slotGavin Shan1-1/+19
We run into guest hang in edk2 firmware when KSM is kept as running on the host. The edk2 firmware is waiting for status 0x80 from QEMU's pflash device (TYPE_PFLASH_CFI01) during the operation of sector erasing or buffered write. The status is returned by reading the memory region of the pflash device and the read request should have been forwarded to QEMU and emulated by it. Unfortunately, the read request is covered by an illegal stage2 mapping when the guest hang issue occurs. The read request is completed with QEMU bypassed and wrong status is fetched. The edk2 firmware runs into an infinite loop with the wrong status. The illegal stage2 mapping is populated due to same page sharing by KSM at (C) even the associated memory slot has been marked as invalid at (B) when the memory slot is requested to be deleted. It's notable that the active and inactive memory slots can't be swapped when we're in the middle of kvm_mmu_notifier_change_pte() because kvm->mn_active_invalidate_count is elevated, and kvm_swap_active_memslots() will busy loop until it reaches to zero again. Besides, the swapping from the active to the inactive memory slots is also avoided by holding &kvm->srcu in __kvm_handle_hva_range(), corresponding to synchronize_srcu_expedited() in kvm_swap_active_memslots(). CPU-A CPU-B ----- ----- ioctl(kvm_fd, KVM_SET_USER_MEMORY_REGION) kvm_vm_ioctl_set_memory_region kvm_set_memory_region __kvm_set_memory_region kvm_set_memslot(kvm, old, NULL, KVM_MR_DELETE) kvm_invalidate_memslot kvm_copy_memslot kvm_replace_memslot kvm_swap_active_memslots (A) kvm_arch_flush_shadow_memslot (B) same page sharing by KSM kvm_mmu_notifier_invalidate_range_start : kvm_mmu_notifier_change_pte kvm_handle_hva_range __kvm_handle_hva_range kvm_set_spte_gfn (C) : kvm_mmu_notifier_invalidate_range_end Fix the issue by skipping the invalid memory slot at (C) to avoid the illegal stage2 mapping so that the read request for the pflash's status is forwarded to QEMU and emulated by it. In this way, the correct pflash's status can be returned from QEMU to break the infinite loop in the edk2 firmware. We tried a git-bisect and the first problematic commit is cd4c71835228 (" KVM: arm64: Convert to the gfn-based MMU notifier callbacks"). With this, clean_dcache_guest_page() is called after the memory slots are iterated in kvm_mmu_notifier_change_pte(). clean_dcache_guest_page() is called before the iteration on the memory slots before this commit. This change literally enlarges the racy window between kvm_mmu_notifier_change_pte() and memory slot removal so that we're able to reproduce the issue in a practical test case. However, the issue exists since commit d5d8184d35c9 ("KVM: ARM: Memory virtualization setup"). Cc: stable@vger.kernel.org # v3.9+ Fixes: d5d8184d35c9 ("KVM: ARM: Memory virtualization setup") Reported-by: Shuai Hu <hshuai@redhat.com> Reported-by: Zhenyu Zhang <zhenyzha@redhat.com> Signed-off-by: Gavin Shan <gshan@redhat.com> Reviewed-by: David Hildenbrand <david@redhat.com> Reviewed-by: Oliver Upton <oliver.upton@linux.dev> Reviewed-by: Peter Xu <peterx@redhat.com> Reviewed-by: Sean Christopherson <seanjc@google.com> Reviewed-by: Shaoqin Huang <shahuang@redhat.com> Message-Id: <20230615054259.14911-1-gshan@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2023-06-19mm: ptep_get() conversionRyan Roberts1-4/+7
Convert all instances of direct pte_t* dereferencing to instead use ptep_get() helper. This means that by default, the accesses change from a C dereference to a READ_ONCE(). This is technically the correct thing to do since where pgtables are modified by HW (for access/dirty) they are volatile and therefore we should always ensure READ_ONCE() semantics. But more importantly, by always using the helper, it can be overridden by the architecture to fully encapsulate the contents of the pte. Arch code is deliberately not converted, as the arch code knows best. It is intended that arch code (arm64) will override the default with its own implementation that can (e.g.) hide certain bits from the core code, or determine young/dirty status by mixing in state from another source. Conversion was done using Coccinelle: ---- // $ make coccicheck \ // COCCI=ptepget.cocci \ // SPFLAGS="--include-headers" \ // MODE=patch virtual patch @ depends on patch @ pte_t *v; @@ - *v + ptep_get(v) ---- Then reviewed and hand-edited to avoid multiple unnecessary calls to ptep_get(), instead opting to store the result of a single call in a variable, where it is correct to do so. This aims to negate any cost of READ_ONCE() and will benefit arch-overrides that may be more complex. Included is a fix for an issue in an earlier version of this patch that was pointed out by kernel test robot. The issue arose because config MMU=n elides definition of the ptep helper functions, including ptep_get(). HUGETLB_PAGE=n configs still define a simple huge_ptep_clear_flush() for linking purposes, which dereferences the ptep. So when both configs are disabled, this caused a build error because ptep_get() is not defined. Fix by continuing to do a direct dereference when MMU=n. This is safe because for this config the arch code cannot be trying to virtualize the ptes because none of the ptep helpers are defined. Link: https://lkml.kernel.org/r/20230612151545.3317766-4-ryan.roberts@arm.com Reported-by: kernel test robot <lkp@intel.com> Link: https://lore.kernel.org/oe-kbuild-all/202305120142.yXsNEo6H-lkp@intel.com/ Signed-off-by: Ryan Roberts <ryan.roberts@arm.com> Cc: Adrian Hunter <adrian.hunter@intel.com> Cc: Alexander Potapenko <glider@google.com> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Alex Williamson <alex.williamson@redhat.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Andrey Konovalov <andreyknvl@gmail.com> Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com> Cc: Christian Brauner <brauner@kernel.org> Cc: Christoph Hellwig <hch@infradead.org> Cc: Daniel Vetter <daniel@ffwll.ch> Cc: Dave Airlie <airlied@gmail.com> Cc: Dimitri Sivanich <dimitri.sivanich@hpe.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Ian Rogers <irogers@google.com> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: Jérôme Glisse <jglisse@redhat.com> Cc: Jiri Olsa <jolsa@kernel.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Lorenzo Stoakes <lstoakes@gmail.com> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Mike Rapoport (IBM) <rppt@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Naoya Horiguchi <naoya.horiguchi@nec.com> Cc: Oleksandr Tyshchenko <oleksandr_tyshchenko@epam.com> Cc: Pavel Tatashin <pasha.tatashin@soleen.com> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: SeongJae Park <sj@kernel.org> Cc: Shakeel Butt <shakeelb@google.com> Cc: Uladzislau Rezki (Sony) <urezki@gmail.com> Cc: Vincenzo Frascino <vincenzo.frascino@arm.com> Cc: Yu Zhao <yuzhao@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-06-15Merge branch kvm-arm64/eager-page-splitting into kvmarm/nextOliver Upton1-1/+2
* kvm-arm64/eager-page-splitting: : Eager Page Splitting, courtesy of Ricardo Koller. : : Dirty logging performance is dominated by the cost of splitting : hugepages to PTE granularity. On systems that mere mortals can get their : hands on, each fault incurs the cost of a full break-before-make : pattern, wherein the broadcast invalidation and ensuing serialization : significantly increases fault latency. : : The goal of eager page splitting is to move the cost of hugepage : splitting out of the stage-2 fault path and instead into the ioctls : responsible for managing the dirty log: : : - If manual protection is enabled for the VM, hugepage splitting : happens in the KVM_CLEAR_DIRTY_LOG ioctl. This is desirable as it : provides userspace granular control over hugepage splitting. : : - Otherwise, if userspace relies on the legacy dirty log behavior : (clear on collection), hugepage splitting is done at the moment dirty : logging is enabled for a particular memslot. : : Support for eager page splitting requires explicit opt-in from : userspace, which is realized through the : KVM_CAP_ARM_EAGER_SPLIT_CHUNK_SIZE capability. arm64: kvm: avoid overflow in integer division KVM: arm64: Use local TLBI on permission relaxation KVM: arm64: Split huge pages during KVM_CLEAR_DIRTY_LOG KVM: arm64: Open-code kvm_mmu_write_protect_pt_masked() KVM: arm64: Split huge pages when dirty logging is enabled KVM: arm64: Add kvm_uninit_stage2_mmu() KVM: arm64: Refactor kvm_arch_commit_memory_region() KVM: arm64: Add kvm_pgtable_stage2_split() KVM: arm64: Add KVM_CAP_ARM_EAGER_SPLIT_CHUNK_SIZE KVM: arm64: Export kvm_are_all_memslots_empty() KVM: arm64: Add helper for creating unlinked stage2 subtrees KVM: arm64: Add KVM_PGTABLE_WALK flags for skipping CMOs and BBM TLBIs KVM: arm64: Rename free_removed to free_unlinked Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
2023-06-13kvm/eventfd: use list_for_each_entry when deassign ioeventfdWei Wang1-3/+2
Simpify kvm_deassign_ioeventfd_idx to use list_for_each_entry as the loop just ends at the entry that's found and deleted. Note, coalesced_mmio_ops and ioeventfd_ops are the only instances of kvm_io_device_ops that implement a destructor, all other callers of kvm_io_bus_unregister_dev() are unaffected by this change. Suggested-by: Michal Luczaj <mhal@rbox.co> Signed-off-by: Wei Wang <wei.w.wang@intel.com> Reviewed-by: Sean Christopherson <seanjc@google.com> Link: https://lore.kernel.org/r/20230207123713.3905-3-wei.w.wang@intel.com [sean: call out that only select users implement a destructor] Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-06-13KVM: destruct kvm_io_device while unregistering it from kvm_io_busWei Wang3-16/+17
Current usage of kvm_io_device requires users to destruct it with an extra call of kvm_iodevice_destructor after the device gets unregistered from kvm_io_bus. This is not necessary and can cause errors if a user forgot to make the extra call. Simplify the usage by combining kvm_iodevice_destructor into kvm_io_bus_unregister_dev. This reduces LOCs a bit for users and can avoid the leakage of destructing the device explicitly. Signed-off-by: Wei Wang <wei.w.wang@intel.com> Reviewed-by: Sean Christopherson <seanjc@google.com> Link: https://lore.kernel.org/r/20230207123713.3905-2-wei.w.wang@intel.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-06-09mm/gup: remove vmas parameter from get_user_pages_remote()Lorenzo Stoakes1-2/+1
The only instances of get_user_pages_remote() invocations which used the vmas parameter were for a single page which can instead simply look up the VMA directly. In particular:- - __update_ref_ctr() looked up the VMA but did nothing with it so we simply remove it. - __access_remote_vm() was already using vma_lookup() when the original lookup failed so by doing the lookup directly this also de-duplicates the code. We are able to perform these VMA operations as we already hold the mmap_lock in order to be able to call get_user_pages_remote(). As part of this work we add get_user_page_vma_remote() which abstracts the VMA lookup, error handling and decrementing the page reference count should the VMA lookup fail. This forms part of a broader set of patches intended to eliminate the vmas parameter altogether. [akpm@linux-foundation.org: avoid passing NULL to PTR_ERR] Link: https://lkml.kernel.org/r/d20128c849ecdbf4dd01cc828fcec32127ed939a.1684350871.git.lstoakes@gmail.com Signed-off-by: Lorenzo Stoakes <lstoakes@gmail.com> Reviewed-by: Catalin Marinas <catalin.marinas@arm.com> (for arm64) Acked-by: David Hildenbrand <david@redhat.com> Reviewed-by: Janosch Frank <frankja@linux.ibm.com> (for s390) Reviewed-by: Christoph Hellwig <hch@lst.de> Cc: Christian König <christian.koenig@amd.com> Cc: Dennis Dalessandro <dennis.dalessandro@cornelisnetworks.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Jarkko Sakkinen <jarkko@kernel.org> Cc: Jason Gunthorpe <jgg@nvidia.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Sakari Ailus <sakari.ailus@linux.intel.com> Cc: Sean Christopherson <seanjc@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-06-09mm/gup: remove unused vmas parameter from get_user_pages()Lorenzo Stoakes1-1/+1
Patch series "remove the vmas parameter from GUP APIs", v6. (pin_/get)_user_pages[_remote]() each provide an optional output parameter for an array of VMA objects associated with each page in the input range. These provide the means for VMAs to be returned, as long as mm->mmap_lock is never released during the GUP operation (i.e. the internal flag FOLL_UNLOCKABLE is not specified). In addition, these VMAs can only be accessed with the mmap_lock held and become invalidated the moment it is released. The vast majority of invocations do not use this functionality and of those that do, all but one case retrieve a single VMA to perform checks upon. It is not egregious in the single VMA cases to simply replace the operation with a vma_lookup(). In these cases we duplicate the (fast) lookup on a slow path already under the mmap_lock, abstracted to a new get_user_page_vma_remote() inline helper function which also performs error checking and reference count maintenance. The special case is io_uring, where io_pin_pages() specifically needs to assert that the VMAs underlying the range do not result in broken long-term GUP file-backed mappings. As GUP now internally asserts that FOLL_LONGTERM mappings are not file-backed in a broken fashion (i.e. requiring dirty tracking) - as implemented in "mm/gup: disallow FOLL_LONGTERM GUP-nonfast writing to file-backed mappings" - this logic is no longer required and so we can simply remove it altogether from io_uring. Eliminating the vmas parameter eliminates an entire class of danging pointer errors that might have occured should the lock have been incorrectly released. In addition, the API is simplified and now clearly expresses what it is intended for - applying the specified GUP flags and (if pinning) returning pinned pages. This change additionally opens the door to further potential improvements in GUP and the possible marrying of disparate code paths. I have run this series against gup_test with no issues. Thanks to Matthew Wilcox for suggesting this refactoring! This patch (of 6): No invocation of get_user_pages() use the vmas parameter, so remove it. The GUP API is confusing and caveated. Recent changes have done much to improve that, however there is more we can do. Exporting vmas is a prime target as the caller has to be extremely careful to preclude their use after the mmap_lock has expired or otherwise be left with dangling pointers. Removing the vmas parameter focuses the GUP functions upon their primary purpose - pinning (and outputting) pages as well as performing the actions implied by the input flags. This is part of a patch series aiming to remove the vmas parameter altogether. Link: https://lkml.kernel.org/r/cover.1684350871.git.lstoakes@gmail.com Link: https://lkml.kernel.org/r/589e0c64794668ffc799651e8d85e703262b1e9d.1684350871.git.lstoakes@gmail.com Signed-off-by: Lorenzo Stoakes <lstoakes@gmail.com> Suggested-by: Matthew Wilcox (Oracle) <willy@infradead.org> Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Acked-by: David Hildenbrand <david@redhat.com> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Acked-by: Christian König <christian.koenig@amd.com> (for radeon parts) Acked-by: Jarkko Sakkinen <jarkko@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Acked-by: Sean Christopherson <seanjc@google.com> (KVM) Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Dennis Dalessandro <dennis.dalessandro@cornelisnetworks.com> Cc: Janosch Frank <frankja@linux.ibm.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Sakari Ailus <sakari.ailus@linux.intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-06-06KVM: Clean up kvm_vm_ioctl_create_vcpu()Michal Luczaj1-1/+1
Since c9d601548603 ("KVM: allow KVM_BUG/KVM_BUG_ON to handle 64-bit cond") 'cond' is internally converted to boolean, so caller's explicit conversion from void* is unnecessary. Remove the double bang. Signed-off-by: Michal Luczaj <mhal@rbox.co> Reviewed-by: Yuan Yao <yuan.yao@intel.com> base-commit: 76a17bf03a268bc342e08c05d8ddbe607d294eb4 Link: https://lore.kernel.org/r/20230605114852.288964-1-mhal@rbox.co Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-06-06KVM: x86: Use standard mmu_notifier invalidate hooks for APIC access pageSean Christopherson1-18/+0
Now that KVM honors past and in-progress mmu_notifier invalidations when reloading the APIC-access page, use KVM's "standard" invalidation hooks to trigger a reload and delete the one-off usage of invalidate_range(). Aside from eliminating one-off code in KVM, dropping KVM's use of invalidate_range() will allow common mmu_notifier to redefine the API to be more strictly focused on invalidating secondary TLBs that share the primary MMU's page tables. Suggested-by: Jason Gunthorpe <jgg@nvidia.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: Robin Murphy <robin.murphy@arm.com> Reviewed-by: Alistair Popple <apopple@nvidia.com> Reviewed-by: Paolo Bonzini <pbonzini@redhat.com> Link: https://lore.kernel.org/r/20230602011518.787006-3-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-06-01KVM: Don't kfree(NULL) on kzalloc() failure in kvm_assign_ioeventfd_idx()Michal Luczaj1-1/+1
On kzalloc() failure, taking the `goto fail` path leads to kfree(NULL). Such no-op has no use. Move it out. Signed-off-by: Michal Luczaj <mhal@rbox.co> Reviewed-by: Sean Christopherson <seanjc@google.com> Reviewed-by: Philippe Mathieu-Daudé <philmd@linaro.org> Link: https://lore.kernel.org/r/20230327175457.735903-1-mhal@rbox.co Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-05-26KVM: Protect vcpu->pid dereference via debugfs with RCUSean Christopherson1-1/+4
Wrap the vcpu->pid dereference in the debugfs hook vcpu_get_pid() with proper RCU read (un)lock. Unlike the code in kvm_vcpu_ioctl(), vcpu_get_pid() is not a simple access; the pid pointer is passed to pid_nr() and fully dereferenced if the pointer is non-NULL. Failure to acquire RCU could result in use-after-free of the old pid if a different task invokes KVM_RUN and puts the last reference to the old vcpu->pid between vcpu_get_pid() reading the pointer and dereferencing it in pid_nr(). Fixes: e36de87d34a7 ("KVM: debugfs: expose pid of vcpu threads") Link: https://lore.kernel.org/r/20230211010719.982919-1-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-05-19KVM: Fix vcpu_array[0] racesMichal Luczaj1-6/+10
In kvm_vm_ioctl_create_vcpu(), add vcpu to vcpu_array iff it's safe to access vcpu via kvm_get_vcpu() and kvm_for_each_vcpu(), i.e. when there's no failure path requiring vcpu removal and destruction. Such order is important because vcpu_array accessors may end up referencing vcpu at vcpu_array[0] even before online_vcpus is set to 1. When online_vcpus=0, any call to kvm_get_vcpu() goes through array_index_nospec() and ends with an attempt to xa_load(vcpu_array, 0): int num_vcpus = atomic_read(&kvm->online_vcpus); i = array_index_nospec(i, num_vcpus); return xa_load(&kvm->vcpu_array, i); Similarly, when online_vcpus=0, a kvm_for_each_vcpu() does not iterate over an "empty" range, but actually [0, ULONG_MAX]: xa_for_each_range(&kvm->vcpu_array, idx, vcpup, 0, \ (atomic_read(&kvm->online_vcpus) - 1)) In both cases, such online_vcpus=0 edge case, even if leading to unnecessary calls to XArray API, should not be an issue; requesting unpopulated indexes/ranges is handled by xa_load() and xa_for_each_range(). However, this means that when the first vCPU is created and inserted in vcpu_array *and* before online_vcpus is incremented, code calling kvm_get_vcpu()/kvm_for_each_vcpu() already has access to that first vCPU. This should not pose a problem assuming that once a vcpu is stored in vcpu_array, it will remain there, but that's not the case: kvm_vm_ioctl_create_vcpu() first inserts to vcpu_array, then requests a file descriptor. If create_vcpu_fd() fails, newly inserted vcpu is removed from the vcpu_array, then destroyed: vcpu->vcpu_idx = atomic_read(&kvm->online_vcpus); r = xa_insert(&kvm->vcpu_array, vcpu->vcpu_idx, vcpu, GFP_KERNEL_ACCOUNT); kvm_get_kvm(kvm); r = create_vcpu_fd(vcpu); if (r < 0) { xa_erase(&kvm->vcpu_array, vcpu->vcpu_idx); kvm_put_kvm_no_destroy(kvm); goto unlock_vcpu_destroy; } atomic_inc(&kvm->online_vcpus); This results in a possible race condition when a reference to a vcpu is acquired (via kvm_get_vcpu() or kvm_for_each_vcpu()) moments before said vcpu is destroyed. Signed-off-by: Michal Luczaj <mhal@rbox.co> Message-Id: <20230510140410.1093987-2-mhal@rbox.co> Cc: stable@vger.kernel.org Fixes: c5b077549136 ("KVM: Convert the kvm->vcpus array to a xarray", 2021-12-08) Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2023-05-19KVM: Don't enable hardware after a restart/shutdown is initiatedSean Christopherson1-1/+16
Reject hardware enabling, i.e. VM creation, if a restart/shutdown has been initiated to avoid re-enabling hardware between kvm_reboot() and machine_{halt,power_off,restart}(). The restart case is especially problematic (for x86) as enabling VMX (or clearing GIF in KVM_RUN on SVM) blocks INIT, which results in the restart/reboot hanging as BIOS is unable to wake and rendezvous with APs. Note, this bug, and the original issue that motivated the addition of kvm_reboot(), is effectively limited to a forced reboot, e.g. `reboot -f`. In a "normal" reboot, userspace will gracefully teardown userspace before triggering the kernel reboot (modulo bugs, errors, etc), i.e. any process that might do ioctl(KVM_CREATE_VM) is long gone. Fixes: 8e1c18157d87 ("KVM: VMX: Disable VMX when system shutdown") Signed-off-by: Sean Christopherson <seanjc@google.com> Acked-by: Marc Zyngier <maz@kernel.org> Message-Id: <20230512233127.804012-3-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2023-05-19KVM: Use syscore_ops instead of reboot_notifier to hook restart/shutdownSean Christopherson1-15/+11
Use syscore_ops.shutdown to disable hardware virtualization during a reboot instead of using the dedicated reboot_notifier so that KVM disables virtualization _after_ system_state has been updated. This will allow fixing a race in KVM's handling of a forced reboot where KVM can end up enabling hardware virtualization between kernel_restart_prepare() and machine_restart(). Rename KVM's hook to match the syscore op to avoid any possible confusion from wiring up a "reboot" helper to a "shutdown" hook (neither "shutdown nor "reboot" is completely accurate as the hook handles both). Opportunistically rewrite kvm_shutdown()'s comment to make it less VMX specific, and to explain why kvm_rebooting exists. Cc: Marc Zyngier <maz@kernel.org> Cc: Oliver Upton <oliver.upton@linux.dev> Cc: James Morse <james.morse@arm.com> Cc: Suzuki K Poulose <suzuki.poulose@arm.com> Cc: Zenghui Yu <yuzenghui@huawei.com> Cc: kvmarm@lists.linux.dev Cc: Huacai Chen <chenhuacai@kernel.org> Cc: Aleksandar Markovic <aleksandar.qemu.devel@gmail.com> Cc: Anup Patel <anup@brainfault.org> Cc: Atish Patra <atishp@atishpatra.org> Cc: kvm-riscv@lists.infradead.org Signed-off-by: Sean Christopherson <seanjc@google.com> Acked-by: Marc Zyngier <maz@kernel.org> Message-Id: <20230512233127.804012-2-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2023-05-16KVM: arm64: Export kvm_are_all_memslots_empty()Ricardo Koller1-1/+2
Export kvm_are_all_memslots_empty(). This will be used by a future commit when checking before setting a capability. Signed-off-by: Ricardo Koller <ricarkol@google.com> Reviewed-by: Shaoqin Huang <shahuang@redhat.com> Reviewed-by: Gavin Shan <gshan@redhat.com> Link: https://lore.kernel.org/r/20230426172330.1439644-5-ricarkol@google.com Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
2023-05-01Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvmLinus Torvalds1-16/+14
Pull kvm updates from Paolo Bonzini: "s390: - More phys_to_virt conversions - Improvement of AP management for VSIE (nested virtualization) ARM64: - Numerous fixes for the pathological lock inversion issue that plagued KVM/arm64 since... forever. - New framework allowing SMCCC-compliant hypercalls to be forwarded to userspace, hopefully paving the way for some more features being moved to VMMs rather than be implemented in the kernel. - Large rework of the timer code to allow a VM-wide offset to be applied to both virtual and physical counters as well as a per-timer, per-vcpu offset that complements the global one. This last part allows the NV timer code to be implemented on top. - A small set of fixes to make sure that we don't change anything affecting the EL1&0 translation regime just after having having taken an exception to EL2 until we have executed a DSB. This ensures that speculative walks started in EL1&0 have completed. - The usual selftest fixes and improvements. x86: - Optimize CR0.WP toggling by avoiding an MMU reload when TDP is enabled, and by giving the guest control of CR0.WP when EPT is enabled on VMX (VMX-only because SVM doesn't support per-bit controls) - Add CR0/CR4 helpers to query single bits, and clean up related code where KVM was interpreting kvm_read_cr4_bits()'s "unsigned long" return as a bool - Move AMD_PSFD to cpufeatures.h and purge KVM's definition - Avoid unnecessary writes+flushes when the guest is only adding new PTEs - Overhaul .sync_page() and .invlpg() to utilize .sync_page()'s optimizations when emulating invalidations - Clean up the range-based flushing APIs - Revamp the TDP MMU's reaping of Accessed/Dirty bits to clear a single A/D bit using a LOCK AND instead of XCHG, and skip all of the "handle changed SPTE" overhead associated with writing the entire entry - Track the number of "tail" entries in a pte_list_desc to avoid having to walk (potentially) all descriptors during insertion and deletion, which gets quite expensive if the guest is spamming fork() - Disallow virtualizing legacy LBRs if architectural LBRs are available, the two are mutually exclusive in hardware - Disallow writes to immutable feature MSRs (notably PERF_CAPABILITIES) after KVM_RUN, similar to CPUID features - Overhaul the vmx_pmu_caps selftest to better validate PERF_CAPABILITIES - Apply PMU filters to emulated events and add test coverage to the pmu_event_filter selftest - AMD SVM: - Add support for virtual NMIs - Fixes for edge cases related to virtual interrupts - Intel AMX: - Don't advertise XTILE_CFG in KVM_GET_SUPPORTED_CPUID if XTILE_DATA is not being reported due to userspace not opting in via prctl() - Fix a bug in emulation of ENCLS in compatibility mode - Allow emulation of NOP and PAUSE for L2 - AMX selftests improvements - Misc cleanups MIPS: - Constify MIPS's internal callbacks (a leftover from the hardware enabling rework that landed in 6.3) Generic: - Drop unnecessary casts from "void *" throughout kvm_main.c - Tweak the layout of "struct kvm_mmu_memory_cache" to shrink the struct size by 8 bytes on 64-bit kernels by utilizing a padding hole Documentation: - Fix goof introduced by the conversion to rST" * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (211 commits) KVM: s390: pci: fix virtual-physical confusion on module unload/load KVM: s390: vsie: clarifications on setting the APCB KVM: s390: interrupt: fix virtual-physical confusion for next alert GISA KVM: arm64: Have kvm_psci_vcpu_on() use WRITE_ONCE() to update mp_state KVM: arm64: Acquire mp_state_lock in kvm_arch_vcpu_ioctl_vcpu_init() KVM: selftests: Test the PMU event "Instructions retired" KVM: selftests: Copy full counter values from guest in PMU event filter test KVM: selftests: Use error codes to signal errors in PMU event filter test KVM: selftests: Print detailed info in PMU event filter asserts KVM: selftests: Add helpers for PMC asserts in PMU event filter test KVM: selftests: Add a common helper for the PMU event filter guest code KVM: selftests: Fix spelling mistake "perrmited" -> "permitted" KVM: arm64: vhe: Drop extra isb() on guest exit KVM: arm64: vhe: Synchronise with page table walker on MMU update KVM: arm64: pkvm: Document the side effects of kvm_flush_dcache_to_poc() KVM: arm64: nvhe: Synchronise with page table walker on TLBI KVM: arm64: Handle 32bit CNTPCTSS traps KVM: arm64: nvhe: Synchronise with page table walker on vcpu run KVM: arm64: vgic: Don't acquire its_lock before config_lock KVM: selftests: Add test to verify KVM's supported XCR0 ...
2023-04-28Merge tag 'smp-core-2023-04-27' of ↵Linus Torvalds1-0/+3
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull SMP cross-CPU function-call updates from Ingo Molnar: - Remove diagnostics and adjust config for CSD lock diagnostics - Add a generic IPI-sending tracepoint, as currently there's no easy way to instrument IPI origins: it's arch dependent and for some major architectures it's not even consistently available. * tag 'smp-core-2023-04-27' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: trace,smp: Trace all smp_function_call*() invocations trace: Add trace_ipi_send_cpu() sched, smp: Trace smp callback causing an IPI smp: reword smp call IPI comment treewide: Trace IPIs sent via smp_send_reschedule() irq_work: Trace self-IPIs sent via arch_irq_work_raise() smp: Trace IPIs sent via arch_send_call_function_ipi_mask() sched, smp: Trace IPIs sent via send_call_function_single_ipi() trace: Add trace_ipi_send_cpumask() kernel/smp: Make csdlock_debug= resettable locking/csd_lock: Remove per-CPU data indirection from CSD lock debugging locking/csd_lock: Remove added data from CSD lock debugging locking/csd_lock: Add Kconfig option for csd_debug default
2023-03-31KVM: PPC: Make KVM_CAP_IRQFD_RESAMPLE platform dependentAlexey Kardashevskiy1-1/+0
When introduced, IRQFD resampling worked on POWER8 with XICS. However KVM on POWER9 has never implemented it - the compatibility mode code ("XICS-on-XIVE") misses the kvm_notify_acked_irq() call and the native XIVE mode does not handle INTx in KVM at all. This moved the capability support advertising to platforms and stops advertising it on XIVE, i.e. POWER9 and later. Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru> Acked-by: Anup Patel <anup@brainfault.org> Acked-by: Nicholas Piggin <npiggin@gmail.com> Message-Id: <20220504074807.3616813-1-aik@ozlabs.ru> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2023-03-27KVM: x86/ioapic: Resample the pending state of an IRQ when unmaskingDmytro Maluka1-6/+35
KVM irqfd based emulation of level-triggered interrupts doesn't work quite correctly in some cases, particularly in the case of interrupts that are handled in a Linux guest as oneshot interrupts (IRQF_ONESHOT). Such an interrupt is acked to the device in its threaded irq handler, i.e. later than it is acked to the interrupt controller (EOI at the end of hardirq), not earlier. Linux keeps such interrupt masked until its threaded handler finishes, to prevent the EOI from re-asserting an unacknowledged interrupt. However, with KVM + vfio (or whatever is listening on the resamplefd) we always notify resamplefd at the EOI, so vfio prematurely unmasks the host physical IRQ, thus a new physical interrupt is fired in the host. This extra interrupt in the host is not a problem per se. The problem is that it is unconditionally queued for injection into the guest, so the guest sees an extra bogus interrupt. [*] There are observed at least 2 user-visible issues caused by those extra erroneous interrupts for a oneshot irq in the guest: 1. System suspend aborted due to a pending wakeup interrupt from ChromeOS EC (drivers/platform/chrome/cros_ec.c). 2. Annoying "invalid report id data" errors from ELAN0000 touchpad (drivers/input/mouse/elan_i2c_core.c), flooding the guest dmesg every time the touchpad is touched. The core issue here is that by the time when the guest unmasks the IRQ, the physical IRQ line is no longer asserted (since the guest has acked the interrupt to the device in the meantime), yet we unconditionally inject the interrupt queued into the guest by the previous resampling. So to fix the issue, we need a way to detect that the IRQ is no longer pending, and cancel the queued interrupt in this case. With IOAPIC we are not able to probe the physical IRQ line state directly (at least not if the underlying physical interrupt controller is an IOAPIC too), so in this patch we use irqfd resampler for that. Namely, instead of injecting the queued interrupt, we just notify the resampler that this interrupt is done. If the IRQ line is actually already deasserted, we are done. If it is still asserted, a new interrupt will be shortly triggered through irqfd and injected into the guest. In the case if there is no irqfd resampler registered for this IRQ, we cannot fix the issue, so we keep the existing behavior: immediately unconditionally inject the queued interrupt. This patch fixes the issue for x86 IOAPIC only. In the long run, we can fix it for other irqchips and other architectures too, possibly taking advantage of reading the physical state of the IRQ line, which is possible with some other irqchips (e.g. with arm64 GIC, maybe even with the legacy x86 PIC). [*] In this description we assume that the interrupt is a physical host interrupt forwarded to the guest e.g. by vfio. Potentially the same issue may occur also with a purely virtual interrupt from an emulated device, e.g. if the guest handles this interrupt, again, as a oneshot interrupt. Signed-off-by: Dmytro Maluka <dmy@semihalf.com> Link: https://lore.kernel.org/kvm/31420943-8c5f-125c-a5ee-d2fde2700083@semihalf.com/ Link: https://lore.kernel.org/lkml/87o7wrug0w.wl-maz@kernel.org/ Message-Id: <20230322204344.50138-3-dmy@semihalf.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2023-03-27KVM: irqfd: Make resampler_list an RCU listDmytro Maluka1-2/+6
It is useful to be able to do read-only traversal of the list of all the registered irqfd resamplers without locking the resampler_lock mutex. In particular, we are going to traverse it to search for a resampler registered for the given irq of an irqchip, and that will be done with an irqchip spinlock (ioapic->lock) held, so it is undesirable to lock a mutex in this context. So turn this list into an RCU list. For protecting the read side, reuse kvm->irq_srcu which is already used for protecting a number of irq related things (kvm->irq_routing, irqfd->resampler->list, kvm->irq_ack_notifier_list, kvm->arch.mask_notifier_list). Signed-off-by: Dmytro Maluka <dmy@semihalf.com> Message-Id: <20230322204344.50138-2-dmy@semihalf.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2023-03-24KVM: Fix comments that refer to the non-existent install_new_memslots()Jun Miao1-7/+7
Fix stale comments that were left behind when install_new_memslots() was replaced by kvm_swap_active_memslots() as part of the scalable memslots rework. Fixes: a54d806688fe ("KVM: Keep memslots in tree-based structures instead of array-based ones") Signed-off-by: Jun Miao <jun.miao@intel.com> Link: https://lore.kernel.org/r/20230223052851.1054799-1-jun.miao@intel.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-03-24treewide: Trace IPIs sent via smp_send_reschedule()Valentin Schneider1-0/+3
To be able to trace invocations of smp_send_reschedule(), rename the arch-specific definitions of it to arch_smp_send_reschedule() and wrap it into an smp_send_reschedule() that contains a tracepoint. Changes to include the declaration of the tracepoint were driven by the following coccinelle script: @func_use@ @@ smp_send_reschedule(...); @include@ @@ #include <trace/events/ipi.h> @no_include depends on func_use && !include@ @@ #include <...> + + #include <trace/events/ipi.h> [csky bits] [riscv bits] Signed-off-by: Valentin Schneider <vschneid@redhat.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Guo Ren <guoren@kernel.org> Acked-by: Palmer Dabbelt <palmer@rivosinc.com> Link: https://lore.kernel.org/r/20230307143558.294354-6-vschneid@redhat.com
2023-03-23kvm: kvm_main: Remove unnecessary (void*) conversionsLi kunyu1-7/+5
void * pointer assignment does not require a forced replacement. Signed-off-by: Li kunyu <kunyu@nfschina.com> Link: https://lore.kernel.org/r/20221213080236.3969-1-kunyu@nfschina.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-03-16KVM: Standardize on "int" return types instead of "long" in kvm_main.cThomas Huth1-2/+2
KVM functions use "long" return values for functions that are wired up to "struct file_operations", but otherwise use "int" return values for functions that can return 0/-errno in order to avoid unintentional divergences between 32-bit and 64-bit kernels. Some code still uses "long" in unnecessary spots, though, which can cause a little bit of confusion and unnecessary size casts. Let's change these spots to use "int" types, too. Signed-off-by: Thomas Huth <thuth@redhat.com> Message-Id: <20230208140105.655814-6-thuth@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2023-02-15Merge tag 'kvm-riscv-6.3-1' of https://github.com/kvm-riscv/linux into HEADPaolo Bonzini1-3/+3
KVM/riscv changes for 6.3 - Fix wrong usage of PGDIR_SIZE to check page sizes - Fix privilege mode setting in kvm_riscv_vcpu_trap_redirect() - Redirect illegal instruction traps to guest - SBI PMU support for guest
2023-02-01KVM: Destroy target device if coalesced MMIO unregistration failsSean Christopherson1-3/+5
Destroy and free the target coalesced MMIO device if unregistering said device fails. As clearly noted in the code, kvm_io_bus_unregister_dev() does not destroy the target device. BUG: memory leak unreferenced object 0xffff888112a54880 (size 64): comm "syz-executor.2", pid 5258, jiffies 4297861402 (age 14.129s) hex dump (first 32 bytes): 38 c7 67 15 00 c9 ff ff 38 c7 67 15 00 c9 ff ff 8.g.....8.g..... e0 c7 e1 83 ff ff ff ff 00 30 67 15 00 c9 ff ff .........0g..... backtrace: [<0000000006995a8a>] kmalloc include/linux/slab.h:556 [inline] [<0000000006995a8a>] kzalloc include/linux/slab.h:690 [inline] [<0000000006995a8a>] kvm_vm_ioctl_register_coalesced_mmio+0x8e/0x3d0 arch/x86/kvm/../../../virt/kvm/coalesced_mmio.c:150 [<00000000022550c2>] kvm_vm_ioctl+0x47d/0x1600 arch/x86/kvm/../../../virt/kvm/kvm_main.c:3323 [<000000008a75102f>] vfs_ioctl fs/ioctl.c:46 [inline] [<000000008a75102f>] file_ioctl fs/ioctl.c:509 [inline] [<000000008a75102f>] do_vfs_ioctl+0xbab/0x1160 fs/ioctl.c:696 [<0000000080e3f669>] ksys_ioctl+0x76/0xa0 fs/ioctl.c:713 [<0000000059ef4888>] __do_sys_ioctl fs/ioctl.c:720 [inline] [<0000000059ef4888>] __se_sys_ioctl fs/ioctl.c:718 [inline] [<0000000059ef4888>] __x64_sys_ioctl+0x6f/0xb0 fs/ioctl.c:718 [<000000006444fa05>] do_syscall_64+0x9f/0x4e0 arch/x86/entry/common.c:290 [<000000009a4ed50b>] entry_SYSCALL_64_after_hwframe+0x49/0xbe BUG: leak checking failed Fixes: 5d3c4c79384a ("KVM: Stop looking for coalesced MMIO zones if the bus is destroyed") Cc: stable@vger.kernel.org Reported-by: 柳菁峰 <liujingfeng@qianxin.com> Reported-by: Michal Luczaj <mhal@rbox.co> Link: https://lore.kernel.org/r/20221219171924.67989-1-seanjc@google.com Link: https://lore.kernel.org/all/20230118220003.1239032-1-mhal@rbox.co Signed-off-by: Sean Christopherson <seanjc@google.com>
2023-01-24Merge branch 'kvm-v6.2-rc4-fixes' into HEADPaolo Bonzini1-0/+7
ARM: * Fix the PMCR_EL0 reset value after the PMU rework * Correctly handle S2 fault triggered by a S1 page table walk by not always classifying it as a write, as this breaks on R/O memslots * Document why we cannot exit with KVM_EXIT_MMIO when taking a write fault from a S1 PTW on a R/O memslot * Put the Apple M2 on the naughty list for not being able to correctly implement the vgic SEIS feature, just like the M1 before it * Reviewer updates: Alex is stepping down, replaced by Zenghui x86: * Fix various rare locking issues in Xen emulation and teach lockdep to detect them * Documentation improvements * Do not return host topology information from KVM_GET_SUPPORTED_CPUID
2023-01-23Merge tag 'vfio-v6.2-rc6' of https://github.com/awilliam/linux-vfioLinus Torvalds1-3/+3
Pull VFIO fixes from Alex Williamson: - Honor reserved regions when testing for IOMMU find grained super page support, avoiding a regression on s390 for a firmware device where the existence of the mapping, even if unused can trigger an error state. (Niklas Schnelle) - Fix a deadlock in releasing KVM references by using the alternate .release() rather than .destroy() callback for the kvm-vfio device. (Yi Liu) * tag 'vfio-v6.2-rc6' of https://github.com/awilliam/linux-vfio: kvm/vfio: Fix potential deadlock on vfio group_lock vfio/type1: Respect IOMMU reserved regions in vfio_test_domain_fgsp()
2023-01-20kvm/vfio: Fix potential deadlock on vfio group_lockYi Liu1-3/+3
Currently it is possible that the final put of a KVM reference comes from vfio during its device close operation. This occurs while the vfio group lock is held; however, if the vfio device is still in the kvm device list, then the following call chain could result in a deadlock: VFIO holds group->group_lock/group_rwsem -> kvm_put_kvm -> kvm_destroy_vm -> kvm_destroy_devices -> kvm_vfio_destroy -> kvm_vfio_file_set_kvm -> vfio_file_set_kvm -> try to hold group->group_lock/group_rwsem The key function is the kvm_destroy_devices() which triggers destroy cb of kvm_device_ops. It calls back to vfio and try to hold group_lock. So if this path doesn't call back to vfio, this dead lock would be fixed. Actually, there is a way for it. KVM provides another point to free the kvm-vfio device which is the point when the device file descriptor is closed. This can be achieved by providing the release cb instead of the destroy cb. Also rename kvm_vfio_destroy() to be kvm_vfio_release(). /* * Destroy is responsible for freeing dev. * * Destroy may be called before or after destructors are called * on emulated I/O regions, depending on whether a reference is * held by a vcpu or other kvm component that gets destroyed * after the emulated I/O. */ void (*destroy)(struct kvm_device *dev); /* * Release is an alternative method to free the device. It is * called when the device file descriptor is closed. Once * release is called, the destroy method will not be called * anymore as the device is removed from the device list of * the VM. kvm->lock is held. */ void (*release)(struct kvm_device *dev); Fixes: 421cfe6596f6 ("vfio: remove VFIO_GROUP_NOTIFY_SET_KVM") Reported-by: Alex Williamson <alex.williamson@redhat.com> Suggested-by: Kevin Tian <kevin.tian@intel.com> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Signed-off-by: Yi Liu <yi.l.liu@intel.com> Reviewed-by: Matthew Rosato <mjrosato@linux.ibm.com> Link: https://lore.kernel.org/r/20230114000351.115444-1-mjrosato@linux.ibm.com Link: https://lore.kernel.org/r/20230120150528.471752-1-yi.l.liu@intel.com [aw: update comment as well, s/destroy/release/] Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2023-01-11KVM: Ensure lockdep knows about kvm->lock vs. vcpu->mutex ordering ruleDavid Woodhouse1-0/+7
Documentation/virt/kvm/locking.rst tells us that kvm->lock is taken outside vcpu->mutex. But that doesn't actually happen very often; it's only in some esoteric cases like migration with AMD SEV. This means that lockdep usually doesn't notice, and doesn't do its job of keeping us honest. Ensure that lockdep *always* knows about the ordering of these two locks, by briefly taking vcpu->mutex in kvm_vm_ioctl_create_vcpu() while kvm->lock is held. Signed-off-by: David Woodhouse <dwmw@amazon.co.uk> Message-Id: <20230111180651.14394-3-dwmw2@infradead.org> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-12-29KVM: Clean up error labels in kvm_init()Sean Christopherson1-4/+4
Convert the last two "out" lables to "err" labels now that the dust has settled, i.e. now that there are no more planned changes to the order of things in kvm_init(). Use "err" instead of "out" as it's easier to describe what failed than it is to describe what needs to be unwound, e.g. if allocating a per-CPU kick mask fails, KVM needs to free any masks that were allocated, and of course needs to unwind previous operations. Reported-by: Chao Gao <chao.gao@intel.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20221130230934.1014142-51-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-12-29KVM: Opt out of generic hardware enabling on s390 and PPCSean Christopherson2-6/+27
Allow architectures to opt out of the generic hardware enabling logic, and opt out on both s390 and PPC, which don't need to manually enable virtualization as it's always on (when available). In addition to letting s390 and PPC drop a bit of dead code, this will hopefully also allow ARM to clean up its related code, e.g. ARM has its own per-CPU flag to track which CPUs have enable hardware due to the need to keep hardware enabled indefinitely when pKVM is enabled. Signed-off-by: Sean Christopherson <seanjc@google.com> Acked-by: Anup Patel <anup@brainfault.org> Message-Id: <20221130230934.1014142-50-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-12-29KVM: Register syscore (suspend/resume) ops early in kvm_init()Sean Christopherson1-34/+34
Register the suspend/resume notifier hooks at the same time KVM registers its reboot notifier so that all the code in kvm_init() that deals with enabling/disabling hardware is bundled together. Opportunstically move KVM's implementations to reside near the reboot notifier code for the same reason. Bunching the code together will allow architectures to opt out of KVM's generic hardware enable/disable logic with minimal #ifdeffery. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20221130230934.1014142-49-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-12-29KVM: Make hardware_enable_failed a local variable in the "enable all" pathIsaku Yamahata1-19/+16
Rework detecting hardware enabling errors to use a local variable in the "enable all" path to track whether or not enabling was successful across all CPUs. Using a global variable complicates paths that enable hardware only on the current CPU, e.g. kvm_resume() and kvm_online_cpu(). Opportunistically add a WARN if hardware enabling fails during kvm_resume(), KVM is all kinds of hosed if CPU0 fails to enable hardware. The WARN is largely futile in the current code, as KVM BUG()s on spurious faults on VMX instructions, e.g. attempting to run a vCPU on CPU if hardware enabling fails will explode. ------------[ cut here ]------------ kernel BUG at arch/x86/kvm/x86.c:508! invalid opcode: 0000 [#1] SMP CPU: 3 PID: 1009 Comm: CPU 4/KVM Not tainted 6.1.0-rc1+ #11 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015 RIP: 0010:kvm_spurious_fault+0xa/0x10 Call Trace: vmx_vcpu_load_vmcs+0x192/0x230 [kvm_intel] vmx_vcpu_load+0x16/0x60 [kvm_intel] kvm_arch_vcpu_load+0x32/0x1f0 vcpu_load+0x2f/0x40 kvm_arch_vcpu_ioctl_run+0x19/0x9d0 kvm_vcpu_ioctl+0x271/0x660 __x64_sys_ioctl+0x80/0xb0 do_syscall_64+0x2b/0x50 entry_SYSCALL_64_after_hwframe+0x46/0xb0 But, the WARN may provide a breadcrumb to understand what went awry, and someday KVM may fix one or both of those bugs, e.g. by finding a way to eat spurious faults no matter the context (easier said than done due to side effects of certain operations, e.g. Intel's VMCLEAR). Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com> [sean: rebase, WARN on failure in kvm_resume()] Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20221130230934.1014142-48-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-12-29KVM: Use a per-CPU variable to track which CPUs have enabled virtualizationSean Christopherson1-23/+18
Use a per-CPU variable instead of a shared bitmap to track which CPUs have successfully enabled virtualization hardware. Using a per-CPU bool avoids the need for an additional allocation, and arguably yields easier to read code. Using a bitmap would be advantageous if KVM used it to avoid generating IPIs to CPUs that failed to enable hardware, but that's an extreme edge case and not worth optimizing, and the low level helpers would still want to keep their individual checks as attempting to enable virtualization hardware when it's already enabled can be problematic, e.g. Intel's VMXON will fault. Opportunistically change the order in hardware_enable_nolock() to set the flag if and only if hardware enabling is successful, instead of speculatively setting the flag and then clearing it on failure. Add a comment explaining that the check in hardware_disable_nolock() isn't simply paranoia. Waaay back when, commit 1b6c016818a5 ("KVM: Keep track of which cpus have virtualization enabled"), added the logic as a guards against CPU hotplug racing with hardware enable/disable. Now that KVM has eliminated the race by taking cpu_hotplug_lock for read (via cpus_read_lock()) when enabling or disabling hardware, at first glance it appears that the check is now superfluous, i.e. it's tempting to remove the per-CPU flag entirely... Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20221130230934.1014142-47-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-12-29KVM: Remove on_each_cpu(hardware_disable_nolock) in kvm_exit()Isaku Yamahata1-1/+0
Drop the superfluous invocation of hardware_disable_nolock() during kvm_exit(), as it's nothing more than a glorified nop. KVM automatically disables hardware on all CPUs when the last VM is destroyed, and kvm_exit() cannot be called until the last VM goes away as the calling module is pinned by an elevated refcount of the fops associated with /dev/kvm. This holds true even on x86, where the caller of kvm_exit() is not kvm.ko, but is instead a dependent module, kvm_amd.ko or kvm_intel.ko, as kvm_chardev_ops.owner is set to the module that calls kvm_init(), not hardcoded to the base kvm.ko module. Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com> [sean: rework changelog] Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20221130230934.1014142-46-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-12-29KVM: Drop kvm_count_lock and instead protect kvm_usage_count with kvm_lockIsaku Yamahata1-12/+24
Drop kvm_count_lock and instead protect kvm_usage_count with kvm_lock now that KVM hooks CPU hotplug during the ONLINE phase, which can sleep. Previously, KVM hooked the STARTING phase, which is not allowed to sleep and thus could not take kvm_lock (a mutex). This effectively allows the task that's initiating hardware enabling/disabling to preempted and/or migrated. Note, the Documentation/virt/kvm/locking.rst statement that kvm_count_lock is "raw" because hardware enabling/disabling needs to be atomic with respect to migration is wrong on multiple fronts. First, while regular spinlocks can be preempted, the task holding the lock cannot be migrated. Second, preventing migration is not required. on_each_cpu() disables preemption, which ensures that cpus_hardware_enabled correctly reflects hardware state. The task may be preempted/migrated between bumping kvm_usage_count and invoking on_each_cpu(), but that's perfectly ok as kvm_usage_count is still protected, e.g. other tasks that call hardware_enable_all() will be blocked until the preempted/migrated owner exits its critical section. KVM does have lockless accesses to kvm_usage_count in the suspend/resume flows, but those are safe because all tasks must be frozen prior to suspending CPUs, and a task cannot be frozen while it holds one or more locks (userspace tasks are frozen via a fake signal). Preemption doesn't need to be explicitly disabled in the hotplug path. The hotplug thread is pinned to the CPU that's being hotplugged, and KVM only cares about having a stable CPU, i.e. to ensure hardware is enabled on the correct CPU. Lockep, i.e. check_preemption_disabled(), plays nice with this state too, as is_percpu_thread() is true for the hotplug thread. Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com> Co-developed-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20221130230934.1014142-45-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-12-29KVM: Ensure CPU is stable during low level hardware enable/disableSean Christopherson1-2/+2
Use the non-raw smp_processor_id() in the low hardware enable/disable helpers as KVM absolutely relies on the CPU being stable, e.g. KVM would end up with incorrect state if the task were migrated between accessing cpus_hardware_enabled and actually enabling/disabling hardware. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20221130230934.1014142-44-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-12-29KVM: Disable CPU hotplug during hardware enabling/disablingChao Gao1-0/+12
Disable CPU hotplug when enabling/disabling hardware to prevent the corner case where if the following sequence occurs: 1. A hotplugged CPU marks itself online in cpu_online_mask 2. The hotplugged CPU enables interrupt before invoking KVM's ONLINE callback 3 hardware_{en,dis}able_all() is invoked on another CPU the hotplugged CPU will be included in on_each_cpu() and thus get sent through hardware_{en,dis}able_nolock() before kvm_online_cpu() is called. start_secondary { ... set_cpu_online(smp_processor_id(), true); <- 1 ... local_irq_enable(); <- 2 ... cpu_startup_entry(CPUHP_AP_ONLINE_IDLE); <- 3 } KVM currently fudges around this race by keeping track of which CPUs have done hardware enabling (see commit 1b6c016818a5 "KVM: Keep track of which cpus have virtualization enabled"), but that's an inefficient, convoluted, and hacky solution. Signed-off-by: Chao Gao <chao.gao@intel.com> [sean: split to separate patch, write changelog] Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20221130230934.1014142-43-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-12-29KVM: Rename and move CPUHP_AP_KVM_STARTING to ONLINE sectionChao Gao1-8/+22
The CPU STARTING section doesn't allow callbacks to fail. Move KVM's hotplug callback to ONLINE section so that it can abort onlining a CPU in certain cases to avoid potentially breaking VMs running on existing CPUs. For example, when KVM fails to enable hardware virtualization on the hotplugged CPU. Place KVM's hotplug state before CPUHP_AP_SCHED_WAIT_EMPTY as it ensures when offlining a CPU, all user tasks and non-pinned kernel tasks have left the CPU, i.e. there cannot be a vCPU task around. So, it is safe for KVM's CPU offline callback to disable hardware virtualization at that point. Likewise, KVM's online callback can enable hardware virtualization before any vCPU task gets a chance to run on hotplugged CPUs. Drop kvm_x86_check_processor_compatibility()'s WARN that IRQs are disabled, as the ONLINE section runs with IRQs disabled. The WARN wasn't intended to be a requirement, e.g. disabling preemption is sufficient, the IRQ thing was purely an aggressive sanity check since the helper was only ever invoked via SMP function call. Rename KVM's CPU hotplug callbacks accordingly. Suggested-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Chao Gao <chao.gao@intel.com> Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com> Reviewed-by: Yuan Yao <yuan.yao@intel.com> [sean: drop WARN that IRQs are disabled] Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20221130230934.1014142-42-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-12-29KVM: Drop kvm_arch_check_processor_compat() hookSean Christopherson1-23/+1
Drop kvm_arch_check_processor_compat() and its support code now that all architecture implementations are nops. Signed-off-by: Sean Christopherson <seanjc@google.com> Reviewed-by: Philippe Mathieu-Daudé <philmd@linaro.org> Reviewed-by: Eric Farman <farman@linux.ibm.com> # s390 Acked-by: Anup Patel <anup@brainfault.org> Reviewed-by: Kai Huang <kai.huang@intel.com> Message-Id: <20221130230934.1014142-33-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-12-29KVM: Drop kvm_arch_{init,exit}() hooksSean Christopherson1-17/+2
Drop kvm_arch_init() and kvm_arch_exit() now that all implementations are nops. No functional change intended. Signed-off-by: Sean Christopherson <seanjc@google.com> Reviewed-by: Eric Farman <farman@linux.ibm.com> # s390 Reviewed-by: Philippe Mathieu-Daudé <philmd@linaro.org> Acked-by: Anup Patel <anup@brainfault.org> Message-Id: <20221130230934.1014142-30-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-12-29KVM: Drop arch hardware (un)setup hooksSean Christopherson1-7/+0
Drop kvm_arch_hardware_setup() and kvm_arch_hardware_unsetup() now that all implementations are nops. No functional change intended. Signed-off-by: Sean Christopherson <seanjc@google.com> Reviewed-by: Eric Farman <farman@linux.ibm.com> # s390 Acked-by: Anup Patel <anup@brainfault.org> Message-Id: <20221130230934.1014142-10-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-12-29KVM: Teardown VFIO ops earlier in kvm_exit()Sean Christopherson1-1/+1
Move the call to kvm_vfio_ops_exit() further up kvm_exit() to try and bring some amount of symmetry to the setup order in kvm_init(), and more importantly so that the arch hooks are invoked dead last by kvm_exit(). This will allow arch code to move away from the arch hooks without any change in ordering between arch code and common code in kvm_exit(). That kvm_vfio_ops_exit() is called last appears to be 100% arbitrary. It was bolted on after the fact by commit 571ee1b68598 ("kvm: vfio: fix unregister kvm_device_ops of vfio"). The nullified kvm_device_ops_table is also local to kvm_main.c and is used only when there are active VMs, so unless arch code is doing something truly bizarre, nullifying the table earlier in kvm_exit() is little more than a nop. Signed-off-by: Sean Christopherson <seanjc@google.com> Reviewed-by: Cornelia Huck <cohuck@redhat.com> Reviewed-by: Eric Farman <farman@linux.ibm.com> Message-Id: <20221130230934.1014142-5-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-12-29KVM: Allocate cpus_hardware_enabled after arch hardware setupSean Christopherson1-7/+7
Allocate cpus_hardware_enabled after arch hardware setup so that arch "init" and "hardware setup" are called back-to-back and thus can be combined in a future patch. cpus_hardware_enabled is never used before kvm_create_vm(), i.e. doesn't have a dependency with hardware setup and only needs to be allocated before /dev/kvm is exposed to userspace. Free the object before the arch hooks are invoked to maintain symmetry, and so that arch code can move away from the hooks without having to worry about ordering changes. Signed-off-by: Sean Christopherson <seanjc@google.com> Reviewed-by: Yuan Yao <yuan.yao@intel.com> Message-Id: <20221130230934.1014142-4-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-12-29KVM: Initialize IRQ FD after arch hardware setupSean Christopherson1-18/+17
Move initialization of KVM's IRQ FD workqueue below arch hardware setup as a step towards consolidating arch "init" and "hardware setup", and eventually towards dropping the hooks entirely. There is no dependency on the workqueue being created before hardware setup, the workqueue is used only when destroying VMs, i.e. only needs to be created before /dev/kvm is exposed to userspace. Move the destruction of the workqueue before the arch hooks to maintain symmetry, and so that arch code can move away from the hooks without having to worry about ordering changes. Reword the comment about kvm_irqfd_init() needing to come after kvm_arch_init() to call out that kvm_arch_init() must come before common KVM does _anything_, as x86 very subtly relies on that behavior to deal with multiple calls to kvm_init(), e.g. if userspace attempts to load kvm_amd.ko and kvm_intel.ko. Tag the code with a FIXME, as x86's subtle requirement is gross, and invoking an arch callback as the very first action in a helper that is called only from arch code is silly. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20221130230934.1014142-3-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-12-29KVM: Register /dev/kvm as the _very_ last thing during initializationSean Christopherson1-9/+22
Register /dev/kvm, i.e. expose KVM to userspace, only after all other setup has completed. Once /dev/kvm is exposed, userspace can start invoking KVM ioctls, creating VMs, etc... If userspace creates a VM before KVM is done with its configuration, bad things may happen, e.g. KVM will fail to properly migrate vCPU state if a VM is created before KVM has registered preemption notifiers. Cc: stable@vger.kernel.org Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20221130230934.1014142-2-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-12-29Merge branch 'kvm-late-6.1' into HEADPaolo Bonzini1-4/+0
x86: * Change tdp_mmu to a read-only parameter * Separate TDP and shadow MMU page fault paths * Enable Hyper-V invariant TSC control selftests: * Use TAP interface for kvm_binary_stats_test and tsc_msrs_test Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-12-28Merge branch 'kvm-late-6.1-fixes' into HEADPaolo Bonzini1-4/+0
x86: * several fixes to nested VMX execution controls * fixes and clarification to the documentation for Xen emulation * do not unnecessarily release a pmu event with zero period * MMU fixes * fix Coverity warning in kvm_hv_flush_tlb() selftests: * fixes for the ucall mechanism in selftests * other fixes mostly related to compilation with clang
2022-12-27kvm: Remove the unused macro KVM_MMU_READ_{,UN}LOCK()Lai Jiangshan1-4/+0
No code is using KVM_MMU_READ_LOCK() or KVM_MMU_READ_UNLOCK(). They used to be in virt/kvm/pfncache.c: KVM_MMU_READ_LOCK(kvm); retry = mmu_notifier_retry_hva(kvm, mmu_seq, uhva); KVM_MMU_READ_UNLOCK(kvm); However, since 58cd407ca4c6 ("KVM: Fix multiple races in gfn=>pfn cache refresh", 2022-05-25) the code is only relying on the MMU notifier's invalidation count and sequence number. Signed-off-by: Lai Jiangshan <jiangshan.ljs@antgroup.com> Message-Id: <20221207120617.9409-1-jiangshanlai@gmail.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-12-15Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvmLinus Torvalds6-92/+188
Pull kvm updates from Paolo Bonzini: "ARM64: - Enable the per-vcpu dirty-ring tracking mechanism, together with an option to keep the good old dirty log around for pages that are dirtied by something other than a vcpu. - Switch to the relaxed parallel fault handling, using RCU to delay page table reclaim and giving better performance under load. - Relax the MTE ABI, allowing a VMM to use the MAP_SHARED mapping option, which multi-process VMMs such as crosvm rely on (see merge commit 382b5b87a97d: "Fix a number of issues with MTE, such as races on the tags being initialised vs the PG_mte_tagged flag as well as the lack of support for VM_SHARED when KVM is involved. Patches from Catalin Marinas and Peter Collingbourne"). - Merge the pKVM shadow vcpu state tracking that allows the hypervisor to have its own view of a vcpu, keeping that state private. - Add support for the PMUv3p5 architecture revision, bringing support for 64bit counters on systems that support it, and fix the no-quite-compliant CHAIN-ed counter support for the machines that actually exist out there. - Fix a handful of minor issues around 52bit VA/PA support (64kB pages only) as a prefix of the oncoming support for 4kB and 16kB pages. - Pick a small set of documentation and spelling fixes, because no good merge window would be complete without those. s390: - Second batch of the lazy destroy patches - First batch of KVM changes for kernel virtual != physical address support - Removal of a unused function x86: - Allow compiling out SMM support - Cleanup and documentation of SMM state save area format - Preserve interrupt shadow in SMM state save area - Respond to generic signals during slow page faults - Fixes and optimizations for the non-executable huge page errata fix. - Reprogram all performance counters on PMU filter change - Cleanups to Hyper-V emulation and tests - Process Hyper-V TLB flushes from a nested guest (i.e. from a L2 guest running on top of a L1 Hyper-V hypervisor) - Advertise several new Intel features - x86 Xen-for-KVM: - Allow the Xen runstate information to cross a page boundary - Allow XEN_RUNSTATE_UPDATE flag behaviour to be configured - Add support for 32-bit guests in SCHEDOP_poll - Notable x86 fixes and cleanups: - One-off fixes for various emulation flows (SGX, VMXON, NRIPS=0). - Reinstate IBPB on emulated VM-Exit that was incorrectly dropped a few years back when eliminating unnecessary barriers when switching between vmcs01 and vmcs02. - Clean up vmread_error_trampoline() to make it more obvious that params must be passed on the stack, even for x86-64. - Let userspace set all supported bits in MSR_IA32_FEAT_CTL irrespective of the current guest CPUID. - Fudge around a race with TSC refinement that results in KVM incorrectly thinking a guest needs TSC scaling when running on a CPU with a constant TSC, but no hardware-enumerated TSC frequency. - Advertise (on AMD) that the SMM_CTL MSR is not supported - Remove unnecessary exports Generic: - Support for responding to signals during page faults; introduces new FOLL_INTERRUPTIBLE flag that was reviewed by mm folks Selftests: - Fix an inverted check in the access tracking perf test, and restore support for asserting that there aren't too many idle pages when running on bare metal. - Fix build errors that occur in certain setups (unsure exactly what is unique about the problematic setup) due to glibc overriding static_assert() to a variant that requires a custom message. - Introduce actual atomics for clear/set_bit() in selftests - Add support for pinning vCPUs in dirty_log_perf_test. - Rename the so called "perf_util" framework to "memstress". - Add a lightweight psuedo RNG for guest use, and use it to randomize the access pattern and write vs. read percentage in the memstress tests. - Add a common ucall implementation; code dedup and pre-work for running SEV (and beyond) guests in selftests. - Provide a common constructor and arch hook, which will eventually be used by x86 to automatically select the right hypercall (AMD vs. Intel). - A bunch of added/enabled/fixed selftests for ARM64, covering memslots, breakpoints, stage-2 faults and access tracking. - x86-specific selftest changes: - Clean up x86's page table management. - Clean up and enhance the "smaller maxphyaddr" test, and add a related test to cover generic emulation failure. - Clean up the nEPT support checks. - Add X86_PROPERTY_* framework to retrieve multi-bit CPUID values. - Fix an ordering issue in the AMX test introduced by recent conversions to use kvm_cpu_has(), and harden the code to guard against similar bugs in the future. Anything that tiggers caching of KVM's supported CPUID, kvm_cpu_has() in this case, effectively hides opt-in XSAVE features if the caching occurs before the test opts in via prctl(). Documentation: - Remove deleted ioctls from documentation - Clean up the docs for the x86 MSR filter. - Various fixes" * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (361 commits) KVM: x86: Add proper ReST tables for userspace MSR exits/flags KVM: selftests: Allocate ucall pool from MEM_REGION_DATA KVM: arm64: selftests: Align VA space allocator with TTBR0 KVM: arm64: Fix benign bug with incorrect use of VA_BITS KVM: arm64: PMU: Fix period computation for 64bit counters with 32bit overflow KVM: x86: Advertise that the SMM_CTL MSR is not supported KVM: x86: remove unnecessary exports KVM: selftests: Fix spelling mistake "probabalistic" -> "probabilistic" tools: KVM: selftests: Convert clear/set_bit() to actual atomics tools: Drop "atomic_" prefix from atomic test_and_set_bit() tools: Drop conflicting non-atomic test_and_{clear,set}_bit() helpers KVM: selftests: Use non-atomic clear/set bit helpers in KVM tests perf tools: Use dedicated non-atomic clear/set bit helpers tools: Take @bit as an "unsigned long" in {clear,set}_bit() helpers KVM: arm64: selftests: Enable single-step without a "full" ucall() KVM: x86: fix APICv/x2AVIC disabled when vm reboot by itself KVM: Remove stale comment about KVM_REQ_UNHALT KVM: Add missing arch for KVM_CREATE_DEVICE and KVM_{SET,GET}_DEVICE_ATTR KVM: Reference to kvm_userspace_memory_region in doc and comments KVM: Delete all references to removed KVM_SET_MEMORY_ALIAS ioctl ...
2022-12-12Merge remote-tracking branch 'kvm/queue' into HEADPaolo Bonzini2-62/+54
x86 Xen-for-KVM: * Allow the Xen runstate information to cross a page boundary * Allow XEN_RUNSTATE_UPDATE flag behaviour to be configured * add support for 32-bit guests in SCHEDOP_poll x86 fixes: * One-off fixes for various emulation flows (SGX, VMXON, NRIPS=0). * Reinstate IBPB on emulated VM-Exit that was incorrectly dropped a few years back when eliminating unnecessary barriers when switching between vmcs01 and vmcs02. * Clean up the MSR filter docs. * Clean up vmread_error_trampoline() to make it more obvious that params must be passed on the stack, even for x86-64. * Let userspace set all supported bits in MSR_IA32_FEAT_CTL irrespective of the current guest CPUID. * Fudge around a race with TSC refinement that results in KVM incorrectly thinking a guest needs TSC scaling when running on a CPU with a constant TSC, but no hardware-enumerated TSC frequency. * Advertise (on AMD) that the SMM_CTL MSR is not supported * Remove unnecessary exports Selftests: * Fix an inverted check in the access tracking perf test, and restore support for asserting that there aren't too many idle pages when running on bare metal. * Fix an ordering issue in the AMX test introduced by recent conversions to use kvm_cpu_has(), and harden the code to guard against similar bugs in the future. Anything that tiggers caching of KVM's supported CPUID, kvm_cpu_has() in this case, effectively hides opt-in XSAVE features if the caching occurs before the test opts in via prctl(). * Fix build errors that occur in certain setups (unsure exactly what is unique about the problematic setup) due to glibc overriding static_assert() to a variant that requires a custom message. * Introduce actual atomics for clear/set_bit() in selftests Documentation: * Remove deleted ioctls from documentation * Various fixes
2022-12-09Merge tag 'kvmarm-6.2' of ↵Paolo Bonzini3-14/+103
https://git.kernel.org/pub/scm/linux/kernel/git/kvmarm/kvmarm into HEAD KVM/arm64 updates for 6.2 - Enable the per-vcpu dirty-ring tracking mechanism, together with an option to keep the good old dirty log around for pages that are dirtied by something other than a vcpu. - Switch to the relaxed parallel fault handling, using RCU to delay page table reclaim and giving better performance under load. - Relax the MTE ABI, allowing a VMM to use the MAP_SHARED mapping option, which multi-process VMMs such as crosvm rely on. - Merge the pKVM shadow vcpu state tracking that allows the hypervisor to have its own view of a vcpu, keeping that state private. - Add support for the PMUv3p5 architecture revision, bringing support for 64bit counters on systems that support it, and fix the no-quite-compliant CHAIN-ed counter support for the machines that actually exist out there. - Fix a handful of minor issues around 52bit VA/PA support (64kB pages only) as a prefix of the oncoming support for 4kB and 16kB pages. - Add/Enable/Fix a bunch of selftests covering memslots, breakpoints, stage-2 faults and access tracking. You name it, we got it, we probably broke it. - Pick a small set of documentation and spelling fixes, because no good merge window would be complete without those. As a side effect, this tag also drags: - The 'kvmarm-fixes-6.1-3' tag as a dependency to the dirty-ring series - A shared branch with the arm64 tree that repaints all the system registers to match the ARM ARM's naming, and resulting in interesting conflicts
2022-12-05Merge branch kvm-arm64/dirty-ring into kvmarm-master/nextMarc Zyngier3-14/+103
* kvm-arm64/dirty-ring: : . : Add support for the "per-vcpu dirty-ring tracking with a bitmap : and sprinkles on top", courtesy of Gavin Shan. : : This branch drags the kvmarm-fixes-6.1-3 tag which was already : merged in 6.1-rc4 so that the branch is in a working state. : . KVM: Push dirty information unconditionally to backup bitmap KVM: selftests: Automate choosing dirty ring size in dirty_log_test KVM: selftests: Clear dirty ring states between two modes in dirty_log_test KVM: selftests: Use host page size to map ring buffer in dirty_log_test KVM: arm64: Enable ring-based dirty memory tracking KVM: Support dirty ring in conjunction with bitmap KVM: Move declaration of kvm_cpu_dirty_log_size() to kvm_dirty_ring.h KVM: x86: Introduce KVM_REQ_DIRTY_RING_SOFT_FULL Signed-off-by: Marc Zyngier <maz@kernel.org>
2022-12-02Merge branch 'gpc-fixes' of git://git.infradead.org/users/dwmw2/linux into HEADPaolo Bonzini1-54/+50
Pull Xen-for-KVM changes from David Woodhouse: * add support for 32-bit guests in SCHEDOP_poll * the rest of the gfn-to-pfn cache API cleanup "I still haven't reinstated the last of those patches to make gpc->len immutable." Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-12-02KVM: Remove stale comment about KVM_REQ_UNHALTSean Christopherson1-4/+0
Remove a comment about KVM_REQ_UNHALT being set by kvm_vcpu_check_block() that was missed when KVM_REQ_UNHALT was dropped. Fixes: c59fb1275838 ("KVM: remove KVM_REQ_UNHALT") Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20221201220433.31366-1-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-11-30KVM: Skip unnecessary "unmap" if gpc is already valid during refreshSean Christopherson1-2/+1
When refreshing a gfn=>pfn cache, skip straight to unlocking if the cache already valid instead of stuffing the "old" variables to turn the unmapping outro into a nop. Signed-off-by: Sean Christopherson <seanjc@google.com> Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
2022-11-30KVM: Drop @gpa from exported gfn=>pfn cache check() and refresh() helpersSean Christopherson1-6/+11
Drop the @gpa param from the exported check()+refresh() helpers and limit changing the cache's GPA to the activate path. All external users just feed in gpc->gpa, i.e. this is a fancy nop. Allowing users to change the GPA at check()+refresh() is dangerous as those helpers explicitly allow concurrent calls, e.g. KVM could get into a livelock scenario. It's also unclear as to what the expected behavior should be if multiple tasks attempt to refresh with different GPAs. Signed-off-by: Sean Christopherson <seanjc@google.com> Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
2022-11-30KVM: Do not partially reinitialize gfn=>pfn cache during activationSean Christopherson1-4/+4
Don't partially reinitialize a gfn=>pfn cache when activating the cache, and instead assert that the cache is not valid during activation. Bug the VM if the assertion fails, as use-after-free and/or data corruption is all but guaranteed if KVM ends up with a valid-but-inactive cache. Signed-off-by: Sean Christopherson <seanjc@google.com> Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
2022-11-30KVM: Drop KVM's API to allow temporarily unmapping gfn=>pfn cacheSean Christopherson1-28/+16
Drop kvm_gpc_unmap() as it has no users and unclear requirements. The API was added as part of the original gfn_to_pfn_cache support, but its sole usage[*] was never merged. Fold the guts of kvm_gpc_unmap() into the deactivate path and drop the API. Omit acquiring refresh_lock as as concurrent calls to kvm_gpc_deactivate() are not allowed (this is not enforced, e.g. via lockdep. due to it being called during vCPU destruction). If/when temporary unmapping makes a comeback, the desirable behavior is likely to restrict temporary unmapping to vCPU-exclusive mappings and require the vcpu->mutex be held to serialize unmap. Use of the refresh_lock to protect unmapping was somewhat specuatively added by commit 93984f19e7bc ("KVM: Fully serialize gfn=>pfn cache refresh via mutex") to guard against concurrent unmaps, but the primary use case of the temporary unmap, nested virtualization[*], doesn't actually need or want concurrent unmaps. [*] https://lore.kernel.org/all/20211210163625.2886-7-dwmw2@infradead.org Signed-off-by: Sean Christopherson <seanjc@google.com> Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
2022-11-30KVM: Use gfn_to_pfn_cache's immutable "kvm" in kvm_gpc_refresh()Michal Luczaj1-4/+3
Make kvm_gpc_refresh() use kvm instance cached in gfn_to_pfn_cache. No functional change intended. Suggested-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Michal Luczaj <mhal@rbox.co> [sean: leave kvm_gpc_unmap() as-is] Signed-off-by: Sean Christopherson <seanjc@google.com> Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
2022-11-30KVM: Clean up hva_to_pfn_retry()Michal Luczaj1-4/+4
Make hva_to_pfn_retry() use kvm instance cached in gfn_to_pfn_cache. Suggested-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Michal Luczaj <mhal@rbox.co> Signed-off-by: Sean Christopherson <seanjc@google.com> Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
2022-11-30KVM: Use gfn_to_pfn_cache's immutable "kvm" in kvm_gpc_check()Michal Luczaj1-3/+2
Make kvm_gpc_check() use kvm instance cached in gfn_to_pfn_cache. Suggested-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Michal Luczaj <mhal@rbox.co> Signed-off-by: Sean Christopherson <seanjc@google.com> Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
2022-11-30KVM: Store immutable gfn_to_pfn_cache propertiesMichal Luczaj1-8/+14
Move the assignment of immutable properties @kvm, @vcpu, and @usage to the initializer. Make _activate() and _deactivate() use stored values. Note, @len is also effectively immutable for most cases, but not in the case of the Xen runstate cache, which may be split across two pages and the length of the first segment will depend on its address. Suggested-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Michal Luczaj <mhal@rbox.co> [sean: handle @len in a separate patch] Signed-off-by: Sean Christopherson <seanjc@google.com> [dwmw2: acknowledge that @len can actually change for some use cases] Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
2022-11-30KVM: x86: Remove unused argument in gpc_unmap_khva()Michal Luczaj1-4/+4
Remove the unused @kvm argument from gpc_unmap_khva(). Signed-off-by: Michal Luczaj <mhal@rbox.co> Signed-off-by: Sean Christopherson <seanjc@google.com> Signed-off-by: David Woodhouse <dwmw@amazon.co.uk> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-11-30KVM: Shorten gfn_to_pfn_cache function namesMichal Luczaj1-10/+10
Formalize "gpc" as the acronym and use it in function names. No functional change intended. Suggested-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Michal Luczaj <mhal@rbox.co> Signed-off-by: Sean Christopherson <seanjc@google.com> Signed-off-by: David Woodhouse <dwmw@amazon.co.uk> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-11-23Merge branch 'kvm-dwmw2-fixes' into HEADPaolo Bonzini1-1/+6
This brings in a few important fixes for Xen emulation. While nobody should be enabling it, the bug effectively allows userspace to read arbitrary memory. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-11-23Merge branch 'kvm-dwmw2-fixes' into HEADPaolo Bonzini1-1/+6
This brings in a few important fixes for Xen emulation. While nobody should be enabling it, the bug effectively allows userspace to read arbitrary memory. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-11-23KVM: Update gfn_to_pfn_cache khva when it moves within the same pageDavid Woodhouse1-1/+6
In the case where a GPC is refreshed to a different location within the same page, we didn't bother to update it. Mostly we don't need to, but since the ->khva field also includes the offset within the page, that does have to be updated. Fixes: 3ba2c95ea180 ("KVM: Do not incorporate page offset into gfn=>pfn cache user address") Signed-off-by: David Woodhouse <dwmw@amazon.co.uk> Reviewed-by: Paul Durrant <paul@xen.org> Reviewed-by: Sean Christopherson <seanjc@google.com> Cc: stable@kernel.org Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-11-18KVM: x86: avoid memslot check in NX hugepage recovery if it cannot succeedPaolo Bonzini1-0/+8
Since gfn_to_memslot() is relatively expensive, it helps to skip it if it the memslot cannot possibly have dirty logging enabled. In order to do this, add to struct kvm a counter of the number of log-page memslots. While the correct value can only be read with slots_lock taken, the NX recovery thread is content with using an approximate value. Therefore, the counter is an atomic_t. Based on https://lore.kernel.org/kvm/20221027200316.2221027-2-dmatlack@google.com/ by David Matlack. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-11-17KVM: Obey kvm.halt_poll_ns in VMs not using KVM_CAP_HALT_POLLDavid Matlack1-3/+24
Obey kvm.halt_poll_ns in VMs not using KVM_CAP_HALT_POLL on every halt, rather than just sampling the module parameter when the VM is first created. This restore the original behavior of kvm.halt_poll_ns for VMs that have not opted into KVM_CAP_HALT_POLL. Notably, this change restores the ability for admins to disable or change the maximum halt-polling time system wide for VMs not using KVM_CAP_HALT_POLL. Reported-by: Christian Borntraeger <borntraeger@de.ibm.com> Fixes: acd05785e48c ("kvm: add capability for halt polling") Signed-off-by: David Matlack <dmatlack@google.com> Message-Id: <20221117001657.1067231-4-dmatlack@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-11-17KVM: Avoid re-reading kvm->max_halt_poll_ns during halt-pollingDavid Matlack1-6/+15
Avoid re-reading kvm->max_halt_poll_ns multiple times during halt-polling except when it is explicitly useful, e.g. to check if the max time changed across a halt. kvm->max_halt_poll_ns can be changed at any time by userspace via KVM_CAP_HALT_POLL. This bug is unlikely to cause any serious side-effects. In the worst case one halt polls for shorter or longer than it should, and then is fixed up on the next halt. Furthmore, this is still possible since kvm->max_halt_poll_ns are not synchronized with halts. Fixes: acd05785e48c ("kvm: add capability for halt polling") Signed-off-by: David Matlack <dmatlack@google.com> Message-Id: <20221117001657.1067231-3-dmatlack@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-11-17KVM: Cap vcpu->halt_poll_ns before halting rather than afterDavid Matlack1-4/+6
Cap vcpu->halt_poll_ns based on the max halt polling time just before halting, rather than after the last halt. This arguably provides better accuracy if an admin disables halt polling in between halts, although the improvement is nominal. A side-effect of this change is that grow_halt_poll_ns() no longer needs to access vcpu->kvm->max_halt_poll_ns, which will be useful in a future commit where the max halt polling time can come from the module parameter halt_poll_ns instead. Signed-off-by: David Matlack <dmatlack@google.com> Message-Id: <20221117001657.1067231-2-dmatlack@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-11-12KVM: Push dirty information unconditionally to backup bitmapGavin Shan1-3/+2
In mark_page_dirty_in_slot(), we bail out when no running vcpu exists and a running vcpu context is strictly required by architecture. It may cause backwards compatible issue. Currently, saving vgic/its tables is the only known case where no running vcpu context is expected. We may have other unknown cases where no running vcpu context exists and it's reported by the warning message and we bail out without pushing the dirty information to the backup bitmap. For this, the application is going to enable the backup bitmap for the unknown cases. However, the dirty information can't be pushed to the backup bitmap even though the backup bitmap is enabled for those unknown cases in the application, until the unknown cases are added to the allowed list of non-running vcpu context with extra code changes to the host kernel. In order to make the new application, where the backup bitmap has been enabled, to work with the unchanged host, we continue to push the dirty information to the backup bitmap instead of bailing out early. With the added check on 'memslot->dirty_bitmap' to mark_page_dirty_in_slot(), the kernel crash is avoided silently by the combined conditions: no running vcpu context, kvm_arch_allow_write_without_running_vcpu() returns 'true', and the backup bitmap (KVM_CAP_DIRTY_LOG_RING_WITH_BITMAP) isn't enabled yet. Suggested-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Gavin Shan <gshan@redhat.com> Signed-off-by: Marc Zyngier <maz@kernel.org> Link: https://lore.kernel.org/r/20221112094322.21911-1-gshan@redhat.com
2022-11-10KVM: Support dirty ring in conjunction with bitmapGavin Shan3-9/+72
ARM64 needs to dirty memory outside of a VCPU context when VGIC/ITS is enabled. It's conflicting with that ring-based dirty page tracking always requires a running VCPU context. Introduce a new flavor of dirty ring that requires the use of both VCPU dirty rings and a dirty bitmap. The expectation is that for non-VCPU sources of dirty memory (such as the VGIC/ITS on arm64), KVM writes to the dirty bitmap. Userspace should scan the dirty bitmap before migrating the VM to the target. Use an additional capability to advertise this behavior. The newly added capability (KVM_CAP_DIRTY_LOG_RING_WITH_BITMAP) can't be enabled before KVM_CAP_DIRTY_LOG_RING_ACQ_REL on ARM64. In this way, the newly added capability is treated as an extension of KVM_CAP_DIRTY_LOG_RING_ACQ_REL. Suggested-by: Marc Zyngier <maz@kernel.org> Suggested-by: Peter Xu <peterx@redhat.com> Co-developed-by: Oliver Upton <oliver.upton@linux.dev> Signed-off-by: Oliver Upton <oliver.upton@linux.dev> Signed-off-by: Gavin Shan <gshan@redhat.com> Acked-by: Peter Xu <peterx@redhat.com> Signed-off-by: Marc Zyngier <maz@kernel.org> Link: https://lore.kernel.org/r/20221110104914.31280-4-gshan@redhat.com
2022-11-10KVM: x86: Introduce KVM_REQ_DIRTY_RING_SOFT_FULLGavin Shan2-4/+31
The VCPU isn't expected to be runnable when the dirty ring becomes soft full, until the dirty pages are harvested and the dirty ring is reset from userspace. So there is a check in each guest's entrace to see if the dirty ring is soft full or not. The VCPU is stopped from running if its dirty ring has been soft full. The similar check will be needed when the feature is going to be supported on ARM64. As Marc Zyngier suggested, a new event will avoid pointless overhead to check the size of the dirty ring ('vcpu->kvm->dirty_ring_size') in each guest's entrance. Add KVM_REQ_DIRTY_RING_SOFT_FULL. The event is raised when the dirty ring becomes soft full in kvm_dirty_ring_push(). The event is only cleared in the check, done in the newly added helper kvm_dirty_ring_check_request(). Since the VCPU is not runnable when the dirty ring becomes soft full, the KVM_REQ_DIRTY_RING_SOFT_FULL event is always set to prevent the VCPU from running until the dirty pages are harvested and the dirty ring is reset by userspace. kvm_dirty_ring_soft_full() becomes a private function with the newly added helper kvm_dirty_ring_check_request(). The alignment for the various event definitions in kvm_host.h is changed to tab character by the way. In order to avoid using 'container_of()', the argument @ring is replaced by @vcpu in kvm_dirty_ring_push(). Link: https://lore.kernel.org/kvmarm/87lerkwtm5.wl-maz@kernel.org Suggested-by: Marc Zyngier <maz@kernel.org> Signed-off-by: Gavin Shan <gshan@redhat.com> Reviewed-by: Peter Xu <peterx@redhat.com> Reviewed-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Marc Zyngier <maz@kernel.org> Link: https://lore.kernel.org/r/20221110104914.31280-2-gshan@redhat.com
2022-11-09KVM: replace direct irq.h inclusionPaolo Bonzini1-2/+1
virt/kvm/irqchip.c is including "irq.h" from the arch-specific KVM source directory (i.e. not from arch/*/include) for the sole purpose of retrieving irqchip_in_kernel. Making the function inline in a header that is already included, such as asm/kvm_host.h, is not possible because it needs to look at struct kvm which is defined after asm/kvm_host.h is included. So add a kvm_arch_irqchip_in_kernel non-inline function; irqchip_in_kernel() is only performance critical on arm64 and x86, and the non-inline function is enough on all other architectures. irq.h can then be deleted from all architectures except x86. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-11-09kvm: Add interruptible flag to __gfn_to_pfn_memslot()Peter Xu3-14/+20
Add a new "interruptible" flag showing that the caller is willing to be interrupted by signals during the __gfn_to_pfn_memslot() request. Wire it up with a FOLL_INTERRUPTIBLE flag that we've just introduced. This prepares KVM to be able to respond to SIGUSR1 (for QEMU that's the SIGIPI) even during e.g. handling an userfaultfd page fault. No functional change intended. Signed-off-by: Peter Xu <peterx@redhat.com> Reviewed-by: Sean Christopherson <seanjc@google.com> Message-Id: <20221011195809.557016-4-peterx@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-11-09kvm: Add KVM_PFN_ERR_SIGPENDINGPeter Xu1-0/+2
Add a new pfn error to show that we've got a pending signal to handle during hva_to_pfn_slow() procedure (of -EINTR retval). Signed-off-by: Peter Xu <peterx@redhat.com> Reviewed-by: Sean Christopherson <seanjc@google.com> Message-Id: <20221011195809.557016-3-peterx@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-11-06Merge tag 'kvmarm-fixes-6.1-3' of ↵Paolo Bonzini1-0/+3
git://git.kernel.org/pub/scm/linux/kernel/git/kvmarm/kvmarm into HEAD * Fix the pKVM stage-1 walker erronously using the stage-2 accessor * Correctly convert vcpu->kvm to a hyp pointer when generating an exception in a nVHE+MTE configuration * Check that KVM_CAP_DIRTY_LOG_* are valid before enabling them * Fix SMPRI_EL1/TPIDR2_EL0 trapping on VHE * Document the boot requirements for FGT when entering the kernel at EL1
2022-10-31KVM: Check KVM_CAP_DIRTY_LOG_{RING, RING_ACQ_REL} prior to enabling themGavin Shan1-0/+3
There are two capabilities related to ring-based dirty page tracking: KVM_CAP_DIRTY_LOG_RING and KVM_CAP_DIRTY_LOG_RING_ACQ_REL. Both are supported by x86. However, arm64 supports KVM_CAP_DIRTY_LOG_RING_ACQ_REL only when the feature is supported on arm64. The userspace doesn't have to enable the advertised capability, meaning KVM_CAP_DIRTY_LOG_RING can be enabled on arm64 by userspace and it's wrong. Fix it by double checking if the capability has been advertised prior to enabling it. It's rejected to enable the capability if it hasn't been advertised. Fixes: 17601bfed909 ("KVM: Add KVM_CAP_DIRTY_LOG_RING_ACQ_REL capability and config option") Reported-by: Sean Christopherson <seanjc@google.com> Suggested-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Gavin Shan <gshan@redhat.com> Reviewed-by: Oliver Upton <oliver.upton@linux.dev> Signed-off-by: Marc Zyngier <maz@kernel.org> Link: https://lore.kernel.org/r/20221031003621.164306-4-gshan@redhat.com
2022-10-27KVM: Reject attempts to consume or refresh inactive gfn_to_pfn_cacheSean Christopherson1-7/+34
Reject kvm_gpc_check() and kvm_gpc_refresh() if the cache is inactive. Not checking the active flag during refresh is particularly egregious, as KVM can end up with a valid, inactive cache, which can lead to a variety of use-after-free bugs, e.g. consuming a NULL kernel pointer or missing an mmu_notifier invalidation due to the cache not being on the list of gfns to invalidate. Note, "active" needs to be set if and only if the cache is on the list of caches, i.e. is reachable via mmu_notifier events. If a relevant mmu_notifier event occurs while the cache is "active" but not on the list, KVM will not acquire the cache's lock and so will not serailize the mmu_notifier event with active users and/or kvm_gpc_refresh(). A race between KVM_XEN_ATTR_TYPE_SHARED_INFO and KVM_XEN_HVM_EVTCHN_SEND can be exploited to trigger the bug. 1. Deactivate shinfo cache: kvm_xen_hvm_set_attr case KVM_XEN_ATTR_TYPE_SHARED_INFO kvm_gpc_deactivate kvm_gpc_unmap gpc->valid = false gpc->khva = NULL gpc->active = false Result: active = false, valid = false 2. Cause cache refresh: kvm_arch_vm_ioctl case KVM_XEN_HVM_EVTCHN_SEND kvm_xen_hvm_evtchn_send kvm_xen_set_evtchn kvm_xen_set_evtchn_fast kvm_gpc_check return -EWOULDBLOCK because !gpc->valid kvm_xen_set_evtchn_fast return -EWOULDBLOCK kvm_gpc_refresh hva_to_pfn_retry gpc->valid = true gpc->khva = not NULL Result: active = false, valid = true 3. Race ioctl KVM_XEN_HVM_EVTCHN_SEND against ioctl KVM_XEN_ATTR_TYPE_SHARED_INFO: kvm_arch_vm_ioctl case KVM_XEN_HVM_EVTCHN_SEND kvm_xen_hvm_evtchn_send kvm_xen_set_evtchn kvm_xen_set_evtchn_fast read_lock gpc->lock kvm_xen_hvm_set_attr case KVM_XEN_ATTR_TYPE_SHARED_INFO mutex_lock kvm->lock kvm_xen_shared_info_init kvm_gpc_activate gpc->khva = NULL kvm_gpc_check [ Check passes because gpc->valid is still true, even though gpc->khva is already NULL. ] shinfo = gpc->khva pending_bits = shinfo->evtchn_pending CRASH: test_and_set_bit(..., pending_bits) Fixes: 982ed0de4753 ("KVM: Reinstate gfn_to_pfn_cache with invalidation support") Cc: stable@vger.kernel.org Reported-by: : Michal Luczaj <mhal@rbox.co> Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20221013211234.1318131-3-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-10-27KVM: Initialize gfn_to_pfn_cache locks in dedicated helperMichal Luczaj1-9/+12
Move the gfn_to_pfn_cache lock initialization to another helper and call the new helper during VM/vCPU creation. There are race conditions possible due to kvm_gfn_to_pfn_cache_init()'s ability to re-initialize the cache's locks. For example: a race between ioctl(KVM_XEN_HVM_EVTCHN_SEND) and kvm_gfn_to_pfn_cache_init() leads to a corrupted shinfo gpc lock. (thread 1) | (thread 2) | kvm_xen_set_evtchn_fast | read_lock_irqsave(&gpc->lock, ...) | | kvm_gfn_to_pfn_cache_init | rwlock_init(&gpc->lock) read_unlock_irqrestore(&gpc->lock, ...) | Rename "cache_init" and "cache_destroy" to activate+deactivate to avoid implying that the cache really is destroyed/freed. Note, there more races in the newly named kvm_gpc_activate() that will be addressed separately. Fixes: 982ed0de4753 ("KVM: Reinstate gfn_to_pfn_cache with invalidation support") Cc: stable@vger.kernel.org Suggested-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Michal Luczaj <mhal@rbox.co> [sean: call out that this is a bug fix] Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20221013211234.1318131-2-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-10-27KVM: debugfs: Return retval of simple_attr_open() if it failsHou Wenlong1-7/+6
Although simple_attr_open() fails only with -ENOMEM with current code base, it would be nicer to return retval of simple_attr_open() directly in kvm_debugfs_open(). No functional change intended. Signed-off-by: Hou Wenlong <houwenlong.hwl@antgroup.com> Message-Id: <69d64d93accd1f33691b8a383ae555baee80f943.1665975828.git.houwenlong.hwl@antgroup.com> Cc: stable@vger.kernel.org Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-10-22kvm: Add support for arch compat vm ioctlsAlexander Graf1-0/+11
We will introduce the first architecture specific compat vm ioctl in the next patch. Add all necessary boilerplate to allow architectures to override compat vm ioctls when necessary. Signed-off-by: Alexander Graf <graf@amazon.com> Message-Id: <20221017184541.2658-2-graf@amazon.com> Cc: stable@vger.kernel.org Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-10-12Merge tag 'vfio-v6.1-rc1' of https://github.com/awilliam/linux-vfioLinus Torvalds1-13/+32
Pull VFIO updates from Alex Williamson: - Prune private items from vfio_pci_core.h to a new internal header, fix missed function rename, and refactor vfio-pci interrupt defines (Jason Gunthorpe) - Create consistent naming and handling of ioctls with a function per ioctl for vfio-pci and vfio group handling, use proper type args where available (Jason Gunthorpe) - Implement a set of low power device feature ioctls allowing userspace to make use of power states such as D3cold where supported (Abhishek Sahu) - Remove device counter on vfio groups, which had restricted the page pinning interface to singleton groups to account for limitations in the type1 IOMMU backend. Document usage as limited to emulated IOMMU devices, ie. traditional mdev devices where this restriction is consistent (Jason Gunthorpe) - Correct function prefix in hisi_acc driver incurred during previous refactoring (Shameer Kolothum) - Correct typo and remove redundant warning triggers in vfio-fsl driver (Christophe JAILLET) - Introduce device level DMA dirty tracking uAPI and implementation in the mlx5 variant driver (Yishai Hadas & Joao Martins) - Move much of the vfio_device life cycle management into vfio core, simplifying and avoiding duplication across drivers. This also facilitates adding a struct device to vfio_device which begins the introduction of device rather than group level user support and fills a gap allowing userspace identify devices as vfio capable without implicit knowledge of the driver (Kevin Tian & Yi Liu) - Split vfio container handling to a separate file, creating a more well defined API between the core and container code, masking IOMMU backend implementation from the core, allowing for an easier future transition to an iommufd based implementation of the same (Jason Gunthorpe) - Attempt to resolve race accessing the iommu_group for a device between vfio releasing DMA ownership and removal of the device from the IOMMU driver. Follow-up with support to allow vfio_group to exist with NULL iommu_group pointer to support existing userspace use cases of holding the group file open (Jason Gunthorpe) - Fix error code and hi/lo register manipulation issues in the hisi_acc variant driver, along with various code cleanups (Longfang Liu) - Fix a prior regression in GVT-g group teardown, resulting in unreleased resources (Jason Gunthorpe) - A significant cleanup and simplification of the mdev interface, consolidating much of the open coded per driver sysfs interface support into the mdev core (Christoph Hellwig) - Simplification of tracking and locking around vfio_groups that fall out from previous refactoring (Jason Gunthorpe) - Replace trivial open coded f_ops tests with new helper (Alex Williamson) * tag 'vfio-v6.1-rc1' of https://github.com/awilliam/linux-vfio: (77 commits) vfio: More vfio_file_is_group() use cases vfio: Make the group FD disassociate from the iommu_group vfio: Hold a reference to the iommu_group in kvm for SPAPR vfio: Add vfio_file_is_group() vfio: Change vfio_group->group_rwsem to a mutex vfio: Remove the vfio_group->users and users_comp vfio/mdev: add mdev available instance checking to the core vfio/mdev: consolidate all the description sysfs into the core code vfio/mdev: consolidate all the available_instance sysfs into the core code vfio/mdev: consolidate all the name sysfs into the core code vfio/mdev: consolidate all the device_api sysfs into the core code vfio/mdev: remove mtype_get_parent_dev vfio/mdev: remove mdev_parent_dev vfio/mdev: unexport mdev_bus_type vfio/mdev: remove mdev_from_dev vfio/mdev: simplify mdev_type handling vfio/mdev: embedd struct mdev_parent in the parent data structure vfio/mdev: make mdev.h standalone includable drm/i915/gvt: simplify vgpu configuration management drm/i915/gvt: fix a memory leak in intel_gvt_init_vgpu_types ...
2022-10-07vfio: Hold a reference to the iommu_group in kvm for SPAPRJason Gunthorpe1-11/+14
SPAPR exists completely outside the normal iommu driver framework, the groups it creates are fake and are only created to enable VFIO's uAPI. Thus, it does not need to follow the iommu core rule that the iommu_group will only be touched while a driver is attached. Carry a group reference into KVM and have KVM directly manage the lifetime of this object independently of VFIO. This means KVM no longer relies on the vfio group file being valid to maintain the group reference. Tested-by: Matthew Rosato <mjrosato@linux.ibm.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com> Link: https://lore.kernel.org/r/2-v2-15417f29324e+1c-vfio_group_disassociate_jgg@nvidia.com Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2022-10-07vfio: Add vfio_file_is_group()Jason Gunthorpe1-2/+18
This replaces uses of vfio_file_iommu_group() which were only detecting if the file is a VFIO file with no interest in the actual group. The only remaning user of vfio_file_iommu_group() is in KVM for the SPAPR stuff. It passes the iommu_group into the arch code through kvm for some reason. Tested-by: Matthew Rosato <mjrosato@linux.ibm.com> Tested-by: Christian Borntraeger <borntraeger@de.ibm.com> Tested-by: Eric Farman <farman@linux.ibm.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com> Link: https://lore.kernel.org/r/1-v2-15417f29324e+1c-vfio_group_disassociate_jgg@nvidia.com Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2022-10-03Merge tag 'kvmarm-6.1' of ↵Paolo Bonzini3-3/+24
git://git.kernel.org/pub/scm/linux/kernel/git/kvmarm/kvmarm into HEAD KVM/arm64 updates for v6.1 - Fixes for single-stepping in the presence of an async exception as well as the preservation of PSTATE.SS - Better handling of AArch32 ID registers on AArch64-only systems - Fixes for the dirty-ring API, allowing it to work on architectures with relaxed memory ordering - Advertise the new kvmarm mailing list - Various minor cleanups and spelling fixes
2022-09-29KVM: Add KVM_CAP_DIRTY_LOG_RING_ACQ_REL capability and config optionMarc Zyngier2-1/+22
In order to differenciate between architectures that require no extra synchronisation when accessing the dirty ring and those who do, add a new capability (KVM_CAP_DIRTY_LOG_RING_ACQ_REL) that identify the latter sort. TSO architectures can obviously advertise both, while relaxed architectures must only advertise the ACQ_REL version. This requires some configuration symbol rejigging, with HAVE_KVM_DIRTY_RING being only indirectly selected by two top-level config symbols: - HAVE_KVM_DIRTY_RING_TSO for strongly ordered architectures (x86) - HAVE_KVM_DIRTY_RING_ACQ_REL for weakly ordered architectures (arm64) Suggested-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Marc Zyngier <maz@kernel.org> Reviewed-by: Gavin Shan <gshan@redhat.com> Reviewed-by: Peter Xu <peterx@redhat.com> Link: https://lore.kernel.org/r/20220926145120.27974-3-maz@kernel.org
2022-09-29KVM: Use acquire/release semantics when accessing dirty ring GFN stateMarc Zyngier1-2/+2
The current implementation of the dirty ring has an implicit requirement that stores to the dirty ring from userspace must be: - be ordered with one another - visible from another CPU executing a ring reset While these implicit requirements work well for x86 (and any other TSO-like architecture), they do not work for more relaxed architectures such as arm64 where stores to different addresses can be freely reordered, and loads from these addresses not observing writes from another CPU unless the required barriers (or acquire/release semantics) are used. In order to start fixing this, upgrade the ring reset accesses: - the kvm_dirty_gfn_harvested() helper now uses acquire semantics so it is ordered after all previous writes, including that from userspace - the kvm_dirty_gfn_set_invalid() helper now uses release semantics so that the next_slot and next_offset reads don't drift past the entry invalidation This is only a partial fix as the userspace side also need upgrading. Signed-off-by: Marc Zyngier <maz@kernel.org> Reviewed-by: Gavin Shan <gshan@redhat.com> Reviewed-by: Peter Xu <peterx@redhat.com> Link: https://lore.kernel.org/r/20220926145120.27974-2-maz@kernel.org
2022-09-26KVM: remove KVM_REQ_UNHALTPaolo Bonzini1-3/+1
KVM_REQ_UNHALT is now unnecessary because it is replaced by the return value of kvm_vcpu_block/kvm_vcpu_halt. Remove it. No functional change intended. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Acked-by: Marc Zyngier <maz@kernel.org> Message-Id: <20220921003201.1441511-13-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-09-26KVM: fix memoryleak in kvm_init()Miaohe Lin1-3/+2
When alloc_cpumask_var_node() fails for a certain cpu, there might be some allocated cpumasks for percpu cpu_kick_mask. We should free these cpumasks or memoryleak will occur. Fixes: baff59ccdc65 ("KVM: Pre-allocate cpumasks for kvm_make_all_cpus_request_except()") Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> Link: https://lore.kernel.org/r/20220823063414.59778-1-linmiaohe@huawei.com Signed-off-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-08-19KVM: Drop unnecessary initialization of "ops" in kvm_ioctl_create_device()Li kunyu1-1/+1
The variable is initialized but it is only used after its assignment. Reviewed-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Li kunyu <kunyu@nfschina.com> Message-Id: <20220819021535.483702-1-kunyu@nfschina.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-08-19KVM: Drop unnecessary initialization of "npages" in hva_to_pfn_slow()Li kunyu1-1/+1
The variable is initialized but it is only used after its assignment. Reviewed-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Li kunyu <kunyu@nfschina.com> Message-Id: <20220819022804.483914-1-kunyu@nfschina.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-08-19KVM: Rename mmu_notifier_* to mmu_invalidate_*Chao Peng2-33/+36
The motivation of this renaming is to make these variables and related helper functions less mmu_notifier bound and can also be used for non mmu_notifier based page invalidation. mmu_invalidate_* was chosen to better describe the purpose of 'invalidating' a page that those variables are used for. - mmu_notifier_seq/range_start/range_end are renamed to mmu_invalidate_seq/range_start/range_end. - mmu_notifier_retry{_hva} helper functions are renamed to mmu_invalidate_retry{_hva}. - mmu_notifier_count is renamed to mmu_invalidate_in_progress to avoid confusion with mn_active_invalidate_count. - While here, also update kvm_inc/dec_notifier_count() to kvm_mmu_invalidate_begin/end() to match the change for mmu_notifier_count. No functional change intended. Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com> Message-Id: <20220816125322.1110439-3-chao.p.peng@linux.intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-08-19KVM: Move coalesced MMIO initialization (back) into kvm_create_vm()Sean Christopherson1-5/+6
Invoke kvm_coalesced_mmio_init() from kvm_create_vm() now that allocating and initializing coalesced MMIO objects is separate from registering any associated devices. Moving coalesced MMIO cleans up the last oddity where KVM does VM creation/initialization after kvm_create_vm(), and more importantly after kvm_arch_post_init_vm() is called and the VM is added to the global vm_list, i.e. after the VM is fully created as far as KVM is concerned. Originally, kvm_coalesced_mmio_init() was called by kvm_create_vm(), but the original implementation was completely devoid of error handling. Commit 6ce5a090a9a0 ("KVM: coalesced_mmio: fix kvm_coalesced_mmio_init()'s error handling" fixed the various bugs, and in doing so rightly moved the call to after kvm_create_vm() because kvm_coalesced_mmio_init() also registered the coalesced MMIO device. Commit 2b3c246a682c ("KVM: Make coalesced mmio use a device per zone") cleaned up that mess by having each zone register a separate device, i.e. moved device registration to its logical home in kvm_vm_ioctl_register_coalesced_mmio(). As a result, kvm_coalesced_mmio_init() is now a "pure" initialization helper and can be safely called from kvm_create_vm(). Opportunstically drop the #ifdef, KVM provides stubs for kvm_coalesced_mmio_{init,free}() when CONFIG_KVM_MMIO=n (s390). Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20220816053937.2477106-4-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-08-19KVM: Unconditionally get a ref to /dev/kvm module when creating a VMSean Christopherson1-10/+4
Unconditionally get a reference to the /dev/kvm module when creating a VM instead of using try_get_module(), which will fail if the module is in the process of being forcefully unloaded. The error handling when try_get_module() fails doesn't properly unwind all that has been done, e.g. doesn't call kvm_arch_pre_destroy_vm() and doesn't remove the VM from the global list. Not removing VMs from the global list tends to be fatal, e.g. leads to use-after-free explosions. The obvious alternative would be to add proper unwinding, but the justification for using try_get_module(), "rmmod --wait", is completely bogus as support for "rmmod --wait", i.e. delete_module() without O_NONBLOCK, was removed by commit 3f2b9c9cdf38 ("module: remove rmmod --wait option.") nearly a decade ago. It's still possible for try_get_module() to fail due to the module dying (more like being killed), as the module will be tagged MODULE_STATE_GOING by "rmmod --force", i.e. delete_module(..., O_TRUNC), but playing nice with forced unloading is an exercise in futility and gives a falsea sense of security. Using try_get_module() only prevents acquiring _new_ references, it doesn't magically put the references held by other VMs, and forced unloading doesn't wait, i.e. "rmmod --force" on KVM is all but guaranteed to cause spectacular fireworks; the window where KVM will fail try_get_module() is tiny compared to the window where KVM is building and running the VM with an elevated module refcount. Addressing KVM's inability to play nice with "rmmod --force" is firmly out-of-scope. Forcefully unloading any module taints kernel (for obvious reasons) _and_ requires the kernel to be built with CONFIG_MODULE_FORCE_UNLOAD=y, which is off by default and comes with the amusing disclaimer that it's "mainly for kernel developers and desperate users". In other words, KVM is free to scoff at bug reports due to using "rmmod --force" while VMs may be running. Fixes: 5f6de5cbebee ("KVM: Prevent module exit until all VMs are freed") Cc: stable@vger.kernel.org Cc: David Matlack <dmatlack@google.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20220816053937.2477106-3-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-08-19KVM: Properly unwind VM creation if creating debugfs failsSean Christopherson1-8/+8
Properly unwind VM creation if kvm_create_vm_debugfs() fails. A recent change to invoke kvm_create_vm_debug() in kvm_create_vm() was led astray by buggy try_get_module() handling adding by commit 5f6de5cbebee ("KVM: Prevent module exit until all VMs are freed"). The debugfs error path effectively inherits the bad error path of try_module_get(), e.g. KVM leaves the to-be-free VM on vm_list even though KVM appears to do the right thing by calling module_put() and falling through. Opportunistically hoist kvm_create_vm_debugfs() above the call to kvm_arch_post_init_vm() so that the "post-init" arch hook is actually invoked after the VM is initialized (ignoring kvm_coalesced_mmio_init() for the moment). x86 is the only non-nop implementation of the post-init hook, and it doesn't allocate/initialize any objects that are reachable via debugfs code (spawns a kthread worker for the NX huge page mitigation). Leave the buggy try_get_module() alone for now, it will be fixed in a separate commit. Fixes: b74ed7a68ec1 ("KVM: Actually create debugfs in kvm_create_vm()") Reported-by: syzbot+744e173caec2e1627ee0@syzkaller.appspotmail.com Cc: Oliver Upton <oliver.upton@linux.dev> Signed-off-by: Sean Christopherson <seanjc@google.com> Reviewed-by: Oliver Upton <oliver.upton@linux.dev> Message-Id: <20220816053937.2477106-2-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-08-10KVM: Actually create debugfs in kvm_create_vm()Oliver Upton1-17/+19
Doing debugfs creation after vm creation leaves things in a quasi-initialized state for a while. This is further complicated by the fact that we tear down debugfs from kvm_destroy_vm(). Align debugfs and stats init/destroy with the vm init/destroy pattern to avoid any headaches. Note the fix for a benign mistake in error handling for calls to kvm_arch_create_vm_debugfs() rolled in. Since all implementations of the function return 0 unconditionally it isn't actually a bug at the moment. Lastly, tear down debugfs/stats data in the kvm_create_vm_debugfs() error path. Previously it was safe to assume that kvm_destroy_vm() would take out the garbage, that is no longer the case. Signed-off-by: Oliver Upton <oupton@google.com> Message-Id: <20220720092259.3491733-6-oliver.upton@linux.dev> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-08-10KVM: Pass the name of the VM fd to kvm_create_vm_debugfs()Oliver Upton1-3/+6
At the time the VM fd is used in kvm_create_vm_debugfs(), the fd has been allocated but not yet installed. It is only really useful as an identifier in strings for the VM (such as debugfs). Treat it exactly as such by passing the string name of the fd to kvm_create_vm_debugfs(), futureproofing against possible misuse of the VM fd. Suggested-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Oliver Upton <oupton@google.com> Message-Id: <20220720092259.3491733-5-oliver.upton@linux.dev> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-08-10KVM: Get an fd before creating the VMOliver Upton1-13/+17
Allocate a VM's fd at the very beginning of kvm_dev_ioctl_create_vm() so that KVM can use the fd value to generate strigns, e.g. for debugfs, when creating and initializing the VM. Signed-off-by: Oliver Upton <oupton@google.com> Message-Id: <20220720092259.3491733-4-oliver.upton@linux.dev> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-08-10KVM: Shove vcpu stats_id init into kvm_vcpu_init()Oliver Upton1-4/+4
Initialize stats_id alongside other kvm_vcpu fields to make it more difficult to unintentionally access stats_id before it's set. No functional change intended. Signed-off-by: Oliver Upton <oupton@google.com> Message-Id: <20220720092259.3491733-3-oliver.upton@linux.dev> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-08-10KVM: Shove vm stats_id init into kvm_create_vm()Oliver Upton1-3/+3
Initialize stats_id alongside other struct kvm fields to make it more difficult to unintentionally access stats_id before it's set. While at it, move the format string to the first line of the call and fix the indentation of the second line. No functional change intended. Signed-off-by: Oliver Upton <oupton@google.com> Message-Id: <20220720092259.3491733-2-oliver.upton@linux.dev> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-08-01Merge remote-tracking branch 'kvm/next' into kvm-next-5.20Paolo Bonzini2-138/+312
KVM/s390, KVM/x86 and common infrastructure changes for 5.20 x86: * Permit guests to ignore single-bit ECC errors * Fix races in gfn->pfn cache refresh; do not pin pages tracked by the cache * Intel IPI virtualization * Allow getting/setting pending triple fault with KVM_GET/SET_VCPU_EVENTS * PEBS virtualization * Simplify PMU emulation by just using PERF_TYPE_RAW events * More accurate event reinjection on SVM (avoid retrying instructions) * Allow getting/setting the state of the speaker port data bit * Refuse starting the kvm-intel module if VM-Entry/VM-Exit controls are inconsistent * "Notify" VM exit (detect microarchitectural hangs) for Intel * Cleanups for MCE MSR emulation s390: * add an interface to provide a hypervisor dump for secure guests * improve selftests to use TAP interface * enable interpretive execution of zPCI instructions (for PCI passthrough) * First part of deferred teardown * CPU Topology * PV attestation * Minor fixes Generic: * new selftests API using struct kvm_vcpu instead of a (vm, id) tuple x86: * Use try_cmpxchg64 instead of cmpxchg64 * Bugfixes * Ignore benign host accesses to PMU MSRs when PMU is disabled * Allow disabling KVM's "MONITOR/MWAIT are NOPs!" behavior * x86/MMU: Allow NX huge pages to be disabled on a per-vm basis * Port eager page splitting to shadow MMU as well * Enable CMCI capability by default and handle injected UCNA errors * Expose pid of vcpu threads in debugfs * x2AVIC support for AMD * cleanup PIO emulation * Fixes for LLDT/LTR emulation * Don't require refcounted "struct page" to create huge SPTEs x86 cleanups: * Use separate namespaces for guest PTEs and shadow PTEs bitmasks * PIO emulation * Reorganize rmap API, mostly around rmap destruction * Do not workaround very old KVM bugs for L0 that runs with nesting enabled * new selftests API for CPUID
2022-07-29KVM: Add gfp_custom flag in struct kvm_mmu_memory_cacheAnup Patel1-1/+3
The kvm_mmu_topup_memory_cache() always uses GFP_KERNEL_ACCOUNT for memory allocation which prevents it's use in atomic context. To address this limitation of kvm_mmu_topup_memory_cache(), we add gfp_custom flag in struct kvm_mmu_memory_cache. When the gfp_custom flag is set to some GFP_xyz flags, the kvm_mmu_topup_memory_cache() will use that instead of GFP_KERNEL_ACCOUNT. Signed-off-by: Anup Patel <apatel@ventanamicro.com> Reviewed-by: Atish Patra <atishp@rivosinc.com> Signed-off-by: Anup Patel <anup@brainfault.org>
2022-06-24KVM: debugfs: expose pid of vcpu threadsVineeth Pillai1-2/+13
Add a new debugfs file to expose the pid of each vcpu threads. This is very helpful for userland tools to get the vcpu pids without worrying about thread naming conventions of the VMM. Signed-off-by: Vineeth Pillai (Google) <vineeth@bitbyteword.org> Message-Id: <20220523190327.2658-1-vineeth@bitbyteword.org> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-06-24KVM: Allow for different capacities in kvm_mmu_memory_cache structsDavid Matlack1-3/+30
Allow the capacity of the kvm_mmu_memory_cache struct to be chosen at declaration time rather than being fixed for all declarations. This will be used in a follow-up commit to declare an cache in x86 with a capacity of 512+ objects without having to increase the capacity of all caches in KVM. This change requires each cache now specify its capacity at runtime, since the cache struct itself no longer has a fixed capacity known at compile time. To protect against someone accidentally defining a kvm_mmu_memory_cache struct directly (without the extra storage), this commit includes a WARN_ON() in kvm_mmu_topup_memory_cache(). In order to support different capacities, this commit changes the objects pointer array to be dynamically allocated the first time the cache is topped-up. While here, opportunistically clean up the stack-allocated kvm_mmu_memory_cache structs in riscv and arm64 to use designated initializers. No functional change intended. Reviewed-by: Marc Zyngier <maz@kernel.org> Signed-off-by: David Matlack <dmatlack@google.com> Message-Id: <20220516232138.1783324-22-dmatlack@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-06-20KVM: Do not zero initialize 'pfn' in hva_to_pfn()Sean Christopherson1-1/+1
Drop the unnecessary initialization of the local 'pfn' variable in hva_to_pfn(). First and foremost, '0' is not an invalid pfn, it's a perfectly valid pfn on most architectures. I.e. if hva_to_pfn() were to return an "uninitializd" pfn, it would actually be interpeted as a legal pfn by most callers. Second, hva_to_pfn() can't return an uninitialized pfn as hva_to_pfn() explicitly sets pfn to an error value (or returns an error value directly) if a helper returns failure, and all helpers set the pfn on success. The zeroing of 'pfn' was introduced by commit 2fc843117d64 ("KVM: reorganize hva_to_pfn"), probably to avoid "uninitialized variable" warnings on statements that return pfn. However, no compiler seems to produce them, making the initialization unnecessary. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20220429010416.2788472-2-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-06-20KVM: Rename/refactor kvm_is_reserved_pfn() to kvm_pfn_to_refcounted_page()Sean Christopherson1-14/+52
Rename and refactor kvm_is_reserved_pfn() to kvm_pfn_to_refcounted_page() to better reflect what KVM is actually checking, and to eliminate extra pfn_to_page() lookups. The kvm_release_pfn_*() an kvm_try_get_pfn() helpers in particular benefit from "refouncted" nomenclature, as it's not all that obvious why KVM needs to get/put refcounts for some PG_reserved pages (ZERO_PAGE and ZONE_DEVICE). Add a comment to call out that the list of exceptions to PG_reserved is all but guaranteed to be incomplete. The list has mostly been compiled by people throwing noodles at KVM and finding out they stick a little too well, e.g. the ZERO_PAGE's refcount overflowed and ZONE_DEVICE pages didn't get freed. No functional change intended. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20220429010416.2788472-10-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-06-20KVM: Take a 'struct page', not a pfn in kvm_is_zone_device_page()Sean Christopherson1-4/+4
Operate on a 'struct page' instead of a pfn when checking if a page is a ZONE_DEVICE page, and rename the helper accordingly. Generally speaking, KVM doesn't actually care about ZONE_DEVICE memory, i.e. shouldn't do anything special for ZONE_DEVICE memory. Rather, KVM wants to treat ZONE_DEVICE memory like regular memory, and the need to identify ZONE_DEVICE memory only arises as an exception to PG_reserved pages. In other words, KVM should only ever check for ZONE_DEVICE memory after KVM has already verified that there is a struct page associated with the pfn. No functional change intended. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20220429010416.2788472-9-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-06-20KVM: Remove kvm_vcpu_gfn_to_page() and kvm_vcpu_gpa_to_page()Sean Christopherson1-20/+11
Drop helpers to convert a gfn/gpa to a 'struct page' in the context of a vCPU. KVM doesn't require that guests be backed by 'struct page' memory, thus any use of helpers that assume 'struct page' is bound to be flawed, as was the case for the recently removed last user in x86's nested VMX. No functional change intended. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20220429010416.2788472-8-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-06-20KVM: Don't WARN if kvm_pfn_to_page() encounters a "reserved" pfnSean Christopherson1-3/+1
Drop a WARN_ON() if kvm_pfn_to_page() encounters a "reserved" pfn, which in this context means a struct page that has PG_reserved but is not a/the ZERO_PAGE and is not a ZONE_DEVICE page. The usage, via gfn_to_page(), in x86 is safe as gfn_to_page() is used only to retrieve a page from KVM-controlled memslot, but the usage in PPC and s390 operates on arbitrary gfns and thus memslots that can be backed by incompatible memory. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20220429010416.2788472-7-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-06-20KVM: Avoid pfn_to_page() and vice versa when releasing pagesSean Christopherson1-21/+43
Invert the order of KVM's page/pfn release helpers so that the "inner" helper operates on a page instead of a pfn. As pointed out by Linus[*], converting between struct page and a pfn isn't necessarily cheap, and that's not even counting the overhead of is_error_noslot_pfn() and kvm_is_reserved_pfn(). Even if the checks were dirt cheap, there's no reason to convert from a page to a pfn and back to a page, just to mark the page dirty/accessed or to put a reference to the page. Opportunistically drop a stale declaration of kvm_set_page_accessed() from kvm_host.h (there was no implementation). No functional change intended. [*] https://lore.kernel.org/all/CAHk-=wifQimj2d6npq-wCi5onYPjzQg4vyO4tFcPJJZr268cRw@mail.gmail.com Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20220429010416.2788472-5-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-06-20KVM: Don't set Accessed/Dirty bits for ZERO_PAGESean Christopherson1-2/+14
Don't set Accessed/Dirty bits for a struct page with PG_reserved set, i.e. don't set A/D bits for the ZERO_PAGE. The ZERO_PAGE (or pages depending on the architecture) should obviously never be written, and similarly there's no point in marking it accessed as the page will never be swapped out or reclaimed. The comment in page-flags.h is quite clear that PG_reserved pages should be managed only by their owner, and strictly following that mandate also simplifies KVM's logic. Fixes: 7df003c85218 ("KVM: fix overflow of zero page refcount with ksm running") Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20220429010416.2788472-4-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>