kernel/git/paulus/powerpc.git - PowerPC development tree

Age	Commit message (Collapse)	Author	Files	Lines
2021-02-11	KVM: PPC: Book3S HV: Fix host radix SLB optimisation with hash guestskvm-ppc-next-5.12-1 kvm-ppc-next	Nicholas Piggin	1	-0/+3
	Commit 68ad28a4cdd4 ("KVM: PPC: Book3S HV: Fix radix guest SLB side channel") incorrectly removed the radix host instruction patch to skip re-loading the host SLB entries when exiting from a hash guest. Restore it. Fixes: 68ad28a4cdd4 ("KVM: PPC: Book3S HV: Fix radix guest SLB side channel") Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2021-02-11	KVM: PPC: Book3S HV: Ensure radix guest has no SLB entries	Paul Mackerras	1	-1/+5
	Commit 68ad28a4cdd4 ("KVM: PPC: Book3S HV: Fix radix guest SLB side channel") changed the older guest entry path, with the side effect that vcpu->arch.slb_max no longer gets cleared for a radix guest. This means that a HPT guest which loads some SLB entries, switches to radix mode, runs the guest using the old guest entry path (e.g., because the indep_threads_mode module parameter has been set to false), and then switches back to HPT mode would now see the old SLB entries being present, whereas previously it would have seen no SLB entries. To avoid changing guest-visible behaviour, this adds a store instruction to clear vcpu->arch.slb_max for a radix guest using the old guest entry path. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2021-02-10	KVM: PPC: Don't always report hash MMU capability for P9 < DD2.2	Fabiano Rosas	3	-2/+13
	These machines don't support running both MMU types at the same time, so remove the KVM_CAP_PPC_MMU_HASH_V3 capability when the host is using Radix MMU. [paulus@ozlabs.org - added defensive check on kvmppc_hv_ops->hash_v3_possible] Signed-off-by: Fabiano Rosas <farosas@linux.ibm.com> Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2021-02-10	KVM: PPC: Book3S HV: Save and restore FSCR in the P9 path	Fabiano Rosas	1	-0/+4
	The Facility Status and Control Register is a privileged SPR that defines the availability of some features in problem state. Since it can be written by the guest, we must restore it to the previous host value after guest exit. This restoration is currently done by taking the value from current->thread.fscr, which in the P9 path is not enough anymore because the guest could context switch the QEMU thread, causing the guest-current value to be saved into the thread struct. The above situation manifested when running a QEMU linked against a libc with System Call Vectored support, which causes scv instructions to be run by QEMU early during the guest boot (during SLOF), at which point the FSCR is 0 due to guest entry. After a few scv calls (1 to a couple hundred), the context switching happens and the QEMU thread runs with the guest value, resulting in a Facility Unavailable interrupt. This patch saves and restores the host value of FSCR in the inner guest entry loop in a way independent of current->thread.fscr. The old way of doing it is still kept in place because it works for the old entry path. Signed-off-by: Fabiano Rosas <farosas@linux.ibm.com> Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2021-02-10	KVM: PPC: remove unneeded semicolon	Yang Li	1	-1/+1
	Eliminate the following coccicheck warning: ./arch/powerpc/kvm/booke.c:701:2-3: Unneeded semicolon Reported-by: Abaci Robot <abaci@linux.alibaba.com> Signed-off-by: Yang Li <yang.lee@linux.alibaba.com> Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2021-02-10	KVM: PPC: Book3S HV: Use POWER9 SLBIA IH=6 variant to clear SLB	Nicholas Piggin	1	-3/+3
	IH=6 may preserve hypervisor real-mode ERAT entries and is the recommended SLBIA hint for switching partitions. Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2021-02-10	KVM: PPC: Book3S HV: No need to clear radix host SLB before loading HPT guest	Nicholas Piggin	1	-1/+5
	Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2021-02-10	KVM: PPC: Book3S HV: Fix radix guest SLB side channel	Nicholas Piggin	1	-8/+31
	The slbmte instruction is legal in radix mode, including radix guest mode. This means radix guests can load the SLB with arbitrary data. KVM host does not clear the SLB when exiting a guest if it was a radix guest, which would allow a rogue radix guest to use the SLB as a side channel to communicate with other guests. Fix this by ensuring the SLB is cleared when coming out of a radix guest. Only the first 4 entries are a concern, because radix guests always run with LPCR[UPRT]=1, which limits the reach of slbmte. slbia is not used (except in a non-performance-critical path) because it can clear cached translations. Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Reviewed-by: Fabiano Rosas <farosas@linux.ibm.com> Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2021-02-10	KVM: PPC: Book3S HV: Remove support for running HPT guest on RPT host ↵	Nicholas Piggin	5	-226/+32
	without mixed mode support This reverts much of commit c01015091a770 ("KVM: PPC: Book3S HV: Run HPT guests on POWER9 radix hosts"), which was required to run HPT guests on RPT hosts on early POWER9 CPUs without support for "mixed mode", which meant the host could not run with MMU on while guests were running. This code has some corner case bugs, e.g., when the guest hits a machine check or HMI the primary locks up waiting for secondaries to switch LPCR to host, which they never do. This could all be fixed in software, but most CPUs in production have mixed mode support, and those that don't are believed to be all in installations that don't use this capability. So simplify things and remove support. Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Tested-by: Fabiano Rosas <farosas@linux.ibm.com> Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2021-02-10	KVM: PPC: Book3S HV: Introduce new capability for 2nd DAWR	Ravi Bangoria	6	-0/+35
	Introduce KVM_CAP_PPC_DAWR1 which can be used by QEMU to query whether KVM supports 2nd DAWR or not. The capability is by default disabled even when the underlying CPU supports 2nd DAWR. QEMU needs to check and enable it manually to use the feature. Signed-off-by: Ravi Bangoria <ravi.bangoria@linux.ibm.com> Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2021-02-10	KVM: PPC: Book3S HV: Add infrastructure to support 2nd DAWR	Ravi Bangoria	9	-1/+91
	KVM code assumes single DAWR everywhere. Add code to support 2nd DAWR. DAWR is a hypervisor resource and thus H_SET_MODE hcall is used to set/ unset it. Introduce new case H_SET_MODE_RESOURCE_SET_DAWR1 for 2nd DAWR. Also, KVM will support 2nd DAWR only if CPU_FTR_DAWR1 is set. Signed-off-by: Ravi Bangoria <ravi.bangoria@linux.ibm.com> Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2021-02-10	KVM: PPC: Book3S HV: Rename current DAWR macros and variables	Ravi Bangoria	5	-30/+30
	Power10 is introducing a second DAWR (Data Address Watchpoint Register). Use real register names (with suffix 0) from ISA for current macros and variables used by kvm. One exception is KVM_REG_PPC_DAWR. Keep it as it is because it's uapi so changing it will break userspace. Signed-off-by: Ravi Bangoria <ravi.bangoria@linux.ibm.com> Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2021-02-10	KVM: PPC: Book3S HV: Allow nested guest creation when L0 hv_guest_state > L1	Ravi Bangoria	2	-12/+60
	On powerpc, L1 hypervisor takes help of L0 using H_ENTER_NESTED hcall to load L2 guest state in cpu. L1 hypervisor prepares the L2 state in struct hv_guest_state and passes a pointer to it via hcall. Using that pointer, L0 reads/writes that state directly from/to L1 memory. Thus L0 must be aware of hv_guest_state layout of L1. Currently it uses version field to achieve this. i.e. If L0 hv_guest_state.version != L1 hv_guest_state.version, L0 won't allow nested kvm guest. This restriction can be loosened up a bit. L0 can be taught to understand older layout of hv_guest_state, if we restrict the new members to be added only at the end, i.e. we can allow nested guest even when L0 hv_guest_state.version > L1 hv_guest_state.version. Though, the other way around is not possible. Signed-off-by: Ravi Bangoria <ravi.bangoria@linux.ibm.com> Reviewed-by: Fabiano Rosas <farosas@linux.ibm.com> Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2021-02-09	Documentation: kvm: fix warning	Paolo Bonzini	1	-1/+1
	Documentation/virt/kvm/api.rst:4927: WARNING: Title underline too short. 4.130 KVM_XEN_VCPU_GET_ATTR -------------------------- Fixes: e1f68169a4f8 ("KVM: Add documentation for Xen hypercall and shared_info updates") Reported-by: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-02-09	KVM: x86/xen: Allow reset of Xen attributes	David Woodhouse	1	-10/+28
	In order to support Xen SHUTDOWN_soft_reset (for guest kexec, etc.) the VMM needs to be able to tear everything down and return the Xen features to a clean slate. Signed-off-by: David Woodhouse <dwmw@amazon.co.uk> Message-Id: <20210208232326.1830370-1-dwmw2@infradead.org> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-02-09	KVM: x86/mmu: Make HVA handler retpoline-friendly	Maciej S. Szmigiero	2	-15/+22
	When retpolines are enabled they have high overhead in the inner loop inside kvm_handle_hva_range() that iterates over the provided memory area. Let's mark this function and its TDP MMU equivalent __always_inline so compiler will be able to change the call to the actual handler function inside each of them into a direct one. This significantly improves performance on the unmap test on the existing kernel memslot code (tested on a Xeon 8167M machine): 30 slots in use: Test Before After Improvement Unmap 0.0353s 0.0334s 5% Unmap 2M 0.00104s 0.000407s 61% 509 slots in use: Test Before After Improvement Unmap 0.0742s 0.0740s None Unmap 2M 0.00221s 0.00159s 28% Looks like having an indirect call in these functions (and, so, a retpoline) might have interfered with unrolling of the whole loop in the CPU. Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com> Message-Id: <732d3fe9eb68aa08402a638ab0309199fa89ae56.1612810129.git.maciej.szmigiero@oracle.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-02-09	KVM: x86: hyper-v: Drop hv_vcpu_to_vcpu() helper	Vitaly Kuznetsov	1	-7/+4
	hv_vcpu_to_vcpu() helper is only used by other helpers and is not very complex, we can drop it without much regret. Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com> Message-Id: <20210126134816.1880136-16-vkuznets@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-02-09	KVM: x86: hyper-v: Allocate Hyper-V context lazily	Vitaly Kuznetsov	3	-18/+26
	Hyper-V context is only needed for guests which use Hyper-V emulation in KVM (e.g. Windows/Hyper-V guests) so we don't actually need to allocate it in kvm_arch_vcpu_create(), we can postpone the action until Hyper-V specific MSRs are accessed or SynIC is enabled. Once allocated, let's keep the context alive for the lifetime of the vCPU as an attempt to free it would require additional synchronization with other vCPUs and normally it is not supposed to happen. Note, Hyper-V style hypercall enablement is done by writing to HV_X64_MSR_GUEST_OS_ID so we don't need to worry about allocating Hyper-V context from kvm_hv_hypercall(). Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com> Message-Id: <20210126134816.1880136-15-vkuznets@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-02-09	KVM: x86: hyper-v: Make Hyper-V emulation enablement conditional	Vitaly Kuznetsov	10	-46/+69
	Hyper-V emulation is enabled in KVM unconditionally. This is bad at least from security standpoint as it is an extra attack surface. Ideally, there should be a per-VM capability explicitly enabled by VMM but currently it is not the case and we can't mandate one without breaking backwards compatibility. We can, however, check guest visible CPUIDs and only enable Hyper-V emulation when "Hv#1" interface was exposed in HYPERV_CPUID_INTERFACE. Note, VMMs are free to act in any sequence they like, e.g. they can try to set MSRs first and CPUIDs later so we still need to allow the host to read/write Hyper-V specific MSRs unconditionally. Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com> Message-Id: <20210126134816.1880136-14-vkuznets@redhat.com> [Add selftest vcpu_set_hv_cpuid API to avoid breaking xen_vmcall_test. - Paolo] Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-02-09	KVM: x86: hyper-v: Allocate 'struct kvm_vcpu_hv' dynamically	Vitaly Kuznetsov	4	-12/+27
	Hyper-V context is only needed for guests which use Hyper-V emulation in KVM (e.g. Windows/Hyper-V guests). 'struct kvm_vcpu_hv' is, however, quite big, it accounts for more than 1/4 of the total 'struct kvm_vcpu_arch' which is also quite big already. This all looks like a waste. Allocate 'struct kvm_vcpu_hv' dynamically. This patch does not bring any (intentional) functional change as we still allocate the context unconditionally but it paves the way to doing that only when needed. Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com> Message-Id: <20210126134816.1880136-13-vkuznets@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-02-09	KVM: x86: hyper-v: Prepare to meet unallocated Hyper-V context	Vitaly Kuznetsov	5	-14/+29
	Currently, Hyper-V context is part of 'struct kvm_vcpu_arch' and is always available. As a preparation to allocating it dynamically, check that it is not NULL at call sites which can normally proceed without it i.e. the behavior is identical to the situation when Hyper-V emulation is not being used by the guest. When Hyper-V context for a particular vCPU is not allocated, we may still need to get 'vp_index' from there. E.g. in a hypothetical situation when Hyper-V emulation was enabled on one CPU and wasn't on another, Hyper-V style send-IPI hypercall may still be used. Luckily, vp_index is always initialized to kvm_vcpu_get_idx() and can only be changed when Hyper-V context is present. Introduce kvm_hv_get_vpindex() helper for simplification. No functional change intended. Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com> Message-Id: <20210126134816.1880136-12-vkuznets@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-02-09	KVM: x86: hyper-v: Always use to_hv_vcpu() accessor to get to 'struct ↵	Vitaly Kuznetsov	5	-11/+21
	kvm_vcpu_hv' As a preparation to allocating Hyper-V context dynamically, make it clear who's the user of the said context. No functional change intended. Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com> Message-Id: <20210126134816.1880136-11-vkuznets@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-02-09	KVM: x86: hyper-v: Stop shadowing global 'current_vcpu' variable	Vitaly Kuznetsov	1	-6/+5
	'current_vcpu' variable in KVM is a per-cpu pointer to the currently scheduled vcpu. kvm_hv_flush_tlb()/kvm_hv_send_ipi() functions used to have local 'vcpu' variable to iterate over vCPUs but it's gone now and there's no need to use anything but the standard 'vcpu' as an argument. Suggested-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com> Message-Id: <20210126134816.1880136-10-vkuznets@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-02-09	KVM: x86: hyper-v: Introduce to_kvm_hv() helper	Vitaly Kuznetsov	4	-53/+64
	Spelling '&kvm->arch.hyperv' correctly is hard. Also, this makes the code more consistent with vmx/svm where to_kvm_vmx()/to_kvm_svm() are already being used. Opportunistically change kvm_hv_msr_{get,set}_crash_{data,ctl}() and kvm_hv_msr_set_crash_data() to take 'kvm' instead of 'vcpu' as these MSRs are partition wide. No functional change intended. Suggested-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com> Message-Id: <20210126134816.1880136-9-vkuznets@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-02-09	KVM: x86: hyper-v: Rename vcpu_to_hv_syndbg() to to_hv_syndbg()	Vitaly Kuznetsov	2	-5/+5
	vcpu_to_hv_syndbg()'s argument is always 'vcpu' so there's no need to have an additional prefix. Also, this makes the code more consistent with vmx/svm where to_vmx()/to_svm() are being used. No functional change intended. Suggested-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com> Message-Id: <20210126134816.1880136-8-vkuznets@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-02-09	KVM: x86: hyper-v: Rename vcpu_to_stimer()/stimer_to_vcpu()	Vitaly Kuznetsov	2	-21/+21
	vcpu_to_stimers()'s argument is almost always 'vcpu' so there's no need to have an additional prefix. Also, this makes the naming more consistent with to_hv_vcpu()/to_hv_synic(). Rename stimer_to_vcpu() to hv_stimer_to_vcpu() for consitency. No functional change intended. Suggested-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com> Message-Id: <20210126134816.1880136-7-vkuznets@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-02-09	KVM: x86: hyper-v: Rename vcpu_to_synic()/synic_to_vcpu()	Vitaly Kuznetsov	4	-20/+20
	vcpu_to_synic()'s argument is almost always 'vcpu' so there's no need to have an additional prefix. Also, as this is used outside of hyper-v emulation code, add '_hv_' part to make it clear what this s. This makes the naming more consistent with to_hv_vcpu(). Rename synic_to_vcpu() to hv_synic_to_vcpu() for consistency. No functional change intended. Suggested-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com> Message-Id: <20210126134816.1880136-6-vkuznets@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-02-09	KVM: x86: hyper-v: Rename vcpu_to_hv_vcpu() to to_hv_vcpu()	Vitaly Kuznetsov	2	-14/+14
	vcpu_to_hv_vcpu()'s argument is almost always 'vcpu' so there's no need to have an additional prefix. Also, this makes the code more consistent with vmx/svm where to_vmx()/to_svm() are being used. No functional change intended. Suggested-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com> Message-Id: <20210126134816.1880136-5-vkuznets@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-02-09	KVM: x86: hyper-v: Drop unused kvm_hv_vapic_assist_page_enabled()	Vitaly Kuznetsov	1	-5/+0
	kvm_hv_vapic_assist_page_enabled() seems to be unused since its introduction in commit 10388a07164c1 ("KVM: Add HYPER-V apic access MSRs"), drop it. Reported-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com> Message-Id: <20210126134816.1880136-4-vkuznets@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-02-09	selftests: kvm: Properly set Hyper-V CPUIDs in evmcs_test	Vitaly Kuznetsov	1	-1/+38
	Generally, when Hyper-V emulation is enabled, VMM is supposed to set Hyper-V CPUID identifications so the guest knows that Hyper-V features are available. evmcs_test doesn't currently do that but so far Hyper-V emulation in KVM was enabled unconditionally. As we are about to change that, proper Hyper-V CPUID identification should be set in selftests as well. Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com> Message-Id: <20210126134816.1880136-3-vkuznets@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-02-09	selftests: kvm: Move kvm_get_supported_hv_cpuid() to common code	Vitaly Kuznetsov	3	-28/+39
	kvm_get_supported_hv_cpuid() may come handy in all Hyper-V related tests. Split it off hyperv_cpuid test, create system-wide and vcpu versions. Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com> Message-Id: <20210126134816.1880136-2-vkuznets@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-02-09	KVM: Raise the maximum number of user memslots	Vitaly Kuznetsov	6	-9/+2
	Current KVM_USER_MEM_SLOTS limits are arch specific (512 on Power, 509 on x86, 32 on s390, 16 on MIPS) but they don't really need to be. Memory slots are allocated dynamically in KVM when added so the only real limitation is 'id_to_index' array which is 'short'. We don't have any other KVM_MEM_SLOTS_NUM/KVM_USER_MEM_SLOTS-sized statically defined structures. Low KVM_USER_MEM_SLOTS can be a limiting factor for some configurations. In particular, when QEMU tries to start a Windows guest with Hyper-V SynIC enabled and e.g. 256 vCPUs the limit is hit as SynIC requires two pages per vCPU and the guest is free to pick any GFN for each of them, this fragments memslots as QEMU wants to have a separate memslot for each of these pages (which are supposed to act as 'overlay' pages). Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com> Message-Id: <20210127175731.2020089-3-vkuznets@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-02-09	selftests: kvm: Raise the default timeout to 120 seconds	Vitaly Kuznetsov	1	-0/+1
	With the updated maximum number of user memslots (32) set_memory_region_test sometimes takes longer than the default 45 seconds to finish. Raise the value to an arbitrary 120 seconds. Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com> Message-Id: <20210127175731.2020089-6-vkuznets@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-02-09	KVM: x86: move kvm_inject_gp up from kvm_set_dr to callers	Paolo Bonzini	3	-29/+20
	Push the injection of #GP up to the callers, so that they can just use kvm_complete_insn_gp. __kvm_set_dr is pretty much what the callers can use together with kvm_complete_insn_gp, so rename it to kvm_set_dr and drop the old kvm_set_dr wrapper. This also allows nested VMX code, which really wanted to use __kvm_set_dr, to use the right function. While at it, remove the kvm_require_dr() check from the SVM interception. The APM states: All normal exception checks take precedence over the SVM intercepts. which includes the CR4.DE=1 #UD. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-02-09	KVM: x86: reading DR cannot fail	Paolo Bonzini	4	-9/+7
	kvm_get_dr and emulator_get_dr except an in-range value for the register number so they cannot fail. Change the return type to void. Suggested-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-02-09	KVM: SVM: Remove an unnecessary forward declaration	Sean Christopherson	1	-2/+0
	Drop a defunct forward declaration of svm_complete_interrupts(). No functional change intended. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210205005750.3841462-3-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-02-09	KVM: SVM: Move AVIC vCPU kicking snippet to helper function	Sean Christopherson	1	-16/+19
	Add a helper function to handle kicking non-running vCPUs when sending virtual IPIs. A future patch will change SVM's interception functions to take @vcpu instead of @svm, at which piont declaring and modifying 'vcpu' in a case statement is confusing, and potentially dangerous. No functional change intended. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210205005750.3841462-2-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-02-09	KVM: x86: Restore all 64 bits of DR6 and DR7 during RSM on x86-64	Sean Christopherson	1	-2/+2
	Restore the full 64-bit values of DR6 and DR7 when emulating RSM on x86-64, as defined by both Intel's SDM and AMD's APM. Note, bits 63:32 of DR6 and DR7 are reserved, so this is a glorified nop unless the SMM handler is poking into SMRAM, which it most definitely shouldn't be doing since both Intel and AMD list the DR6 and DR7 fields as read-only. Fixes: 660a5d517aaa ("KVM: x86: save/load state on SMM switch") Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210205012458.3872687-3-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-02-09	KVM: x86: Remove misleading DR6/DR7 adjustments from RSM emulation	Sean Christopherson	1	-4/+4
	Drop the DR6/7 volatile+fixed bits adjustments in RSM emulation, which are redundant and misleading. The necessary adjustments are made by kvm_set_dr(), which properly sets the fixed bits that are conditional on the vCPU model. Note, KVM incorrectly reads only bits 31:0 of the DR6/7 fields when emulating RSM on x86-64. On the plus side for this change, that bug makes removing "& DRx_VOLATILE" a nop. No functional change intended. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210205012458.3872687-2-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-02-09	KVM: x86/xen: Use hva_t for holding hypercall page address	Sean Christopherson	1	-2/+6
	Use hva_t, a.k.a. unsigned long, for the local variable that holds the hypercall page address. On 32-bit KVM, gcc complains about using a u64 due to the implicit cast from a 64-bit value to a 32-bit pointer. arch/x86/kvm/xen.c: In function ‘kvm_xen_write_hypercall_page’: arch/x86/kvm/xen.c:300:22: error: cast to pointer from integer of different size [-Werror=int-to-pointer-cast] 300 \| page = memdup_user((u8 __user *)blob_addr, PAGE_SIZE); Cc: Joao Martins <joao.m.martins@oracle.com> Cc: David Woodhouse <dwmw@amazon.co.uk> Fixes: 23200b7a30de ("KVM: x86/xen: intercept xen hypercalls if enabled") Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210208201502.1239867-1-seanjc@google.com> Acked-by: David Woodhouse <dwmw@amazon.co.uk> Reviewed-by: Joao Martins <joao.m.martins@oracle.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-02-09	KVM: x86/xen: Remove extra unlock in kvm_xen_hvm_set_attr()	David Woodhouse	1	-2/+0
	This accidentally ended up locking and then immediately unlocking kvm->lock at the beginning of the function. Fix it. Fixes: a76b9641ad1c ("KVM: x86/xen: add KVM_XEN_HVM_SET_ATTR/KVM_XEN_HVM_GET_ATTR") Signed-off-by: David Woodhouse <dwmw@amazon.co.uk> Message-Id: <20210208232326.1830370-2-dwmw2@infradead.org> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-02-09	KVM: Use kvm_pfn_t for local PFN variable in hva_to_pfn_remapped()	Sean Christopherson	1	-1/+1
	Use kvm_pfn_t, a.k.a. u64, for the local 'pfn' variable when retrieving a so called "remapped" hva/pfn pair. In theory, the hva could resolve to a pfn in high memory on a 32-bit kernel. This bug was inadvertantly exposed by commit bd2fae8da794 ("KVM: do not assume PTE is writable after follow_pfn"), which added an error PFN value to the mix, causing gcc to comlain about overflowing the unsigned long. arch/x86/kvm/../../../virt/kvm/kvm_main.c: In function ‘hva_to_pfn_remapped’: include/linux/kvm_host.h:89:30: error: conversion from ‘long long unsigned int’ to ‘long unsigned int’ changes value from ‘9218868437227405314’ to ‘2’ [-Werror=overflow] 89 \| #define KVM_PFN_ERR_RO_FAULT (KVM_PFN_ERR_MASK + 2) \| ^ virt/kvm/kvm_main.c:1935:9: note: in expansion of macro ‘KVM_PFN_ERR_RO_FAULT’ Cc: stable@vger.kernel.org Fixes: add6a0cd1c5b ("KVM: MMU: try to fix up page faults before giving up") Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210208201940.1258328-1-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-02-09	mm: provide a saner PTE walking API for modules	Paolo Bonzini	5	-13/+47
	Currently, the follow_pfn function is exported for modules but follow_pte is not. However, follow_pfn is very easy to misuse, because it does not provide protections (so most of its callers assume the page is writable!) and because it returns after having already unlocked the page table lock. Provide instead a simplified version of follow_pte that does not have the pmdpp and range arguments. The older version survives as follow_invalidate_pte() for use by fs/dax.c. Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-02-08	KVM: x86: compile out TDP MMU on 32-bit systems	Paolo Bonzini	6	-51/+53
	The TDP MMU assumes that it can do atomic accesses to 64-bit PTEs. Rather than just disabling it, compile it out completely so that it is possible to use for example 64-bit xchg. To limit the number of stubs, wrap all accesses to tdp_mmu_enabled or tdp_mmu_page with a function. Calls to all other functions in tdp_mmu.c are eliminated and do not even reach the linker. Reviewed-by: Sean Christopherson <seanjc@google.com> Tested-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-02-08	i915: kvmgt: the KVM mmu_lock is now an rwlock	Paolo Bonzini	1	-6/+6
	Adjust the KVMGT page tracking callbacks. Cc: Zhenyu Wang <zhenyuw@linux.intel.com> Cc: Zhi Wang <zhi.a.wang@intel.com> Cc: intel-gvt-dev@lists.freedesktop.org Cc: intel-gfx@lists.freedesktop.org Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-02-04	KVM: x86: Add helper to consolidate "raw" reserved GPA mask calculations	Sean Christopherson	4	-9/+20
	Add a helper to generate the mask of reserved GPA bits _without_ any adjustments for repurposed bits, and use it to replace a variety of open coded variants in the MTRR and APIC_BASE flows. No functional change intended. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210204000117.3303214-11-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-02-04	KVM: x86/mmu: Add helper to generate mask of reserved HPA bits	Sean Christopherson	1	-5/+9
	Add a helper to generate the mask of reserved PA bits in the host. No functional change intended. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210204000117.3303214-10-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-02-04	KVM: x86: Use reserved_gpa_bits to calculate reserved PxE bits	Sean Christopherson	3	-59/+58
	Use reserved_gpa_bits, which accounts for exceptions to the maxphyaddr rule, e.g. SEV's C-bit, for the page {table,directory,etc...} entry (PxE) reserved bits checks. For SEV, the C-bit is ignored by hardware when walking pages tables, e.g. the APM states: Note that while the guest may choose to set the C-bit explicitly on instruction pages and page table addresses, the value of this bit is a don't-care in such situations as hardware always performs these as private accesses. Such behavior is expected to hold true for other features that repurpose GPA bits, e.g. KVM could theoretically emulate SME or MKTME, which both allow non-zero repurposed bits in the page tables. Conceptually, KVM should apply reserved GPA checks universally, and any features that do not adhere to the basic rule should be explicitly handled, i.e. if a GPA bit is repurposed but not allowed in page tables for whatever reason. Refactor __reset_rsvds_bits_mask() to take the pre-generated reserved bits mask, and opportunistically clean up its code, e.g. to align lines and comments. Practically speaking, this is change is a likely a glorified nop given the current KVM code base. SEV's C-bit is the only repurposed GPA bit, and KVM doesn't support shadowing encrypted page tables (which is theoretically possible via SEV debug APIs). Cc: Rick Edgecombe <rick.p.edgecombe@intel.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210204000117.3303214-9-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-02-04	KVM: x86: SEV: Treat C-bit as legal GPA bit regardless of vCPU mode	Sean Christopherson	6	-9/+8
	Rename cr3_lm_rsvd_bits to reserved_gpa_bits, and use it for all GPA legality checks. AMD's APM states: If the C-bit is an address bit, this bit is masked from the guest physical address when it is translated through the nested page tables. Thus, any access that can conceivably be run through NPT should ignore the C-bit when checking for validity. For features that KVM emulates in software, e.g. MTRRs, there is no clear direction in the APM for how the C-bit should be handled. For such cases, follow the SME behavior inasmuch as possible, since SEV is is essentially a VM-specific variant of SME. For SME, the APM states: In this case the upper physical address bits are treated as reserved when the feature is enabled except where otherwise indicated. Collecting the various relavant SME snippets in the APM and cross- referencing the omissions with Linux kernel code, this leaves MTTRs and APIC_BASE as the only flows that KVM emulates that should _not_ ignore the C-bit. Note, this means the reserved bit checks in the page tables are technically broken. This will be remedied in a future patch. Although the page table checks are technically broken, in practice, it's all but guaranteed to be irrelevant. NPT is required for SEV, i.e. shadowing page tables isn't needed in the common case. Theoretically, the checks could be in play for nested NPT, but it's extremely unlikely that anyone is running nested VMs on SEV, as doing so would require L1 to expose sensitive data to L0, e.g. the entire VMCB. And if anyone is running nested VMs, L0 can't read the guest's encrypted memory, i.e. L1 would need to put its NPT in shared memory, in which case the C-bit will never be set. Or, L1 could use shadow paging, but again, if L0 needs to read page tables, e.g. to load PDPTRs, the memory can't be encrypted if L1 has any expectation of L0 doing the right thing. Cc: Tom Lendacky <thomas.lendacky@amd.com> Cc: Brijesh Singh <brijesh.singh@amd.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210204000117.3303214-8-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-02-04	KVM: nSVM: Use common GPA helper to check for illegal CR3	Sean Christopherson	1	-1/+1
	Replace an open coded check for an invalid CR3 with its equivalent helper. No functional change intended. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210204000117.3303214-7-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-02-04	KVM: VMX: Use GPA legality helpers to replace open coded equivalents	Sean Christopherson	2	-20/+8
	Replace a variety of open coded GPA checks with the recently introduced common helpers. No functional change intended. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210204000117.3303214-6-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-02-04	KVM: x86: Add a helper to handle legal GPA with an alignment requirement	Sean Christopherson	1	-1/+7
	Add a helper to genericize checking for a legal GPA that also must conform to an arbitrary alignment, and use it in the existing page_address_valid(). Future patches will replace open coded variants in VMX and SVM. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210204000117.3303214-5-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-02-04	KVM: x86: Add a helper to check for a legal GPA	Sean Christopherson	1	-6/+11
	Add a helper to check for a legal GPA, and use it to consolidate code in existing, related helpers. Future patches will extend usage to VMX and SVM code, properly handle exceptions to the maxphyaddr rule, and add more helpers. No functional change intended. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210204000117.3303214-4-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-02-04	KVM: nSVM: Don't strip host's C-bit from guest's CR3 when reading PDPTRs	Sean Christopherson	1	-1/+1
	Don't clear the SME C-bit when reading a guest PDPTR, as the GPA (CR3) is in the guest domain. Barring a bizarre paravirtual use case, this is likely a benign bug. SME is not emulated by KVM, loading SEV guest PDPTRs is doomed as KVM can't use the correct key to read guest memory, and setting guest MAXPHYADDR higher than the host, i.e. overlapping the C-bit, would cause faults in the guest. Note, for SEV guests, stripping the C-bit is technically aligned with CPU behavior, but for KVM it's the greater of two evils. Because KVM doesn't have access to the guest's encryption key, ignoring the C-bit would at best result in KVM reading garbage. By keeping the C-bit, KVM will fail its read (unless userspace creates a memslot with the C-bit set). The guest will still undoubtedly die, as KVM will use '0' for the PDPTR value, but that's preferable to interpreting encrypted data as a PDPTR. Fixes: d0ec49d4de90 ("kvm/x86/svm: Support Secure Memory Encryption within KVM") Cc: Tom Lendacky <thomas.lendacky@amd.com> Cc: Brijesh Singh <brijesh.singh@amd.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210204000117.3303214-3-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-02-04	KVM: x86: Set so called 'reserved CR3 bits in LM mask' at vCPU reset	Sean Christopherson	1	-0/+1
	Set cr3_lm_rsvd_bits, which is effectively an invalid GPA mask, at vCPU reset. The reserved bits check needs to be done even if userspace never configures the guest's CPUID model. Cc: stable@vger.kernel.org Fixes: 0107973a80ad ("KVM: x86: Introduce cr3_lm_rsvd_bits in kvm_vcpu_arch") Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210204000117.3303214-2-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-02-04	KVM: Add documentation for Xen hypercall and shared_info updates	David Woodhouse	1	-5/+166
	Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
2021-02-04	KVM: x86: declare Xen HVM shared info capability and add test case	David Woodhouse	4	-1/+175
	Instead of adding a plethora of new KVM_CAP_XEN_FOO capabilities, just add bits to the return value of KVM_CAP_XEN_HVM. Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
2021-02-04	KVM: x86/xen: Add event channel interrupt vector upcall	David Woodhouse	6	-1/+74
	It turns out that we can't handle event channels entirely in userspace by delivering them as ExtINT, because KVM is a bit picky about when it accepts ExtINT interrupts from a legacy PIC. The in-kernel local APIC has to have LVT0 configured in APIC_MODE_EXTINT and unmasked, which isn't necessarily the case for Xen guests especially on secondary CPUs. To cope with this, add kvm_xen_get_interrupt() which checks the evtchn_pending_upcall field in the Xen vcpu_info, and delivers the Xen upcall vector (configured by KVM_XEN_ATTR_TYPE_UPCALL_VECTOR) if it's set regardless of LAPIC LVT0 configuration. This gives us the minimum support we need for completely userspace-based implementation of event channels. This does mean that vcpu_enter_guest() needs to check for the evtchn_pending_upcall flag being set, because it can't rely on someone having set KVM_REQ_EVENT unless we were to add some way for userspace to do so manually. Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
2021-02-04	KVM: x86/xen: register vcpu time info region	Joao Martins	4	-0/+23
	Allow the Xen emulated guest the ability to register secondary vcpu time information. On Xen guests this is used in order to be mapped to userspace and hence allow vdso gettimeofday to work. Signed-off-by: Joao Martins <joao.m.martins@oracle.com> Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
2021-02-04	KVM: x86/xen: setup pvclock updates	Joao Martins	2	-15/+21
	Parameterise kvm_setup_pvclock_page() a little bit so that it can be invoked for different gfn_to_hva_cache structures, and with different offsets. Then we can invoke it for the normal KVM pvclock and also for the Xen one in the vcpu_info. Signed-off-by: Joao Martins <joao.m.martins@oracle.com> Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
2021-02-04	KVM: x86/xen: register vcpu info	Joao Martins	3	-1/+29
	The vcpu info supersedes the per vcpu area of the shared info page and the guest vcpus will use this instead. Signed-off-by: Joao Martins <joao.m.martins@oracle.com> Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com> Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
2021-02-04	KVM: x86/xen: Add KVM_XEN_VCPU_SET_ATTR/KVM_XEN_VCPU_GET_ATTR	David Woodhouse	4	-0/+65
	This will be used for per-vCPU setup such as runstate and vcpu_info. Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
2021-02-04	KVM: x86/xen: update wallclock region	Joao Martins	3	-8/+43
	Wallclock on Xen is written in the shared_info page. To that purpose, export kvm_write_wall_clock() and pass on the GPA of its location to populate the shared_info wall clock data. Signed-off-by: Joao Martins <joao.m.martins@oracle.com> Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
2021-02-04	xen: add wc_sec_hi to struct shared_info	David Woodhouse	2	-1/+6
	Xen added this in 2015 (Xen 4.6). On x86_64 and Arm it fills what was previously a 32-bit hole in the generic shared_info structure; on i386 it had to go at the end of struct arch_shared_info. Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
2021-02-04	KVM: x86/xen: register shared_info page	Joao Martins	3	-4/+42
	Add KVM_XEN_ATTR_TYPE_SHARED_INFO to allow hypervisor to know where the guest's shared info page is. Signed-off-by: Joao Martins <joao.m.martins@oracle.com> Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
2021-02-04	KVM: x86/xen: add definitions of compat_shared_info, compat_vcpu_info	David Woodhouse	1	-0/+36
	There aren't a lot of differences for the things that the kernel needs to care about, but there are a few. Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
2021-02-04	KVM: x86/xen: latch long_mode when hypercall page is set up	David Woodhouse	3	-1/+24
	Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
2021-02-04	KVM: x86/xen: add KVM_XEN_HVM_SET_ATTR/KVM_XEN_HVM_GET_ATTR	Joao Martins	4	-0/+63
	This will be used to set up shared info pages etc. Signed-off-by: Joao Martins <joao.m.martins@oracle.com> Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
2021-02-04	KVM: x86/xen: Add kvm_xen_enabled static key	David Woodhouse	3	-2/+27
	The code paths for Xen support are all fairly lightweight but if we hide them behind this, they're even more lightweight for any system which isn't actually hosting Xen guests. Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
2021-02-04	KVM: x86/xen: Move KVM_XEN_HVM_CONFIG handling to xen.c	David Woodhouse	3	-13/+20
	This is already more complex than the simple memcpy it originally had. Move it to xen.c with the rest of the Xen support. Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
2021-02-04	KVM: x86/xen: Fix coexistence of Xen and Hyper-V hypercalls	Joao Martins	3	-17/+68
	Disambiguate Xen vs. Hyper-V calls by adding 'orl $0x80000000, %eax' at the start of the Hyper-V hypercall page when Xen hypercalls are also enabled. That bit is reserved in the Hyper-V ABI, and those hypercall numbers will never be used by Xen (because it does precisely the same trick). Switch to using kvm_vcpu_write_guest() while we're at it, instead of open-coding it. Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
2021-02-04	KVM: x86/xen: intercept xen hypercalls if enabled	Joao Martins	10	-32/+369
	Add a new exit reason for emulator to handle Xen hypercalls. Since this means KVM owns the ABI, dispense with the facility for the VMM to provide its own copy of the hypercall pages; just fill them in directly using VMCALL/VMMCALL as we do for the Hyper-V hypercall page. This behaviour is enabled by a new INTERCEPT_HCALL flag in the KVM_XEN_HVM_CONFIG ioctl structure, and advertised by the same flag being returned from the KVM_CAP_XEN_HVM check. Rename xen_hvm_config() to kvm_xen_write_hypercall_page() and move it to the nascent xen.c while we're at it, and add a test case. Signed-off-by: Joao Martins <joao.m.martins@oracle.com> Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
2021-02-04	KVM: x86/xen: Fix __user pointer handling for hypercall page installation	David Woodhouse	1	-3/+5
	The address we give to memdup_user() isn't correctly tagged as __user. This is harmless enough as it's a one-off use and we're doing exactly the right thing, but fix it anyway to shut the checker up. Otherwise it'll whine when the (now legacy) code gets moved around in a later patch. Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
2021-02-04	KVM: x86/xen: fix Xen hypercall page msr handling	Joao Martins	1	-2/+3
	Xen usually places its MSR at 0x40000000 or 0x40000200 depending on whether it is running in viridian mode or not. Note that this is not ABI guaranteed, so it is possible for Xen to advertise the MSR some place else. Given the way xen_hvm_config() is handled, if the former address is selected, this will conflict with Hyper-V's MSR (HV_X64_MSR_GUEST_OS_ID) which unconditionally uses the same address. Given that the MSR location is arbitrary, move the xen_hvm_config() handling to the top of kvm_set_msr_common() before falling through. Signed-off-by: Joao Martins <joao.m.martins@oracle.com> Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
2021-02-04	KVM: x86/mmu: Allow parallel page faults for the TDP MMU	Ben Gardon	1	-2/+10
	Make the last few changes necessary to enable the TDP MMU to handle page faults in parallel while holding the mmu_lock in read mode. Reviewed-by: Peter Feiner <pfeiner@google.com> Signed-off-by: Ben Gardon <bgardon@google.com> Message-Id: <20210202185734.1680553-24-bgardon@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-02-04	KVM: x86/mmu: Mark SPTEs in disconnected pages as removed	Ben Gardon	1	-6/+30
	When clearing TDP MMU pages what have been disconnected from the paging structure root, set the SPTEs to a special non-present value which will not be overwritten by other threads. This is needed to prevent races in which a thread is clearing a disconnected page table, but another thread has already acquired a pointer to that memory and installs a mapping in an already cleared entry. This can lead to memory leaks and accounting errors. Reviewed-by: Peter Feiner <pfeiner@google.com> Signed-off-by: Ben Gardon <bgardon@google.com> Message-Id: <20210202185734.1680553-23-bgardon@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-02-04	KVM: x86/mmu: Flush TLBs after zap in TDP MMU PF handler	Ben Gardon	2	-10/+74
	When the TDP MMU is allowed to handle page faults in parallel there is the possiblity of a race where an SPTE is cleared and then imediately replaced with a present SPTE pointing to a different PFN, before the TLBs can be flushed. This race would violate architectural specs. Ensure that the TLBs are flushed properly before other threads are allowed to install any present value for the SPTE. Reviewed-by: Peter Feiner <pfeiner@google.com> Signed-off-by: Ben Gardon <bgardon@google.com> Message-Id: <20210202185734.1680553-22-bgardon@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-02-04	KVM: x86/mmu: Use atomic ops to set SPTEs in TDP MMU map	Ben Gardon	3	-34/+130
	To prepare for handling page faults in parallel, change the TDP MMU page fault handler to use atomic operations to set SPTEs so that changes are not lost if multiple threads attempt to modify the same SPTE. Reviewed-by: Peter Feiner <pfeiner@google.com> Signed-off-by: Ben Gardon <bgardon@google.com> Message-Id: <20210202185734.1680553-21-bgardon@google.com> [Document new locking rules. - Paolo] Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-02-04	KVM: x86/mmu: Factor out functions to add/remove TDP MMU pages	Ben Gardon	1	-8/+39
	Move the work of adding and removing TDP MMU pages to/from "secondary" data structures to helper functions. These functions will be built on in future commits to enable MMU operations to proceed (mostly) in parallel. No functional change expected. Signed-off-by: Ben Gardon <bgardon@google.com> Message-Id: <20210202185734.1680553-20-bgardon@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-02-04	KVM: x86/mmu: Use an rwlock for the x86 MMU	Ben Gardon	10	-84/+118
	Add a read / write lock to be used in place of the MMU spinlock on x86. The rwlock will enable the TDP MMU to handle page faults, and other operations in parallel in future commits. Reviewed-by: Peter Feiner <pfeiner@google.com> Signed-off-by: Ben Gardon <bgardon@google.com> Message-Id: <20210202185734.1680553-19-bgardon@google.com> [Introduce virt/kvm/mmu_lock.h - Paolo] Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-02-04	sched: Add cond_resched_rwlock	Ben Gardon	2	-0/+52
	Safely rescheduling while holding a spin lock is essential for keeping long running kernel operations running smoothly. Add the facility to cond_resched rwlocks. CC: Ingo Molnar <mingo@redhat.com> CC: Will Deacon <will@kernel.org> Acked-by: Peter Zijlstra <peterz@infradead.org> Acked-by: Davidlohr Bueso <dbueso@suse.de> Acked-by: Waiman Long <longman@redhat.com> Acked-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Ben Gardon <bgardon@google.com> Message-Id: <20210202185734.1680553-9-bgardon@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-02-04	sched: Add needbreak for rwlocks	Ben Gardon	1	-0/+17
	Contention awareness while holding a spin lock is essential for reducing latency when long running kernel operations can hold that lock. Add the same contention detection interface for read/write spin locks. CC: Ingo Molnar <mingo@redhat.com> CC: Will Deacon <will@kernel.org> Acked-by: Peter Zijlstra <peterz@infradead.org> Acked-by: Davidlohr Bueso <dbueso@suse.de> Acked-by: Waiman Long <longman@redhat.com> Acked-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Ben Gardon <bgardon@google.com> Message-Id: <20210202185734.1680553-8-bgardon@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-02-04	locking/rwlocks: Add contention detection for rwlocks	Ben Gardon	2	-6/+25
	rwlocks do not currently have any facility to detect contention like spinlocks do. In order to allow users of rwlocks to better manage latency, add contention detection for queued rwlocks. CC: Ingo Molnar <mingo@redhat.com> CC: Will Deacon <will@kernel.org> Acked-by: Peter Zijlstra <peterz@infradead.org> Acked-by: Davidlohr Bueso <dbueso@suse.de> Acked-by: Waiman Long <longman@redhat.com> Acked-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Ben Gardon <bgardon@google.com> Message-Id: <20210202185734.1680553-7-bgardon@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-02-04	KVM: x86/mmu: Protect TDP MMU page table memory with RCU	Ben Gardon	4	-21/+103
	In order to enable concurrent modifications to the paging structures in the TDP MMU, threads must be able to safely remove pages of page table memory while other threads are traversing the same memory. To ensure threads do not access PT memory after it is freed, protect PT memory with RCU. Protecting concurrent accesses to page table memory from use-after-free bugs could also have been acomplished using walk_shadow_page_lockless_begin/end() and READING_SHADOW_PAGE_TABLES, coupling with the barriers in a TLB flush. The use of RCU for this case has several distinct advantages over that approach. 1. Disabling interrupts for long running operations is not desirable. Future commits will allow operations besides page faults to operate without the exclusive protection of the MMU lock and those operations are too long to disable iterrupts for their duration. 2. The use of RCU here avoids long blocking / spinning operations in perfromance critical paths. By freeing memory with an asynchronous RCU API we avoid the longer wait times TLB flushes experience when overlapping with a thread in walk_shadow_page_lockless_begin/end(). 3. RCU provides a separation of concerns when removing memory from the paging structure. Because the RCU callback to free memory can be scheduled immediately after a TLB flush, there's no need for the thread to manually free a queue of pages later, as commit_zap_pages does. Fixes: 95fb5b0258b7 ("kvm: x86/mmu: Support MMIO in the TDP MMU") Reviewed-by: Peter Feiner <pfeiner@google.com> Suggested-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Ben Gardon <bgardon@google.com> Message-Id: <20210202185734.1680553-18-bgardon@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-02-04	KVM: x86/mmu: Clear dirtied pages mask bit before early break	Ben Gardon	1	-2/+2
	In clear_dirty_pt_masked, the loop is intended to exit early after processing each of the GFNs with corresponding bits set in mask. This does not work as intended if another thread has already cleared the dirty bit or writable bit on the SPTE. In that case, the loop would proceed to the next iteration early and the bit in mask would not be cleared. As a result the loop could not exit early and would proceed uselessly. Move the unsetting of the mask bit before the check for a no-op SPTE change. Fixes: a6a0b05da9f3 ("kvm: x86/mmu: Support dirty logging for the TDP MMU") Suggested-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Ben Gardon <bgardon@google.com> Message-Id: <20210202185734.1680553-17-bgardon@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-02-04	KVM: x86/mmu: Skip no-op changes in TDP MMU functions	Ben Gardon	1	-2/+4
	Skip setting SPTEs if no change is expected. Reviewed-by: Peter Feiner <pfeiner@google.com> Signed-off-by: Ben Gardon <bgardon@google.com> Message-Id: <20210202185734.1680553-16-bgardon@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-02-04	KVM: x86/mmu: Yield in TDU MMU iter even if no SPTES changed	Ben Gardon	1	-10/+22
	Given certain conditions, some TDP MMU functions may not yield reliably / frequently enough. For example, if a paging structure was very large but had few, if any writable entries, wrprot_gfn_range could traverse many entries before finding a writable entry and yielding because the check for yielding only happens after an SPTE is modified. Fix this issue by moving the yield to the beginning of the loop. Fixes: a6a0b05da9f3 ("kvm: x86/mmu: Support dirty logging for the TDP MMU") Reviewed-by: Peter Feiner <pfeiner@google.com> Signed-off-by: Ben Gardon <bgardon@google.com> Message-Id: <20210202185734.1680553-15-bgardon@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-02-04	KVM: x86/mmu: Ensure forward progress when yielding in TDP MMU iter	Ben Gardon	3	-23/+23
	In some functions the TDP iter risks not making forward progress if two threads livelock yielding to one another. This is possible if two threads are trying to execute wrprot_gfn_range. Each could write protect an entry and then yield. This would reset the tdp_iter's walk over the paging structure and the loop would end up repeating the same entry over and over, preventing either thread from making forward progress. Fix this issue by only yielding if the loop has made forward progress since the last yield. Fixes: a6a0b05da9f3 ("kvm: x86/mmu: Support dirty logging for the TDP MMU") Reviewed-by: Peter Feiner <pfeiner@google.com> Signed-off-by: Ben Gardon <bgardon@google.com> Message-Id: <20210202185734.1680553-14-bgardon@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-02-04	KVM: x86/mmu: Rename goal_gfn to next_last_level_gfn	Ben Gardon	2	-12/+12
	The goal_gfn field in tdp_iter can be misleading as it implies that it is the iterator's final goal. It is really a target for the lowest gfn mapped by the leaf level SPTE the iterator will traverse towards. Change the field's name to be more precise. Signed-off-by: Ben Gardon <bgardon@google.com> Message-Id: <20210202185734.1680553-13-bgardon@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-02-04	KVM: x86/mmu: Merge flush and non-flush tdp_mmu_iter_cond_resched	Ben Gardon	1	-29/+13
	The flushing and non-flushing variants of tdp_mmu_iter_cond_resched have almost identical implementations. Merge the two functions and add a flush parameter. Signed-off-by: Ben Gardon <bgardon@google.com> Message-Id: <20210202185734.1680553-12-bgardon@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-02-04	KVM: x86/mmu: Fix braces in kvm_recover_nx_lpages	Ben Gardon	1	-2/+2
	No functional change intended. Fixes: 29cf0f5007a2 ("kvm: x86/mmu: NX largepage recovery for TDP MMU") Signed-off-by: Ben Gardon <bgardon@google.com> Message-Id: <20210202185734.1680553-10-bgardon@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-02-04	KVM: x86/mmu: Factor out handling of removed page tables	Ben Gardon	1	-29/+42
	Factor out the code to handle a disconnected subtree of the TDP paging structure from the code to handle the change to an individual SPTE. Future commits will build on this to allow asynchronous page freeing. No functional change intended. Reviewed-by: Peter Feiner <pfeiner@google.com> Acked-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Ben Gardon <bgardon@google.com> Message-Id: <20210202185734.1680553-6-bgardon@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-02-04	KVM: x86/mmu: Don't redundantly clear TDP MMU pt memory	Ben Gardon	1	-1/+0
	The KVM MMU caches already guarantee that shadow page table memory will be zeroed, so there is no reason to re-zero the page in the TDP MMU page fault handler. No functional change intended. Reviewed-by: Peter Feiner <pfeiner@google.com> Reviewed-by: Sean Christopherson <seanjc@google.com> Acked-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Ben Gardon <bgardon@google.com> Message-Id: <20210202185734.1680553-5-bgardon@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-02-04	KVM: x86/mmu: Add lockdep when setting a TDP MMU SPTE	Ben Gardon	1	-0/+2
	Add lockdep to __tdp_mmu_set_spte to ensure that SPTEs are only modified under the MMU lock. No functional change intended. Reviewed-by: Peter Feiner <pfeiner@google.com> Reviewed-by: Sean Christopherson <seanjc@google.com> Acked-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Ben Gardon <bgardon@google.com> Message-Id: <20210202185734.1680553-4-bgardon@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-02-04	KVM: x86/mmu: Add comment on __tdp_mmu_set_spte	Ben Gardon	1	-0/+16
	__tdp_mmu_set_spte is a very important function in the TDP MMU which already accepts several arguments and will take more in future commits. To offset this complexity, add a comment to the function describing each of the arguemnts. No functional change intended. Reviewed-by: Peter Feiner <pfeiner@google.com> Acked-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Ben Gardon <bgardon@google.com> Message-Id: <20210202185734.1680553-3-bgardon@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-02-04	KVM: x86/mmu: change TDP MMU yield function returns to match cond_resched	Ben Gardon	1	-10/+29
	Currently the TDP MMU yield / cond_resched functions either return nothing or return true if the TLBs were not flushed. These are confusing semantics, especially when making control flow decisions in calling functions. To clean things up, change both functions to have the same return value semantics as cond_resched: true if the thread yielded, false if it did not. If the function yielded in the _flush_ version, then the TLBs will have been flushed. Reviewed-by: Peter Feiner <pfeiner@google.com> Acked-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Ben Gardon <bgardon@google.com> Message-Id: <20210202185734.1680553-2-bgardon@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-02-04	KVM: x86: move kvm_inject_gp up from kvm_set_xcr to callers	Paolo Bonzini	3	-14/+8
	Push the injection of #GP up to the callers, so that they can just use kvm_complete_insn_gp. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-02-04	KVM: cleanup DR6/DR7 reserved bits checks	Paolo Bonzini	1	-2/+2
	kvm_dr6_valid and kvm_dr7_valid check that bits 63:32 are zero. Using them makes it easier to review the code for inconsistencies. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-02-04	KVM: move EXIT_FASTPATH_REENTER_GUEST to common code	Paolo Bonzini	3	-22/+15
	Now that KVM is using static calls, calling vmx_vcpu_run and vmx_sync_pir_to_irr does not incur anymore the cost of a retpoline. Therefore there is no need anymore to handle EXIT_FASTPATH_REENTER_GUEST in vendor code. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-02-04	selftest: kvm: x86: test KVM_GET_CPUID2 and guest visible CPUIDs against ↵	Vitaly Kuznetsov	5	-0/+233
	KVM_GET_SUPPORTED_CPUID Commit 181f494888d5 ("KVM: x86: fix CPUID entries returned by KVM_GET_CPUID2 ioctl") revealed that we're not testing KVM_GET_CPUID2 ioctl at all. Add a test for it and also check that from inside the guest visible CPUIDs are equal to it's output. Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com> Message-Id: <20210129161821.74635-1-vkuznets@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-02-04	KVM: x86/mmu: Add '__func__' in rmap_printk()	Stephen Zhang	2	-11/+11
	Given the common pattern: rmap_printk("%s:"..., __func__,...) we could improve this by adding '__func__' in rmap_printk(). Signed-off-by: Stephen Zhang <stephenzhangzsd@gmail.com> Message-Id: <1611713325-3591-1-git-send-email-stephenzhangzsd@gmail.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-02-04	KVM: SVM: Replace hard-coded value with #define	Krish Sadhukhan	1	-1/+1
	Replace the hard-coded value for bit# 1 in EFLAGS, with the available #define. Signed-off-by: Krish Sadhukhan <krish.sadhukhan@oracle.com> Message-Id: <20210203012842.101447-2-krish.sadhukhan@oracle.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-02-04	KVM: SVM: use .prepare_guest_switch() to handle CPU register save/setup	Michael Roth	3	-47/+56
	Currently we save host state like user-visible host MSRs, and do some initial guest register setup for MSR_TSC_AUX and MSR_AMD64_TSC_RATIO in svm_vcpu_load(). Defer this until just before we enter the guest by moving the handling to kvm_x86_ops.prepare_guest_switch() similarly to how it is done for the VMX implementation. Additionally, since handling of saving/restoring host user MSRs is the same both with/without SEV-ES enabled, move that handling to common code. Suggested-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Michael Roth <michael.roth@amd.com> Message-Id: <20210202190126.2185715-4-michael.roth@amd.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-02-04	KVM: SVM: remove uneeded fields from host_save_users_msrs	Michael Roth	3	-21/+8
	Now that the set of host user MSRs that need to be individually saved/restored are the same with/without SEV-ES, we can drop the .sev_es_restored flag and just iterate through the list unconditionally for both cases. A subsequent patch can then move these loops to a common path. Signed-off-by: Michael Roth <michael.roth@amd.com> Message-Id: <20210202190126.2185715-3-michael.roth@amd.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-02-04	KVM: SVM: use vmsave/vmload for saving/restoring additional host state	Michael Roth	3	-43/+10
	Using a guest workload which simply issues 'hlt' in a tight loop to generate VMEXITs, it was observed (on a recent EPYC processor) that a significant amount of the VMEXIT overhead measured on the host was the result of MSR reads/writes in svm_vcpu_load/svm_vcpu_put according to perf: 67.49%--kvm_arch_vcpu_ioctl_run \| \|--23.13%--vcpu_put \| kvm_arch_vcpu_put \| \| \| \|--21.31%--native_write_msr \| \| \| --1.27%--svm_set_cr4 \| \|--16.11%--vcpu_load \| \| \| --15.58%--kvm_arch_vcpu_load \| \| \| \|--13.97%--svm_set_cr4 \| \| \| \| \| \|--12.64%--native_read_msr Most of these MSRs relate to 'syscall'/'sysenter' and segment bases, and can be saved/restored using 'vmsave'/'vmload' instructions rather than explicit MSR reads/writes. In doing so there is a significant reduction in the svm_vcpu_load/svm_vcpu_put overhead measured for the above workload: 50.92%--kvm_arch_vcpu_ioctl_run \| \|--19.28%--disable_nmi_singlestep \| \|--13.68%--vcpu_load \| kvm_arch_vcpu_load \| \| \| \|--9.19%--svm_set_cr4 \| \| \| \| \| --6.44%--native_read_msr \| \| \| --3.55%--native_write_msr \| \|--6.05%--kvm_inject_nmi \|--2.80%--kvm_sev_es_mmio_read \|--2.19%--vcpu_put \| \| \| --1.25%--kvm_arch_vcpu_put \| native_write_msr Quantifying this further, if we look at the raw cycle counts for a normal iteration of the above workload (according to 'rdtscp'), kvm_arch_vcpu_ioctl_run() takes ~4600 cycles from start to finish with the current behavior. Using 'vmsave'/'vmload', this is reduced to ~2800 cycles, a savings of 39%. While this approach doesn't seem to manifest in any noticeable improvement for more realistic workloads like UnixBench, netperf, and kernel builds, likely due to their exit paths generally involving IO with comparatively high latencies, it does improve overall overhead of KVM_RUN significantly, which may still be noticeable for certain situations. It also simplifies some aspects of the code. With this change, explicit save/restore is no longer needed for the following host MSRs, since they are documented[1] as being part of the VMCB State Save Area: MSR_STAR, MSR_LSTAR, MSR_CSTAR, MSR_SYSCALL_MASK, MSR_KERNEL_GS_BASE, MSR_IA32_SYSENTER_CS, MSR_IA32_SYSENTER_ESP, MSR_IA32_SYSENTER_EIP, MSR_FS_BASE, MSR_GS_BASE and only the following MSR needs individual handling in svm_vcpu_put/svm_vcpu_load: MSR_TSC_AUX We could drop the host_save_user_msrs array/loop and instead handle MSR read/write of MSR_TSC_AUX directly, but we leave that for now as a potential follow-up. Since 'vmsave'/'vmload' also handles the LDTR and FS/GS segment registers (and associated hidden state)[2], some of the code previously used to handle this is no longer needed, so we drop it as well. The first public release of the SVM spec[3] also documents the same handling for the host state in question, so we make these changes unconditionally. Also worth noting is that we 'vmsave' to the same page that is subsequently used by 'vmrun' to record some host additional state. This is okay, since, in accordance with the spec[2], the additional state written to the page by 'vmrun' does not overwrite any fields written by 'vmsave'. This has also been confirmed through testing (for the above CPU, at least). [1] AMD64 Architecture Programmer's Manual, Rev 3.33, Volume 2, Appendix B, Table B-2 [2] AMD64 Architecture Programmer's Manual, Rev 3.31, Volume 3, Chapter 4, VMSAVE/VMLOAD [3] Secure Virtual Machine Architecture Reference Manual, Rev 3.01 Suggested-by: Tom Lendacky <thomas.lendacky@amd.com> Signed-off-by: Michael Roth <michael.roth@amd.com> Message-Id: <20210202190126.2185715-2-michael.roth@amd.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-02-04	KVM: SVM: Use asm goto to handle unexpected #UD on SVM instructions	Sean Christopherson	3	-16/+67
	Add svm_asm() macros, a la the existing vmx_asm() macros, to handle faults on SVM instructions instead of using the generic __ex(), a.k.a. __kvm_handle_fault_on_reboot(). Using asm goto generates slightly better code as it eliminates the in-line JMP+CALL sequences that are needed by __kvm_handle_fault_on_reboot() to avoid triggering BUG() from fixup (which generates bad stack traces). Using SVM specific macros also drops the last user of __ex() and the the last asm linkage to kvm_spurious_fault(), and adds a helper for VMSAVE, which may gain an addition call site in the future (as part of optimizing the SVM context switching). Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20201231002702.2223707-8-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-02-04	KVM: VMX: Use the kernel's version of VMXOFF	Sean Christopherson	2	-13/+9
	Drop kvm_cpu_vmxoff() in favor of the kernel's cpu_vmxoff(). Modify the latter to return -EIO on fault so that KVM can invoke kvm_spurious_fault() when appropriate. In addition to the obvious code reuse, dropping kvm_cpu_vmxoff() also eliminates VMX's last usage of the __ex()/__kvm_handle_fault_on_reboot() macros, thus helping pave the way toward dropping them entirely. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20201231002702.2223707-7-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-02-04	KVM: VMX: Move Intel PT shenanigans out of VMXON/VMXOFF flows	Sean Christopherson	1	-4/+7
	Move the Intel PT tracking outside of the VMXON/VMXOFF helpers so that a future patch can drop KVM's kvm_cpu_vmxoff() in favor of the kernel's cpu_vmxoff() without an associated PT functional change, and without losing symmetry between the VMXON and VMXOFF flows. Barring undocumented behavior, this should have no meaningful effects as Intel PT behavior does not interact with CR4.VMXE. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20201231002702.2223707-6-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-02-04	KVM/nVMX: Use __vmx_vcpu_run in nested_vmx_check_vmentry_hw	Uros Bizjak	4	-32/+5
	Replace inline assembly in nested_vmx_check_vmentry_hw with a call to __vmx_vcpu_run. The function is not performance critical, so (double) GPR save/restore in __vmx_vcpu_run can be tolerated, as far as performance effects are concerned. Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Sean Christopherson <seanjc@google.com> Reviewed-and-tested-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Uros Bizjak <ubizjak@gmail.com> [sean: dropped versioning info from changelog] Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20201231002702.2223707-5-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-02-04	x86/virt: Mark flags and memory as clobbered by VMXOFF	David P. Reed	1	-1/+2
	Explicitly tell the compiler that VMXOFF modifies flags (like all VMX instructions), and mark memory as clobbered since VMXOFF must not be reordered and also may have memory side effects (though the kernel really shouldn't be accessing the root VMCS anyways). Practically speaking, adding the clobbers is most likely a nop; the primary motivation is to properly document VMXOFF's behavior. For the flags clobber, both Clang and GCC automatically mark flags as clobbered; this is noted in commit 4b1e54786e48 ("KVM/x86: Use assembly instruction mnemonics instead of .byte streams"), which intentionally removed the previous clobber. But, neither Clang nor GCC documents this behavior, and there's no downside to including the clobber. For the memory clobber, the RFLAGS.IF and CR4.VMXE manipulations that immediately follow VMXOFF have compiler barriers of their own, i.e. VMXOFF can't get reordered after clearing CR4.VMXE, which is really what's of interest. Cc: Randy Dunlap <rdunlap@infradead.org> Signed-off-by: David P. Reed <dpreed@deepplum.com> [sean: rewrote changelog, dropped comment adjustments] Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20201231002702.2223707-4-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-02-04	x86/reboot: Force all cpus to exit VMX root if VMX is supported	Sean Christopherson	1	-20/+10
	Force all CPUs to do VMXOFF (via NMI shootdown) during an emergency reboot if VMX is _supported_, as VMX being off on the current CPU does not prevent other CPUs from being in VMX root (post-VMXON). This fixes a bug where a crash/panic reboot could leave other CPUs in VMX root and prevent them from being woken via INIT-SIPI-SIPI in the new kernel. Fixes: d176720d34c7 ("x86: disable VMX on all CPUs on reboot") Cc: stable@vger.kernel.org Suggested-by: Sean Christopherson <seanjc@google.com> Signed-off-by: David P. Reed <dpreed@deepplum.com> [sean: reworked changelog and further tweaked comment] Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20201231002702.2223707-3-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-02-04	x86/virt: Eat faults on VMXOFF in reboot flows	Sean Christopherson	1	-5/+12
	Silently ignore all faults on VMXOFF in the reboot flows as such faults are all but guaranteed to be due to the CPU not being in VMX root. Because (a) VMXOFF may be executed in NMI context, e.g. after VMXOFF but before CR4.VMXE is cleared, (b) there's no way to query the CPU's VMX state without faulting, and (c) the whole point is to get out of VMX root, eating faults is the simplest way to achieve the desired behaior. Technically, VMXOFF can fault (or fail) for other reasons, but all other fault and failure scenarios are mode related, i.e. the kernel would have to magically end up in RM, V86, compat mode, at CPL>0, or running with the SMI Transfer Monitor active. The kernel is beyond hosed if any of those scenarios are encountered; trying to do something fancy in the error path to handle them cleanly is pointless. Fixes: 1e9931146c74 ("x86: asm/virtext.h: add cpu_vmxoff() inline function") Reported-by: David P. Reed <dpreed@deepplum.com> Cc: stable@vger.kernel.org Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20201231002702.2223707-2-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-02-04	KVM: x86: use static calls to reduce kvm_x86_ops overhead	Jason Baron	13	-197/+193
	Convert kvm_x86_ops to use static calls. Note that all kvm_x86_ops are covered here except for 'pmu_ops and 'nested ops'. Here are some numbers running cpuid in a loop of 1 million calls averaged over 5 runs, measured in the vm (lower is better). Intel Xeon 3000MHz: \|default \|mitigations=off ------------------------------------- vanilla \|.671s \|.486s static call\|.573s(-15%)\|.458s(-6%) AMD EPYC 2500MHz: \|default \|mitigations=off ------------------------------------- vanilla \|.710s \|.609s static call\|.664s(-6%) \|.609s(0%) Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@redhat.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Sean Christopherson <seanjc@google.com> Signed-off-by: Jason Baron <jbaron@akamai.com> Message-Id: <e057bf1b8a7ad15652df6eeba3f907ae758d3399.1610680941.git.jbaron@akamai.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-02-04	KVM: x86: introduce definitions to support static calls for kvm_x86_ops	Jason Baron	3	-0/+149
	Use static calls to improve kvm_x86_ops performance. Introduce the definitions that will be used by a subsequent patch to actualize the savings. Add a new kvm-x86-ops.h header that can be used for the definition of static calls. This header is also intended to be used to simplify the defition of svm_kvm_ops and vmx_x86_ops. Note that all functions in kvm_x86_ops are covered here except for 'pmu_ops' and 'nested ops'. I think they can be covered by static calls in a simlilar manner, but were omitted from this series to reduce scope and because I don't think they have as large of a performance impact. Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Sean Christopherson <seanjc@google.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@redhat.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Andrea Arcangeli <aarcange@redhat.com> Signed-off-by: Jason Baron <jbaron@akamai.com> Message-Id: <e5cc82ead7ab37b2dceb0837a514f3f8bea4f8d1.1610680941.git.jbaron@akamai.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-02-04	KVM: X86: prepend vmx/svm prefix to additional kvm_x86_ops functions	Jason Baron	4	-27/+27
	A subsequent patch introduces macros in preparation for simplifying the definition for vmx_x86_ops and svm_x86_ops. Making the naming more uniform expands the coverage of the macros. Add vmx/svm prefix to the following functions: update_exception_bitmap(), enable_nmi_window(), enable_irq_window(), update_cr8_intercept and enable_smi_window(). Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Sean Christopherson <seanjc@google.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@redhat.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Andrea Arcangeli <aarcange@redhat.com> Signed-off-by: Jason Baron <jbaron@akamai.com> Message-Id: <ed594696f8e2c2b2bfc747504cee9bbb2a269300.1610680941.git.jbaron@akamai.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-02-04	KVM: Stop using deprecated jump label APIs	Cun Li	4	-32/+23
	The use of 'struct static_key' and 'static_key_false' is deprecated. Use the new API. Signed-off-by: Cun Li <cun.jia.li@gmail.com> Message-Id: <20210111152435.50275-1-cun.jia.li@gmail.com> [Make it compile. While at it, rename kvm_no_apic_vcpu to kvm_has_noapic_vcpu; the former reads too much like "true if no vCPU has an APIC". - Paolo] Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-02-04	KVM: SVM: Fix #GP handling for doubly-nested virtualization	Wei Huang	1	-2/+18
	Under the case of nested on nested (L0, L1, L2 are all hypervisors), we do not support emulation of the vVMLOAD/VMSAVE feature, the L0 hypervisor can inject the proper #VMEXIT to inform L1 of what is happening and L1 can avoid invoking the #GP workaround. For this reason we turns on guest VM's X86_FEATURE_SVME_ADDR_CHK bit for KVM running inside VM to receive the notification and change behavior. Similarly we check if vcpu is under guest mode before emulating the vmware-backdoor instructions. For the case of nested on nested, we let the guest handle it. Co-developed-by: Bandan Das <bsd@redhat.com> Signed-off-by: Bandan Das <bsd@redhat.com> Signed-off-by: Wei Huang <wei.huang2@amd.com> Tested-by: Maxim Levitsky <mlevitsk@redhat.com> Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com> Message-Id: <20210126081831.570253-5-wei.huang2@amd.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-02-04	KVM: SVM: Add support for SVM instruction address check change	Wei Huang	2	-0/+4
	New AMD CPUs have a change that checks #VMEXIT intercept on special SVM instructions before checking their EAX against reserved memory region. This change is indicated by CPUID_0x8000000A_EDX[28]. If it is 1, #VMEXIT is triggered before #GP. KVM doesn't need to intercept and emulate #GP faults as #GP is supposed to be triggered. Co-developed-by: Bandan Das <bsd@redhat.com> Signed-off-by: Bandan Das <bsd@redhat.com> Signed-off-by: Wei Huang <wei.huang2@amd.com> Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com> Message-Id: <20210126081831.570253-4-wei.huang2@amd.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-02-04	KVM: SVM: Add emulation support for #GP triggered by SVM instructions	Bandan Das	1	-18/+91
	While running SVM related instructions (VMRUN/VMSAVE/VMLOAD), some AMD CPUs check EAX against reserved memory regions (e.g. SMM memory on host) before checking VMCB's instruction intercept. If EAX falls into such memory areas, #GP is triggered before VMEXIT. This causes problem under nested virtualization. To solve this problem, KVM needs to trap #GP and check the instructions triggering #GP. For VM execution instructions, KVM emulates these instructions. Co-developed-by: Wei Huang <wei.huang2@amd.com> Signed-off-by: Wei Huang <wei.huang2@amd.com> Signed-off-by: Bandan Das <bsd@redhat.com> Message-Id: <20210126081831.570253-3-wei.huang2@amd.com> [Conditionally enable #GP intercept. - Paolo] Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-02-04	KVM: x86: Factor out x86 instruction emulation with decoding	Wei Huang	2	-23/+41
	Move the instruction decode part out of x86_emulate_instruction() for it to be used in other places. Also kvm_clear_exception_queue() is moved inside the if-statement as it doesn't apply when KVM are coming back from userspace. Co-developed-by: Bandan Das <bsd@redhat.com> Signed-off-by: Bandan Das <bsd@redhat.com> Signed-off-by: Wei Huang <wei.huang2@amd.com> Message-Id: <20210126081831.570253-2-wei.huang2@amd.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-02-04	KVM: X86: Rename DR6_INIT to DR6_ACTIVE_LOW	Chenyi Qiang	7	-25/+38
	DR6_INIT contains the 1-reserved bits as well as the bit that is cleared to 0 when the condition (e.g. RTM) happens. The value can be used to initialize dr6 and also be the XOR mask between the #DB exit qualification (or payload) and DR6. Concerning that DR6_INIT is used as initial value only once, rename it to DR6_ACTIVE_LOW and apply it in other places, which would make the incoming changes for bus lock debug exception more simple. Signed-off-by: Chenyi Qiang <chenyi.qiang@intel.com> Message-Id: <20210202090433.13441-2-chenyi.qiang@intel.com> [Define DR6_FIXED_1 from DR6_ACTIVE_LOW and DR6_VOLATILE. - Paolo] Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-02-04	selftests: kvm/x86: add test for pmu msr MSR_IA32_PERF_CAPABILITIES	Like Xu	5	-1/+169
	This test will check the effect of various CPUID settings on the MSR_IA32_PERF_CAPABILITIES MSR, check that whatever user space writes with KVM_SET_MSR is _not_ modified from the guest and can be retrieved with KVM_GET_MSR, and check that invalid LBR formats are rejected. Signed-off-by: Like Xu <like.xu@linux.intel.com> Message-Id: <20210201051039.255478-12-like.xu@linux.intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-02-04	KVM: vmx/pmu: Expose LBR_FMT in the MSR_IA32_PERF_CAPABILITIES	Like Xu	1	-1/+8
	Userspace could enable guest LBR feature when the exactly supported LBR format value is initialized to the MSR_IA32_PERF_CAPABILITIES and the LBR is also compatible with vPMU version and host cpu model. The LBR could be enabled on the guest if host perf supports LBR (checked via x86_perf_get_lbr()) and the vcpu model is compatible with the host one. Signed-off-by: Like Xu <like.xu@linux.intel.com> Message-Id: <20210201051039.255478-11-like.xu@linux.intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-02-04	KVM: vmx/pmu: Release guest LBR event via lazy release mechanism	Like Xu	3	-1/+24
	The vPMU uses GUEST_LBR_IN_USE_IDX (bit 58) in 'pmu->pmc_in_use' to indicate whether a guest LBR event is still needed by the vcpu. If the vcpu no longer accesses LBR related registers within a scheduling time slice, and the enable bit of LBR has been unset, vPMU will treat the guest LBR event as a bland event of a vPMC counter and release it as usual. Also, the pass-through state of LBR records msrs is cancelled. Signed-off-by: Like Xu <like.xu@linux.intel.com> Message-Id: <20210201051039.255478-10-like.xu@linux.intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-02-04	KVM: vmx/pmu: Emulate legacy freezing LBRs on virtual PMI	Like Xu	5	-3/+39
	The current vPMU only supports Architecture Version 2. According to Intel SDM "17.4.7 Freezing LBR and Performance Counters on PMI", if IA32_DEBUGCTL.Freeze_LBR_On_PMI = 1, the LBR is frozen on the virtual PMI and the KVM would emulate to clear the LBR bit (bit 0) in IA32_DEBUGCTL. Also, guest needs to re-enable IA32_DEBUGCTL.LBR to resume recording branches. Signed-off-by: Like Xu <like.xu@linux.intel.com> Reviewed-by: Andi Kleen <ak@linux.intel.com> Message-Id: <20210201051039.255478-9-like.xu@linux.intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-02-04	KVM: vmx/pmu: Reduce the overhead of LBR pass-through or cancellation	Like Xu	2	-0/+16
	When the LBR records msrs has already been pass-through, there is no need to call vmx_update_intercept_for_lbr_msrs() again and again, and vice versa. Signed-off-by: Like Xu <like.xu@linux.intel.com> Reviewed-by: Andi Kleen <ak@linux.intel.com> Message-Id: <20210201051039.255478-8-like.xu@linux.intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-02-04	KVM: vmx/pmu: Pass-through LBR msrs when the guest LBR event is ACTIVE	Like Xu	3	-3/+135
	In addition to DEBUGCTLMSR_LBR, any KVM trap caused by LBR msrs access will result in a creation of guest LBR event per-vcpu. If the guest LBR event is scheduled on with the corresponding vcpu context, KVM will pass-through all LBR records msrs to the guest. The LBR callstack mechanism implemented in the host could help save/restore the guest LBR records during the event context switches, which reduces a lot of overhead if we save/restore tens of LBR msrs (e.g. 32 LBR records entries) in the much more frequent VMX transitions. To avoid reclaiming LBR resources from any higher priority event on host, KVM would always check the exist of guest LBR event and its state before vm-entry as late as possible. A negative result would cancel the pass-through state, and it also prevents real registers accesses and potential data leakage. If host reclaims the LBR between two checks, the interception state and LBR records can be safely preserved due to native save/restore support from guest LBR event. The KVM emits a pr_warn() when the LBR hardware is unavailable to the guest LBR event. The administer is supposed to reminder users that the guest result may be inaccurate if someone is using LBR to record hypervisor on the host side. Suggested-by: Andi Kleen <ak@linux.intel.com> Co-developed-by: Wei Wang <wei.w.wang@intel.com> Signed-off-by: Wei Wang <wei.w.wang@intel.com> Signed-off-by: Like Xu <like.xu@linux.intel.com> Reviewed-by: Andi Kleen <ak@linux.intel.com> Message-Id: <20210201051039.255478-7-like.xu@linux.intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-02-04	KVM: vmx/pmu: Create a guest LBR event when vcpu sets DEBUGCTLMSR_LBR	Like Xu	3	-0/+76
	When vcpu sets DEBUGCTLMSR_LBR in the MSR_IA32_DEBUGCTLMSR, the KVM handler would create a guest LBR event which enables the callstack mode and none of hardware counter is assigned. The host perf would schedule and enable this event as usual but in an exclusive way. The guest LBR event will be released when the vPMU is reset but soon, the lazy release mechanism would be applied to this event like a vPMC. Suggested-by: Andi Kleen <ak@linux.intel.com> Co-developed-by: Wei Wang <wei.w.wang@intel.com> Signed-off-by: Wei Wang <wei.w.wang@intel.com> Signed-off-by: Like Xu <like.xu@linux.intel.com> Reviewed-by: Andi Kleen <ak@linux.intel.com> Message-Id: <20210201051039.255478-6-like.xu@linux.intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-02-04	KVM: vmx/pmu: Add PMU_CAP_LBR_FMT check when guest LBR is enabled	Like Xu	4	-2/+25
	Usespace could set the bits [0, 5] of the IA32_PERF_CAPABILITIES MSR which tells about the record format stored in the LBR records. The LBR will be enabled on the guest if host perf supports LBR (checked via x86_perf_get_lbr()) and the vcpu model is compatible with the host one. Signed-off-by: Like Xu <like.xu@linux.intel.com> Message-Id: <20210201051039.255478-4-like.xu@linux.intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-02-04	KVM: vmx/pmu: Add PMU_CAP_LBR_FMT check when guest LBR is enabled	Paolo Bonzini	4	-0/+43
	Usespace could set the bits [0, 5] of the IA32_PERF_CAPABILITIES MSR which tells about the record format stored in the LBR records. The LBR will be enabled on the guest if host perf supports LBR (checked via x86_perf_get_lbr()) and the vcpu model is compatible with the host one. Signed-off-by: Like Xu <like.xu@linux.intel.com> Message-Id: <20210201051039.255478-4-like.xu@linux.intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-02-04	KVM: x86/pmu: preserve IA32_PERF_CAPABILITIES across CPUID refresh	Paolo Bonzini	1	-6/+10
	Once MSR_IA32_PERF_CAPABILITIES is changed via vmx_set_msr(), the value should not be changed by cpuid(). To ensure that the new value is kept, the default initialization path is moved to intel_pmu_init(). The effective value of the MSR will be 0 if PDCM is clear, however. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-02-04	KVM: x86/vmx: Make vmx_set_intercept_for_msr() non-static	Like Xu	2	-1/+3
	To make code responsibilities clear, we may resue and invoke the vmx_set_intercept_for_msr() in other vmx-specific files (e.g. pmu_intel.c), so expose it to passthrough LBR msrs later. Signed-off-by: Like Xu <like.xu@linux.intel.com> Reviewed-by: Andi Kleen <ak@linux.intel.com> Message-Id: <20210201051039.255478-2-like.xu@linux.intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-02-04	KVM: VMX: read/write MSR_IA32_DEBUGCTLMSR from GUEST_IA32_DEBUGCTL	Like Xu	4	-18/+28
	SVM already has specific handlers of MSR_IA32_DEBUGCTLMSR in the svm_get/set_msr, so the x86 common part can be safely moved to VMX. This allows KVM to store the bits it supports in GUEST_IA32_DEBUGCTL. Add vmx_supported_debugctl() to refactor the throwing logic of #GP. Signed-off-by: Like Xu <like.xu@linux.intel.com> Reviewed-by: Andi Kleen <ak@linux.intel.com> Message-Id: <20210108013704.134985-2-like.xu@linux.intel.com> [Merge parts of Chenyi Qiang's "KVM: X86: Expose bus lock debug exception to guest". - Paolo] Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-02-04	KVM: VMX: Use x2apic_mode to avoid RDMSR when querying PI state	Sean Christopherson	1	-3/+3
	Use x2apic_mode instead of x2apic_enabled() when adjusting the destination ID during Posted Interrupt updates. This avoids the costly RDMSR that is hidden behind x2apic_enabled(). Reported-by: luferry <luferry@163.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210115220354.434807-3-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-02-04	x86/apic: Export x2apic_mode for use by KVM in "warm" path	Sean Christopherson	1	-0/+1
	Export x2apic_mode so that KVM can query whether x2APIC is active without having to incur the RDMSR in x2apic_enabled(). When Posted Interrupts are in use for a guest with an assigned device, KVM ends up checking for x2APIC at least once every time a vCPU halts. KVM could obviously snapshot x2apic_enabled() to avoid the RDMSR, but that's rather silly given that x2apic_mode holds the exact info needed by KVM. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210115220354.434807-2-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-02-04	KVM: X86: Add the Document for KVM_CAP_X86_BUS_LOCK_EXIT	Chenyi Qiang	1	-3/+42
	Introduce a new capability named KVM_CAP_X86_BUS_LOCK_EXIT, which is used to handle bus locks detected in guest. It allows the userspace to do custom throttling policies to mitigate the 'noisy neighbour' problem. Signed-off-by: Chenyi Qiang <chenyi.qiang@intel.com> Message-Id: <20201106090315.18606-5-chenyi.qiang@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-02-04	KVM: VMX: Enable bus lock VM exit	Chenyi Qiang	10	-4/+83
	Virtual Machine can exploit bus locks to degrade the performance of system. Bus lock can be caused by split locked access to writeback(WB) memory or by using locks on uncacheable(UC) memory. The bus lock is typically >1000 cycles slower than an atomic operation within a cache line. It also disrupts performance on other cores (which must wait for the bus lock to be released before their memory operations can complete). To address the threat, bus lock VM exit is introduced to notify the VMM when a bus lock was acquired, allowing it to enforce throttling or other policy based mitigations. A VMM can enable VM exit due to bus locks by setting a new "Bus Lock Detection" VM-execution control(bit 30 of Secondary Processor-based VM execution controls). If delivery of this VM exit was preempted by a higher priority VM exit (e.g. EPT misconfiguration, EPT violation, APIC access VM exit, APIC write VM exit, exception bitmap exiting), bit 26 of exit reason in vmcs field is set to 1. In current implementation, the KVM exposes this capability through KVM_CAP_X86_BUS_LOCK_EXIT. The user can get the supported mode bitmap (i.e. off and exit) and enable it explicitly (disabled by default). If bus locks in guest are detected by KVM, exit to user space even when current exit reason is handled by KVM internally. Set a new field KVM_RUN_BUS_LOCK in vcpu->run->flags to inform the user space that there is a bus lock detected in guest. Document for Bus Lock VM exit is now available at the latest "Intel Architecture Instruction Set Extensions Programming Reference". Document Link: https://software.intel.com/content/www/us/en/develop/download/intel-architecture-instruction-set-extensions-programming-reference.html Co-developed-by: Xiaoyao Li <xiaoyao.li@intel.com> Signed-off-by: Xiaoyao Li <xiaoyao.li@intel.com> Signed-off-by: Chenyi Qiang <chenyi.qiang@intel.com> Message-Id: <20201106090315.18606-4-chenyi.qiang@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-02-04	KVM: X86: Reset the vcpu->run->flags at the beginning of vcpu_run	Chenyi Qiang	1	-1/+4
	Reset the vcpu->run->flags at the beginning of kvm_arch_vcpu_ioctl_run. It can avoid every thunk of code that needs to set the flag clear it, which increases the odds of missing a case and ending up with a flag in an undefined state. Signed-off-by: Chenyi Qiang <chenyi.qiang@intel.com> Message-Id: <20201106090315.18606-3-chenyi.qiang@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-02-04	KVM: VMX: Convert vcpu_vmx.exit_reason to a union	Sean Christopherson	3	-49/+86
	Convert vcpu_vmx.exit_reason from a u32 to a union (of size u32). The full VM_EXIT_REASON field is comprised of a 16-bit basic exit reason in bits 15:0, and single-bit modifiers in bits 31:16. Historically, KVM has only had to worry about handling the "failed VM-Entry" modifier, which could only be set in very specific flows and required dedicated handling. I.e. manually stripping the FAILED_VMENTRY bit was a somewhat viable approach. But even with only a single bit to worry about, KVM has had several bugs related to comparing a basic exit reason against the full exit reason store in vcpu_vmx. Upcoming Intel features, e.g. SGX, will add new modifier bits that can be set on more or less any VM-Exit, as opposed to the significantly more restricted FAILED_VMENTRY, i.e. correctly handling everything in one-off flows isn't scalable. Tracking exit reason in a union forces code to explicitly choose between consuming the full exit reason and the basic exit, and is a convenient way to document and access the modifiers. No functional change intended. Cc: Xiaoyao Li <xiaoyao.li@intel.com> Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Chenyi Qiang <chenyi.qiang@intel.com> Message-Id: <20201106090315.18606-2-chenyi.qiang@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-02-04	KVM/SVM: add support for SEV attestation command	Brijesh Singh	5	-0/+118
	The SEV FW version >= 0.23 added a new command that can be used to query the attestation report containing the SHA-256 digest of the guest memory encrypted through the KVM_SEV_LAUNCH_UPDATE_{DATA, VMSA} commands and sign the report with the Platform Endorsement Key (PEK). See the SEV FW API spec section 6.8 for more details. Note there already exist a command (KVM_SEV_LAUNCH_MEASURE) that can be used to get the SHA-256 digest. The main difference between the KVM_SEV_LAUNCH_MEASURE and KVM_SEV_ATTESTATION_REPORT is that the latter can be called while the guest is running and the measurement value is signed with PEK. Cc: James Bottomley <jejb@linux.ibm.com> Cc: Tom Lendacky <Thomas.Lendacky@amd.com> Cc: David Rientjes <rientjes@google.com> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Sean Christopherson <seanjc@google.com> Cc: Borislav Petkov <bp@alien8.de> Cc: John Allen <john.allen@amd.com> Cc: Herbert Xu <herbert@gondor.apana.org.au> Cc: linux-crypto@vger.kernel.org Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com> Acked-by: David Rientjes <rientjes@google.com> Tested-by: James Bottomley <jejb@linux.ibm.com> Signed-off-by: Brijesh Singh <brijesh.singh@amd.com> Message-Id: <20210104151749.30248-1-brijesh.singh@amd.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-02-04	KVM: selftests: Disable dirty logging with vCPUs running	Ben Gardon	1	-5/+5
	Disabling dirty logging is much more intestesting from a testing perspective if the vCPUs are still running. This also excercises the code-path in which collapsible SPTEs must be faulted back in at a higher level after disabling dirty logging. To: linux-kselftest@vger.kernel.org CC: Peter Xu <peterx@redhat.com> CC: Andrew Jones <drjones@redhat.com> CC: Thomas Huth <thuth@redhat.com> Signed-off-by: Ben Gardon <bgardon@google.com> Message-Id: <20210202185734.1680553-29-bgardon@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-02-04	KVM: selftests: Add backing src parameter to dirty_log_perf_test	Ben Gardon	8	-15/+63
	Add a parameter to control the backing memory type for dirty_log_perf_test so that the test can be run with hugepages. To: linux-kselftest@vger.kernel.org CC: Peter Xu <peterx@redhat.com> CC: Andrew Jones <drjones@redhat.com> CC: Thomas Huth <thuth@redhat.com> Signed-off-by: Ben Gardon <bgardon@google.com> Message-Id: <20210202185734.1680553-28-bgardon@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-02-04	KVM: selftests: Add memslot modification stress test	Ben Gardon	3	-0/+213
	Add a memslot modification stress test in which a memslot is repeatedly created and removed while vCPUs access memory in another memslot. Most userspaces do not create or remove memslots on running VMs which makes it hard to test races in adding and removing memslots without a dedicated test. Adding and removing a memslot also has the effect of tearing down the entire paging structure, which leads to more page faults and pressure on the page fault handling path than a one-and-done memory population test. Reviewed-by: Jacob Xu <jacobhxu@google.com> Signed-off-by: Ben Gardon <bgardon@google.com> Message-Id: <20210112214253.463999-7-bgardon@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-02-04	KVM: selftests: Add option to overlap vCPU memory access	Ben Gardon	4	-18/+57
	Add an option to overlap the ranges of memory each vCPU accesses instead of partitioning them. This option will increase the probability of multiple vCPUs faulting on the same page at the same time, and causing interesting races, if there are bugs in the page fault handler or elsewhere in the kernel. Reviewed-by: Jacob Xu <jacobhxu@google.com> Reviewed-by: Makarand Sonare <makarandsonare@google.com> Signed-off-by: Ben Gardon <bgardon@google.com> Message-Id: <20210112214253.463999-6-bgardon@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-02-04	KVM: selftests: Fix population stage in dirty_log_perf_test	Ben Gardon	1	-3/+8
	Currently the population stage in the dirty_log_perf_test does nothing as the per-vCPU iteration counters are not initialized and the loop does not wait for each vCPU. Remedy those errors. Reviewed-by: Jacob Xu <jacobhxu@google.com> Reviewed-by: Makarand Sonare <makarandsonare@google.com> Signed-off-by: Ben Gardon <bgardon@google.com> Message-Id: <20210112214253.463999-5-bgardon@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-02-04	KVM: selftests: Convert iterations to int in dirty_log_perf_test	Ben Gardon	1	-13/+13
	In order to add an iteration -1 to indicate that the memory population phase has not yet completed, convert the interations counters to ints. No functional change intended. Reviewed-by: Jacob Xu <jacobhxu@google.com> Signed-off-by: Ben Gardon <bgardon@google.com> Message-Id: <20210112214253.463999-4-bgardon@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-02-04	KVM: selftests: Avoid flooding debug log while populating memory	Ben Gardon	1	-5/+4
	Peter Xu pointed out that a log message printed while waiting for the memory population phase of the dirty_log_perf_test will flood the debug logs as there is no delay after printing the message. Since the message does not provide much value anyway, remove it. Reviewed-by: Jacob Xu <jacobhxu@google.com> Signed-off-by: Ben Gardon <bgardon@google.com> Message-Id: <20210112214253.463999-3-bgardon@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-02-04	KVM: selftests: Rename timespec_diff_now to timespec_elapsed	Ben Gardon	4	-13/+13
	In response to some earlier comments from Peter Xu, rename timespec_diff_now to the much more sensible timespec_elapsed. No functional change intended. Reviewed-by: Jacob Xu <jacobhxu@google.com> Reviewed-by: Makarand Sonare <makarandsonare@google.com> Signed-off-by: Ben Gardon <bgardon@google.com> Message-Id: <20210112214253.463999-2-bgardon@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-02-04	KVM: x86/mmu: Remove the defunct update_pte() paging hook	Sean Christopherson	3	-51/+2
	Remove the update_pte() shadow paging logic, which was obsoleted by commit 4731d4c7a077 ("KVM: MMU: out of sync shadow core"), but never removed. As pointed out by Yu, KVM never write protects leaf page tables for the purposes of shadow paging, and instead marks their associated shadow page as unsync so that the guest can write PTEs at will. The update_pte() path, which predates the unsync logic, optimizes COW scenarios by refreshing leaf SPTEs when they are written, as opposed to zapping the SPTE, restarting the guest, and installing the new SPTE on the subsequent fault. Since KVM no longer write-protects leaf page tables, update_pte() is unreachable and can be dropped. Reported-by: Yu Zhang <yu.c.zhang@intel.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210115004051.4099250-1-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-02-04	KVM: selftests: Test IPI to halted vCPU in xAPIC while backing page moves	Peter Shier	4	-0/+566
	When a guest is using xAPIC KVM allocates a backing page for the required EPT entry for the APIC access address set in the VMCS. If mm decides to move that page the KVM mmu notifier will update the VMCS with the new HPA. This test induces a page move to test that APIC access continues to work correctly. It is a directed test for commit e649b3f0188f "KVM: x86: Fix APIC page invalidation race". Tested: ran for 1 hour on a skylake, migrating backing page every 1ms Depends on patch "selftests: kvm: Add exception handling to selftests" from aaronlewis@google.com that has not yet been queued. Signed-off-by: Peter Shier <pshier@google.com> Reviewed-by: Jim Mattson <jmattson@google.com> Reviewed-by: Ricardo Koller <ricarkol@google.com> Message-Id: <20201105223823.850068-1-pshier@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-02-04	KVM: Expose AVX_VNNI instruction to guset	Yang Zhong	1	-1/+1
	Expose AVX (VEX-encoded) versions of the Vector Neural Network Instructions to guest. The bit definition: CPUID.(EAX=7,ECX=1):EAX[bit 4] AVX_VNNI The following instructions are available when this feature is present in the guest. 1. VPDPBUS: Multiply and Add Unsigned and Signed Bytes 2. VPDPBUSDS: Multiply and Add Unsigned and Signed Bytes with Saturation 3. VPDPWSSD: Multiply and Add Signed Word Integers 4. VPDPWSSDS: Multiply and Add Signed Integers with Saturation This instruction is currently documented in the latest "extensions" manual (ISE). It will appear in the "main" manual (SDM) in the future. Signed-off-by: Yang Zhong <yang.zhong@intel.com> Reviewed-by: Tony Luck <tony.luck@intel.com> Message-Id: <20210105004909.42000-3-yang.zhong@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-02-04	Enumerate AVX Vector Neural Network instructions	Kyung Min Park	1	-0/+1
	Add AVX version of the Vector Neural Network (VNNI) Instructions. A processor supports AVX VNNI instructions if CPUID.0x07.0x1:EAX[4] is present. The following instructions are available when this feature is present. 1. VPDPBUS: Multiply and Add Unsigned and Signed Bytes 2. VPDPBUSDS: Multiply and Add Unsigned and Signed Bytes with Saturation 3. VPDPWSSD: Multiply and Add Signed Word Integers 4. VPDPWSSDS: Multiply and Add Signed Integers with Saturation The only in-kernel usage of this is kvm passthrough. The CPU feature flag is shown as "avx_vnni" in /proc/cpuinfo. This instruction is currently documented in the latest "extensions" manual (ISE). It will appear in the "main" manual (SDM) in the future. Signed-off-by: Kyung Min Park <kyung.min.park@intel.com> Signed-off-by: Yang Zhong <yang.zhong@intel.com> Reviewed-by: Tony Luck <tony.luck@intel.com> Message-Id: <20210105004909.42000-2-yang.zhong@intel.com> Acked-by: Borislav Petkov <bp@suse.de> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-02-04	x86: kvm: style: Simplify bool comparison	YANG LI	1	-1/+1
	Fix the following coccicheck warning: ./arch/x86/kvm/x86.c:8012:5-48: WARNING: Comparison to bool Signed-off-by: YANG LI <abaci-bugfix@linux.alibaba.com> Reported-by: Abaci Robot <abaci@linux.alibaba.com> Message-Id: <1610357578-66081-1-git-send-email-abaci-bugfix@linux.alibaba.com> Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-02-04	KVM: x86: Zap the oldest MMU pages, not the newest	Sean Christopherson	1	-1/+1
	Walk the list of MMU pages in reverse in kvm_mmu_zap_oldest_mmu_pages(). The list is FIFO, meaning new pages are inserted at the head and thus the oldest pages are at the tail. Using a "forward" iterator causes KVM to zap MMU pages that were just added, which obliterates guest performance once the max number of shadow MMU pages is reached. Fixes: 6b82ef2c9cf1 ("KVM: x86/mmu: Batch zap MMU pages when recycling oldest pages") Reported-by: Zdenek Kaspar <zkaspar82@gmail.com> Cc: stable@vger.kernel.org Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210113205030.3481307-1-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-02-04	KVM: x86/mmu: Use boolean returns for (S)PTE accessors	Sean Christopherson	2	-9/+5
	Return a 'bool' instead of an 'int' for various PTE accessors that are boolean in nature, e.g. is_shadow_present_pte(). Returning an int is goofy and potentially dangerous, e.g. if a flag being checked is moved into the upper 32 bits of a SPTE, then the compiler may silently squash the entire check since casting to an int is guaranteed to yield a return value of '0'. Opportunistically refactor is_last_spte() so that it naturally returns a bool value instead of letting it implicitly cast 0/1 to false/true. No functional change intended. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210123003003.3137525-1-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-02-04	KVM: X86: use vzalloc() instead of vmalloc/memset	Tian Tao	1	-2/+1
	fixed the following warning： /virt/kvm/dirty_ring.c:70:20-27: WARNING: vzalloc should be used for ring -> dirty_gfns, instead of vmalloc/memset. Signed-off-by: Tian Tao <tiantao6@hisilicon.com> Message-Id: <1611547045-13669-1-git-send-email-tiantao6@hisilicon.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-02-04	KVM: x86: Take KVM's SRCU lock only if steal time update is needed	Sean Christopherson	1	-9/+11
	Enter a SRCU critical section for a memslots lookup during steal time update if and only if a steal time update is actually needed. Taking the lock can be avoided if steal time is disabled by the guest, or if KVM knows it has already flagged the vCPU as being preempted. Reword the comment to be more precise as to exactly why memslots will be queried. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210123000334.3123628-3-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-02-04	KVM: x86: Remove obsolete disabling of page faults in kvm_arch_vcpu_put()	Sean Christopherson	1	-10/+0
	Remove the disabling of page faults across kvm_steal_time_set_preempted() as KVM now accesses the steal time struct (shared with the guest) via a cached mapping (see commit b043138246a4, "x86/KVM: Make sure KVM_VCPU_FLUSH_TLB flag is not missed".) The cache lookup is flagged as atomic, thus it would be a bug if KVM tried to resolve a new pfn, i.e. we want the splat that would be reached via might_fault(). Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210123000334.3123628-2-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-02-04	KVM: do not assume PTE is writable after follow_pfn	Paolo Bonzini	1	-3/+12
	In order to convert an HVA to a PFN, KVM usually tries to use the get_user_pages family of functinso. This however is not possible for VM_IO vmas; in that case, KVM instead uses follow_pfn. In doing this however KVM loses the information on whether the PFN is writable. That is usually not a problem because the main use of VM_IO vmas with KVM is for BARs in PCI device assignment, however it is a bug. To fix it, use follow_pte and check pte_write while under the protection of the PTE lock. The information can be used to fail hva_to_pfn_remapped or passed back to the caller via *writable. Usage of follow_pfn was introduced in commit add6a0cd1c5b ("KVM: MMU: try to fix up page faults before giving up", 2016-07-05); however, even older version have the same issue, all the way back to commit 2e2e3738af33 ("KVM: Handle vma regions with no backing page", 2008-07-20), as they also did not check whether the PFN was writable. Fixes: 2e2e3738af33 ("KVM: Handle vma regions with no backing page") Reported-by: David Stevens <stevensd@google.com> Cc: 3pvd@google.com Cc: Jann Horn <jannh@google.com> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: stable@vger.kernel.org Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-02-04	KVM: x86/mmu: Fix TDP MMU zap collapsible SPTEs	Ben Gardon	1	-3/+3
	There is a bug in the TDP MMU function to zap SPTEs which could be replaced with a larger mapping which prevents the function from doing anything. Fix this by correctly zapping the last level SPTEs. Cc: stable@vger.kernel.org Fixes: 14881998566d ("kvm: x86/mmu: Support disabling dirty logging for the tdp MMU") Signed-off-by: Ben Gardon <bgardon@google.com> Message-Id: <20210202185734.1680553-11-bgardon@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-02-03	KVM: x86: cleanup CR3 reserved bits checks	Paolo Bonzini	3	-13/+5
	If not in long mode, the low bits of CR3 are reserved but not enforced to be zero, so remove those checks. If in long mode, however, the MBZ bits extend down to the highest physical address bit of the guest, excluding the encryption bit. Make the checks consistent with the above, and match them between nested_vmcb_checks and KVM_SET_SREGS. Cc: stable@vger.kernel.org Fixes: 761e41693465 ("KVM: nSVM: Check that MBZ bits in CR3 and CR4 are not set on vmrun of nested guests") Fixes: a780a3ea6282 ("KVM: X86: Fix reserved bits check for MOV to CR3") Reviewed-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-02-03	KVM: SVM: Treat SVM as unsupported when running as an SEV guest	Sean Christopherson	2	-0/+6
	Don't let KVM load when running as an SEV guest, regardless of what CPUID says. Memory is encrypted with a key that is not accessible to the host (L0), thus it's impossible for L0 to emulate SVM, e.g. it'll see garbage when reading the VMCB. Technically, KVM could decrypt all memory that needs to be accessible to the L0 and use shadow paging so that L0 does not need to shadow NPT, but exposing such information to L0 largely defeats the purpose of running as an SEV guest. This can always be revisited if someone comes up with a use case for running VMs inside SEV guests. Note, VMLOAD, VMRUN, etc... will also #GP on GPAs with C-bit set, i.e. KVM is doomed even if the SEV guest is debuggable and the hypervisor is willing to decrypt the VMCB. This may or may not be fixed on CPUs that have the SVME_ADDR_CHK fix. Cc: stable@vger.kernel.org Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210202212017.2486595-1-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-02-02	KVM: x86: Update emulator context mode if SYSENTER xfers to 64-bit mode	Sean Christopherson	1	-0/+2
	Set the emulator context to PROT64 if SYSENTER transitions from 32-bit userspace (compat mode) to a 64-bit kernel, otherwise the RIP update at the end of x86_emulate_insn() will incorrectly truncate the new RIP. Note, this bug is mostly limited to running an Intel virtual CPU model on an AMD physical CPU, as other combinations of virtual and physical CPUs do not trigger full emulation. On Intel CPUs, SYSENTER in compatibility mode is legal, and unconditionally transitions to 64-bit mode. On AMD CPUs, SYSENTER is illegal in compatibility mode and #UDs. If the vCPU is AMD, KVM injects a #UD on SYSENTER in compat mode. If the pCPU is Intel, SYSENTER will execute natively and not trigger #UD->VM-Exit (ignoring guest TLB shenanigans). Fixes: fede8076aab4 ("KVM: x86: handle wrap around 32-bit address space") Cc: stable@vger.kernel.org Signed-off-by: Jonny Barker <jonny@jonnybarker.com> [sean: wrote changelog] Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210202165546.2390296-1-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-02-01	KVM: x86: Supplement __cr4_reserved_bits() with X86_FEATURE_PCID check	Vitaly Kuznetsov	1	-0/+2
	Commit 7a873e455567 ("KVM: selftests: Verify supported CR4 bits can be set before KVM_SET_CPUID2") reveals that KVM allows to set X86_CR4_PCIDE even when PCID support is missing: ==== Test Assertion Failure ==== x86_64/set_sregs_test.c:41: rc pid=6956 tid=6956 - Invalid argument 1 0x000000000040177d: test_cr4_feature_bit at set_sregs_test.c:41 2 0x00000000004014fc: main at set_sregs_test.c:119 3 0x00007f2d9346d041: ?? ??:0 4 0x000000000040164d: _start at ??:? KVM allowed unsupported CR4 bit (0x20000) Add X86_FEATURE_PCID feature check to __cr4_reserved_bits() to make kvm_is_valid_cr4() fail. Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com> Message-Id: <20210201142843.108190-1-vkuznets@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-02-01	KVM/x86: assign hva with the right value to vm_munmap the pages	Zheng Zhan Liang	1	-1/+1
	Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Wanpeng Li <wanpengli@tencent.com> Cc: kvm@vger.kernel.org Cc: linux-kernel@vger.kernel.org Signed-off-by: Zheng Zhan Liang <zhengzhanliang@huorong.cn> Message-Id: <20210201055310.267029-1-zhengzhanliang@huorong.cn> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-02-01	KVM: x86: Allow guests to see MSR_IA32_TSX_CTRL even if tsx=off	Paolo Bonzini	2	-13/+30
	Userspace that does not know about KVM_GET_MSR_FEATURE_INDEX_LIST will generally use the default value for MSR_IA32_ARCH_CAPABILITIES. When this happens and the host has tsx=on, it is possible to end up with virtual machines that have HLE and RTM disabled, but TSX_CTRL available. If the fleet is then switched to tsx=off, kvm_get_arch_capabilities() will clear the ARCH_CAP_TSX_CTRL_MSR bit and it will not be possible to use the tsx=off hosts as migration destinations, even though the guests do not have TSX enabled. To allow this migration, allow guests to write to their TSX_CTRL MSR, while keeping the host MSR unchanged for the entire life of the guests. This ensures that TSX remains disabled and also saves MSR reads and writes, and it's okay to do because with tsx=off we know that guests will not have the HLE and RTM features in their CPUID. (If userspace sets bogus CPUID data, we do not expect HLE and RTM to work in guests anyway). Cc: stable@vger.kernel.org Fixes: cbbaa2727aa3 ("KVM: x86: fix presentation of TSX feature in ARCH_CAPABILITIES") Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-01-28	Fix unsynchronized access to sev members through svm_register_enc_region	Peter Gonda	1	-7/+10
	Grab kvm->lock before pinning memory when registering an encrypted region; sev_pin_memory() relies on kvm->lock being held to ensure correctness when checking and updating the number of pinned pages. Add a lockdep assertion to help prevent future regressions. Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@redhat.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Joerg Roedel <joro@8bytes.org> Cc: Tom Lendacky <thomas.lendacky@amd.com> Cc: Brijesh Singh <brijesh.singh@amd.com> Cc: Sean Christopherson <seanjc@google.com> Cc: x86@kernel.org Cc: kvm@vger.kernel.org Cc: stable@vger.kernel.org Cc: linux-kernel@vger.kernel.org Fixes: 1e80fdc09d12 ("KVM: SVM: Pin guest memory when SEV is active") Signed-off-by: Peter Gonda <pgonda@google.com> V2 - Fix up patch description - Correct file paths svm.c -> sev.c - Add unlock of kvm->lock on sev_pin_memory error V1 - https://lore.kernel.org/kvm/20210126185431.1824530-1-pgonda@google.com/ Message-Id: <20210127161524.2832400-1-pgonda@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-01-28	KVM: Documentation: Fix documentation for nested.	Yu Zhang	2	-3/+5
	Nested VMX was enabled by default in commit 1e58e5e59148 ("KVM: VMX: enable nested virtualization by default"), which was merged in Linux 4.20. This patch is to fix the documentation accordingly. Signed-off-by: Yu Zhang <yu.c.zhang@linux.intel.com> Message-Id: <20210128154747.4242-1-yu.c.zhang@linux.intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-01-28	Merge tag 'kvmarm-fixes-5.11-3' of ↵	Paolo Bonzini	1	-9/+11
	git://git.kernel.org/pub/scm/linux/kernel/git/kvmarm/kvmarm into HEAD KVM/arm64 fixes for 5.11, take #3 - Avoid clobbering extra registers on initialisation
2021-01-28	KVM: x86: fix CPUID entries returned by KVM_GET_CPUID2 ioctl	Michael Roth	1	-1/+1
	Recent commit 255cbecfe0 modified struct kvm_vcpu_arch to make 'cpuid_entries' a pointer to an array of kvm_cpuid_entry2 entries rather than embedding the array in the struct. KVM_SET_CPUID and KVM_SET_CPUID2 were updated accordingly, but KVM_GET_CPUID2 was missed. As a result, KVM_GET_CPUID2 currently returns random fields from struct kvm_vcpu_arch to userspace rather than the expected CPUID values. Fix this by treating 'cpuid_entries' as a pointer when copying its contents to userspace buffer. Fixes: 255cbecfe0c9 ("KVM: x86: allocate vcpu->arch.cpuid_entries dynamically") Cc: Vitaly Kuznetsov <vkuznets@redhat.com> Signed-off-by: Michael Roth <michael.roth@amd.com.com> Message-Id: <20210128024451.1816770-1-michael.roth@amd.com> Cc: stable@vger.kernel.org Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-01-25	KVM: x86: allow KVM_REQ_GET_NESTED_STATE_PAGES outside guest mode for VMX	Paolo Bonzini	3	-9/+29
	VMX also uses KVM_REQ_GET_NESTED_STATE_PAGES for the Hyper-V eVMCS, which may need to be loaded outside guest mode. Therefore we cannot WARN in that case. However, that part of nested_get_vmcs12_pages is _not_ needed at vmentry time. Split it out of KVM_REQ_GET_NESTED_STATE_PAGES handling, so that both vmentry and migration (and in the latter case, independent of is_guest_mode) do the parts that are needed. Cc: <stable@vger.kernel.org> # 5.10.x: f2c7ef3ba: KVM: nSVM: cancel KVM_REQ_GET_NESTED_STATE_PAGES Cc: <stable@vger.kernel.org> # 5.10.x Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-01-25	KVM: x86: Revert "KVM: x86: Mark GPRs dirty when written"	Sean Christopherson	1	-26/+25
	Revert the dirty/available tracking of GPRs now that KVM copies the GPRs to the GHCB on any post-VMGEXIT VMRUN, even if a GPR is not dirty. Per commit de3cd117ed2f ("KVM: x86: Omit caching logic for always-available GPRs"), tracking for GPRs noticeably impacts KVM's code footprint. This reverts commit 1c04d8c986567c27c56c05205dceadc92efb14ff. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210122235049.3107620-3-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-01-25	KVM: SVM: Unconditionally sync GPRs to GHCB on VMRUN of SEV-ES guest	Sean Christopherson	1	-9/+6
	Drop the per-GPR dirty checks when synchronizing GPRs to the GHCB, the GRPs' dirty bits are set from time zero and never cleared, i.e. will always be seen as dirty. The obvious alternative would be to clear the dirty bits when appropriate, but removing the dirty checks is desirable as it allows reverting GPR dirty+available tracking, which adds overhead to all flavors of x86 VMs. Note, unconditionally writing the GPRs in the GHCB is tacitly allowed by the GHCB spec, which allows the hypervisor (or guest) to provide unnecessary info; it's the guest's responsibility to consume only what it needs (the hypervisor is untrusted after all). The guest and hypervisor can supply additional state if desired but must not rely on that additional state being provided. Cc: Brijesh Singh <brijesh.singh@amd.com> Cc: Tom Lendacky <thomas.lendacky@amd.com> Fixes: 291bd20d5d88 ("KVM: SVM: Add initial support for a VMGEXIT VMEXIT") Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210122235049.3107620-2-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-01-25	KVM: nVMX: Sync unsync'd vmcs02 state to vmcs12 on migration	Maxim Levitsky	1	-5/+8
	Even when we are outside the nested guest, some vmcs02 fields may not be in sync vs vmcs12. This is intentional, even across nested VM-exit, because the sync can be delayed until the nested hypervisor performs a VMCLEAR or a VMREAD/VMWRITE that affects those rarely accessed fields. However, during KVM_GET_NESTED_STATE, the vmcs12 has to be up to date to be able to restore it. To fix that, call copy_vmcs02_to_vmcs12_rare() before the vmcs12 contents are copied to userspace. Fixes: 7952d769c29ca ("KVM: nVMX: Sync rarely accessed guest fields only when needed") Reviewed-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com> Message-Id: <20210114205449.8715-2-mlevitsk@redhat.com> Cc: stable@vger.kernel.org Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-01-25	kvm: tracing: Fix unmatched kvm_entry and kvm_exit events	Lorenzo Brescia	3	-2/+5
	On VMX, if we exit and then re-enter immediately without leaving the vmx_vcpu_run() function, the kvm_entry event is not logged. That means we will see one (or more) kvm_exit, without its (their) corresponding kvm_entry, as shown here: CPU-1979 [002] 89.871187: kvm_entry: vcpu 1 CPU-1979 [002] 89.871218: kvm_exit: reason MSR_WRITE CPU-1979 [002] 89.871259: kvm_exit: reason MSR_WRITE It also seems possible for a kvm_entry event to be logged, but then we leave vmx_vcpu_run() right away (if vmx->emulation_required is true). In this case, we will have a spurious kvm_entry event in the trace. Fix these situations by moving trace_kvm_entry() inside vmx_vcpu_run() (where trace_kvm_exit() already is). A trace obtained with this patch applied looks like this: CPU-14295 [000] 8388.395387: kvm_entry: vcpu 0 CPU-14295 [000] 8388.395392: kvm_exit: reason MSR_WRITE CPU-14295 [000] 8388.395393: kvm_entry: vcpu 0 CPU-14295 [000] 8388.395503: kvm_exit: reason EXTERNAL_INTERRUPT Of course, not calling trace_kvm_entry() in common x86 code any longer means that we need to adjust the SVM side of things too. Signed-off-by: Lorenzo Brescia <lorenzo.brescia@edu.unito.it> Signed-off-by: Dario Faggioli <dfaggioli@suse.com> Message-Id: <160873470698.11652.13483635328769030605.stgit@Wayrath> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-01-25	KVM: Documentation: Update description of KVM_{GET,CLEAR}_DIRTY_LOG	Zenghui Yu	1	-9/+7
	Update various words, including the wrong parameter name and the vague description of the usage of "slot" field. Signed-off-by: Zenghui Yu <yuzenghui@huawei.com> Message-Id: <20201208043439.895-1-yuzenghui@huawei.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-01-25	KVM: x86: get smi pending status correctly	Jay Zhou	1	-0/+4
	The injection process of smi has two steps: Qemu KVM Step1: cpu->interrupt_request &= \ ~CPU_INTERRUPT_SMI; kvm_vcpu_ioctl(cpu, KVM_SMI) call kvm_vcpu_ioctl_smi() and kvm_make_request(KVM_REQ_SMI, vcpu); Step2: kvm_vcpu_ioctl(cpu, KVM_RUN, 0) call process_smi() if kvm_check_request(KVM_REQ_SMI, vcpu) is true, mark vcpu->arch.smi_pending = true; The vcpu->arch.smi_pending will be set true in step2, unfortunately if vcpu paused between step1 and step2, the kvm_run->immediate_exit will be set and vcpu has to exit to Qemu immediately during step2 before mark vcpu->arch.smi_pending true. During VM migration, Qemu will get the smi pending status from KVM using KVM_GET_VCPU_EVENTS ioctl at the downtime, then the smi pending status will be lost. Signed-off-by: Jay Zhou <jianjay.zhou@huawei.com> Signed-off-by: Shengen Zhuang <zhuangshengen@huawei.com> Message-Id: <20210118084720.1585-1-jianjay.zhou@huawei.com> Cc: stable@vger.kernel.org Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-01-25	KVM: x86/pmu: Fix HW_REF_CPU_CYCLES event pseudo-encoding in intel_arch_events[]	Like Xu	1	-1/+1
	The HW_REF_CPU_CYCLES event on the fixed counter 2 is pseudo-encoded as 0x0300 in the intel_perfmon_event_map[]. Correct its usage. Fixes: 62079d8a4312 ("KVM: PMU: add proper support for fixed counter 2") Signed-off-by: Like Xu <like.xu@linux.intel.com> Message-Id: <20201230081916.63417-1-like.xu@linux.intel.com> Reviewed-by: Sean Christopherson <seanjc@google.com> Cc: stable@vger.kernel.org Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-01-25	KVM: x86/pmu: Fix UBSAN shift-out-of-bounds warning in intel_pmu_refresh()	Like Xu	1	-0/+4
	Since we know vPMU will not work properly when (1) the guest bit_width(s) of the [gp\|fixed] counters are greater than the host ones, or (2) guest requested architectural events exceeds the range supported by the host, so we can setup a smaller left shift value and refresh the guest cpuid entry, thus fixing the following UBSAN shift-out-of-bounds warning: shift exponent 197 is too large for 64-bit type 'long long unsigned int' Call Trace: __dump_stack lib/dump_stack.c:79 [inline] dump_stack+0x107/0x163 lib/dump_stack.c:120 ubsan_epilogue+0xb/0x5a lib/ubsan.c:148 __ubsan_handle_shift_out_of_bounds.cold+0xb1/0x181 lib/ubsan.c:395 intel_pmu_refresh.cold+0x75/0x99 arch/x86/kvm/vmx/pmu_intel.c:348 kvm_vcpu_after_set_cpuid+0x65a/0xf80 arch/x86/kvm/cpuid.c:177 kvm_vcpu_ioctl_set_cpuid2+0x160/0x440 arch/x86/kvm/cpuid.c:308 kvm_arch_vcpu_ioctl+0x11b6/0x2d70 arch/x86/kvm/x86.c:4709 kvm_vcpu_ioctl+0x7b9/0xdb0 arch/x86/kvm/../../../virt/kvm/kvm_main.c:3386 vfs_ioctl fs/ioctl.c:48 [inline] __do_sys_ioctl fs/ioctl.c:753 [inline] __se_sys_ioctl fs/ioctl.c:739 [inline] __x64_sys_ioctl+0x193/0x200 fs/ioctl.c:739 do_syscall_64+0x2d/0x70 arch/x86/entry/common.c:46 entry_SYSCALL_64_after_hwframe+0x44/0xa9 Reported-by: syzbot+ae488dc136a4cc6ba32b@syzkaller.appspotmail.com Signed-off-by: Like Xu <like.xu@linux.intel.com> Message-Id: <20210118025800.34620-1-like.xu@linux.intel.com> Cc: stable@vger.kernel.org Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-01-25	KVM: x86: Add more protection against undefined behavior in rsvd_bits()	Sean Christopherson	1	-1/+8
	Add compile-time asserts in rsvd_bits() to guard against KVM passing in garbage hardcoded values, and cap the upper bound at '63' for dynamic values to prevent generating a mask that would overflow a u64. Suggested-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210113204515.3473079-1-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-01-25	KVM: Documentation: Fix spec for KVM_CAP_ENABLE_CAP_VM	Quentin Perret	1	-1/+1
	The documentation classifies KVM_ENABLE_CAP with KVM_CAP_ENABLE_CAP_VM as a vcpu ioctl, which is incorrect. Fix it by specifying it as a VM ioctl. Fixes: e5d83c74a580 ("kvm: make KVM_CAP_ENABLE_CAP_VM architecture agnostic") Signed-off-by: Quentin Perret <qperret@google.com> Message-Id: <20210108165349.747359-1-qperret@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-01-25	Merge tag 'kvmarm-fixes-5.11-2' of ↵	Paolo Bonzini	6	-49/+74
	git://git.kernel.org/pub/scm/linux/kernel/git/kvmarm/kvmarm into HEAD KVM/arm64 fixes for 5.11, take #2 - Don't allow tagged pointers to point to memslots - Filter out ARMv8.1+ PMU events on v8.0 hardware - Hide PMU registers from userspace when no PMU is configured - More PMU cleanups - Don't try to handle broken PSCI firmware - More sys_reg() to reg_to_encoding() conversions
2021-01-25	KVM: arm64: Don't clobber x4 in __do_hyp_init	Andrew Scull	1	-9/+11
	arm_smccc_1_1_hvc() only adds write contraints for x0-3 in the inline assembly for the HVC instruction so make sure those are the only registers that change when __do_hyp_init is called. Tested-by: David Brazdil <dbrazdil@google.com> Signed-off-by: Andrew Scull <ascull@google.com> Signed-off-by: Marc Zyngier <maz@kernel.org> Link: https://lore.kernel.org/r/20210125145415.122439-3-ascull@google.com
2021-01-21	KVM: Forbid the use of tagged userspace addresses for memslots	Marc Zyngier	2	-0/+4
	The use of a tagged address could be pretty confusing for the whole memslot infrastructure as well as the MMU notifiers. Forbid it altogether, as it never quite worked the first place. Cc: stable@vger.kernel.org Reported-by: Rick Edgecombe <rick.p.edgecombe@intel.com> Reviewed-by: Catalin Marinas <catalin.marinas@arm.com> Signed-off-by: Marc Zyngier <maz@kernel.org>
2021-01-21	KVM: arm64: Filter out v8.1+ events on v8.0 HW	Marc Zyngier	1	-3/+7
	When running on v8.0 HW, make sure we don't try to advertise events in the 0x4000-0x403f range. Cc: stable@vger.kernel.org Fixes: 88865beca9062 ("KVM: arm64: Mask out filtered events in PCMEID{0,1}_EL1") Signed-off-by: Marc Zyngier <maz@kernel.org> Link: https://lore.kernel.org/r/20210121105636.1478491-1-maz@kernel.org
2021-01-21	KVM: arm64: Compute TPIDR_EL2 ignoring MTE tag	Steven Price	1	-1/+2
	KASAN in HW_TAGS mode will store MTE tags in the top byte of the pointer. When computing the offset for TPIDR_EL2 we don't want anything in the top byte, so remove the tag to ensure the computation is correct no matter what the tag. Fixes: 94ab5b61ee16 ("kasan, arm64: enable CONFIG_KASAN_HW_TAGS") Signed-off-by: Steven Price <steven.price@arm.com> [maz: added comment] Signed-off-by: Marc Zyngier <maz@kernel.org> Link: https://lore.kernel.org/r/20210108161254.53674-1-steven.price@arm.com
2021-01-14	KVM: arm64: Use the reg_to_encoding() macro instead of sys_reg()	Alexandru Elisei	1	-10/+7
	The reg_to_encoding() macro is a wrapper over sys_reg() and conveniently takes a sys_reg_desc or a sys_reg_params argument and returns the 32 bit register encoding. Use it instead of calling sys_reg() directly. Signed-off-by: Alexandru Elisei <alexandru.elisei@arm.com> Signed-off-by: Marc Zyngier <maz@kernel.org> Link: https://lore.kernel.org/r/20210106144218.110665-1-alexandru.elisei@arm.com
2021-01-14	KVM: arm64: Allow PSCI SYSTEM_OFF/RESET to return	David Brazdil	1	-8/+5
	The KVM/arm64 PSCI relay assumes that SYSTEM_OFF and SYSTEM_RESET should not return, as dictated by the PSCI spec. However, there is firmware out there which breaks this assumption, leading to a hyp panic. Make KVM more robust to broken firmware by allowing these to return. Signed-off-by: David Brazdil <dbrazdil@google.com> Signed-off-by: Marc Zyngier <maz@kernel.org> Link: https://lore.kernel.org/r/20201229160059.64135-1-dbrazdil@google.com
2021-01-14	KVM: arm64: Simplify handling of absent PMU system registers	Marc Zyngier	1	-7/+1
	Now that all PMU registers are gated behind a .visibility callback, remove the other checks against an absent PMU. Signed-off-by: Marc Zyngier <maz@kernel.org>
2021-01-14	KVM: arm64: Hide PMU registers from userspace when not available	Marc Zyngier	1	-20/+48
	It appears that while we are now able to properly hide PMU registers from the guest when a PMU isn't available (either because none has been configured, the host doesn't have the PMU support compiled in, or that the HW doesn't have one at all), we are still exposing more than we should to userspace. Introduce a visibility callback gating all the PMU registers, which covers both usrespace and guest. Signed-off-by: Marc Zyngier <maz@kernel.org>
2021-01-10	Linux 5.11-rc3	Linus Torvalds	1	-1/+1

2021-01-10	Merge tag 'kbuild-fixes-v5.11' of ↵	Linus Torvalds	6	-14/+6
	git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild Pull Kbuild fixes from Masahiro Yamada: - Search for <ncurses.h> in the default header path of HOSTCC - Tweak the option order to be kind to old BSD awk - Remove 'kvmconfig' and 'xenconfig' shorthands - Fix documentation * tag 'kbuild-fixes-v5.11' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild: Documentation: kbuild: Fix section reference kconfig: remove 'kvmconfig' and 'xenconfig' shorthands lib/raid6: Let $(UNROLL) rules work with macOS userland kconfig: Support building mconf with vendor sysroot ncurses kconfig: config script: add a little user help MAINTAINERS: adjust GCC PLUGINS after gcc-plugin.sh removal
2021-01-10	Merge tag 'scsi-fixes' of ↵	Linus Torvalds	5	-26/+123
	git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi Pull SCSI fixes from James Bottomley: "This is two driver fixes (megaraid_sas and hisi_sas). The megaraid one is a revert of a previous revert of a cpu hotplug fix which exposed a bug in the block layer which has been fixed in this merge window. The hisi_sas performance enhancement comes from switching to interrupt managed completion queues, which depended on the addition of devm_platform_get_irqs_affinity() which is now upstream via the irq tree in the last merge window" * tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi: scsi: hisi_sas: Expose HW queues for v2 hw Revert "Revert "scsi: megaraid_sas: Added support for shared host tagset for cpuhotplug""
2021-01-10	Merge tag 'block-5.11-2021-01-10' of git://git.kernel.dk/linux-block	Linus Torvalds	22	-59/+172
	Pull block fixes from Jens Axboe: - Missing CRC32 selections (Arnd) - Fix for a merge window regression with bdev inode init (Christoph) - bcache fixes - rnbd fixes - NVMe pull request from Christoph: - fix a race in the nvme-tcp send code (Sagi Grimberg) - fix a list corruption in an nvme-rdma error path (Israel Rukshin) - avoid a possible double fetch in nvme-pci (Lalithambika Krishnakumar) - add the susystem NQN quirk for a Samsung driver (Gopal Tiwari) - fix two compiler warnings in nvme-fcloop (James Smart) - don't call sleeping functions from irq context in nvme-fc (James Smart) - remove an unused argument (Max Gurtovoy) - remove unused exports (Minwoo Im) - Use-after-free fix for partition iteration (Ming) - Missing blk-mq debugfs flag annotation (John) - Bdev freeze regression fix (Satya) - blk-iocost NULL pointer deref fix (Tejun) * tag 'block-5.11-2021-01-10' of git://git.kernel.dk/linux-block: (26 commits) bcache: set bcache device into read-only mode for BCH_FEATURE_INCOMPAT_OBSO_LARGE_BUCKET bcache: introduce BCH_FEATURE_INCOMPAT_LOG_LARGE_BUCKET_SIZE for large bucket bcache: check unsupported feature sets for bcache register bcache: fix typo from SUUP to SUPP in features.h bcache: set pdev_set_uuid before scond loop iteration blk-mq-debugfs: Add decode for BLK_MQ_F_TAG_HCTX_SHARED block/rnbd-clt: avoid module unload race with close confirmation block/rnbd: Adding name to the Contributors List block/rnbd-clt: Fix sg table use after free block/rnbd-srv: Fix use after free in rnbd_srv_sess_dev_force_close block/rnbd: Select SG_POOL for RNBD_CLIENT block: pre-initialize struct block_device in bdev_alloc_inode fs: Fix freeze_bdev()/thaw_bdev() accounting of bd_fsfreeze_sb nvme: remove the unused status argument from nvme_trace_bio_complete nvmet-rdma: Fix list_del corruption on queue establishment failure nvme: unexport functions with no external caller nvme: avoid possible double fetch in handling CQE nvme-tcp: Fix possible race of io_work and direct send nvme-pci: mark Samsung PM1725a as IGNORE_DEV_SUBNQN nvme-fcloop: Fix sscanf type and list_first_entry_or_null warnings ...
2021-01-10	Merge tag 'io_uring-5.11-2021-01-10' of git://git.kernel.dk/linux-block	Linus Torvalds	1	-89/+167
	Pull io_uring fixes from Jens Axboe: "A bit larger than I had hoped at this point, but it's all changes that will be directed towards stable anyway. In detail: - Fix a merge window regression on error return (Matthew) - Remove useless variable declaration/assignment (Ye Bin) - IOPOLL fixes (Pavel) - Exit and cancelation fixes (Pavel) - fasync lockdep complaint fix (Pavel) - Ensure SQPOLL is synchronized with creator life time (Pavel)" * tag 'io_uring-5.11-2021-01-10' of git://git.kernel.dk/linux-block: io_uring: stop SQPOLL submit on creator's death io_uring: add warn_once for io_uring_flush() io_uring: inline io_uring_attempt_task_drop() io_uring: io_rw_reissue lockdep annotations io_uring: synchronise ev_posted() with waitqueues io_uring: dont kill fasync under completion_lock io_uring: trigger eventfd for IOPOLL io_uring: Fix return value from alloc_fixed_file_ref_node io_uring: Delete useless variable ‘id’ in io_prep_async_work io_uring: cancel more aggressively in exit_work io_uring: drop file refs after task cancel io_uring: patch up IOPOLL overflow_flush sync io_uring: synchronise IOPOLL on task_submit fail
2021-01-10	Merge tag 'usb-5.11-rc3' of ↵	Linus Torvalds	33	-225/+267
	git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb Pull USB fixes from Greg KH: "Here are a number of small USB driver fixes for 5.11-rc3. Include in here are: - USB gadget driver fixes for reported issues - new usb-serial driver ids - dma from stack bugfixes - typec bugfixes - dwc3 bugfixes - xhci driver bugfixes - other small misc usb driver bugfixes All of these have been in linux-next with no reported issues" * tag 'usb-5.11-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb: (35 commits) usb: dwc3: gadget: Clear wait flag on dequeue usb: typec: Send uevent for num_altmodes update usb: typec: Fix copy paste error for NVIDIA alt-mode description usb: gadget: enable super speed plus kcov, usb: hide in_serving_softirq checks in __usb_hcd_giveback_urb usb: uas: Add PNY USB Portable SSD to unusual_uas usb: gadget: configfs: Preserve function ordering after bind failure usb: gadget: select CONFIG_CRC32 usb: gadget: core: change the comment for usb_gadget_connect usb: gadget: configfs: Fix use-after-free issue with udc_name usb: dwc3: gadget: Restart DWC3 gadget when enabling pullup usb: usbip: vhci_hcd: protect shift size USB: usblp: fix DMA to stack USB: serial: iuu_phoenix: fix DMA from stack USB: serial: option: add LongSung M5710 module support USB: serial: option: add Quectel EM160R-GL USB: Gadget: dummy-hcd: Fix shift-out-of-bounds bug usb: gadget: f_uac2: reset wMaxPacketSize usb: dwc3: ulpi: Fix USB2.0 HS/FS/LS PHY suspend regression usb: dwc3: ulpi: Replace CPU-based busyloop with Protocol-based one ...
2021-01-10	Merge tag 'staging-5.11-rc3' of ↵	Linus Torvalds	5	-29/+21
	git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/staging Pull staging driver fixes from Greg KH: "Here are some small staging driver fixes for 5.11-rc3. Nothing major, just resolving some reported issues: - cleanup some remaining mentions of the ION drivers that were removed in 5.11-rc1 - comedi driver bugfix - two error path memory leak fixes All have been in linux-next for a while with no reported issues" * tag 'staging-5.11-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/staging: staging: ION: remove some references to CONFIG_ION staging: mt7621-dma: Fix a resource leak in an error handling path Staging: comedi: Return -EFAULT if copy_to_user() fails staging: spmi: hisi-spmi-controller: Fix some error handling paths
2021-01-10	Merge tag 'char-misc-5.11-rc3' of ↵	Linus Torvalds	21	-217/+322
	git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc Pull char/misc driver fixes from Greg KH: "Here are some small char and misc driver fixes for 5.11-rc3. The majority here are fixes for the habanalabs drivers, but also in here are: - crypto driver fix - pvpanic driver fix - updated font file - interconnect driver fixes All of these have been in linux-next for a while with no reported issues" * tag 'char-misc-5.11-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc: (26 commits) Fonts: font_ter16x32: Update font with new upstream Terminus release misc: pvpanic: Check devm_ioport_map() for NULL speakup: Add github repository URL and bug tracker MAINTAINERS: Update Georgi's email address crypto: asym_tpm: correct zero out potential secrets habanalabs: Fix memleak in hl_device_reset interconnect: imx8mq: Use icc_sync_state interconnect: imx: Remove a useless test interconnect: imx: Add a missing of_node_put after of_device_is_available interconnect: qcom: fix rpmh link failures habanalabs: fix order of status check habanalabs: register to pci shutdown callback habanalabs: add validation cs counter, fix misplaced counters habanalabs/gaudi: retry loading TPC f/w on -EINTR habanalabs: adjust pci controller init to new firmware habanalabs: update comment in hl_boot_if.h habanalabs/gaudi: enhance reset message habanalabs: full FW hard reset support habanalabs/gaudi: disable CGM at HW initialization habanalabs: Revise comment to align with mirror list name ...
2021-01-11	Documentation: kbuild: Fix section reference	Viresh Kumar	1	-1/+1
	Section 3.11 was incorrectly called 3.9, fix it. Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org> Signed-off-by: Masahiro Yamada <masahiroy@kernel.org>
2021-01-10	Merge tag 'arc-5.11-rc3-fixes' of ↵	Linus Torvalds	5	-232/+3
	git://git.kernel.org/pub/scm/linux/kernel/git/vgupta/arc Pull ARC fixes from Vineet Gupta: - Address the 2nd boot failure due to snafu in signal handling code (first was generic console ttynull issue) - misc other fixes * tag 'arc-5.11-rc3-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/vgupta/arc: ARC: [hsdk]: Enable FPU_SAVE_RESTORE ARC: unbork 5.11 bootup: fix snafu in _TIF_NOTIFY_SIGNAL handling include/soc: remove headers for EZChip NPS arch/arc: add copy_user_page() to <asm/page.h> to fix build error on ARC