Age | Commit message (Collapse) | Author | Files | Lines |
|
The adapter is freed before we check its flags. It was caused
by commit 144be3d ("net/cxgb4: Avoid disabling PCI device for
towice"). The problem was reported by Intel's "0-day" tool.
The patch fixes it to avoid reverting commit 144be3d. It's
responsing to bug#110450.
Signed-off-by: Gavin Shan <shangw@linux.vnet.ibm.com>
|
|
We possibly retrieve the adapter's statistics during EEH recovery
and that should be disallowed. Otherwise, it would possibly incur
replicate EEH error and EEH recovery is going to fail eventually.
The patch reuses statistics lock and checks net_device is attached
before going to retrieve statistics, so that the problem can be
avoided.
It's responsing to bug#110450.
Signed-off-by: Gavin Shan <shangw@linux.vnet.ibm.com>
|
|
If we have EEH error happens to the adapter and we have to remove
it from the system for some reasons (e.g. more than 5 EEH errors
detected from the device in last hour), the adapter will be disabled
for towice separately by eeh_err_detected() and remove_one(), which
will incur following unexpected backtrace. The patch tries to avoid
it.
It's responsing bug#110450.
WARNING: at drivers/pci/pci.c:1431
CPU: 12 PID: 121 Comm: eehd Not tainted 3.13.0-rc7+ #1
task: c0000001823a3780 ti: c00000018240c000 task.ti: c00000018240c000
NIP: c0000000003c1e40 LR: c0000000003c1e3c CTR: 0000000001764c5c
REGS: c00000018240f470 TRAP: 0700 Not tainted (3.13.0-rc7+)
MSR: 8000000000029032 <SF,EE,ME,IR,DR,RI> CR: 28000024 XER: 00000004
CFAR: c000000000706528 SOFTE: 1
GPR00: c0000000003c1e3c c00000018240f6f0 c0000000010fe1f8 0000000000000035
GPR04: 0000000000000000 0000000000000000 00000000003ae509 0000000000000000
GPR08: 000000000000346f 0000000000000000 0000000000000000 0000000000003fef
GPR12: 0000000028000022 c00000000ec93000 c0000000000c11b0 c000000184ac3e40
GPR16: 0000000000000000 0000000000000000 0000000000000000 0000000000000000
GPR20: 0000000000000000 0000000000000000 0000000000000000 0000000000000000
GPR24: 0000000000000000 c0000000009398d8 c00000000101f9c0 c0000001860ae000
GPR28: c000000182ba0000 00000000000001f0 c0000001860ae6f8 c0000001860ae000
NIP [c0000000003c1e40] .pci_disable_device+0xd0/0xf0
LR [c0000000003c1e3c] .pci_disable_device+0xcc/0xf0
Call Trace:
[c0000000003c1e3c] .pci_disable_device+0xcc/0xf0 (unreliable)
[d0000000073881c4] .remove_one+0x174/0x320 [cxgb4]
[c0000000003c57e0] .pci_device_remove+0x60/0x100
[c00000000046396c] .__device_release_driver+0x9c/0x120
[c000000000463a20] .device_release_driver+0x30/0x60
[c0000000003bcdb4] .pci_stop_bus_device+0x94/0xd0
[c0000000003bcf48] .pci_stop_and_remove_bus_device+0x18/0x30
[c00000000003f548] .pcibios_remove_pci_devices+0xa8/0x140
[c000000000035c00] .eeh_handle_normal_event+0xa0/0x3c0
[c000000000035f50] .eeh_handle_event+0x30/0x2b0
[c0000000000362c4] .eeh_event_handler+0xf4/0x1b0
[c0000000000c12b8] .kthread+0x108/0x130
[c00000000000a168] .ret_from_kernel_thread+0x5c/0x74
Signed-off-by: Gavin Shan <shangw@linux.vnet.ibm.com>
|
|
Occasional failures have been seen with split-core mode and migration
where the message "KVM: couldn't grab cpu" appears. This increases
the length of time that we wait from 1ms to 10ms, which seems to
work around the issue.
Fixes: BZ 110865
Signed-off-by: Paul Mackerras <paulus@samba.org>
|
|
This patch fixes the EEH recoery issue in bnx2x.
Signed-off-by: Wen Xiong <wenxiong@linux.vnet.ibm.com>
LTC-Bugzilla: #110449
|
|
On Tuleta system, HTX has miscompare data issue after EEH recovery.
Signed-off-by: Wen Xiong <wenxiong@linux.vnet.ibm.com>
|
|
Add a memory barrier to ensure the valid bit is read before
any of the cqe payload is read. This fixes an issue seen
on Power where the cqe payload was getting loaded before
the valid bit. When this occurred, we saw an iotag out of
range error when a command completed, but since the iotag
looked invalid the command didn't get completed to scsi core.
Later we hit the command timeout, attempted to abort the command,
then waited for the aborted command to get returned. Since the
adapter already returned the command, we timeout waiting,
and end up escalating EEH all the way to host reset. This
patch fixes this issue.
Signed-off-by: Brian King <brking@linux.vnet.ibm.com>
|
|
Pulled from 3.10.23 stable for bug 110340..
>From abb5100737bba3f82b5514350fea89ca361ac66c Mon Sep 17 00:00:00 2001
From: Peter Hurley <peter@hurleysoftware.com>
Date: Sat, 3 May 2014 14:04:59 +0200
Subject: n_tty: Fix n_tty_write crash when echoing in raw mode
commit 4291086b1f081b869c6d79e5b7441633dc3ace00 upstream.
The tty atomic_write_lock does not provide an exclusion guarantee for
the tty driver if the termios settings are LECHO & !OPOST. And since
it is unexpected and not allowed to call TTY buffer helpers like
tty_insert_flip_string concurrently, this may lead to crashes when
concurrect writers call pty_write. In that case the following two
writers:
* the ECHOing from a workqueue and
* pty_write from the process
race and can overflow the corresponding TTY buffer like follows.
If we look into tty_insert_flip_string_fixed_flag, there is:
int space = __tty_buffer_request_room(port, goal, flags);
struct tty_buffer *tb = port->buf.tail;
...
memcpy(char_buf_ptr(tb, tb->used), chars, space);
...
tb->used += space;
so the race of the two can result in something like this:
A B
__tty_buffer_request_room
__tty_buffer_request_room
memcpy(buf(tb->used), ...)
tb->used += space;
memcpy(buf(tb->used), ...) ->BOOM
B's memcpy is past the tty_buffer due to the previous A's tb->used
increment.
Since the N_TTY line discipline input processing can output
concurrently with a tty write, obtain the N_TTY ldisc output_lock to
serialize echo output with normal tty writes. This ensures the tty
buffer helper tty_insert_flip_string is not called concurrently and
everything is fine.
Note that this is nicely reproducible by an ordinary user using
forkpty and some setup around that (raw termios + ECHO). And it is
present in kernels at least after commit
d945cb9cce20ac7143c2de8d88b187f62db99bdc (pty: Rework the pty layer to
use the normal buffering logic) in 2.6.31-rc3.
js: add more info to the commit log
js: switch to bool
js: lock unconditionally
js: lock only the tty->ops->write call
References: CVE-2014-0196
Reported-and-tested-by: Jiri Slaby <jslaby@suse.cz>
Signed-off-by: Peter Hurley <peter@hurleysoftware.com>
Signed-off-by: Jiri Slaby <jslaby@suse.cz>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Alan Cox <alan@lxorguk.ukuu.org.uk>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
Pulled from 3.10.23 stable for bug 110340.
>From a9ded882d5168e2fd5c0c20e2874f85c56016b4b Mon Sep 17 00:00:00 2001
From: Paolo Bonzini <pbonzini@redhat.com>
Date: Fri, 28 Mar 2014 20:41:50 +0100
Subject: KVM: ioapic: fix assignment of ioapic->rtc_status.pending_eoi
(CVE-2014-0155)
commit 5678de3f15010b9022ee45673f33bcfc71d47b60 upstream.
QE reported that they got the BUG_ON in ioapic_service to trigger.
I cannot reproduce it, but there are two reasons why this could happen.
The less likely but also easiest one, is when kvm_irq_delivery_to_apic
does not deliver to any APIC and returns -1.
Because irqe.shorthand == 0, the kvm_for_each_vcpu loop in that
function is never reached. However, you can target the similar loop in
kvm_irq_delivery_to_apic_fast; just program a zero logical destination
address into the IOAPIC, or an out-of-range physical destination address.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
We have observed that on machines with all their memory in a single
node, it is possible to hit an out of memory situation where kernel
allocations (which can't use the CMA pool) fail, triggering the OOM
killer, yet reclaim doesn't start because there is still free memory
in the CMA pool. To alleviate this situation somewhat, this reduces
the default CMA pool size from 5% to 3% of system memory. The 3%
should still be enough in most situations, and if not, the user can
specify a different amount on the kernel command line.
This should help with BZ 110181.
Signed-off-by: Paul Mackerras <paulus@samba.org>
|
|
As Ben suggested, it's meaningful to dump PE's location code
for site engineers when hitting EEH errors. The patch introduces
function eeh_pe_loc_get() to retireve the location code from
dev-tree so that we can output it when hitting EEH errors.
If primary PE bus is root bus, the PHB's dev-node would be tried
prior to root port's dev-node. Otherwise, the upstream bridge's
dev-node of the primary PE bus will be check for the location code
directly.
This fixes BZ 109585. Please apply to the next build for GA.
Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
|
|
The first bug is that we are testing the C (changed) bit in the hashed
page table without first doing a tlbie. The architecture allows the
update of the C bit to happen at any time up until we do a tlbie for
the page. However, we don't want to do a tlbie for every page on every
pass of a migration operation. Thus we do the tlbie if there are no
vcpus currently running, which would indicate the final phase of
migration. If any vcpus are running then reading the dirty log is
already racy because pages could get dirtied immediately after we
check them. Also, we don't need to do the tlbie if the HPT entry
doesn't allow writing, since in that case the C bit can not get set.
The second bug is that in the case where we see a dirty 16MB page
followed by a dirty 4kB page (both mapping to the same guest real
address), we return 1 rather than 16MB / PAGE_SIZE. The return value,
indicating the number of dirty pages, needs to reflect the largest
dirty page we come across, not the last dirty page we see.
Fixes: 109551 (this time for sure)
Signed-off-by: Paul Mackerras <paulus@samba.org>
|
|
The dirty map is system page (4K/64K) per bit, and when we populate dirty
map, we reset the Change bit in HPT which is expected to contains pages
less or equal to the system page size. This works until we start using
huge pages (16MB). In this case, we mark dirty just a single system page
and miss the rest of 16MB page which may be dirty as well.
This changes kvm_test_clear_dirty to return the actual number of pages
which is calculated from HPT entry.
This changes kvmppc_hv_get_dirty_log() to make pages dirty starting from
the rounded guest physical page number.
[paulus@samba.org - don't advance i in the loop to set dirty bits, so
that we make sure to clear C in all HPTEs.]
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Signed-off-by: Paul Mackerras <paulus@samba.org>
|
|
table.
We reserve 5% of total ram for CMA allocation and not using that can
result in us running out of numa node memory with specific
configuration. One caveat is we may not have node local hpt with pinned
vcpu configuration. But currently libvirt also pins the vcpu to cpuset
after creating hash page table.
Reviewed-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
|
|
Commit 63fa7d4 ("powerpc/eeh: Escalate error on non-existing PE")
escalates the frozen state on non-existing PE to fenced PHB. It
was to improve kdump reliability. After that, commit 716a0e8 ("
powrpc/powernv: Reset PHB in kdump kernel") was introduced to
apply complete reset on all PHBs to increase the kdump reliability.
Commit 63fa7d4 becomes unuseful and to issue PHB reset on non-fenced
(on HW level) PHB would cause unexpected problems. So I'd like to
revert it.
It's responsing to bug#109562.
Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
|
|
When we have the corner case of frozen parent and child PE at the
same time, we have to handle the frozen parent PE prior to the
child. Without clearning the frozen state on parent PE, the child
PE can't be recovered successfully.
There're 2 ways (polling and interrupt) to have frozen PE to be
reported. If we have frozen parent PE out there, we have to report
and handle that firstly.
It's responsing to bug#109562.
Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
|
|
Since commit cb523e09 ("powerpc/eeh: Avoid I/O access during PE
reset"), the PE is kept as frozen state on hardware level until
the PE reset is done completely. After that, we explicitly clear
the frozen state of the affected PE. However, there might have
frozen child PEs of the affected PE and we also need clear their
frozen state as well. Otherwise, the recovery is going to fail.
It's responsing to bug#109562.
Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
|
|
Currently we forward MCEs to guest which have been recovered by guest.
And for unhandled errors we do not deliver the MCE to guest. It looks like
with no support of FWNMI in qemu, guest just panics whenever we deliver the
recovered MCEs to guest. Also, the existig code used to return to host for
unhandled errors which was casuing guest to hang with soft lockups inside
guest and makes it difficult to recover guest instance.
This patch now forwards all fatal MCEs to guest causing guest to crash/panic.
And, for recovered errors we just go back to normal functioning of guest
instead of returning to host. This fixes soft lockup issues in guest.
This patch also fixes an issue where guest MCE events were not logged to
host console.
This patch fixes bz108165 and bz108413
Signed-off-by: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
|
|
During split-core operations, one of the online CPUs is nominated as the
"master" and then stop_machine() is invoked to perform the split/unsplit
procedure. Between these 2 steps, if CPU hotplug occurs and takes the
just nominated "master" CPU offline, then the split/unsplit procedure
does not complete properly and leads to undesirable effects.
So protect the entire split-core operation with get/put_online_cpus()
to synchronize with CPU hotplug.
Fixes bz 105509.
Acked-by: Michael Neuling <mikey@neuling.org>
Signed-off-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
|
|
The hardware manages the resync during split-core operations, on newer
revisions (DD2.1 and higher). So we don't need to call opal_resync_timebase()
on those systems.
Fixes bz 105856.
[Srivatsa: Added changelog]
Signed-off-by: Alistair Popple <alistair@popple.id.au>
Signed-off-by: Michael Neuling <mikey@neuling.org>
Signed-off-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
|
|
We don't see MCE counter getting increased in /proc/interrupts which gives
false impression of no MCE occurred even when there were MCE events.
The machine check early handling was added for PowerKVM and we missed to
increment the MCE count in the early handler.
We also increment mce counters in the machine_check_exception call, but
in most cases where we handle the error hypervisor never reaches there
unless its fatal and we want to crash. Only during fatal situation we may
see double increment of mce count. We need to fix that. But for
now it always good to have some count increased instead of zero.
This fixes the MCE count issue mentioned in bz108413
Signed-off-by: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
|
|
Without this, we get lockdep errors
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
|
|
Currently machine check handler does not check for stack overflow for
nested machine check. If we hit another MCE while inside the machine check
handler repeatedly from same address then we get into risk of stack
overflow which can cause huge memory corruption. This patch limits the
nested MCE level to 4 and panic when we cross level 4.
Signed-off-by: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
|
|
The size of the sysparam sysfs files is determined from the device tree
at boot. However the buffer is hard coded to 64 bytes. If we encounter a
parameter that is larger than 64, or miss-parse the device tree, the
buffer will overflow when reading or writing to the parameter.
Check it at discovery time, and if the parameter is too large, do not
create a sysfs entry for it.
Signed-off-by: Joel Stanley <joel@jms.id.au>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
|
|
Signed-off-by: Joel Stanley <joel@jms.id.au>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
|
|
The sysparam code currently uses the userspace supplied number of
bytes when memcpy()ing in to a local 64-byte buffer.
Limit the maximum number of bytes by the size of the buffer.
Signed-off-by: Joel Stanley <joel@jms.id.au>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
|
|
The OPAL calls are returning int64_t values, which the sysparam code
stores in an int, and the sysfs callback returns ssize_t. Make code a
easier to read by consistently using ssize_t.
Signed-off-by: Joel Stanley <joel@jms.id.au>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
|
|
When a sysparam query in OPAL returned a negative value (error code),
sysfs would spew out a decent chunk of memory; almost 64K more than
expected. This was traced to a sign/unsigned mix up in the OPAL sysparam
sysfs code at sys_param_show.
The return value of sys_param_show is a ssize_t, calculated using
return ret ? ret : attr->param_size;
Alan Modra explains:
"attr->param_size" is an unsigned int, "ret" an int, so the overall
expression has type unsigned int. Result is that ret is cast to
unsigned int before being cast to ssize_t.
Instead of using the ternary operator, set ret to the param_size if an
error is not detected. The same bug exists in the sysfs write callback;
this patch fixes it in the same way.
A note on debugging this next time: on my system gcc will warn about
this if compiled with -Wsign-compare, which is not enabled by -Wall,
only -Wextra.
Signed-off-by: Joel Stanley <joel@jms.id.au>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
|
|
Today CPUs in fast sleep are being woken up to handle their timers
by the tick broadcast framework using a hrtimer queued on a nominated
broadcast CPU. The hrtimer is programmed for the earlier of the next
wakeup and a broadcast period which happens to be a jiffy. This
programming is being done incorrectly today. The current time
is noted, the tick broadcast interrupt handler is called, then the
time at which the hrtimer needs to be programmed is decided. By
then the noted current time would be stale and the hrtimer will
be forward much ahead than required, leading to delayed broadcast
interrupts being delivered to sleeping cpus.
Fix this by noting the current time just before programming the hrtimer.
Signed-off-by: Preeti U Murthy <preeti@linux.vnet.ibm.com>
|
|
Current code does not check for unhandled/unrecovered errors and return from
interrupt if it is recoverable exception which in-turn triggers same machine
check exception in a loop causing hypervisor to be unresponsive.
This patch fixes this situation and forces hypervisor to panic for
unhandled/unrecovered errors.
This patch also fixes another issue where unrecoverable_exception routine
was called in real mode in case of unrecoverable exception (MSR_RI = 0).
This causes another exception vector 0x300 (data access) during system crash
leading to confusion while debugging cause of the system crash.
With the above fixes we now throw correct console messages (see below) while
crashing the system in case of unhandled/unrecoverable machine checks.
--------------
Severe Machine check interrupt [[Not recovered]
Initiator: CPU
Error type: UE [Instruction fetch]
Effective address: 0000000030002864
Oops: Machine check, sig: 7 [#1]
SMP NR_CPUS=2048 NUMA PowerNV
Modules linked in: bork(O) bridge stp llc kvm [last unloaded: bork]
CPU: 36 PID: 55162 Comm: bash Tainted: G O 3.14.0mce #1
task: c000002d72d022d0 ti: c000000007ec0000 task.ti: c000002d72de4000
NIP: 0000000030002864 LR: 00000000300151a4 CTR: 000000003001518c
REGS: c000000007ec3d80 TRAP: 0200 Tainted: G O (3.14.0mce)
MSR: 9000000000041002 <SF,HV,ME,RI> CR: 28222848 XER: 20000000
CFAR: 0000000030002838 DAR: d0000000004d0000 DSISR: 00000000 SOFTE: 1
GPR00: 000000003001512c 0000000031f92cb0 0000000030078af0 0000000030002864
GPR04: d0000000004d0000 0000000000000000 0000000030002864 ffffffffffffffc9
GPR08: 0000000000000024 0000000030008af0 000000000000002c c00000000150e728
GPR12: 9000000000041002 0000000031f90000 0000000010142550 0000000040000000
GPR16: 0000000010143cdc 0000000000000000 00000000101306fc 00000000101424dc
GPR20: 00000000101424e0 000000001013c6f0 0000000000000000 0000000000000000
GPR24: 0000000010143ce0 00000000100f6440 c000002d72de7e00 c000002d72860250
GPR28: c000002d72860240 c000002d72ac0038 0000000000000008 0000000000040000
NIP [0000000030002864] 0x30002864
LR [00000000300151a4] 0x300151a4
Call Trace:
Instruction dump:
XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX
XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX
---[ end trace 7285f0beac1e29d3 ]---
Sending IPI to other CPUs
IPI complete
OPAL V3 detected !
--------------
Signed-off-by: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
|
|
The PPC_MSG_TIMER IPI message slot was introduced for the tick broadcast
IPIs which are required to wakeup sleeping CPUs. The decrementer of the
CPUs that enter fast sleep stops as a consequence of entering the idle
state. Therefore such CPUs have to be woken up in time to handle their
timers by a broadcast CPU which sends the PPC_MSG_TIMER IPIs to them.
This IPI message is being parsed wrongly in smp_ipi_demux(). Thus the
tick broadcast interrupt handler is never executed on the sleeping CPU.
This could have led to unpleasant side effects like not handling timers
in time on the sleeping cpus. But since the sleeping CPUs still receive
the tick broadcast IPI, they are awoken from the idle state and their
decrementers are back in action.
As a result, its possible that they are managing to handle timers
before they go to sleep again.
Hence timers are being handled on the sleeping cpus although the tick
broadcast interrupt handler, which is actually supposed to ensure that
is never being called today due to the wrong number of shift bits while
parsing the tick broadcast IPI.
However we need to note that as a result of this discrepency, timer
handling on the sleeping cpus may be unstable. This could be one of
the reasons we are observing some softlockups in the cpuidle wakeup path.
Signed-off-by: Preeti U Murthy <preeti@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
|
|
PCI resets will attempt to take the device_lock for any device to be
reset. This is a problem if that lock is already held, for instance
in the device remove path. It's not sufficient to simply kill the
user process or skip the reset if called after .remove as a race could
result in the same deadlock. Instead, we handle all resets as "best
effort" using the PCI "try" reset interfaces. This prevents the user
from being able to induce a deadlock by triggering a reset.
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
(cherry picked from commit 890ed578df82f5b7b5a874f9f2fa4f117305df5f)
Signed-off-by: Thadeu Lima de Souza Cascardo <cascardo@linux.vnet.ibm.com>
LTC-Bugzilla: #104951
|
|
When doing a function/slot/bus reset PCI grabs the device_lock for each
device to block things like suspend and driver probes, but call paths exist
where this lock may already be held. This creates an opportunity for
deadlock. For instance, vfio allows userspace to issue resets so long as
it owns the device(s). If a driver unbind .remove callback races with
userspace issuing a reset, we have a deadlock as userspace gets stuck
waiting on device_lock while another thread has device_lock and waits for
.remove to complete. To resolve this, we can make a version of the reset
interfaces which use trylock. With this, we can safely attempt a reset and
return error to userspace if there is contention.
[bhelgaas: the deadlock happens when A (userspace) has a file descriptor for
the device, and B waits in this path:
driver_detach
device_lock # take device_lock
__device_release_driver
pci_device_remove # pci_bus_type.remove
vfio_pci_remove # pci_driver .remove
vfio_del_group_dev
wait_event(vfio.release_q, !vfio_dev_present) # wait (holding device_lock)
Now B is stuck until A gives up the file descriptor. If A tries to acquire
device_lock for any reason, we deadlock because A is waiting for B to release
the lock, and B is waiting for A to release the file descriptor.]
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
(cherry picked from commit 61cf16d8bd38c3dc52033ea75d5b1f8368514a17)
Signed-off-by: Thadeu Lima de Souza Cascardo <cascardo@linux.vnet.ibm.com>
LTC-Bugzilla: #104951
|
|
When PCI_ERS_RESULT_CAN_RECOVER returned from device drivers, the
EEH core should enable I/O and DMA for the affected PE. However,
it was missed to have DMA enabled in eeh_handle_normal_event().
Besides, the frozen state of the affected PE should be cleared
after successful recovery, but we didn't.
The patch fixes both of the issues as above. It's responsing to
bug#105179.
Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
|
|
As pointed by Alexey, we're going to hit build failure without
exporting the functions when (CONFIG_VFIO_PCI == M). It should
be part of commit 9762b50 ("drivers/vfio/pci: Fix MSIx message
lost").
Reported-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
|
|
At present, if a PR guest on a POWER8 machine tries to access some
disabled functionality such as transactional memory, the result is
a facility-unavailable interrupt, which isn't handled in
kvmppc_handle_exit_pr(), resulting in a call to BUG(), crashing
the PR host kernel.
This adds code to handle the facility-unavailable interrupts and
give the guest an illegal instruction interrupt, instead of crashing
the PR host.
Signed-off-by: Paul Mackerras <paulus@samba.org>
|
|
This provides basic support for the KVM_REG_PPC_ARCH_COMPAT register
in PR KVM. At present the value is sanity-checked when set, but
doesn't actually affect anything yet.
Implementing this makes it possible to use a qemu command-line
argument such as "-cpu host,compat=power7" on a POWER8 machine,
just as we would with HV KVM.
Signed-off-by: Paul Mackerras <paulus@samba.org>
|
|
The Power ISA states that an mtspr or mfspr to/from an unimplemented
SPR should be a no-op in privileged mode, rather than causing an
program interrupt (0x700 vector), with the exception of mtspr to SPR 0
and mfspr from SPRs 0, 4, 5 or 6.
Currently our SPR emulation code doesn't follow this rule. This
modifies the code in kvmppc_core_emulate_m[ft]spr_pr() to check
the PR bit in the MSR when we detect an unknown SPR number, and
only return EMULATE_FAIL (which results in a program interrupt)
if PR is 0 or the SPR number is one of the ones which are specifically
defined to cause a program interrupt.
Signed-off-by: Paul Mackerras <paulus@samba.org>
|
|
each other
Its possible that the tick_broadcast_force_mask contains cpus which are not
in cpu_online_mask when a broadcast tick occurs. This could happen under the
following circumstance assuming CPU1 is among the CPUs waiting for broadcast
and the cpu being hotplugged out.
CPU0 CPU1
Run CPU_DOWN_PREPARE notifiers
Start stop_machine Gets woken up by IPI to run
stop_machine, sets itself in
tick_broadcast_force_mask if the
time of broadcast interrupt is around
the same time as this IPI.
Start stop_machine
set_cpu_online(cpu1, false)
End stop_machine End stop_machine
Broadcast interrupt
Finds that cpu1 in
tick_broadcast_force_mask is offline
and triggers the WARN_ON in
tick_handle_oneshot_broadcast()
Clears all broadcast masks
in CPU_DEAD stage.
While the hotplugged cpu clears its bit in the tick_broadcast_oneshot_mask
and tick_broadcast_pending mask during BROADCAST_EXIT, it *sets* its bit
in the tick_broadcast_force_mask if the broadcast interrupt is found to be
around the same time as the present time. Today we clear all the broadcast
masks and shutdown tick devices in the CPU_DEAD stage. But as shown above
the broadcast interrupt could occur before this stage is reached and the
WARN_ON() gets triggered when it is found that the tick_broadcast_force_mask
contains an offline cpu.
Please note that a scenario such as above will occur *only if the broadcast
interrupt is delayed under some circumstance*. Ideally the broadcast interrupt
in the above scenario should have occured before we reach the irq_disabled
stage of stop_machine and should have seen a valid broadcast mask. But for
some reason that is yet to be understood it is getting delayed leading to the
above scenario.
Besides this another point to notice is that for a small duration between
the CPU_DYING stage where the hotplugged cpu clears its bit from the
cpu_online_mask and the CPU_DEAD stage where the broadcast_force_mask gets
cleared of the same, both these masks are out of sync with each other during that
time thus triggering the above scenario.
The temporary solution to this is to move the clearing of broadcast masks to
the CPU_DYING notification stage. The reason is, it is during this stage that
the hotplugged cpu clears itself from the cpu_online_mask() and runs
notifications relevant to this stage including those to clear the broadcast masks
(with this patch).
All this, while the rest of the cpus are busy spinning in stop_machine to notice
this change. By the time this stage ends and all cpus resume work, the hotplugged
cpu would have cleared itself from the cpu_online_mask and the broadcast cpu mask
thus keeping them in sync with each other at such times when the rest of the cpus
can read these masks.
Since the above mentioned delay in the broadcast interrupt has not triggered
any soft lockups so far, we are assuming its a non-fatal issue and have this
patch to prevent the warning from popping up in this case.
Suggested-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
Signed-off-by: Preeti U Murthy <preeti@linux.vnet.ibm.com>
Reviewed-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
Acked-by: Benjamin Herrenschmidt <benh@au1.ibm.com>
|
|
The changes to increment _mapcount was added w.r.t THP change
3526741f0964c88bc2ce511e1078359052bf225b. Later this was fixed
to to handle the hugetlb case in 44518d2b32646e37b4b7a0813bbbe98dc21c7f8f
Instead of backporting 44518, we can remove the _mapcount update since
we don't support THP for kvm host yet.
Fixes: bz# 108558
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
|
|
In the kdump scenario, the first kerenl doesn't shutdown PCI devices
and the kdump kerenl clean PHB IODA table at the early probe time.
That means the kdump kerenl can't support PCI transactions piled
by the first kerenl. Otherwise, lots of EEH errors and frozen PEs
will be detected.
In order to avoid the EEH errors, the PHB is resetted to drop all
PCI transaction from the first kerenl. It looks good on P7, but need
to be verified on P8.
Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
Acked-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
|
|
The problem was initially reported by Wendy who tried pass through
IPR adapter, which was connected to PHB root port directly, to KVM
based guest. When doing that, pci_reset_bridge_secondary_bus() was
called by VFIO driver and linkDown was detected by the root port.
That caused all PEs to be frozen.
The patch fixes the issue by routing the reset for the secondary bus
of root port to underly firmware. For that, one more weak function
pci_reset_secondary_bus() is introduced so that the individual platforms
can override that and do specific reset for bridge's secondary bus.
Reported-by: Wendy Xiong <wenxiong@linux.vnet.ibm.com>
Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
Acked-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
|
|
Basically, we have 3 types of resets to fulfil PE reset: fundamental,
hot and PHB reset. For the later 2 cases, we need PCI bus reset hold
and settlement delay as specified by PCI spec. PowerNV and pSeries
platforms are running on top of different firmware and some of the
delays have been covered by underly firmware (PowerNV).
The patch makes the delays unified to be done in backend, instead of
EEH core.
Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
Acked-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
|
|
Resetting root port has more stuff to do than that for PCIe switch
ports and we should have resetting root port done in firmware instead
of the kernel itself. The problem was introduced by commit 5b2e198e
("powerpc/powernv: Rework EEH reset").
Cc: linux-stable <stable@vger.kernel.org>
Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
Acked-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
|
|
In pseries_eeh_get_state(), EEH_STATE_UNAVAILABLE is always
overwritten by EEH_STATE_NOT_SUPPORT because of the missed
"break" there. The patch fixes the issue.
Reported-by: Joe Perches <joe@perches.com>
Cc: linux-stable <stable@vger.kernel.org>
Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
Acked-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
|
|
Once one specific PE has been marked as EEH_PE_ISOLATED, it's in
the middile of recovery or removed permenently. We needn't report
the frozen PE again. Otherwise, we will have endless reporting
same frozen PE.
Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
Acked-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
|
|
The issue was detected in a bit complicated test case where
we have multiple hierarchical PEs shown as following figure:
+-----------------+
| PE#3 p2p#0 |
| p2p#1 |
+-----------------+
|
+-----------------+
| PE#4 pdev#0 |
| pdev#1 |
+-----------------+
PE#4 (have 2 PCI devices) is the child of PE#3, which has 2 p2p
bridges. We accidentally had less-known scenario: PE#4 was removed
permanently from the system because of permanent failure (e.g.
exceeding the max allowd failure times in last hour), then we detects
EEH errors on PE#3 and tried to recover it. However, eeh_dev instances
for pdev#0/1 were not detached from PE#4, which was still connected to
PE#3. All of that was because of the fact that we rely on count-based
pcibios_release_device(), which isn't reliable enough. When doing
recovery for PE#3, we still apply hotplug on PE#4 and pdev#0/1, which
are not valid any more. Eventually, we run into kernel crash.
The patch fixes above issue from two aspects. For unplug, we simply
skip those permanently removed PE, whose state is (EEH_PE_STATE_ISOLATED
&& !EEH_PE_STATE_RECOVERING) and its frozen count should be greater
than EEH_MAX_ALLOWED_FREEZES. For plug, we marked all permanently
removed EEH devices with EEH_DEV_REMOVED and return 0xFF's on read
its PCI config so that PCI core will omit them.
Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
Acked-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
|
|
The patch introduces bootarg "eeh=off" to disable EEH functinality.
Also, it creates /sys/kerenl/debug/powerpc/eeh_enable to disable
or enable EEH functionality. By default, we have the functionality
enabled.
For PowerNV platform, we will restore to have the conventional
mechanism of clearing frozen PE during PCI config access if we're
going to disable EEH functionality. Conversely, we will rely on
EEH for error recovery.
The patch also fixes the issue that we missed to cover the case
of disabled EEH functionality in function ioda_eeh_event(). Those
events driven by interrupt should be cleared to avoid endless
reporting.
Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
Acked-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
|
|
There're 2 EEH subsystem variables: eeh_subsystem_enabled and
eeh_probe_mode. We needn't maintain 2 variables and we can just
have one variable and introduce different flags. The patch also
introduces additional flag EEH_FORCE_DISABLE, which will be used
to disable EEH subsystem via boot parameter ("eeh=off") in future.
Besides, the patch also introduces flag EEH_ENABLED, which is
changed to disable or enable EEH functionality on the fly through
debugfs entry in future.
With the patch applied, the creteria to check the enabled EEH
functionality is changed to:
!EEH_FORCE_DISABLED && EEH_ENABLED : Enabled
Other cases : Disabled
Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
Acked-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
|
|
When calling into eeh_gather_pci_data() on pSeries platform, we
possiblly don't have pci_dev instance yet, but eeh_dev is always
ready. So we use cached capability from eeh_dev instead of pci_dev
for log dump there. In order to keep things unified, we also cache
PCI capability positions to eeh_dev for PowerNV as well.
Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
Acked-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
|
|
The patch replaces printk(KERN_WARNING ...) with pr_warn() in the
function eeh_gather_pci_data().
Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
Acked-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
|
|
We have suffered recrusive frozen PE a lot, which was caused
by IO accesses during the PE reset. Ben came up with the good
idea to keep frozen PE until recovery (BAR restore) gets done.
With that, IO accesses during PE reset are dropped by hardware
and wouldn't incur the recrusive frozen PE any more.
The patch implements the idea. We don't clear the frozen state
until PE reset is done completely. During the period, the EEH
core expects unfrozen state from backend to keep going. So we
have to reuse EEH_PE_RESET flag, which has been set during PE
reset, to return normal state from backend. The side effect is
we have to clear frozen state for towice (PE reset and clear it
explicitly), but that's harmless.
We have some limitations on pHyp. pHyp doesn't allow to enable
IO or DMA for unfrozen PE. So we don't enable them on unfrozen PE
in eeh_pci_enable(). We have to enable IO before grabbing logs on
pHyp. Otherwise, 0xFF's is always returned from PCI config space.
Also, we had wrong return value from eeh_pci_enable() for
EEH_OPT_THAW_DMA case. The patch fixes it too.
Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
Acked-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
|
|
For EEH PowerNV backends, they need use their own PCI config
accesors as the normal one could be blocked during PE reset.
The patch also removes necessary parameter "hose" for the
function ioda_eeh_bridge_reset().
Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
Acked-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
|
|
We've observed multiple PE reset failures because of PCI-CFG
access during that period. Potentially, some device drivers
can't support EEH very well and they can't put the device to
motionless state before PE reset. So those device drivers might
produce PCI-CFG accesses during PE reset. Also, we could have
PCI-CFG access from user space (e.g. "lspci"). Since access to
frozen PE should return 0xFF's, we can block PCI-CFG access
during the period of PE reset so that we won't get recrusive EEH
errors.
The patch adds flag EEH_PE_RESET, which is kept during PE reset.
The PowerNV/pSeries PCI-CFG accessors reuse the flag to block
PCI-CFG accordingly.
Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
Acked-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
|
|
When doing PE reset, EEH_PE_ISOLATED is cleared unconditionally.
However, We should remove that if the PE reset has cleared the
frozen state successfully. Otherwise, the flag should be kept.
The patch fixes the issue.
Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
Acked-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
|
|
For some fields (e.g. LEM, MMIO, DMA) in PHB diag-data dump, it's
meaningless to print them if they have non-zero value in the
corresponding mask registers because we always have non-zero values
in the mask registers. The patch only prints those fieds if we
have non-zero values in the primary registers (e.g. LEM, MMIO, DMA
status) so that we can save couple of lines. The patch also removes
unnecessary spare line before "brdgCtl:" and two leading spaces as
prefix in each line as Ben suggested.
Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
Acked-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
|
|
Commit c7062d83fe7b ("powerpc/ppc64: Do not turn AIL (reloc-on
interrupts) too early") added code to set the AIL bit in the LPCR
without checking whether the kernel is running in hypervisor mode.
The result is that when the kernel is running as a guest (i.e.,
under PowerKVM or PowerVM), the processor takes a privileged
instruction interrupt at that point, causing a panic. The visible
result is that the kernel hangs after printing "returning from
prom_init".
This fixes it by checking for hypervisor mode being available
before setting LPCR. If we are not in hypervisor mode, we enable
relocation-on interrupts later in pSeries_setup_arch using the
H_SET_MODE hcall.
This fixes BZ 108728.
Signed-off-by: Paul Mackerras <paulus@samba.org>
|
|
When the guest cedes the vcpu or the vcpu has no guest to
run it naps. Clear the runlatch bit of the vcpu before
napping to indicate an idle cpu.
Signed-off-by: Preeti U Murthy <preeti@linux.vnet.ibm.com>
Acked-by: Paul Mackerras <paulus@samba.org>
Reviewed-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
|
|
The secondary threads in the core have their runlatch bits cleared since they
are offline. When the secondary threads are called in to start a guest their
runlatch bits need to be set to indicate that they are busy. The primary
thread has its runlatch bit set though, but there is no harm in setting this
bit once again. Hence set the runlatch bit for all threads before they start
guest.
Signed-off-by: Preeti U Murthy <preeti@linux.vnet.ibm.com>
Acked-by: Paul Mackerras <paulus@samba.org>
Reviewed-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
|
|
Up until now we have been setting the runlatch bits for a busy CPU and
clearing it when a CPU enters idle state. The runlatch bit has thus
been consistent with the utilization of a CPU as long as the CPU is online.
However when a CPU is hotplugged out the runlatch bit is not cleared. It
needs to be cleared to indicate an unused CPU. OCC consumes the runlatch bit
to decide the utilization of a thread and ends up seeing the offline threads
as busy. Hence this patch has the runlatch bit cleared for an offline CPU
just before entering an idle state and sets it immediately after it exits
the idle state.
Signed-off-by: Preeti U Murthy <preeti@linux.vnet.ibm.com>
Acked-by: Paul Mackerras <paulus@samba.org>
Reviewed-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
|
|
The issue is happened in dual controller configuration. We got the
sysfs warnings when rmmod the ipr module.
enclosure_unregister() in drivers/msic/enclosure.c, call device_unregister()
for each componment deivce, device_unregister() ->device_del()->kobject_del()
->sysfs_remove_dir(). In sysfs_remove_dir(), set kobj->sd = NULL.
For each componment device, enclosure_component_release()->enclosure_remove_links()->sysfs_remove_link() in which checking kobj->sd again, it has been set as NULL when doing device_unregister. So we saw all these sysfs WARNING.
sysfs: can not remove 'enclosure_device: P1-D1 2SS6', no directory
------------[ cut here ]------------
WARNING: at fs/sysfs/inode.c:325
Modules linked in: fuse loop dm_mod ses enclosure ipr(-) ipv6 ibmveth libata sg ext3 jbd mbcache sd_mod crc_t10dif crct10dif_common ibmvscsi scsi_transport_srp scsi_tgt scsi_dh_rdac scsi_dh_emc scsi_dh_hp_sw scsi_dh_alua scsi_dh scsi_mod
CPU: 0 PID: 4006 Comm: rmmod Not tainted 3.12.0-scsi-0.11-ppc64 #1
task: c0000000f769aba0 ti: c0000000f8f9c000 task.ti: c0000000f8f9c000
NIP: c0000000002b038c LR: c0000000002b0388 CTR: 0000000000000000
REGS: c0000000f8f9ee70 TRAP: 0700 Not tainted (3.12.0-scsi-0.11-ppc64)
MSR: 8000000000029032 <SF,EE,ME,IR,DR,RI> CR: 28008444 XER: 20000000
SOFTE: 1
CFAR: c000000000736118
GPR00: c0000000002b0388 c0000000f8f9f0f0 c0000000010ed630 0000000000000047
GPR04: c000000001502628 c000000001513010 0000000000000689 652027656e636c6f
GPR08: 737572655f646576 c000000000ae2b7c 0000000000a20000 c000000000add630
GPR12: 0000000028008442 c000000007f20000 0000000000000000 0000000010146920
GPR16: 00000000100cb9d8 0000000010093088 0000000010146920 0000000000000000
GPR20: 0000000000000000 0000000010161900 00000000100ce458 0000000000000000
GPR24: 0000000010161940 0000000000000000 d0000000046ad440 0000000000000000
GPR28: c0000000f8f9f270 0000000000000000 c0000000fcb882c8 0000000000000000
NIP [c0000000002b038c] .sysfs_hash_and_remove+0xe4/0xf0
LR [c0000000002b0388] .sysfs_hash_and_remove+0xe0/0xf0
Call Trace:
[c0000000f8f9f0f0] [c0000000002b0388] .sysfs_hash_and_remove+0xe0/0xf0 (unreliable)
[c0000000f8f9f190] [c0000000002b4134] .sysfs_remove_link+0x24/0x60
[c0000000f8f9f200] [d000000004df037c] .enclosure_remove_links+0x64/0xa0 [enclosure]
[c0000000f8f9f2d0] [d000000004df0518] .enclosure_component_release+0x30/0x60 [enclosure]
[c0000000f8f9f350] [c000000000540068] .device_release+0x50/0xd8
[c0000000f8f9f3d0] [c0000000003b6f80] .kobject_cleanup+0xb8/0x230
[c0000000f8f9f460] [c00000000053f404] .put_device+0x1c/0x30
[c0000000f8f9f4d0] [d000000004df0db0] .enclosure_unregister+0xa0/0xe8 [enclosure]
[c0000000f8f9f560] [d000000004f90094] .ses_intf_remove_enclosure+0x8c/0xa8 [ses]
[c0000000f8f9f5f0] [c0000000005413ec] .device_del+0xf4/0x268
[c0000000f8f9f680] [c000000000541594] .device_unregister+0x34/0x88
[c0000000f8f9f700] [d000000001423d3c] .__scsi_remove_device+0x104/0x128 [scsi_mod]
[c0000000f8f9f780] [d00000000141eff8] .scsi_forget_host+0x70/0xa0 [scsi_mod]
[c0000000f8f9f800] [d000000001413dc0] .scsi_remove_host+0x88/0x178 [scsi_mod]
[c0000000f8f9f890] [d00000000469db5c] .ipr_remove+0x7c/0xf8 [ipr]
[c0000000f8f9f920] [c0000000003fe1f4] .pci_device_remove+0x64/0xf0
[c0000000f8f9f9b0] [c000000000544f10] .__device_release_driver+0xd0/0x158
[c0000000f8f9fa40] [c0000000005450d8] .driver_detach+0x140/0x148
[c0000000f8f9fae0] [c000000000543848] .bus_remove_driver+0xe0/0x188
[c0000000f8f9fb70] [c00000000054628c] .driver_unregister+0x3c/0x80
[c0000000f8f9fbf0] [c0000000003fe35c] .pci_unregister_driver+0x34/0xe8
[c0000000f8f9fc90] [d0000000046a5fb4] .ipr_exit+0x2c/0x44 [ipr]
[c0000000f8f9fd20] [c0000000001359dc] .SyS_delete_module+0x204/0x308
[c0000000f8f9fe30] [c000000000009f60] syscall_exit+0x0/0xa0
Instruction dump:
e8010010 eb81ffe0 7c0803a6 eba1ffe8 ebc1fff0 ebe1fff8 4e800020 3c62ff8a
7ca42b78 3863c388 48485d45 60000000 <0fe00000> 3860fffe 4bffff94 fba1ffe8
o
Signed-off-by: Wen Xiong <wenxiong@linux.vnet.ibm.com>
|
|
commit 789b5e0315284463617e106baad360cb9e8db3ac upstream.
Subsystems that want to register CPU hotplug callbacks, as well as perform
initialization for the CPUs that are already online, often do it as shown
below:
get_online_cpus();
for_each_online_cpu(cpu)
init_cpu(cpu);
register_cpu_notifier(&foobar_cpu_notifier);
put_online_cpus();
This is wrong, since it is prone to ABBA deadlocks involving the
cpu_add_remove_lock and the cpu_hotplug.lock (when running concurrently
with CPU hotplug operations).
Interestingly, the raid5 code can actually prevent double initialization and
hence can use the following simplified form of callback registration:
register_cpu_notifier(&foobar_cpu_notifier);
get_online_cpus();
for_each_online_cpu(cpu)
init_cpu(cpu);
put_online_cpus();
A hotplug operation that occurs between registering the notifier and calling
get_online_cpus(), won't disrupt anything, because the code takes care to
perform the memory allocations only once.
So reorganize the code in raid5 this way to fix the deadlock with callback
registration.
This fixes BZ 103213.
Cc: linux-raid@vger.kernel.org
Fixes: 36d1c6476be51101778882897b315bd928c8c7b5
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
[Srivatsa: Fixed the unregister_cpu_notifier() deadlock, added the
free_scratch_buffer() helper to condense code further and wrote the changelog.]
Signed-off-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
Signed-off-by: NeilBrown <neilb@suse.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
Commit 2775d6230 (md: Avoid deadlock in raid5_alloc_percpu) only partially
fixed the deadlock involving CPU hotplug notifiers. In particular, it fixed
the deadlock possibility in register_cpu_notifier(), but left the deadlock
in unregister_cpu_notifier() unfixed. So revert this commit so that we can
fix both the deadlocks properly, using the solution that was accepted
upstream.
Signed-off-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
|
|
kvm_vfio_spapr_tce_release was spelled as ikvm_vfio_ispapr_tce_release
which caused compilation to break in case of CONFIG_KVM_VFIO=n. Fix it.
Cc: Alexey Kardashevskiy <aik@ozlabs.ru>
Signed-off-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
|
|
The global_invalidates() function contains a check that is intended
to tell whether we are currently executing in the context of a hypercall
issued by the guest. The reason is that the optimization of using a
local TLB invalidate instruction is only valid in that context. The
check was testing local_paca->kvm_hstate.kvm_vcore, which gets set
when entering the guest but no longer gets cleared when exiting the
guest. To fix this, we use the kvm_vcpu field instead, which does
get cleared when exiting the guest, by the kvmppc_release_hwthread()
calls inside kvmppc_run_core().
The effect of having the check wrong was that when kvmppc_do_h_remove()
got called from htab_write() on the destination machine during a
migration, it cleared the current cpu's bit in kvm->arch.need_tlb_flush.
This meant that when the guest started running in the destination VM,
it may miss out on doing a complete TLB flush, and therefore may end
up using stale TLB entries from a previous guest that used the same
LPID value.
This should make migration more reliable.
Signed-off-by: Paul Mackerras <paulus@samba.org>
|
|
The OPAL log is now accessed through sysfs at /sys/firmware/opal/msglog,
so remove the old and buggy debugfs file.
Signed-off-by: Joel Stanley <joel@jms.id.au>
|
|
Create a driver attribute named "cpuinfo_nominal_freq" which will in
turn create a read-only sysfs interface that will be used to export
the nominal frequency to the userspace. This will be necessary for
creating an optimal "performance" policy which should be running the
on-demand governor with "scaling_max_freq" to be set to the value
exported via "cpuinfo_max_freq" and "scaling_min_freq" to be set to
the nominal frequency exported via "cpuinfo_nominal_freq".
The patch caches the values of max, min, nominal pstate ids and
nr_pstates queried from the DT during the initialization of the driver
so that they can be used in other places in the driver for
validatation.
Also, it adds a helper method that returns the frequency corresponding to
a pstate id.
This has been backported from the version posted against mainline
which can be found here:
https://www.mail-archive.com/linuxppc-dev@lists.ozlabs.org/msg76990.html
Reviewed-by: Preeti U Murthy <preeti@linux.vnet.ibm.com>
Signed-off-by: Gautham R. Shenoy <ego@linux.vnet.ibm.com>
|
|
We had added the debug prints to confirm the idle state exit
by the cpus. This was mainly to test if fast sleep was working
fine. Now that we are confident about its functioning we
can get rid of these prints.
Signed-off-by: Preeti U Murthy <preeti@linux.vnet.ibm.com>
|
|
This reworks the opal message log following upstream review. A bug was
fixed where wrapped logs were not read correctly, and locking was added
to reduce the impact of races between reading counters and the buffer
contents.
Signed-off-by: Joel Stanley <joel@jms.id.au>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
|
|
In autogroup_create(), a tg is allocated and added to the task_groups
list. If CONFIG_RT_GROUP_SCHED is set, this tg is then modified while on
the list, without locking. This can race with someone walking the list,
like __enable_runtime() during CPU unplug, and result in a use-after-free
bug.
To fix this, move sched_online_group(), which adds the tg to the list,
to the end of the autogroup_create() function after the modification.
Signed-off-by: Gerald Schaefer <gerald.schaefer@de.ibm.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1369411669-46971-2-git-send-email-gerald.schaefer@de.ibm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Preeti U Murthy <preeti@linux.vnet.ibm.com>
(cherry picked from commit 41261b6a832ea0e788627f6a8707854423f9ff49)
|
|
The firmware can notify us when new input data is available, so
let's make sure we wakeup the HVC thread in that case.
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
|
|
opal_notifier_register() is missing a pending "unregister" variant
and should be exposed to modules.
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
|
|
Turn them on at the same time as we allow MSR_IR/DR in the paca
kernel MSR, ie, after the MMU has been setup enough to be able
to handle relocated access to the linear mapping.
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
|
|
If we take an interrupt such as a trap caused by a BUG_ON before the
MMU has been setup, the interrupt handlers try to enable virutal mode
and cause a recursive crash, making the original problem very hard
to debug.
This fixes it by adjusting the "kernel_msr" value in the PACA so that
it only has MSR_IR and MSR_DR (translation for instruction and data)
set after the MMU has been initialized for the processor.
We may still not have a console yet but at least we don't get into
a recursive fault (and early debug console or memory dump via JTAG
of the kernel buffer *will* give us the proper error).
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
|
|
This call will not be understood by OPAL, and cause it to add an error
to it's log. Among other things, this is useful for testing the
behaviour of the log as it fills up.
Signed-off-by: Joel Stanley <joel@jms.id.au>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
|
|
OPAL provides an in-memory circular buffer containing a message log
populated with various runtime messages produced by the firmware.
Provide a sysfs interface /sys/firmware/opal/messages for userspace to
view the messages.
Signed-off-by: Joel Stanley <joel@jms.id.au>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
|
|
We had a mix & match of flags used when creating legacy ports
depending on where we found them in the device-tree. Among others
we were missing UPF_SKIP_TEST for some kind of ISA ports which is
a problem as quite a few UARTs out there don't support the loopback
test (such as a lot of BMCs).
Let's pick the set of flags used by the SoC code and generalize it
which means autoconf, no loopback test, irq maybe shared and fixed
port.
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
|
|
Helps debug funky firmware issues
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
|
|
Commit b1022fbd293564de91596b8775340cf41ad5214c and subsequent ones
(in 3.10) introduced some preparatory changes for THP which consist
of trying to read the actual HPTE page size from the hash table to
perform the right variant of tlbie. However this has two issues:
- The hash entry can have been evicted and replaced by another
one with a different page size. This can in turn cause us to use
an impossible combination of psize and actual_psize, in turn
causing tlbie to be called with an invalid LP bit combination
causing a HW checkstop
- The whole business is unnecessary as in 3.10 we don't have THP
and thus always have psize == actual_psize
When THP was actual enabled in 3.11, we discovered that this wasn't
going to work and changed the code significantly to pass the proper
actual_psize from the upper layers rather than tyring to deduce it
from the HPTE.
However, we didn't "fix" 3.10 as we didn't realize that the bug
introduced an exposure without THP being enabled.
If a user page was hashed as a 64k page, and later got evicted from
the hash and replaced with a 4k hash entry (due to a segment being
demoted to 4k, for example by subpage protection or because it's
an IO page), we could get into a situation where we tried to
do a tlbie with a psize of 64k and actual_psize of 4k which is
deadly.
This is a 3.10-only fix for this situation which essentially removes
the actual_psize business from the normal updatepp and invalidate
path in hash_native_64.c since we know on 3.10 that the psize coming
from the upper levels is always correct (no THP).
As such it's a partial revert of b1022fbd293564de91596b8775340cf41ad5214c
(we don't touch the bolted path etc... those should be fine and we
want to minimize churn).
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
|
|
Fill in asic family specific versions rather than
using the generic version. This lets us handle asic
specific differences more easily. In this case, we
disable sw swapping of the rtpr writeback value on
r6xx+ since the hw does it for us. Fixes bogus
rptr readback on BE systems.
v2: remove missed cpu_to_le32(), add comments
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
(backported from commit ea31bf697d27270188a93cd78cf9de4bc968aca3)
Signed-off-by: Kleber Sacilotto de Souza <klebers@linux.vnet.ibm.com>
LTC-Bugzilla: #99530
|
|
Now that we have callbacks for [rw]ptr handling we can
remove the special handling for the DMA rings and use
the callbacks instead.
Signed-off-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
(backported from commit 2e1e6dad6a6d437e4c40611fdcc4e6cd9e2f969e)
Signed-off-by: Kleber Sacilotto de Souza <klebers@linux.vnet.ibm.com>
LTC-Bugzilla: #99530
|
|
The hardware just doesn't support this correctly.
Disable it before we accidentally write anywhere we shouldn't.
Signed-off-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
(backported from commit 02c9f7fa4e7230fc4ae8bf26f64e45aa76011f9c)
Signed-off-by: Kleber Sacilotto de Souza <klebers@linux.vnet.ibm.com>
LTC-Bugzilla: #99530
|
|
Give the ring functions a separate structure and let the asic
structure point to the ring specific functions. This simplifies
the code and allows us to make changes at only one point.
No change in functionality.
Signed-off-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
(backported from commit 76a0df859defc53e6cb61f698a48ac7da92c8d84)
Signed-off-by: Kleber Sacilotto de Souza <klebers@linux.vnet.ibm.com>
LTC-Bugzilla: #99530
|
|
Add callbacks to the radeon_asic struct to handle
rptr/wptr fetchs and wptr updates.
We currently use one version for all rings, but this
allows us to override with a ring specific versions.
Needed for compute rings on CIK.
v2: udpate as per Christian's comments
v3: fix some rebase cruft
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
(cherry picked from commit f93bdefe6269067afc85688d45c646cde350e0d8)
Signed-off-by: Kleber Sacilotto de Souza <klebers@linux.vnet.ibm.com>
LTC-Bugzilla: #99530
|
|
Signed-off-by: Eli Qiao <taget@linux.vnet.ibm.com>
|
|
While checking powersaving mode in machine check handler at 0x200, we
clobber CFAR register. Fix it by saving and restoring it during beq/bgt.
Signed-off-by: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
|
|
The branch target should be the func addr, not the addr of func_descr_t.
So using ppc_function_entry() to generate the right target addr.
Signed-off-by: Liu Ping Fan <pingfank@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
|
|
Fixing up the 'sysfs' file duplication by passing the initialized
char array to strncpy() function as the result is not %NUL-terminated
if the source exceeds 'copy_length' bytes.
Signed-off-by: Neelesh Gupta <neelegup@linux.vnet.ibm.com>
|
|
Add the appropriate definition and table entry for new hardware support.
Signed-off-by: Wen Xiong <wenxiong@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
|
|
This patch adds formatting error overlay 0x21 to improve debug capabilities.
Signed-off-by: Wen Xiong <wenxiong@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
|
|
There is no need to call pci_disable_msi() or pci_disable_msix()
in case the call to pci_enable_msi() or pci_enable_msix() failed.
Signed-off-by: Alexander Gordeev <agordeev@redhat.com>
Acked-by: Brian King <brking@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
|
|
If, when the ipr driver loads, the adapter is in an EEH error state,
it will currently oops and not be able to recover, as it attempts
to access memory that has not yet been allocated. We've seen this
occur in some kexec scenarios. The following patch fixes the oops
and also allows the driver to recover from these probe time EEH errors.
Signed-off-by: Brian King <brking@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
|
|
Add the appropriate definition and table entry for new hardware support.
Signed-off-by: Wen Xiong <wenxiong@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
|
|
This patch removes extended delay bit on GSCSI reads/writes ops, the
performance will be significanly better.
Signed-off-by: Wen Xiong <wenxiong@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
|
|
Signed-off-by: Wen Xiong <wenxiong@linux.vnet.ibm.com>
Acked-by: Brian King <brking@linux.vnet.ibm.com>
Signed-off-by: James Bottomley <JBottomley@Parallels.com>
|
|
Add the appropriate definitions and table entries for new adapter support.
Signed-off-by: Wen Xiong <wenxiong@linux.vnet.ibm.com>
Signed-off-by: James Bottomley <JBottomley@Parallels.com>
|
|
Signed-off-by: Wen Xiong <wenxiong@linux.vnet.ibm.com>
Acked-by: Brian King <brking@linux.vnet.ibm.com>
Signed-off-by: James Bottomley <JBottomley@Parallels.com>
|
|
The 'ctl' field of the 'struct ata_taskfile' is not really dual purpose, i.e.
it is not intended for storing the alternate status register (which is mapped
at the same address in the legacy IDE controllers) in the qc_fill_rtf() method.
No other 'libata' driver except 'drivers/scsi/ipr.c' stores the alternate status
register's value in the 'ctl' field of 'qc->result_tf', hence this driver should
not do this as well...
Signed-off-by: Sergei Shtylyov <sergei.shtylyov@cogentembedded.com>
Acked-by: Brian King <brking@linux.vnet.ibm.com>
Signed-off-by: James Bottomley <JBottomley@Parallels.com>
|
|
Currently we save the host PMU configuration, counter values, etc.,
when entering a guest, and restore it on return from the guest.
(We have to do this because the guest has control of the PMU while
it is executing.) However, we missed saving/restoring the SIAR and
SDAR registers, as well as the registers which are new on POWER8,
namely SIER and MMCR2.
This adds code to save the values of these registers when entering
the guest and restore them on exit. This also works around the bug
where setting PMAE with a counter already negative doesn't generate
an interrupt. This was already worked around for the guest PMU state
in an earlier commit, and is worked around for the host PMU state here.
Signed-off-by: Paul Mackerras <paulus@samba.org>
|
|
This adds workarounds for two hardware bugs in the POWER8 performance
monitor unit (PMU), both related to interrupt generation. The effect
of these bugs is that PMU interrupts can get lost, leading to tools
such as perf reporting fewer counts and samples than they should.
The first bug relates to the PMAO (perf. mon. alert occurred) bit in
MMCR0; setting it should cause an interrupt, but doesn't. The other
bug relates to the PMAE (perf. mon. alert enable) bit in MMCR0.
Setting PMAE when a counter is negative and counter negative
conditions are enabled to cause alerts should cause an alert, but
doesn't.
The workaround for the first bug is to create conditions where a
counter will overflow, whenever we are about to restore a MMCR0
value that has PMAO set (and PMAO_SYNC clear). The workaround for
the second bug is to freeze all counters using MMCR2 before reading
MMCR0.
Signed-off-by: Paul Mackerras <paulus@samba.org>
|
|
Somehow, the code that restores the guest transactional memory state
got put in the middle of the code sequence that restores the guest
PMU (performance monitor unit) state. This results in corruption of
the value written to MMCR0 if the guest is in transactional state.
This fixes it by moving the TM state-restoring code to come just before
the PMU state-restoring code. This comes out in the patch as the
first part of the PMU state-restoring code being moved down to just
before the second part of the PMU state-restoring code.
Signed-off-by: Paul Mackerras <paulus@samba.org>
|
|
Some power8 revisions have a hardware bug where we can lose a PMU
exception, this commit adds a workaround to detect the bad condition and
rectify the situation.
See the comment in the commit for a full description.
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Signed-off-by: Paul Mackerras <paulus@samba.org>
|
|
Some power8 revisions have a hardware bug where we can lose a
Performance Monitor (PMU) exception under certain circumstances.
We will be adding a workaround for this case, see the next commit for
details. The observed behaviour is that writing PMAO doesn't cause an
exception as we would expect, hence the name of the feature.
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Signed-off-by: Paul Mackerras <paulus@samba.org>
|
|
This patch converts Event TRB's 3rd field, which has type le32, to CPU
byteorder before using it to retrieve the Slot ID with TRB_TO_SLOT_ID macro.
This bug was found using sparse.
Signed-off-by: Xenia Ragiadakou <burzalodowa@gmail.com>
Signed-off-by: Sarah Sharp <sarah.a.sharp@linux.intel.com>
[Backport of 7e76ad431545d013911ddc744843118b43d01e89]
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
|
|
This patch converts TRB_CYCLE to le32 to update correctly the Cycle Bit in
'control' field of the link TRB.
This bug was found using sparse.
Signed-off-by: Xenia Ragiadakou <burzalodowa@gmail.com>
Signed-off-by: Sarah Sharp <sarah.a.sharp@linux.intel.com>
[Backport of 587194873820a4a1b2eda260ac851394095afd77]
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
|
|
In kexec scenario, we failed to load the mlx4 driver in the
second kernel because the ownership bit was hold by the first
kernel without release correctly.
The patch adds shutdown() interface so that the ownership can
be released correctly in the first kernel. It also helps avoiding
EEH error happened during boot stage of the second kernel because
of undesired traffic, which can't be handled by hardware during
that stage on Power platform.
Signed-off-by: Gavin Shan <shangw@linux.vnet.ibm.com>
Tested-by: Wei Yang <weiyang@linux.vnet.ibm.com>
|
|
The problem is specific to the case of BIST issued to IPR adapter
on the guest side. The IPR driver does something like this:
pci_save_state(), BIST reset and then pci_save_state(). we lose
everything in MSIx table with BIST reset and we never have chance
to restore MSIx table under the case.
pci_restore_msix_state() called by pci_save_state() mask all MSIx
vectors by MSIx capability, restore MSIx table, and then unmask
all MSIx vectors. We force the host kernel to restore the MSIx
vector in the step of unmasking all MSIx vectors to fix the issue.
The patch is under review this moment in Linux community. It'd better
to have ack from Ben and Alexey if we really want this to be Frobisher.
It's responsing to bug#103589.
Reported-by: Wen Xiong <wenxiong@us.ibm.com>
Signed-off-by: Gavin Shan <shangw@linux.vnet.ibm.com>
|
|
We possiblly detect EEH errors during reboot, particularly in kexec
path, but it's impossible for device drivers and EEH core to handle
or recover them properly.
The patch registers one reboot notifier for EEH and disable EEH
subsystem during reboot. That means the EEH errors is going to be
cleared by hardware reset or second kernel during early stage of
PCI probe.
It's backporting commit 66f9af83e56bfa12964d251df9d60fb571579913
("powerpc/eeh: Disable EEH on reboot") from 3.14 upstream for
bug#103590
Signed-off-by: Gavin Shan <shangw@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
|
|
The patch cleans up variable eeh_subsystem_enabled so that we needn't
refer the variable directly from external. Instead, we will use
function eeh_enabled() and eeh_set_enable() to operate the variable.
It's backporting 2ec5a0adf60c23bb6b0a95d3b96a8c1ff1e1aa5a ("powerpc/eeh:
Cleanup on eeh_subsystem_enabled") from 3.14 upstream for bug#103590
Signed-off-by: Gavin Shan <shangw@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
|
|
When doing reset in order to recover the affected PE, we issue
hot reset on PE primary bus if it's not root bus. Otherwise, we
issue hot or fundamental reset on root port or PHB accordingly.
For the later case, we didn't cover the situation where PE only
includes root port and it potentially causes kernel crash upon
EEH error to the PE.
The patch reworks the logic of EEH reset to improve the code
readability and also avoid the kernel crash.
It's backporting commit 5b2e198e50f6ba57081586b853163ea1bb95f1a8
("powerpc/powernv: Rework EEH reset") from 3.14 upstream for
bug#103590
Cc: stable@vger.kernel.org
Reported-by: Thadeu Lima de Souza Cascardo <cascardo@linux.vnet.ibm.com>
Signed-off-by: Gavin Shan <shangw@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
|
|
A malicious guest can register an IOMMU in KVM while a TCE request is
being passed from the real to virtual mode. If vcpu->arch.tce_rm_fail
was previously used and not cleared because of missing LIOBN entry in KVM,
this may cause unwanted put_page() in the virtual mode handler.
This moves @tce_rm_fail earlier to avoid using the incorrect tce_rm_fail
flag value.
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
|
|
The PCI core has function pci_reset_function() to do reset on the
specified PCI device. Before the reset starts, the sate of the PCI
device is saved and it is restored after reset. The real reset work
could be routed to pcibios_set_pcie_reset_state() by quirks. However,
the PCI bus or PCI device isn't settled down fully for restore (PCI
config and MMIO for MSIx table) after reset and it would introduce
unnecessary frozen PE. Eventually, we're stopped from passing through
IPR adapter from host to KVM-based guest.
The patch adds delay in pcibios_set_pcie_reset_state() so that the
PCI bus/device can settle down fully before restoring PCI device
states. It's part of the fixes regarding bug#103297 and bug#103589.
Signed-off-by: Gavin Shan <shangw@linux.vnet.ibm.com>
|
|
In periodic mode we remove offline cpus from the broadcast propagation
mask. In oneshot mode we fail to do so. This was not a problem so far,
but the recent changes to the broadcast propagation introduced a
constellation which can result in a NULL pointer dereference.
What happens is:
CPU0 CPU1
idle()
arch_idle()
tick_broadcast_oneshot_control(OFF);
set cpu1 in tick_broadcast_force_mask
if (cpu_offline())
arch_cpu_dead()
cpu_dead_cleanup(cpu1)
cpu1 tickdevice pointer = NULL
broadcast interrupt
dereference cpu1 tickdevice pointer -> OOPS
We dereference the pointer because cpu1 is still set in
tick_broadcast_force_mask and tick_do_broadcast() expects a valid
cpumask and therefor lacks any further checks.
Remove the cpu from the tick_broadcast_force_mask before we set the
tick device pointer to NULL. Also add a sanity check to the oneshot
broadcast function, so we can detect such issues w/o crashing the
machine.
Reported-by: Prarit Bhargava <prarit@redhat.com>
Cc: athorlton@sgi.com
Cc: CAI Qian <caiqian@redhat.com>
Link: http://lkml.kernel.org/r/alpine.DEB.2.02.1306261303260.4013@ionos.tec.linutronix.de
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
(cherry picked from commit c9b5a266b103af873abb9ac03bc3d067702c8f4b)
Reviewed-by: Preeti U Murthy <preeti@linux.vnet.ibm.com>
Signed-off-by: Preeti U Murthy <preeti@linux.vnet.ibm.com>
|
|
Fast sleep can be enabled today, only after writing into the proc interface
/proc/sys/kernel/powersave-nap with a value greater than 1. Remove this
constraint, now that we have a stable framework to support fast sleep, so
that it is enabled by default at boot.
However the same proc interface is also used to convey if deep idle states
beyond snooze can be entered into or not. Hence retain the check on
powersave-nap in fast sleep to verify if this is the case.
Signed-off-by: Preeti U Murthy <preeti@linux.vnet.ibm.com>
|
|
Add a configuration file to use when building the skiroot
(Sapphire bootloader) kernel.
Signed-off-by: Jeremy Kerr <jk@ozlabs.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
|
|
As Ben suggested, the patch prints PHB diag-data with multiple
fields in one line and omits the line if the fields of that
line are all zero.
With the patch applied, the PHB3 diag-data dump looks like:
PHB3 PHB#3 Diag-data (Version: 1)
brdgCtl: 00000002
RootSts: 0000000f 00400000 b0830008 00100147 00002000
nFir: 0000000000000000 0030006e00000000 0000000000000000
PhbSts: 0000001c00000000 0000000000000000
Lem: 0000000000100000 42498e327f502eae 0000000000000000
InAErr: 8000000000000000 8000000000000000 0402030000000000 \
0000000000000000
PE[ 8] A/B: 8480002b00000000 8000000000000000
Signed-off-by: Gavin Shan <shangw@linux.vnet.ibm.com>
Signed-off-by: Jeremy Kerr <jk@ozlabs.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
|
|
The PHB diag-data is useful to help locating the root cause for
frozen PE or fenced PHB. However, EEH core enables IO path by clearing
part of HW registers before collecting it and eventually we got broken
PHB diag-data.
The patch intends to fix it by dumping the PHB diag-data immediately
when frozen/fenced state on PE or PHB is detected for the first time
in eeh_ops::get_state() or next_error() backend.
Signed-off-by: Gavin Shan <shangw@linux.vnet.ibm.com>
Signed-off-by: Jeremy Kerr <jk@ozlabs.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
|
|
The flag PNV_EEH_STATE_ENABLED is put into pnv_phb::eeh_state,
which is protected by CONFIG_EEH. We needn't that. Instead, we
can have pnv_phb::flags and maintain all flags there, which is
the purpose of the patch. The patch also renames PNV_EEH_STATE_ENABLED
to PNV_PHB_FLAG_EEH.
Signed-off-by: Gavin Shan <shangw@linux.vnet.ibm.com>
Signed-off-by: Jeremy Kerr <jk@ozlabs.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
|
|
The PHB state PNV_EEH_STATE_REMOVED maintained in pnv_phb isn't
so useful any more and it's duplicated to EEH_PE_ISOLATED. The
patch replaces PNV_EEH_STATE_REMOVED with EEH_PE_ISOLATED.
Signed-off-by: Gavin Shan <shangw@linux.vnet.ibm.com>
Signed-off-by: Jeremy Kerr <jk@ozlabs.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
|
|
The PE state (for eeh_pe instance) EEH_PE_PHB_DEAD is duplicate to
EEH_PE_ISOLATED. Originally, those PHBs (PHB PE) with EEH_PE_PHB_DEAD
would be removed from the system. However, it's safe to replace
that with EEH_PE_ISOLATED.
The patch also clear EEH_PE_RECOVERING after fenced PHB has been handled,
either failure or success. It makes the PHB PE state consistent with:
PHB functions normally NONE
PHB has been removed EEH_PE_ISOLATED
PHB fenced, recovery in progress EEH_PE_ISOLATED | RECOVERING
Signed-off-by: Gavin Shan <shangw@linux.vnet.ibm.com>
Signed-off-by: Jeremy Kerr <jk@ozlabs.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
|
|
Cleaning up the code, removing not necessary enumeration, clubbing the
fragmented data structure and some conditional checks in node traversal
in __init code.
This also fixes a bug of sysfs file duplication.
Signed-off-by: Neelesh Gupta <neelegup@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
|
|
This fixes memory corruption which happens when VFIO is used with
PR KVM.
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
|
|
Firmware update on PowerNV platform takes several minutes. During
this time one CPU is stuck in FW and the kernel complains about "soft
lockups".
This patch returns all secondary CPUs to firmware before starting
firmware update process.
[ Reworked a bit and cleaned up -- BenH ]
Signed-off-by: Vasant Hegde <hegdevasant@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
|
|
Cherry pick 3b3f89ac6614d6bc2e2edb32e49d4906d931c795, implementing the
error log reading code we're pushing upstream.
This changes the userspace interface for reading and acknowledging
error logs, so userspace code will have to change if it relied on the
old way.
Based on a patch by: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
This patch adds support to read error logs from OPAL and export
them to userspace through a sysfs interface.
We export each log entry as a directory in /sys/firmware/opal/elog/
Currently, OPAL will buffer up to 128 error log records, we don't
need to have any knowledge of this limit on the Linux side as that
is actually largely transparent to us.
Each error log entry has the following files: id, type, acknowledge, raw.
Currently we just export the raw binary error log in the 'raw' attribute.
In a future patch, we may parse more of the error log to make it a bit
easier for userspace (e.g. to be able to display a brief summary in
petitboot without having to have a full parser).
If we have >128 logs from OPAL, we'll only be notified of 128 until
userspace starts acknowledging them. This limitation may be lifted in
the future and with this patch, that should "just work" from the linux side.
A userspace daemon should:
- wait for error log entries using normal mechanisms (we announce creation)
- read error log entry
- save error log entry safely to disk
- acknowledge the error log entry
- rinse, repeat.
On the Linux side, we read the error log when we're notified of it. This
possibly isn't ideal as it would be better to only read them on-demand.
However, this doesn't really work with current OPAL interface, so we
read the error log immediately when notified at the moment.
I've tested this pretty extensively and am rather confident that the
linux side of things works rather well. There is currently an issue with
the service processor side of things for >128 error logs though.
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
Conflicts:
arch/powerpc/include/asm/opal.h
arch/powerpc/platforms/powernv/Makefile
arch/powerpc/platforms/powernv/opal-elog.c
|
|
This patch makes the sysfs interface match that of what's pushed upstream.
changes in kernel:
- fetch dump on-demand
- directory per dump
- in sysfs rather than debugfs
Userspace changes needed
- read from sysfs rather than debugfs.
This enables support for userspace to fetch and initiate FSP and
Platform dumps from the service processor (via firmware) through sysfs.
Based on original patch from Vasant Hegde <hegdevasant@linux.vnet.ibm.com>
Flow:
- We register for OPAL notification events.
- OPAL sends new dump available notification.
- We make information on dump available via sysfs
- Userspace requests dump contents
- We retrieve the dump via OPAL interface
- User copies the dump data
- userspace sends ack for dump
- We send ACK to OPAL.
sysfs files:
- We add the /sys/firmware/opal/dump directory
- echoing 1 (well, anything, but in future we may support
different dump types) to /sys/firmware/opal/dump/initiate_dump
will initiate a dump.
- Each dump that we've been notified of gets a directory
in /sys/firmware/opal/dump/ with a name of the dump type and ID (in hex,
as this is what's used elsewhere to identify the dump).
- Each dump has files: id, type, dump and acknowledge
dump is binary and is the dump itself.
echoing 'ack' to acknowledge (currently any string will do) will
acknowledge the dump and it will soon after disappear from sysfs.
OPAL APIs:
- opal_dump_init()
- opal_dump_info()
- opal_dump_read()
- opal_dump_ack()
- opal_dump_resend_notification()
Currently we are only ever notified for one dump at a time (until
the user explicitly acks the current dump, then we get a notification
of the next dump), but this kernel code should "just work" when OPAL
starts notifying us of all the dumps present.
Changes since v2:
- fix bug where we would free the dump buffer after userspace read it,
refetching if needed. Refetching doesn't currently work, so we must
keep the dump around for subsequent reads.
Changes since v1:
- Add support for getting dump type from OPAL through new OPAL call
(falling back to old OPAL_DUMP_INFO call if OPAL_DUMP_INFO2 isn't
supported)
- use dump type in directory name for dump
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
Conflicts:
arch/powerpc/include/asm/opal.h
arch/powerpc/platforms/powernv/Makefile
arch/powerpc/platforms/powernv/opal-dump.c
arch/powerpc/platforms/powernv/opal-wrappers.S
arch/powerpc/platforms/powernv/opal.c
|
|
In copy_oldmem_page, the current check using max_pfn and min_low_pfn to
decide if the page is backed or not, is not valid when the memory layout is
not continuous.
This happens when running as a QEMU/KVM guest, where RTAS is mapped higher
in the memory. In that case max_pfn points to the end of RTAS, and a hole
between the end of the kdump kernel and RTAS is not backed by PTEs. As a
consequence, the kdump kernel is crashing in copy_oldmem_page when accessing
in a direct way the pages in that hole.
This fix relies on the memblock's service memblock_is_region_memory to
check if the read page is part or not of the directly accessible memory.
This is a backport of upstream patch
https://lists.ozlabs.org/pipermail/linuxppc-dev/2014-February/115569.html
This fixes LTC BUG #104729
Signed-off-by: Laurent Dufour <ldufour@linux.vnet.ibm.com>
Tested-by: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
|
|
Without a shutdown handler, T4 cards behave very badly after a kexec.
Some firmware calls return errors indicating allocation failures, for
example. This is probably because thouse resources were not released by
a BYE message to the firmware, for example.
Using the remove handler guarantees we will use a well tested path.
With this patch I applied, I managed to use kexec multiple times and
probe and iSCSI login worked every time.
Signed-off-by: Thadeu Lima de Souza Cascardo <cascardo@linux.vnet.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
LTC-Bugzilla: #103241
(cherry picked from commit 687d705c031916b83953b714917b04d899e23cf5)
|
|
Signed-off-by: Eli Qiao <taget@linux.vnet.ibm.com>
|
|
https://bugzilla.linux.ibm.com/show_bug.cgi?id=104249
https://bugzilla.linux.ibm.com/show_bug.cgi?id=104444
Signed-off-by: Wang Sen <wangsen@linux.vnet.ibm.com>
|
|
On p8 systems, with relocation on exception feature enabled we are seeing
kdump kernel hang at interrupt vector 0xc*4400. The reason is, with this
feature enabled, exception are raised with MMU (IR=DR=1) ON with the
default offset of 0xc*4000. Since exception is raised in virtual mode it
requires the vector region to be executable without which it fails to
fetch and execute instruction at 0xc*4xxx. For default kernel since kernel
is loaded at real 0, the htab mappings sets the entire kernel text region
executable. But for relocatable kernel (e.g. kdump case) we only copy
interrupt vectors down to real 0 and never marked that region as
executable because in p7 and below we always get exception in real mode.
This patch fixes this issue by marking htab mapping range as executable
that overlaps with the interrupt vector region for relocatable kernel.
Thanks to Ben who helped me to debug this issue and find the root cause.
This is at least part of the fix for kdump failures that we are seeing
in bug 103693.
Signed-off-by: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
(cherry picked from commit 429d2e8342954d337abe370d957e78291032d867)
Signed-off-by: Paul Mackerras <paulus@samba.org>
|
|
Disable relocation on exception while going down even in kdump case. This
is because we are about clear htab mappings while kexec-ing into kdump
kernel and we may run into issues if we still have AIL ON.
This is at least part of the fix for kdump failures that we are seeing
in bug 103693.
Signed-off-by: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
(cherry picked from commit 3ec8b78fcc5aa7745026d8d85a4e9ab52c922765)
Signed-off-by: Paul Mackerras <paulus@samba.org>
|
|
acked it.
This fixes a bug where we would get two events from OPAL with DUMP_AVAIL
set (which is valid for OPAL to do) and in the second run of extract_dump()
we would fail to free the memory previously allocated for the dump
(leaking ~6MB+) as well as on the second dump_read_data() call OPAL
would not retrieve the dump, leaving us with a dump in linux that was
the correct size but all zeros.
Changes since v1: fixed typo
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
Acked-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
LTC-Bugzilla: #104211
|
|
Commit 082fee36bd2c ("KVM: PPC: Book3S HV: Make physical thread 0 do
the MMU switching") reordered the guest entry/exit code so that most
of the guest register save/restore code happened in guest MMU context.
A side effect of that is that the timebase still contains the guest
timebase value at the point where we compute and use vcpu->arch.dec_expires,
and therefore that is now a guest timebase value rather than a host
timebase value. That in turn means that the timeouts computed in
kvmppc_set_timer() are wrong if the timebase offset for the guest is
non-zero. The consequence of that is things such as "sleep 1" in a
guest after migration may sleep for much longer than they should.
This fixes the problem by converting between guest and host timebase
values as necessary, by adding or subtracting the timebase offset.
This also fixes an incorrect comment.
This is part of the fix for many of the migration-related bug reports.
Signed-off-by: Paul Mackerras <paulus@samba.org>
|
|
In kdump kernel we see a hang during subcore_init() at
unsplit_core()->wait_for_sync_step(). In kdump kernel we always boot with
maxcpus=1 and all other cpus are waiting inside OPAL, hence with 1 online
cpu the master thread keep waiting on secondary threads to set split_state
indefinitely. This is even true for all cases where max_cpus is not aligned
with threads_per_core. This patch fixes this issue by disabling
core split/unsplit feature if max_cpus are not aligned with threads_per_core.
This also fixes kdump hang issue.
Signed-off-by: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
|
|
Signed-off-by: Crístian Viana <vianac@linux.vnet.ibm.com>
|
|
This fixes one of the corner cases which produced wrong backtrack
from put_page().
BZ: 103055
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
|
|
icp_native_flush_interrupt() function is supposed to clear a pending
interrupt, like local_irq_enable(); local_irq_disable() would, but
without calling generic code. Unfortunately it missed clearing
the "IPI pending" flag in the PACA (local_paca->kvm_hstate.host_ipi).
The effect of this flag being set is that secondary CPU threads won't
go into the KVM guest, leading to messages like:
kvmppc_wait_for_nap timeout 0 1
when a KVM HV guest is run. This fixes it by adding a call to
kvmppc_set_host_ipi to clear the flag.
This fixes BZ 103513.
Signed-off-by: Paul Mackerras <paulus@samba.org>
|
|
Commit 595e4f7e697e ("KVM: PPC: Book3S HV: Use load/store_fp_state
functions in HV guest entry/exit") changed the register usage in
kvmppc_save_fp() and kvmppc_load_fp() but omitted changing the
instructions that load and save VRSAVE. The result is that the
VRSAVE value was loaded from a constant address, and saved to a
location past the end of the vcpu struct, causing host kernel
memory corruption and various kinds of host kernel crashes.
This fixes the problem by using register r31, which contains the
vcpu pointer, instead of r3 and r4.
This should help resolve several bugzillas involving guest or host
crashes and hangs, including 98456, 102775, 103534, 100504, and
possibly others.
Signed-off-by: Paul Mackerras <paulus@samba.org>
|
|
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
|
|
register_cpu_notifier() can deadlock if called inside a
get/put_online_cpus block. To avoid this, move the call to
register_cpu_notifier before the get_online_cpus().
[paulus@samba.org - renamed alloc_xxx to alloc_percpu_areas, fixed
compile errors, made up patch description]
This fixes BZ 103213.
Signed-off-by: Paul Mackerras <paulus@samba.org>
|
|
change log:
PPC: KVM: fix to compile without VFIO
vfio: fix in-kernel and ioctl handlers
Fix a bug where asking for a POWER8 guest on a POWER7 system doesn't fail, but should
Fix and performance improvements for nested virtualization
LTC BZ 101114 CPU Build0.6: Host Cpu Offline/online leads to instruction dump and further cpu online/offline functions are not
PowerKVM Build 8 host platform support
Fix problems reported by the kernel RCU checking machinery and may help fix the memory corruption issues we have been seeing
LTC BZ 101123 Unable to bring up LE guest using libvirt/virsh
Fixes a bug with not resetting page struct pointer which caused bugs in calling code.
Fix one of the corner cases when the realmode handler fails to handle T_PUT_TCE_INDIRECT call and passes it further to the vir
Signed-off-by: Eli Qiao <taget@linux.vnet.ibm.com>
|
|
The existing handler assumes that the first failed TCE entry's host
physical address is saved in the tce_tmp_hpas cache but it is not so
the virtmode handler has to read it from the TCE list again so does
this patch.
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
|
|
to config-powerpc64 and config-powerpc64p7
Signed-off-by: Eli Qiao <taget@linux.vnet.ibm.com>
|
|
The code in remove_cache_dir() is supposed to remove the "cache"
subdirectory from the sysfs directory for a CPU when that CPU is
being offlined. It tries to do this by calling kobject_put() on
the kobject for the subdirectory. However, the subdirectory only
gets removed once the last reference goes away, and the reference
being put here may well not be the last reference. That means
that the "cache" subdirectory may still exist when the offlining
operation has finished. If the same CPU subsequently gets onlined,
the code tries to add a new "cache" subdirectory. If the old
subdirectory has not yet been removed, we get a WARN_ON in the
sysfs code, with stack trace, and an error message printed on the
console. Further, we ultimately end up with an online cpu with no
"cache" subdirectory.
This fixes it by doing an explicit kobject_del() at the point where
we want the subdirectory to go away. kobject_del() removes the sysfs
directory even though the object still exists in memory. The object
will get freed at some point in the future. A subsequent onlining
operation can create a new sysfs directory, even if the old object
still exists in memory, without causing any problems.
This fixes BZ 101114.
Signed-off-by: Paul Mackerras <paulus@samba.org>
|
|
This fixes a bug with not resetting page struct pointer
which caused bugs in calling code.
Suggested-by: Thadeu Lima de Souza Cascardo <cascardo@linux.vnet.ibm.com>
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
|
|
This does for PR KVM what c9438092cae4 ("KVM: PPC: Book3S HV: Take SRCU
read lock around kvm_read_guest() call") did for HV KVM, that is,
eliminate a "suspicious rcu_dereference_check() usage!" warning by
taking the SRCU lock around the call to kvmppc_rtas_hcall().
It also fixes a return of RESUME_HOST to return EMULATE_FAIL instead,
since kvmppc_h_pr() is supposed to return EMULATE_* values.
Signed-off-by: Paul Mackerras <paulus@samba.org>
|
|
Now that we have the vcpu floating-point and vector state stored in
the same type of struct as the main kernel uses, we can load that
state directly from the vcpu struct instead of having extra copies
to/from the thread_struct. Similarly, when the guest state needs to
be saved, we can have it saved it directly to the vcpu struct by
setting the current->thread.fp_save_area and current->thread.vr_save_area
pointers. That also means that we don't need to back up and restore
userspace's FP/vector state. This all makes the code simpler and
faster.
Note that it's not necessary to save or modify current->thread.fpexc_mode,
since nothing in KVM uses or is affected by its value. Nor is it
necessary to touch used_vr or used_vsr.
Signed-off-by: Paul Mackerras <paulus@samba.org>
|
|
This fixes missing read/write TCE bits in VFIO map/unmap ioctls.
This fixes the real mode handler to switch to the virtual mode if
pte does not have "write" AND "dirty" bits set.
This fixes get_user_pages_fast() call in the virtual mode handler
to use correct write flag (used to be 0 always).
This adds a lock around a kvm_memory_slot struct use.
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
(cherry picked from commit 754177ee49cd27c9380e7bb9c0de6f8488197ca3)
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
|
|
This removes the code that handles the H_SET_MODE_RESOURCE_LE and
H_SET_MODE_RESOURCE_ADDR_TRANS_MODE subfunctions of the H_SET_MODE
hypercall from the kernel. Instead we now return H_TOO_HARD which
causes the hypercall to be sent up to userspace to be handled there.
In addition we now also send any other subfunction which we don't
recognize to userspace.
The reason for doing these two subfunctions in userspace is that they
need to modify LPCR across all vcpus of the guest. Modifying LPCR in
the kernel like this introduces a race between the kernel's
modification and any modification that userspace might be doing on
another vcpu. Therefore it's better to let userspace do all the
modifications, so it can do any necessary synchronization itself.
This also adds code to make sure that the MSR_LE bit in intr_msr
(the MSR value we set when synthesizing an interrupt for the guest)
is in sync with the ILE bit in the virtual core's LPCR value. This
is necessary for implementing the LE subfunction of H_SET_MODE in
userspace.
Signed-off-by: Paul Mackerras <paulus@samba.org>
|
|
The load_up_fpu and load_up_altivec functions were never intended to
be called from C, and do things like modifying the MSR value in their
callers' stack frames, which are assumed to be interrupt frames. In
addition, on 32-bit Book S they require the MMU to be off.
This makes KVM use the new load_fp_state() and load_vr_state() functions
instead of load_up_fpu/altivec. This means we can remove the assembler
glue in book3s_rmhandlers.S, and potentially fixes a bug on Book E,
where load_up_fpu was called directly from C.
Signed-off-by: Paul Mackerras <paulus@samba.org>
|
|
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
(cherry picked from commit 6a87e5da59bf1d1a4186bf27ad8aa5dc3b03dd63)
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
|
|
With HV KVM, some high-frequency hypercalls such as H_ENTER are handled
in real mode, and need to access the memslots array for the guest.
Accessing the memslots array is safe, because we hold the SRCU read
lock for the whole time that a guest vcpu is running. However, the
checks that kvm_memslots() does when lockdep is enabled are potentially
unsafe in real mode, when only the linear mapping is available.
Furthermore, kvm_memslots() can be called from a secondary CPU thread,
which is an offline CPU from the point of view of the host kernel,
and is not running the task which holds the SRCU read lock.
To avoid false positives in the checks in kvm_memslots(), and to avoid
possible side effects from doing the checks in real mode, this replaces
kvm_memslots() with kvm_memslots_raw() in all the places that execute
in real mode. kvm_memslots_raw() is a new function that is like
kvm_memslots() but uses rcu_dereference_raw_notrace() instead of
kvm_dereference_check().
Signed-off-by: Paul Mackerras <paulus@samba.org>
|
|
Since the guest can read the machine's PVR (Processor Version Register)
directly and see the real value, we should disallow userspace from
setting any value for the guest's PVR other than the real host value.
Therefore this makes kvm_arch_vcpu_set_sregs_hv() check the supplied
PVR value and return an error if it is different from the host value,
which has been put into vcpu->arch.pvr at vcpu creation time.
Signed-off-by: Paul Mackerras <paulus@samba.org>
|
|
On PowerNV platforms, when a CPU is offline, we put it into nap mode.
It's possible that the CPU wakes up from nap mode while it is still
offline due to a stray IPI. A misdirected device interrupt could also
potentially cause it to wake up. In that circumstance, we need to clear
the interrupt so that the CPU can go back to nap mode.
In the past the clearing of the interrupt was accomplished by briefly
enabling interrupts and allowing the normal interrupt handling code
(do_IRQ() etc.) to handle the interrupt. This has the problem that
this code calls irq_enter() and irq_exit(), which call functions such
as account_system_vtime() which use RCU internally. Use of RCU is not
permitted on offline CPUs and will trigger errors if RCU checking is
enabled.
To avoid calling into any generic code which might use RCU, we adopt
a different method of clearing interrupts on offline CPUs. Since we
are on the PowerNV platform, we know that the system interrupt
controller is a XICS being driven directly (i.e. not via hcalls) by
the kernel. Hence this adds a new icp_native_flush_interrupt()
function to the native-mode XICS driver and arranges to call that
when an offline CPU is woken from nap. This new function reads the
interrupt from the XICS. If it is an IPI, it clears the IPI; if it
is a device interrupt, it prints a warning and disables the source.
Then it does the end-of-interrupt processing for the interrupt.
The other thing that briefly enabling interrupts did was to check and
clear the irq_happened flag in this CPU's PACA. Therefore, after
flushing the interrupt from the XICS, we also clear all bits except
the PACA_IRQ_HARD_DIS (interrupts are hard disabled) bit from the
irq_happened flag. The PACA_IRQ_HARD_DIS flag is set by power7_nap()
and is left set to indicate that interrupts are hard disabled. This
means we then have to ignore that flag in power7_nap(), which is
reasonable since it doesn't indicate that any interrupt event needs
servicing.
Signed-off-by: Paul Mackerras <paulus@samba.org>
|
|
Signed-off-by: Thadeu Lima de Souza Cascardo <cascardo@linux.vnet.ibm.com>
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
|
|
The policy->cpus mask populated by the cpufreq driver is expected to be hotplug
invariant, since the cpufreq core copies this mask as-it-is to policy->related_cpus mask (which shouldn't vary upon hotplug).
The cpufreq core code later prunes the offlines cpus from the policy->cpus mask.
At the moment, the powerpc cpufreq driver uses topology_thread_cpumask() to
populate policy->cpus during .init(), and hence this is NOT hotplug invariant.
Due to this, we hit the following bug:
1. Once we offline all threads of a core, say CPUs 8-15, and online
CPU 8 back, its related cpus mask shows:
$ cat /sys/devices/system/cpu/cpu8/cpufreq/related_cpus
8
[ It should have actually shown 8 9 10 11 12 13 14 15 ]
2. When we try to online the next sibling thread (CPU 9), it tries to do a fresh
initialization since it is not listed in the related_cpus mask of CPU 8.(Note
that for CPU 9, the cpufreq driver would have populated the related_cpus mask
as [ 8 9 ], since those are the 2 online threads in that core so far). During
CPU 9 init, it fails in the call to cpufreq_add_dev_symlink() because it
tries to initialize the sysfs files for CPU 8 as well (which had already been
initialized) while iterating through the policy->cpus.
As a result, we hit this bug while onlining CPU 9:
[ 1019.458183] sysfs: cannot create duplicate filename '/devices/system/cpu/cpu8/cpufreq'
[ 1019.458270] ------------[ cut here ]------------
[ 1019.458338] WARNING: at fs/sysfs/dir.c:530
[ 1019.458367] Modules linked in: xt_tcpudp ipt_REJECT nf_conntrack_ipv4 nf_defrag_ipv4
xt_conntrack nf_conntrack iptable_filter ip_tables x_tables kvm binfmt_misc autofs4 lpfc
[ 1019.458543] CPU: 76 PID: 73014 Comm: bash Not tainted 3.10.11-cpufreq-10 #1
[ 1019.458590] task: c000000ff02c3200 ti: c000000fe7604000 task.ti: c000000fe7604000
[ 1019.458645] NIP: c000000000284634 LR: c000000000284630 CTR: c0000000005b5d10
[ 1019.458700] REGS: c000000fe7606fa0 TRAP: 0700 Not tainted (3.10.11-cpufreq-10)
[ 1019.458754] MSR: 9000000100029032 <SF,HV,EE,ME,IR,DR,RI> CR: 28222824 XER: 20000000
[ 1019.458883] SOFTE: 1
[ 1019.458903] CFAR: c000000000874d6c
[ 1019.458930]
GPR00: c000000000284630 c000000fe7607220 c000000000d9ab60 000000000000004a
GPR04: 0000000000000000 000000000000005a c000000000c82fb8 c000000004482448
GPR08: c000000000c7ab60 0000000000000000 0000000000000000 0000000000000000
GPR12: 0000000028222822 c00000000fe13000 0000000010142550 c000000000ce8d70
GPR16: 0000000000000001 c000000000f28c68 0000000000000000 c000000003c20030
GPR20: c000000ff6d91800 c000000000ce8fc8 c000000000b45340 c000000000e26858
GPR24: c000000000ce8d70 0000000000000000 0000000000000001 c000000ff6d91a70
GPR28: c000000fef1b2000 c000000fe7607320 c000000fc98087a0 ffffffffffffffef
[ 1019.459605] NIP [c000000000284634] .sysfs_add_one+0xe4/0x100
[ 1019.459653] LR [c000000000284630] .sysfs_add_one+0xe0/0x100
[ 1019.459689] PACATMSCRATCH [9000000100009032]
[ 1019.459726] Call Trace:
[ 1019.459747] [c000000fe7607220] [c000000000284630] .sysfs_add_one+0xe0/0x100 (unreliable)
[ 1019.459813] [c000000fe76072b0] [c0000000002854dc] .sysfs_do_create_link_sd+0x10c/0x320
[ 1019.459879] [c000000fe7607370] [c000000000718318] .cpufreq_add_dev_interface+0x2e8/0x410
[ 1019.459943] [c000000fe7607710] [c000000000718da0] .cpufreq_add_dev+0x590/0x6d0
[ 1019.460009] [c000000fe7607810] [c000000000899580] .cpufreq_cpu_callback+0x7c/0x94
[ 1019.460073] [c000000fe7607890] [c00000000086f40c] .notifier_call_chain+0x8c/0x100
[ 1019.460138] [c000000fe7607930] [c000000000091450] .cpu_notify+0x40/0xa0
[ 1019.460194] [c000000fe76079b0] [c00000000089696c] ._cpu_up+0x17c/0x1ec
[ 1019.460249] [c000000fe7607a70] [c000000000896b40] .cpu_up+0x164/0x194
[ 1019.460304] [c000000fe7607b00] [c000000000746edc] .store_online+0xbc/0xa60
[ 1019.460361] [c000000fe7607bb0] [c0000000004faf64] .dev_attr_store+0x64/0xa0
[ 1019.460417] [c000000fe7607c40] [c000000000282244] .sysfs_write_file+0xf4/0x1d0
[ 1019.460482] [c000000fe7607cf0] [c0000000001f1fa8] .vfs_write+0xe8/0x260
[ 1019.460537] [c000000fe7607d90] [c0000000001f2c44] .SyS_write+0x64/0xe0
[ 1019.460593] [c000000fe7607e30] [c000000000009d54] syscall_exit+0x0/0x98
[ 1019.460647] Instruction dump:
[ 1019.460675] 481b0b2d 60000000 e89e0010 7f83e378 38a01000 481b0b19 60000000 7f84e378
[ 1019.460774] 3c62ffd5 38632cf0 485f06dd 60000000 <0fe00000> 7f83e378 4bf5f8a5 60000000
[ 1019.460952] ---[ end trace 600f2280a5b2cd86 ]---
None of this would have occurred if related_cpus had remained unchanged during
hotplug, because in that case, CPU 9 would have done a light-weight init, thus
avoiding this duplication bug. So fix this by populating policy->cpus in a
hotplug invariant manner in the cpufreq driver.
Signed-off-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
Signed-off-by: Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
|
|
Platform will provide power data in watts, hwmon expects in
micro-watts.
Signed-off-by: Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
|
|
This patch is a increamental patch on top of commit af93eec4.
This adds support to resend dump available notification, updates
README file. Alos fixes with few other minor issues.
Signed-off-by: Vasant Hegde <hegdevasant@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
|
|
Detect and recover from machine check when inside opal on a special
scom load instructions. On specific SCOM read via MMIO we may get a machine
check exception with SRR0 pointing inside opal. To recover from MC
in this scenario, get a recovery instruction address and return to it from
MC.
OPAL will export the machine check recoverable ranges through
device tree node mcheck-recoverable-ranges under ibm,opal:
# hexdump /proc/device-tree/ibm,opal/mcheck-recoverable-ranges
0000000 0000 0000 3000 2804 0000 000c 0000 0000
0000010 3000 2814 0000 0000 3000 27f0 0000 000c
0000020 0000 0000 3000 2814 xxxx xxxx xxxx xxxx
0000030 llll llll yyyy yyyy yyyy yyyy
...
...
#
where:
xxxx xxxx xxxx xxxx = Starting instruction address
llll llll = Length of the address range.
yyyy yyyy yyyy yyyy = recovery address
Each recoverable address range entry is an (start address, len,
recovery address), 2 cells each for start and recovery address, 1 cell for
len, totalling 5 cells per entry. During kernel boot time, build up the
recovery table with the list of recovery ranges from device-tree node which
will be used during machine check exception to recover from MMIO SCOM UE.
Changes in v2:
- As per Ben's comment, added mcheck-recoverable-ranges property under
ibm,opal node.
- Changed the format of the mcheck-recoverable-ranges list.
Signed-off-by: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
|
|
This patch adds basic kernel enablement for reading power values, fan
speed rpm and temperature values on powernv platforms which will
be exported to user space through /sys interface.
Signed-off-by: Shivaprasad G Bhat <sbhat@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
|
|
This patch enables fetching of various platform sensor data through
OPAL and expects a sensor handle from the driver to pass to OPAL.
Signed-off-by: Sahir K <sahirk1@linux.vnet.ibm.com>
Signed-off-by: Neelesh Gupta <neelegup@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
|
|
This patch enables reading and updating of system parameters through
OPAL call.
Signed-off-by: Neelesh Gupta <neelegup@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
|
|
This patch adds support for notifying the clients of their request
completion. Clients request for the token before making OPAL call
and then wait for the response.
This patch uses messaging infrastructure to pull the data to linux
by registering itself for the message type OPAL_MSG_ASYNC_COMP.
Signed-off-by: Neelesh Gupta <neelegup@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
|
|
Sometimes, especially in sinario of loading another kernel with kdump,
we got EEH error on non-existing PE. That means the PEEV / PEST in
the corresponding PHB would be messy and we can't handle that case.
The patch escalates the error to fenced PHB so that the PHB could be
rested in order to revoer the errors on non-existing PEs.
Reported-by: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
Signed-off-by: Gavin Shan <shangw@linux.vnet.ibm.com>
Tested-by: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
|
|
For one PCI error relevant OPAL event, we possibly have multiple
EEH errors for that. For example, multiple frozen PEs detected on
different PHBs. Unfortunately, we didn't cover the case. The patch
enumarates the return value from eeh_ops::next_error() and change
eeh_handle_special_event() and eeh_ops::next_error() to handle all
existing EEH errors.
As Ben pointed out, we needn't list_for_each_entry_safe() since we
are not deleting any PHB from the hose_list and the EEH serialized
lock should be held while purging EEH events. The patch covers those
suggestions as well.
Signed-off-by: Gavin Shan <shangw@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
|
|
If we set irq_work on a processor and immediately afterward, before the
irq work has a chance to be processed, we change the decrementer value,
we can seriously delay the handling of that irq_work.
Fix it by checking in a few places for pending irq work, first before
changing the decrementer in decrementer_set_next_event() and after
changing it in the same function and in timer_interrupt().
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
|
|
Huge Dickins reported an issue that b5ff4211a829
"powerpc/book3s: Queue up and process delayed MCE events" breaks the
PowerMac G5 boot. This patch fixes it by moving the mce even processing
away from syscall exit, which was wrong to do that in first place, and
using irq work framework to delay processing of mce event.
Reported-by: Hugh Dickins <hughd@google.com
Signed-off-by: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
|
|
At present we assume candidate image is <= 256MB. But in P8,
candidate image size can go up to 750MB. Hence increasing
candidate image max size to 1GB.
Signed-off-by: Vasant Hegde <hegdevasant@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
|
|
There have been some weird bugs in the past where the kernel tried to associate
threads of the same core to different NUMA nodes, and things went haywire after
that point (as expected).
But unfortunately, root-causing such issues have been quite challenging, due to
the lack of appropriate debug checks in the kernel. These bugs usually lead to
some odd soft-lockups in the scheduler's build-sched-domain code in the CPU
hotplug path, which makes it very hard to trace it back to the incorrect
cpu-to-node mappings.
So add appropriate debug checks to catch such invalid cpu-to-node mappings
as early as possible.
Signed-off-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
|
|
On POWER platforms, the hypervisor can notify the guest kernel about dynamic
changes in the cpu-numa associativity (VPHN topology update). Hence the
cpu-to-node mappings that we got from the firmware during boot, may no longer
be valid after such updates. This is handled using the arch_update_cpu_topology()
hook in the scheduler, and the sched-domains are rebuilt according to the new
mappings.
But unfortunately, at the moment, CPU hotplug ignores these updated mappings
and instead queries the firmware for the cpu-to-numa relationships and uses
them during CPU online. So the kernel can end up assigning wrong NUMA nodes
to CPUs during subsequent CPU hotplug online operations (after booting).
Further, a particularly problematic scenario can result from this bug:
On POWER platforms, the SMT mode can be switched between 1, 2, 4 (and even 8)
threads per core. The switch to Single-Threaded (ST) mode is performed by
offlining all except the first CPU thread in each core. Switching back to
SMT mode involves onlining those other threads back, in each core.
Now consider this scenario:
1. During boot, the kernel gets the cpu-to-node mappings from the firmware
and assigns the CPUs to NUMA nodes appropriately, during CPU online.
2. Later on, the hypervisor updates the cpu-to-node mappings dynamically and
communicates this update to the kernel. The kernel in turn updates its
cpu-to-node associations and rebuilds its sched domains. Everything is
fine so far.
3. Now, the user switches the machine from SMT to ST mode (say, by running
ppc64_cpu --smt=1). This involves offlining all except 1 thread in each
core.
4. The user then tries to switch back from ST to SMT mode (say, by running
ppc64_cpu --smt=4), and this involves onlining those threads back. Since
CPU hotplug ignores the new mappings, it queries the firmware and tries to
associate the newly onlined sibling threads to the old NUMA nodes. This
results in sibling threads within the same core getting associated with
different NUMA nodes, which is incorrect.
The scheduler's build-sched-domains code gets thoroughly confused with this
and enters an infinite loop and causes soft-lockups, as explained in detail
in commit 3be7db6ab (powerpc: VPHN topology change updates all siblings).
So to fix this, use the numa_cpu_lookup_table to remember the updated
cpu-to-node mappings, and use them during CPU hotplug online operations.
Further, we also need to ensure that all threads in a core are assigned to a
common NUMA node, irrespective of whether all those threads were online during
the topology update. To achieve this, we take care not to use cpu_sibling_mask()
since it is not hotplug invariant. Instead, we use cpu_first_sibling_thread()
and set up the mappings manually using the 'threads_per_core' value for that
particular platform. This helps us ensure that we don't hit this bug with any
combination of CPU hotplug and SMT mode switching.
Cc: stable@vger.kernel.org
Signed-off-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
|
|
Some devices, for example PCI root port, don't have IOMMU table and
group. We needn't detach them from their IOMMU group. Otherwise, it
potentially incurs kernel crash because of referring NULL IOMMU group
as following backtrace indicates:
.iommu_group_remove_device+0x74/0x1b0
.iommu_bus_notifier+0x94/0xb4
.notifier_call_chain+0x78/0xe8
.__blocking_notifier_call_chain+0x7c/0xbc
.blocking_notifier_call_chain+0x38/0x48
.device_del+0x50/0x234
.pci_remove_bus_device+0x88/0x138
.pci_stop_and_remove_bus_device+0x2c/0x40
.pcibios_remove_pci_devices+0xcc/0xfc
.pcibios_remove_pci_devices+0x3c/0xfc
Signed-off-by: Gavin Shan <shangw@linux.vnet.ibm.com>
Reviewed-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
|
|
When EEH error comes to one specific PCI device before its driver
is loaded, we will apply hotplug to recover the error. During the
plug time, the PCI device will be probed and its driver is loaded.
Then we wrongly calls to the error handlers if the driver supports
EEH explicitly.
The patch intends to fix by introducing flag EEH_DEV_NO_HANDLER and
set it before we remove the PCI device. In turn, we can avoid wrongly
calls the error handlers of the PCI device after its driver loaded.
Signed-off-by: Gavin Shan <shangw@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
|
|
The patch implements the EEH operation backend restore_config()
for PowerNV platform. That relies on OPAL API opal_pci_reinit()
where we reinitialize the error reporting properly after PE or
PHB reset.
Signed-off-by: Gavin Shan <shangw@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
|
|
After reset on the specific PE or PHB, we never configure AER
correctly on PowerNV platform. We needn't care it on pSeries
platform. The patch introduces additional EEH operation eeh_ops::
restore_config() so that we have chance to configure AER correctly
for PowerNV platform.
Signed-off-by: Gavin Shan <shangw@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
|
|
We don't have IO ports on PHB3 and the assignment of variable
"iomap_off" on PHB3 is meaningless. The patch just removes the
unnecessary assignment to the variable. The code change should
have been part of commit c35d2a8c ("powerpc/powernv: Needn't IO
segment map for PHB3").
Signed-off-by: Gavin Shan <shangw@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
|
|
Prevent ioda_eeh_hub_diag() from clobbering itself when called by supplying
a per-PHB buffer for P7IOC hub diagnostic data. Take care to inform OPAL of
the correct size for the buffer.
[Small style change to the use of sizeof -- BenH]
Signed-off-by: Brian W Hart <hartb@linux.vnet.ibm.com>
Acked-by: Gavin Shan <shangw@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
|
|
In EXCEPTION_PROLOG_COMMON() we check to see if the stack pointer (r1)
is valid when coming from the kernel. If it's not valid, we die but
with a nice oops message.
Currently we allocate a stack frame (subtract INT_FRAME_SIZE) before we
check to see if the stack pointer is negative. Unfortunately, this
won't detect a bad stack where r1 is less than INT_FRAME_SIZE.
This patch fixes the check to compare the modified r1 with
-INT_FRAME_SIZE. With this, bad kernel stack pointers (including NULL
pointers) are correctly detected again.
Kudos to Paulus for finding this.
Signed-off-by: Michael Neuling <mikey@neuling.org>
cc: stable@vger.kernel.org
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
|
|
Signed-off-by: Eli Qiao <taget@linux.vnet.ibm.com>
|
|
Signed-off-by: Eli Qiao <taget@linux.vnet.ibm.com>
|
|
Makefile.config
Makefile.release
cpupower.config
cpupower.service
mcp8_configs.tar.gz
mod-extra.list
mod-extra.sh
mod-sign.sh
x509.genkey
Currently when build kernel on koji server, we need to build a srpm first.
In kernel.spce file(the spce file is from mcp), we include some source files.
so it make really hard to build kernel package on koji server.
1) Get kernel.spec from git repo.
2) Get SOURCES files from mcp cvs
2) Build a kernel package from kernel.spec + SOURCES files
make srpm to create a kernel.srpm(contains kernel.spec + SOURCES)
4) run koji build --scratch mcp8-rawhide kernel.srpm to build a kernel package.
by copying all there cource file to git repo, we can build kernel easily with
follow steps:
1) Get kernel.spec from git repo.
2) Build a kernel package from kernel.spec(only contain the spec file)
rpmbuild -bs kernel.spec
3) run koji build --scratch mcp8-rawhide kernel.srpm to build a kernel package.
Signed-off-by: Eli Qiao <taget@linux.vnet.ibm.com>
|
|
Request by Alexey Kardashevskiy
Revert "PPC: KVM: move TCE cache alloc/free from KVM-common to KVM-HV"
This reverts commit af4c301bd5b700f62597bcdf8e6f66bd2fd65db9.
|
|
When write to MMIO happens and there is an ioeventfd for that and
is handled successfully, ioeventfd_write() returns 0 (success) and
kvmppc_handle_store() returns EMULATE_DONE. Then kvmppc_emulate_mmio()
converts EMULATE_DONE to RESUME_GUEST_NV and this broke from the loop.
This adds handling of RESUME_GUEST_NV in kvmppc_vcpu_run_hv().
Cc: Michael S. Tsirkin <mst@redhat.com>
Suggested-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
|
|
Signed-off-by: Eli Qiao <taget@linux.vnet.ibm.com>
|
|
Since commit 52367a763d8046190754ab43743e42638564a2d1
("cxgb4/cxgb4vf: Code cleanup to enable T4 Configuration File support"),
we have failures like this during cxgb4 probe:
cxgb4 0000:01:00.4: bad SGE FL page buffer sizes [65536, 65536]
cxgb4: probe of 0000:01:00.4 failed with error -22
This happens whenever software parameters are used, without a
configuration file. That happens when the hardware was already
initialized (after kexec, or after csiostor is loaded).
It happens that these values are acceptable, rendering fl_pg_order equal
to 0, which is the case of a hard init when the page size is equal or
larger than 65536.
Accepting fl_large_pg equal to fl_small_pg solves the issue, and
shouldn't cause any trouble besides a possible performance reduction
when smaller pages are used. And that can be fixed by a configuration
file.
Signed-off-by: Thadeu Lima de Souza Cascardo <cascardo@linux.vnet.ibm.com>
|
|
This changes the PCI power management scheme of the bnx2x driver to be similar
to those of most network drivers - the driver will now changes the power state
into D3hot whenever the driver will be removed, instead of whenever an
interface is unloaded.
This change enables the driver to access its eeprom via ethtool callbacks
even when interfaces are unloaded (such access requires the function to be
in D0active).
Signed-off-by: Yuval Mintz <yuvalmin@broadcom.com>
Signed-off-by: Ariel Elior <ariele@broadcom.com>
Signed-off-by: Eilon Greenstein <eilong@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
The upstream kernel got a VFIO KVM device which we support on PPC64 to
associate LIOBNs with IOMMU groups.
This moves the existing real/virtual mode handlers to newer codebase.
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
|
|
The idea of the tce_tmp_hpas cache is to pass partially processed TCEs
from real mode to virtual mode via H_TOO_HARD mechanism.
Since TCE hypercalls are never called in real mode under PR KVM, move them
to HV KVM.
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
|
|
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
|
|
In addition to the external VFIO user API, a VFIO KVM device
has been introduced recently.
sPAPR TCE IOMMU is para-virtualized and the guest does map/unmap
via hypercalls which take a logical bus id (LIOBN) as a target IOMMU
identifier. LIOBNs are made up and linked to IOMMU groups by the user
space. In order to accelerate IOMMU operations in the KVM, we need
to tell KVM the information about LIOBN-to-group mapping.
For that, a new KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE_LIOBN parameter
is added. It accepts a pair of a VFIO group fd and LIOBN.
This also adds a new kvm_vfio_find_group_by_liobn() function which
receives kvm struct, LIOBN and a callback. As it increases the IOMMU
group use counter, the KVMr is required to pass a callback which
called when the VFIO group is about to be removed VFIO-KVM tracking so
the KVM is able to call iommu_group_put() to release the IOMMU group.
The KVM uses kvm_vfio_find_group_by_liobn() once per KVM run and caches
the result in kvm_arch. iommu_group_put() for all groups will be called
when KVM finishes (in the SPAPR TCE in KVM enablement patch).
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
|
|
We currently use some ad-hoc arch variables tied to legacy KVM device
assignment to manage emulation of instructions that depend on whether
non-coherent DMA is present. Create an interface for this, adapting
legacy KVM device assignment and adding VFIO via the KVM-VFIO device.
For now we assume that non-coherent DMA is possible any time we have a
VFIO group. Eventually an interface can be developed as part of the
VFIO external user interface to query the coherency of a group.
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
(cherry picked from commit e0f0bbc527f6e9c0261f1d16b2a0b47612b7f235)
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
|
|
Default to operating in coherent mode. This simplifies the logic when
we switch to a model of registering and unregistering noncoherent I/O
with KVM.
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
(cherry picked from commit d96eb2c6f480769bff32054e78b964860dae4d56)
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
|
|
So far we've succeeded at making KVM and VFIO mostly unaware of each
other, but areas are cropping up where a connection beyond eventfds
and irqfds needs to be made. This patch introduces a KVM-VFIO device
that is meant to be a gateway for such interaction. The user creates
the device and can add and remove VFIO groups to it via file
descriptors. When a group is added, KVM verifies the group is valid
and gets a reference to it via the VFIO external user interface.
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
(cherry picked from commit ec53500fae421e07c5d035918ca454a429732ef4)
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
|
|
The current VFIO_DEVICE_RESET interface only maps to PCI use cases
where we can isolate the reset to the individual PCI function. This
means the device must support FLR (PCIe or AF), PM reset on D3hot->D0
transition, device specific reset, or be a singleton device on a bus
for a secondary bus reset. FLR does not have widespread support,
PM reset is not very reliable, and bus topology is dictated by the
system and device design. We need to provide a means for a user to
induce a bus reset in cases where the existing mechanisms are not
available or not reliable.
This device specific extension to VFIO provides the user with this
ability. Two new ioctls are introduced:
- VFIO_DEVICE_PCI_GET_HOT_RESET_INFO
- VFIO_DEVICE_PCI_HOT_RESET
The first provides the user with information about the extent of
devices affected by a hot reset. This is essentially a list of
devices and the IOMMU groups they belong to. The user may then
initiate a hot reset by calling the second ioctl. We must be
careful that the user has ownership of all the affected devices
found via the first ioctl, so the second ioctl takes a list of file
descriptors for the VFIO groups affected by the reset. Each group
must have IOMMU protection established for the ioctl to succeed.
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
|
|
Having PCIe/PCI-X capability isn't enough to assume that there are
extended capabilities. Both specs define that the first capability
header is all zero if there are no extended capabilities. Testing
for this avoids an erroneous message about hiding capability 0x0 at
offset 0x100.
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
|
|
eventfd_fget() tests to see whether the file is an eventfd file, which
we then immediately pass to eventfd_ctx_fileget(), which again tests
whether the file is an eventfd file. Simplify slightly by using
fdget() so that we only test that we're looking at an eventfd once.
fget() could also be used, but fdget() makes use of fget_light() for
another slight optimization.
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
|
|
Users of pci_reset_bus() and pci_reset_slot() need a way to probe
whether the bus or slot supports reset. Add trivial helper functions
and export them as vfio-pci will make use of these.
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
(cherry picked from commit 9a3d2b9beefd5b07c1d8f70ded01b88f203ee304)
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
|
|
One PCI bus reset function to rule them all.
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
(cherry picked from commit 1b95ce8fc9c12fdb60047f2f9950f29e76e7c66d)
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
|
|
The PCI spec indicates that with stable power, reset needs to be
asserted for a minimum of 1ms (Trst). We should be able to assume
stable power for a Hot Reset, but we add another millisecond as
a fudge factor to make sure the reset is seen on the bus for at least
a full 1ms.
After reset is de-asserted we must wait for devices to complete
initialization. The specs refer to this as "recovery time" (Trhfa).
For PCI this is 2^25 clock cycles or 2^26 for PCI-X. For minimum
bus speeds, both of those come to 1s. PCIe "softens" this
requirement with the Configuration Request Retry Status (CRS)
completion status. Theoretically we could use CRS to shorten the
wait time. We don't make use of that here, using a fixed 1s delay
to allow devices to re-initialize.
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
(cherry picked from commit de0c548c33429cc78fd47a3c190c6d00b0e4e441)
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
|
|
Devices come out of reset in D0. Restoring a device to a different
post-reset state takes more smarts than our simple config space
restore, which can leave devices in an inconsistent state. For
example, if a device is reset in D3, but the restore doesn't
successfully return the device to D3, then the actual state of the
device and dev->current_state are contradictory. Put everything
in D0 going into the reset, then we don't need to do anything
special on the way out.
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
(cherry picked from commit a6cbaadea0af9b4aa6eee2882f2aa761ab91a4f8)
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
|
|
Sometimes pci_reset_function() is not sufficient. We have cases where
devices do not support any kind of reset, but there might be multiple
functions on the bus preventing pci_reset_function() from doing a
secondary bus reset. We also have cases where a device will advertise
that it supports a PM reset, but really does nothing on D3hot->D0
(graphics cards are notorious for this). These devices often also
have more than one function, so even blacklisting PM reset for them
wouldn't allow a secondary bus reset through pci_reset_function().
If a driver supports multiple devices it should have the ability to
induce a bus reset when it needs to. This patch provides that ability
through pci_reset_slot() and pci_reset_bus(). It's the caller's
responsibility when using these interfaces to understand that all of
the devices in or below the slot (or on or below the bus) will be
reset and therefore should be under control of the caller. PCI state
of all the affected devices is saved and restored around these resets,
but internal state of all of the affected devices is reset (which
should be the intention).
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
(cherry picked from commit 090a3c5322e900f468b3205b76d0837003ad57b2)
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
|