aboutsummaryrefslogtreecommitdiffstats
AgeCommit message (Collapse)AuthorFilesLines
2014-06-03net/cxgb4: Fix referencing freed adapterHEADpowerkvmmasterGavin Shan1-1/+1
The adapter is freed before we check its flags. It was caused by commit 144be3d ("net/cxgb4: Avoid disabling PCI device for towice"). The problem was reported by Intel's "0-day" tool. The patch fixes it to avoid reverting commit 144be3d. It's responsing to bug#110450. Signed-off-by: Gavin Shan <shangw@linux.vnet.ibm.com>
2014-06-03net/cxgb4: Don't retrieve stats during recoveryGavin Shan1-0/+10
We possibly retrieve the adapter's statistics during EEH recovery and that should be disallowed. Otherwise, it would possibly incur replicate EEH error and EEH recovery is going to fail eventually. The patch reuses statistics lock and checks net_device is attached before going to retrieve statistics, so that the problem can be avoided. It's responsing to bug#110450. Signed-off-by: Gavin Shan <shangw@linux.vnet.ibm.com>
2014-06-03net/cxgb4: Avoid disabling PCI device for towiceGavin Shan2-7/+21
If we have EEH error happens to the adapter and we have to remove it from the system for some reasons (e.g. more than 5 EEH errors detected from the device in last hour), the adapter will be disabled for towice separately by eeh_err_detected() and remove_one(), which will incur following unexpected backtrace. The patch tries to avoid it. It's responsing bug#110450. WARNING: at drivers/pci/pci.c:1431 CPU: 12 PID: 121 Comm: eehd Not tainted 3.13.0-rc7+ #1 task: c0000001823a3780 ti: c00000018240c000 task.ti: c00000018240c000 NIP: c0000000003c1e40 LR: c0000000003c1e3c CTR: 0000000001764c5c REGS: c00000018240f470 TRAP: 0700 Not tainted (3.13.0-rc7+) MSR: 8000000000029032 <SF,EE,ME,IR,DR,RI> CR: 28000024 XER: 00000004 CFAR: c000000000706528 SOFTE: 1 GPR00: c0000000003c1e3c c00000018240f6f0 c0000000010fe1f8 0000000000000035 GPR04: 0000000000000000 0000000000000000 00000000003ae509 0000000000000000 GPR08: 000000000000346f 0000000000000000 0000000000000000 0000000000003fef GPR12: 0000000028000022 c00000000ec93000 c0000000000c11b0 c000000184ac3e40 GPR16: 0000000000000000 0000000000000000 0000000000000000 0000000000000000 GPR20: 0000000000000000 0000000000000000 0000000000000000 0000000000000000 GPR24: 0000000000000000 c0000000009398d8 c00000000101f9c0 c0000001860ae000 GPR28: c000000182ba0000 00000000000001f0 c0000001860ae6f8 c0000001860ae000 NIP [c0000000003c1e40] .pci_disable_device+0xd0/0xf0 LR [c0000000003c1e3c] .pci_disable_device+0xcc/0xf0 Call Trace: [c0000000003c1e3c] .pci_disable_device+0xcc/0xf0 (unreliable) [d0000000073881c4] .remove_one+0x174/0x320 [cxgb4] [c0000000003c57e0] .pci_device_remove+0x60/0x100 [c00000000046396c] .__device_release_driver+0x9c/0x120 [c000000000463a20] .device_release_driver+0x30/0x60 [c0000000003bcdb4] .pci_stop_bus_device+0x94/0xd0 [c0000000003bcf48] .pci_stop_and_remove_bus_device+0x18/0x30 [c00000000003f548] .pcibios_remove_pci_devices+0xa8/0x140 [c000000000035c00] .eeh_handle_normal_event+0xa0/0x3c0 [c000000000035f50] .eeh_handle_event+0x30/0x2b0 [c0000000000362c4] .eeh_event_handler+0xf4/0x1b0 [c0000000000c12b8] .kthread+0x108/0x130 [c00000000000a168] .ret_from_kernel_thread+0x5c/0x74 Signed-off-by: Gavin Shan <shangw@linux.vnet.ibm.com>
2014-06-03KVM: PPC: Book3S HV: Increase timeout for grabbing secondary threadsPaul Mackerras1-1/+1
Occasional failures have been seen with split-core mode and migration where the message "KVM: couldn't grab cpu" appears. This increases the length of time that we wait from 1ms to 10ms, which seems to work around the issue. Fixes: BZ 110865 Signed-off-by: Paul Mackerras <paulus@samba.org>
2014-05-23bnx2x: EEH recovory failed with Shiner adapterWang Sen1-2/+2
This patch fixes the EEH recoery issue in bnx2x. Signed-off-by: Wen Xiong <wenxiong@linux.vnet.ibm.com> LTC-Bugzilla: #110449
2014-05-22bnx2x: Fix kernel crash and EEH recovery issueswenxiong@linux.vnet.ibm.com1-0/+2
On Tuleta system, HTX has miscompare data issue after EEH recovery. Signed-off-by: Wen Xiong <wenxiong@linux.vnet.ibm.com>
2014-05-22lpfc: Add iotag memory barrierBrian King1-0/+2
Add a memory barrier to ensure the valid bit is read before any of the cqe payload is read. This fixes an issue seen on Power where the cqe payload was getting loaded before the valid bit. When this occurred, we saw an iotag out of range error when a command completed, but since the iotag looked invalid the command didn't get completed to scsi core. Later we hit the command timeout, attempted to abort the command, then waited for the aborted command to get returned. Since the adapter already returned the command, we timeout waiting, and end up escalating EEH all the way to host reset. This patch fixes this issue. Signed-off-by: Brian King <brking@linux.vnet.ibm.com>
2014-05-16CVE-2014-0196Mike Ranweiler1-0/+4
Pulled from 3.10.23 stable for bug 110340.. >From abb5100737bba3f82b5514350fea89ca361ac66c Mon Sep 17 00:00:00 2001 From: Peter Hurley <peter@hurleysoftware.com> Date: Sat, 3 May 2014 14:04:59 +0200 Subject: n_tty: Fix n_tty_write crash when echoing in raw mode commit 4291086b1f081b869c6d79e5b7441633dc3ace00 upstream. The tty atomic_write_lock does not provide an exclusion guarantee for the tty driver if the termios settings are LECHO & !OPOST. And since it is unexpected and not allowed to call TTY buffer helpers like tty_insert_flip_string concurrently, this may lead to crashes when concurrect writers call pty_write. In that case the following two writers: * the ECHOing from a workqueue and * pty_write from the process race and can overflow the corresponding TTY buffer like follows. If we look into tty_insert_flip_string_fixed_flag, there is: int space = __tty_buffer_request_room(port, goal, flags); struct tty_buffer *tb = port->buf.tail; ... memcpy(char_buf_ptr(tb, tb->used), chars, space); ... tb->used += space; so the race of the two can result in something like this: A B __tty_buffer_request_room __tty_buffer_request_room memcpy(buf(tb->used), ...) tb->used += space; memcpy(buf(tb->used), ...) ->BOOM B's memcpy is past the tty_buffer due to the previous A's tb->used increment. Since the N_TTY line discipline input processing can output concurrently with a tty write, obtain the N_TTY ldisc output_lock to serialize echo output with normal tty writes. This ensures the tty buffer helper tty_insert_flip_string is not called concurrently and everything is fine. Note that this is nicely reproducible by an ordinary user using forkpty and some setup around that (raw termios + ECHO). And it is present in kernels at least after commit d945cb9cce20ac7143c2de8d88b187f62db99bdc (pty: Rework the pty layer to use the normal buffering logic) in 2.6.31-rc3. js: add more info to the commit log js: switch to bool js: lock unconditionally js: lock only the tty->ops->write call References: CVE-2014-0196 Reported-and-tested-by: Jiri Slaby <jslaby@suse.cz> Signed-off-by: Peter Hurley <peter@hurleysoftware.com> Signed-off-by: Jiri Slaby <jslaby@suse.cz> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Alan Cox <alan@lxorguk.ukuu.org.uk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-05-16CVE-2014-0155Mike Ranweiler1-1/+1
Pulled from 3.10.23 stable for bug 110340. >From a9ded882d5168e2fd5c0c20e2874f85c56016b4b Mon Sep 17 00:00:00 2001 From: Paolo Bonzini <pbonzini@redhat.com> Date: Fri, 28 Mar 2014 20:41:50 +0100 Subject: KVM: ioapic: fix assignment of ioapic->rtc_status.pending_eoi (CVE-2014-0155) commit 5678de3f15010b9022ee45673f33bcfc71d47b60 upstream. QE reported that they got the BUG_ON in ioapic_service to trigger. I cannot reproduce it, but there are two reasons why this could happen. The less likely but also easiest one, is when kvm_irq_delivery_to_apic does not deliver to any APIC and returns -1. Because irqe.shorthand == 0, the kvm_for_each_vcpu loop in that function is never reached. However, you can target the similar loop in kvm_irq_delivery_to_apic_fast; just program a zero logical destination address into the IOAPIC, or an out-of-range physical destination address. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-05-15KVM: PPC: Book3S HV: Reduce default CMA pool sizePaul Mackerras1-2/+5
We have observed that on machines with all their memory in a single node, it is possible to hit an out of memory situation where kernel allocations (which can't use the CMA pool) fail, triggering the OOM killer, yet reclaim doesn't start because there is still free memory in the CMA pool. To alleviate this situation somewhat, this reduces the default CMA pool size from 5% to 3% of system memory. The 3% should still be enough in most situations, and if not, the user can specify a different amount on the kernel command line. This should help with BZ 110181. Signed-off-by: Paul Mackerras <paulus@samba.org>
2014-05-12powerpc/eeh: Dump PE location codeGavin Shan4-11/+81
As Ben suggested, it's meaningful to dump PE's location code for site engineers when hitting EEH errors. The patch introduces function eeh_pe_loc_get() to retireve the location code from dev-tree so that we can output it when hitting EEH errors. If primary PE bus is root bus, the PHB's dev-node would be tried prior to root port's dev-node. Otherwise, the upstream bridge's dev-node of the primary PE bus will be check for the location code directly. This fixes BZ 109585. Please apply to the next build for GA. Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2014-05-08KVM: PPC: Book3S HV: Fix two bugs in dirty-page trackingPaul Mackerras1-11/+40
The first bug is that we are testing the C (changed) bit in the hashed page table without first doing a tlbie. The architecture allows the update of the C bit to happen at any time up until we do a tlbie for the page. However, we don't want to do a tlbie for every page on every pass of a migration operation. Thus we do the tlbie if there are no vcpus currently running, which would indicate the final phase of migration. If any vcpus are running then reading the dirty log is already racy because pages could get dirtied immediately after we check them. Also, we don't need to do the tlbie if the HPT entry doesn't allow writing, since in that case the C bit can not get set. The second bug is that in the case where we see a dirty 16MB page followed by a dirty 4kB page (both mapping to the same guest real address), we return 1 rather than 16MB / PAGE_SIZE. The return value, indicating the number of dirty pages, needs to reflect the largest dirty page we come across, not the last dirty page we see. Fixes: 109551 (this time for sure) Signed-off-by: Paul Mackerras <paulus@samba.org>
2014-05-04PPC: KVM: fix dirty map for hugepagesPaul Mackerras1-4/+7
The dirty map is system page (4K/64K) per bit, and when we populate dirty map, we reset the Change bit in HPT which is expected to contains pages less or equal to the system page size. This works until we start using huge pages (16MB). In this case, we mark dirty just a single system page and miss the rest of 16MB page which may be dirty as well. This changes kvm_test_clear_dirty to return the actual number of pages which is calculated from HPT entry. This changes kvmppc_hv_get_dirty_log() to make pages dirty starting from the rounded guest physical page number. [paulus@samba.org - don't advance i in the loop to set dirty bits, so that we make sure to clear C in all HPTEs.] Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru> Signed-off-by: Paul Mackerras <paulus@samba.org>
2014-05-04powerpc/kvm: Don't try to allocate from kernel page allocator for hash page ↵Aneesh Kumar K.V1-17/+6
table. We reserve 5% of total ram for CMA allocation and not using that can result in us running out of numa node memory with specific configuration. One caveat is we may not have node local hpt with pinned vcpu configuration. But currently libvirt also pins the vcpu to cpuset after creating hash page table. Reviewed-by: Paul Mackerras <paulus@samba.org> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
2014-05-04powerpc/powernv: Don't escalate non-existing frozen PEGavin Shan1-6/+6
Commit 63fa7d4 ("powerpc/eeh: Escalate error on non-existing PE") escalates the frozen state on non-existing PE to fenced PHB. It was to improve kdump reliability. After that, commit 716a0e8 (" powrpc/powernv: Reset PHB in kdump kernel") was introduced to apply complete reset on all PHBs to increase the kdump reliability. Commit 63fa7d4 becomes unuseful and to issue PHB reset on non-fenced (on HW level) PHB would cause unexpected problems. So I'd like to revert it. It's responsing to bug#109562. Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
2014-05-04powerpc/eeh: Report frozen parent PE prior to child PEGavin Shan2-5/+52
When we have the corner case of frozen parent and child PE at the same time, we have to handle the frozen parent PE prior to the child. Without clearning the frozen state on parent PE, the child PE can't be recovered successfully. There're 2 ways (polling and interrupt) to have frozen PE to be reported. If we have frozen parent PE out there, we have to report and handle that firstly. It's responsing to bug#109562. Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
2014-05-04powerpc/eeh: Clear frozen state for child PEGavin Shan1-4/+16
Since commit cb523e09 ("powerpc/eeh: Avoid I/O access during PE reset"), the PE is kept as frozen state on hardware level until the PE reset is done completely. After that, we explicitly clear the frozen state of the affected PE. However, there might have frozen child PEs of the affected PE and we also need clear their frozen state as well. Otherwise, the recovery is going to fail. It's responsing to bug#109562. Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
2014-04-29powerpc/book3s: Improve machine check delivery to guest.Mahesh Salgaonkar2-11/+23
Currently we forward MCEs to guest which have been recovered by guest. And for unhandled errors we do not deliver the MCE to guest. It looks like with no support of FWNMI in qemu, guest just panics whenever we deliver the recovered MCEs to guest. Also, the existig code used to return to host for unhandled errors which was casuing guest to hang with soft lockups inside guest and makes it difficult to recover guest instance. This patch now forwards all fatal MCEs to guest causing guest to crash/panic. And, for recovered errors we just go back to normal functioning of guest instead of returning to host. This fixes soft lockup issues in guest. This patch also fixes an issue where guest MCE events were not logged to host console. This patch fixes bz108165 and bz108413 Signed-off-by: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
2014-04-29powerpc/powernv: Protect split-core operations from CPU hotplugSrivatsa S. Bhat1-0/+4
During split-core operations, one of the online CPUs is nominated as the "master" and then stop_machine() is invoked to perform the split/unsplit procedure. Between these 2 steps, if CPU hotplug occurs and takes the just nominated "master" CPU offline, then the split/unsplit procedure does not complete properly and leads to undesirable effects. So protect the entire split-core operation with get/put_online_cpus() to synchronize with CPU hotplug. Fixes bz 105509. Acked-by: Michael Neuling <mikey@neuling.org> Signed-off-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
2014-04-29powerpc: Remove timebase resync during split-core operations on DD2.1 chipsAlistair Popple1-6/+9
The hardware manages the resync during split-core operations, on newer revisions (DD2.1 and higher). So we don't need to call opal_resync_timebase() on those systems. Fixes bz 105856. [Srivatsa: Added changelog] Signed-off-by: Alistair Popple <alistair@popple.id.au> Signed-off-by: Michael Neuling <mikey@neuling.org> Signed-off-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
2014-04-24powerpc/book3s: Increment the mce counter during machine_check_early call.Mahesh Salgaonkar1-0/+2
We don't see MCE counter getting increased in /proc/interrupts which gives false impression of no MCE occurred even when there were MCE events. The machine check early handling was added for PowerKVM and we missed to increment the MCE count in the early handler. We also increment mce counters in the machine_check_exception call, but in most cases where we handle the error hypervisor never reaches there unless its fatal and we want to crash. Only during fatal situation we may see double increment of mce count. We need to fix that. But for now it always good to have some count increased instead of zero. This fixes the MCE count issue mentioned in bz108413 Signed-off-by: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
2014-04-22powerpc/powernv: Add missing sysfs_attr_init()Benjamin Herrenschmidt1-0/+1
Without this, we get lockdep errors Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2014-04-22powerpc/book3s: Add stack overflow check in machine check handler.Mahesh Salgaonkar1-4/+20
Currently machine check handler does not check for stack overflow for nested machine check. If we hit another MCE while inside the machine check handler repeatedly from same address then we get into risk of stack overflow which can cause huge memory corruption. This patch limits the nested MCE level to 4 and panic when we cross level 4. Signed-off-by: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2014-04-22powerpc/powernv: Check sysparam size before creationJoel Stanley1-0/+5
The size of the sysparam sysfs files is determined from the device tree at boot. However the buffer is hard coded to 64 bytes. If we encounter a parameter that is larger than 64, or miss-parse the device tree, the buffer will overflow when reading or writing to the parameter. Check it at discovery time, and if the parameter is too large, do not create a sysfs entry for it. Signed-off-by: Joel Stanley <joel@jms.id.au> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2014-04-22powerpc/powernv: Fix typos in sysparam codeJoel Stanley1-2/+2
Signed-off-by: Joel Stanley <joel@jms.id.au> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2014-04-22powerpc/powernv: Check sysfs size before copyingJoel Stanley1-0/+4
The sysparam code currently uses the userspace supplied number of bytes when memcpy()ing in to a local 64-byte buffer. Limit the maximum number of bytes by the size of the buffer. Signed-off-by: Joel Stanley <joel@jms.id.au> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2014-04-22powerpc/powernv: Use ssize_t for sysparam return valuesJoel Stanley1-5/+6
The OPAL calls are returning int64_t values, which the sysparam code stores in an int, and the sysfs callback returns ssize_t. Make code a easier to read by consistently using ssize_t. Signed-off-by: Joel Stanley <joel@jms.id.au> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2014-04-22powerpc/powernv: Fix sysparam sysfs error handlingJoel Stanley1-2/+5
When a sysparam query in OPAL returned a negative value (error code), sysfs would spew out a decent chunk of memory; almost 64K more than expected. This was traced to a sign/unsigned mix up in the OPAL sysparam sysfs code at sys_param_show. The return value of sys_param_show is a ssize_t, calculated using return ret ? ret : attr->param_size; Alan Modra explains: "attr->param_size" is an unsigned int, "ret" an int, so the overall expression has type unsigned int. Result is that ret is cast to unsigned int before being cast to ssize_t. Instead of using the ternary operator, set ret to the param_size if an error is not detected. The same bug exists in the sysfs write callback; this patch fixes it in the same way. A note on debugging this next time: on my system gcc will warn about this if compiled with -Wsign-compare, which is not enabled by -Wall, only -Wextra. Signed-off-by: Joel Stanley <joel@jms.id.au> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2014-04-22tick-broadcast/cpuidle: Fix the programming of the broadcast hrtimerBenjamin Herrenschmidt1-2/+1
Today CPUs in fast sleep are being woken up to handle their timers by the tick broadcast framework using a hrtimer queued on a nominated broadcast CPU. The hrtimer is programmed for the earlier of the next wakeup and a broadcast period which happens to be a jiffy. This programming is being done incorrectly today. The current time is noted, the tick broadcast interrupt handler is called, then the time at which the hrtimer needs to be programmed is decided. By then the noted current time would be stale and the hrtimer will be forward much ahead than required, leading to delayed broadcast interrupts being delivered to sleeping cpus. Fix this by noting the current time just before programming the hrtimer. Signed-off-by: Preeti U Murthy <preeti@linux.vnet.ibm.com>
2014-04-22powerpc/book3s: Improve machine check handling for unhandled errorsMahesh Salgaonkar1-3/+37
Current code does not check for unhandled/unrecovered errors and return from interrupt if it is recoverable exception which in-turn triggers same machine check exception in a loop causing hypervisor to be unresponsive. This patch fixes this situation and forces hypervisor to panic for unhandled/unrecovered errors. This patch also fixes another issue where unrecoverable_exception routine was called in real mode in case of unrecoverable exception (MSR_RI = 0). This causes another exception vector 0x300 (data access) during system crash leading to confusion while debugging cause of the system crash. With the above fixes we now throw correct console messages (see below) while crashing the system in case of unhandled/unrecoverable machine checks. -------------- Severe Machine check interrupt [[Not recovered] Initiator: CPU Error type: UE [Instruction fetch] Effective address: 0000000030002864 Oops: Machine check, sig: 7 [#1] SMP NR_CPUS=2048 NUMA PowerNV Modules linked in: bork(O) bridge stp llc kvm [last unloaded: bork] CPU: 36 PID: 55162 Comm: bash Tainted: G O 3.14.0mce #1 task: c000002d72d022d0 ti: c000000007ec0000 task.ti: c000002d72de4000 NIP: 0000000030002864 LR: 00000000300151a4 CTR: 000000003001518c REGS: c000000007ec3d80 TRAP: 0200 Tainted: G O (3.14.0mce) MSR: 9000000000041002 <SF,HV,ME,RI> CR: 28222848 XER: 20000000 CFAR: 0000000030002838 DAR: d0000000004d0000 DSISR: 00000000 SOFTE: 1 GPR00: 000000003001512c 0000000031f92cb0 0000000030078af0 0000000030002864 GPR04: d0000000004d0000 0000000000000000 0000000030002864 ffffffffffffffc9 GPR08: 0000000000000024 0000000030008af0 000000000000002c c00000000150e728 GPR12: 9000000000041002 0000000031f90000 0000000010142550 0000000040000000 GPR16: 0000000010143cdc 0000000000000000 00000000101306fc 00000000101424dc GPR20: 00000000101424e0 000000001013c6f0 0000000000000000 0000000000000000 GPR24: 0000000010143ce0 00000000100f6440 c000002d72de7e00 c000002d72860250 GPR28: c000002d72860240 c000002d72ac0038 0000000000000008 0000000000040000 NIP [0000000030002864] 0x30002864 LR [00000000300151a4] 0x300151a4 Call Trace: Instruction dump: XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX ---[ end trace 7285f0beac1e29d3 ]--- Sending IPI to other CPUs IPI complete OPAL V3 detected ! -------------- Signed-off-by: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2014-04-22tick-broadcast/cpuidle: Fix the demuxing of PPC_MSG_TIMER IPI messageBenjamin Herrenschmidt1-1/+1
The PPC_MSG_TIMER IPI message slot was introduced for the tick broadcast IPIs which are required to wakeup sleeping CPUs. The decrementer of the CPUs that enter fast sleep stops as a consequence of entering the idle state. Therefore such CPUs have to be woken up in time to handle their timers by a broadcast CPU which sends the PPC_MSG_TIMER IPIs to them. This IPI message is being parsed wrongly in smp_ipi_demux(). Thus the tick broadcast interrupt handler is never executed on the sleeping CPU. This could have led to unpleasant side effects like not handling timers in time on the sleeping cpus. But since the sleeping CPUs still receive the tick broadcast IPI, they are awoken from the idle state and their decrementers are back in action. As a result, its possible that they are managing to handle timers before they go to sleep again. Hence timers are being handled on the sleeping cpus although the tick broadcast interrupt handler, which is actually supposed to ensure that is never being called today due to the wrong number of shift bits while parsing the tick broadcast IPI. However we need to note that as a result of this discrepency, timer handling on the sleeping cpus may be unstable. This could be one of the reasons we are observing some softlockups in the cpuidle wakeup path. Signed-off-by: Preeti U Murthy <preeti@linux.vnet.ibm.com> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2014-04-16vfio-pci: Use pci "try" reset interfaceAlex Williamson1-20/+9
PCI resets will attempt to take the device_lock for any device to be reset. This is a problem if that lock is already held, for instance in the device remove path. It's not sufficient to simply kill the user process or skip the reset if called after .remove as a race could result in the same deadlock. Instead, we handle all resets as "best effort" using the PCI "try" reset interfaces. This prevents the user from being able to induce a deadlock by triggering a reset. Signed-off-by: Alex Williamson <alex.williamson@redhat.com> Signed-off-by: Bjorn Helgaas <bhelgaas@google.com> (cherry picked from commit 890ed578df82f5b7b5a874f9f2fa4f117305df5f) Signed-off-by: Thadeu Lima de Souza Cascardo <cascardo@linux.vnet.ibm.com> LTC-Bugzilla: #104951
2014-04-16PCI: Add pci_try_reset_function(), pci_try_reset_slot(), pci_try_reset_bus()Alex Williamson2-0/+158
When doing a function/slot/bus reset PCI grabs the device_lock for each device to block things like suspend and driver probes, but call paths exist where this lock may already be held. This creates an opportunity for deadlock. For instance, vfio allows userspace to issue resets so long as it owns the device(s). If a driver unbind .remove callback races with userspace issuing a reset, we have a deadlock as userspace gets stuck waiting on device_lock while another thread has device_lock and waits for .remove to complete. To resolve this, we can make a version of the reset interfaces which use trylock. With this, we can safely attempt a reset and return error to userspace if there is contention. [bhelgaas: the deadlock happens when A (userspace) has a file descriptor for the device, and B waits in this path: driver_detach device_lock # take device_lock __device_release_driver pci_device_remove # pci_bus_type.remove vfio_pci_remove # pci_driver .remove vfio_del_group_dev wait_event(vfio.release_q, !vfio_dev_present) # wait (holding device_lock) Now B is stuck until A gives up the file descriptor. If A tries to acquire device_lock for any reason, we deadlock because A is waiting for B to release the lock, and B is waiting for A to release the file descriptor.] Signed-off-by: Alex Williamson <alex.williamson@redhat.com> Signed-off-by: Bjorn Helgaas <bhelgaas@google.com> (cherry picked from commit 61cf16d8bd38c3dc52033ea75d5b1f8368514a17) Signed-off-by: Thadeu Lima de Souza Cascardo <cascardo@linux.vnet.ibm.com> LTC-Bugzilla: #104951
2014-04-16powerpc/eeh: Can't recover from non-PE-reset caseGavin Shan1-3/+9
When PCI_ERS_RESULT_CAN_RECOVER returned from device drivers, the EEH core should enable I/O and DMA for the affected PE. However, it was missed to have DMA enabled in eeh_handle_normal_event(). Besides, the frozen state of the affected PE should be cleared after successful recovery, but we didn't. The patch fixes both of the issues as above. It's responsing to bug#105179. Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
2014-04-16PCI: Export MSI message relevant functionsGavin Shan1-0/+2
As pointed by Alexey, we're going to hit build failure without exporting the functions when (CONFIG_VFIO_PCI == M). It should be part of commit 9762b50 ("drivers/vfio/pci: Fix MSIx message lost"). Reported-by: Alexey Kardashevskiy <aik@ozlabs.ru> Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
2014-04-16KVM: PPC: Book3S PR: Handle facility-unavailable interrupts gracefullyPaul Mackerras2-0/+6
At present, if a PR guest on a POWER8 machine tries to access some disabled functionality such as transactional memory, the result is a facility-unavailable interrupt, which isn't handled in kvmppc_handle_exit_pr(), resulting in a call to BUG(), crashing the PR host kernel. This adds code to handle the facility-unavailable interrupts and give the guest an illegal instruction interrupt, instead of crashing the PR host. Signed-off-by: Paul Mackerras <paulus@samba.org>
2014-04-16KVM: PPC: Book3S PR: Implement ARCH_COMPAT registerPaul Mackerras2-0/+42
This provides basic support for the KVM_REG_PPC_ARCH_COMPAT register in PR KVM. At present the value is sanity-checked when set, but doesn't actually affect anything yet. Implementing this makes it possible to use a qemu command-line argument such as "-cpu host,compat=power7" on a POWER8 machine, just as we would with HV KVM. Signed-off-by: Paul Mackerras <paulus@samba.org>
2014-04-16KVM: PPC: Book3S PR: Unimplemented SPRs in supervisor mode don't cause trapPaul Mackerras1-10/+15
The Power ISA states that an mtspr or mfspr to/from an unimplemented SPR should be a no-op in privileged mode, rather than causing an program interrupt (0x700 vector), with the exception of mtspr to SPR 0 and mfspr from SPRs 0, 4, 5 or 6. Currently our SPR emulation code doesn't follow this rule. This modifies the code in kvmppc_core_emulate_m[ft]spr_pr() to check the PR bit in the MSR when we detect an unknown SPR number, and only return EMULATE_FAIL (which results in a program interrupt) if PR is 0 or the SPR number is one of the ones which are specifically defined to cause a program interrupt. Signed-off-by: Paul Mackerras <paulus@samba.org>
2014-04-14tick, broadcast:Keep the cpu_online_mask and broadcast masks in sync with ↵Preeti U Murthy1-1/+1
each other Its possible that the tick_broadcast_force_mask contains cpus which are not in cpu_online_mask when a broadcast tick occurs. This could happen under the following circumstance assuming CPU1 is among the CPUs waiting for broadcast and the cpu being hotplugged out. CPU0 CPU1 Run CPU_DOWN_PREPARE notifiers Start stop_machine Gets woken up by IPI to run stop_machine, sets itself in tick_broadcast_force_mask if the time of broadcast interrupt is around the same time as this IPI. Start stop_machine set_cpu_online(cpu1, false) End stop_machine End stop_machine Broadcast interrupt Finds that cpu1 in tick_broadcast_force_mask is offline and triggers the WARN_ON in tick_handle_oneshot_broadcast() Clears all broadcast masks in CPU_DEAD stage. While the hotplugged cpu clears its bit in the tick_broadcast_oneshot_mask and tick_broadcast_pending mask during BROADCAST_EXIT, it *sets* its bit in the tick_broadcast_force_mask if the broadcast interrupt is found to be around the same time as the present time. Today we clear all the broadcast masks and shutdown tick devices in the CPU_DEAD stage. But as shown above the broadcast interrupt could occur before this stage is reached and the WARN_ON() gets triggered when it is found that the tick_broadcast_force_mask contains an offline cpu. Please note that a scenario such as above will occur *only if the broadcast interrupt is delayed under some circumstance*. Ideally the broadcast interrupt in the above scenario should have occured before we reach the irq_disabled stage of stop_machine and should have seen a valid broadcast mask. But for some reason that is yet to be understood it is getting delayed leading to the above scenario. Besides this another point to notice is that for a small duration between the CPU_DYING stage where the hotplugged cpu clears its bit from the cpu_online_mask and the CPU_DEAD stage where the broadcast_force_mask gets cleared of the same, both these masks are out of sync with each other during that time thus triggering the above scenario. The temporary solution to this is to move the clearing of broadcast masks to the CPU_DYING notification stage. The reason is, it is during this stage that the hotplugged cpu clears itself from the cpu_online_mask() and runs notifications relevant to this stage including those to clear the broadcast masks (with this patch). All this, while the rest of the cpus are busy spinning in stop_machine to notice this change. By the time this stage ends and all cpus resume work, the hotplugged cpu would have cleared itself from the cpu_online_mask and the broadcast cpu mask thus keeping them in sync with each other at such times when the rest of the cpus can read these masks. Since the above mentioned delay in the broadcast interrupt has not triggered any soft lockups so far, we are assuming its a non-fatal issue and have this patch to prevent the warning from popping up in this case. Suggested-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com> Signed-off-by: Preeti U Murthy <preeti@linux.vnet.ibm.com> Reviewed-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com> Acked-by: Benjamin Herrenschmidt <benh@au1.ibm.com>
2014-04-14powerpc/mm: Don't update page->_mapcount for hugetlb tail pagesAneesh Kumar K.V1-10/+0
The changes to increment _mapcount was added w.r.t THP change 3526741f0964c88bc2ce511e1078359052bf225b. Later this was fixed to to handle the hugetlb case in 44518d2b32646e37b4b7a0813bbbe98dc21c7f8f Instead of backporting 44518, we can remove the _mapcount update since we don't support THP for kvm host yet. Fixes: bz# 108558 Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
2014-04-14powrpc/powernv: Reset PHB in kdump kernelGavin Shan3-4/+24
In the kdump scenario, the first kerenl doesn't shutdown PCI devices and the kdump kerenl clean PHB IODA table at the early probe time. That means the kdump kerenl can't support PCI transactions piled by the first kerenl. Otherwise, lots of EEH errors and frozen PEs will be detected. In order to avoid the EEH errors, the PHB is resetted to drop all PCI transaction from the first kerenl. It looks good on P7, but need to be verified on P8. Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com> Acked-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2014-04-14powerpc/pci: Mask linkDown on resetting PCI busGavin Shan6-10/+74
The problem was initially reported by Wendy who tried pass through IPR adapter, which was connected to PHB root port directly, to KVM based guest. When doing that, pci_reset_bridge_secondary_bus() was called by VFIO driver and linkDown was detected by the root port. That caused all PEs to be frozen. The patch fixes the issue by routing the reset for the secondary bus of root port to underly firmware. For that, one more weak function pci_reset_secondary_bus() is introduced so that the individual platforms can override that and do specific reset for bridge's secondary bus. Reported-by: Wendy Xiong <wenxiong@linux.vnet.ibm.com> Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com> Acked-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2014-04-14powerpc/eeh: Make the delay for PE reset unifiedGavin Shan4-16/+30
Basically, we have 3 types of resets to fulfil PE reset: fundamental, hot and PHB reset. For the later 2 cases, we need PCI bus reset hold and settlement delay as specified by PCI spec. PowerNV and pSeries platforms are running on top of different firmware and some of the delays have been covered by underly firmware (PowerNV). The patch makes the delays unified to be done in backend, instead of EEH core. Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com> Acked-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2014-04-14powerpc/powernv: Reset root port in firmwareGavin Shan1-7/+6
Resetting root port has more stuff to do than that for PCIe switch ports and we should have resetting root port done in firmware instead of the kernel itself. The problem was introduced by commit 5b2e198e ("powerpc/powernv: Rework EEH reset"). Cc: linux-stable <stable@vger.kernel.org> Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com> Acked-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2014-04-14powerpc/pseries: Fix overwritten PE stateGavin Shan1-0/+1
In pseries_eeh_get_state(), EEH_STATE_UNAVAILABLE is always overwritten by EEH_STATE_NOT_SUPPORT because of the missed "break" there. The patch fixes the issue. Reported-by: Joe Perches <joe@perches.com> Cc: linux-stable <stable@vger.kernel.org> Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com> Acked-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2014-04-14powerpc/powernv: Fix endless reporting frozen PEGavin Shan1-0/+7
Once one specific PE has been marked as EEH_PE_ISOLATED, it's in the middile of recovery or removed permenently. We needn't report the frozen PE again. Otherwise, we will have endless reporting same frozen PE. Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com> Acked-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2014-04-14powerpc/eeh: No hotplug on permanently removed devGavin Shan6-20/+102
The issue was detected in a bit complicated test case where we have multiple hierarchical PEs shown as following figure: +-----------------+ | PE#3 p2p#0 | | p2p#1 | +-----------------+ | +-----------------+ | PE#4 pdev#0 | | pdev#1 | +-----------------+ PE#4 (have 2 PCI devices) is the child of PE#3, which has 2 p2p bridges. We accidentally had less-known scenario: PE#4 was removed permanently from the system because of permanent failure (e.g. exceeding the max allowd failure times in last hour), then we detects EEH errors on PE#3 and tried to recover it. However, eeh_dev instances for pdev#0/1 were not detached from PE#4, which was still connected to PE#3. All of that was because of the fact that we rely on count-based pcibios_release_device(), which isn't reliable enough. When doing recovery for PE#3, we still apply hotplug on PE#4 and pdev#0/1, which are not valid any more. Eventually, we run into kernel crash. The patch fixes above issue from two aspects. For unplug, we simply skip those permanently removed PE, whose state is (EEH_PE_STATE_ISOLATED && !EEH_PE_STATE_RECOVERING) and its frozen count should be greater than EEH_MAX_ALLOWED_FREEZES. For plug, we marked all permanently removed EEH devices with EEH_DEV_REMOVED and return 0xFF's on read its PCI config so that PCI core will omit them. Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com> Acked-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2014-04-14powerpc/eeh: Allow to disable EEHGavin Shan3-7/+70
The patch introduces bootarg "eeh=off" to disable EEH functinality. Also, it creates /sys/kerenl/debug/powerpc/eeh_enable to disable or enable EEH functionality. By default, we have the functionality enabled. For PowerNV platform, we will restore to have the conventional mechanism of clearing frozen PE during PCI config access if we're going to disable EEH functionality. Conversely, we will rely on EEH for error recovery. The patch also fixes the issue that we missed to cover the case of disabled EEH functionality in function ioda_eeh_event(). Those events driven by interrupt should be cleared to avoid endless reporting. Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com> Acked-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2014-04-14powerpc/eeh: Cleanup EEH subsystem variablesGavin Shan2-26/+34
There're 2 EEH subsystem variables: eeh_subsystem_enabled and eeh_probe_mode. We needn't maintain 2 variables and we can just have one variable and introduce different flags. The patch also introduces additional flag EEH_FORCE_DISABLE, which will be used to disable EEH subsystem via boot parameter ("eeh=off") in future. Besides, the patch also introduces flag EEH_ENABLED, which is changed to disable or enable EEH functionality on the fly through debugfs entry in future. With the patch applied, the creteria to check the enabled EEH functionality is changed to: !EEH_FORCE_DISABLED && EEH_ENABLED : Enabled Other cases : Disabled Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com> Acked-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2014-04-14powerpc/eeh: Use cached capability for log dumpGavin Shan4-20/+53
When calling into eeh_gather_pci_data() on pSeries platform, we possiblly don't have pci_dev instance yet, but eeh_dev is always ready. So we use cached capability from eeh_dev instead of pci_dev for log dump there. In order to keep things unified, we also cache PCI capability positions to eeh_dev for PowerNV as well. Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com> Acked-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2014-04-14powerpc/eeh: Cleanup eeh_gather_pci_data()Gavin Shan1-14/+12
The patch replaces printk(KERN_WARNING ...) with pr_warn() in the function eeh_gather_pci_data(). Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com> Acked-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2014-04-14powerpc/eeh: Avoid I/O access during PE resetGavin Shan3-70/+99
We have suffered recrusive frozen PE a lot, which was caused by IO accesses during the PE reset. Ben came up with the good idea to keep frozen PE until recovery (BAR restore) gets done. With that, IO accesses during PE reset are dropped by hardware and wouldn't incur the recrusive frozen PE any more. The patch implements the idea. We don't clear the frozen state until PE reset is done completely. During the period, the EEH core expects unfrozen state from backend to keep going. So we have to reuse EEH_PE_RESET flag, which has been set during PE reset, to return normal state from backend. The side effect is we have to clear frozen state for towice (PE reset and clear it explicitly), but that's harmless. We have some limitations on pHyp. pHyp doesn't allow to enable IO or DMA for unfrozen PE. So we don't enable them on unfrozen PE in eeh_pci_enable(). We have to enable IO before grabbing logs on pHyp. Otherwise, 0xFF's is always returned from PCI config space. Also, we had wrong return value from eeh_pci_enable() for EEH_OPT_THAW_DMA case. The patch fixes it too. Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com> Acked-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2014-04-14powerpc/powernv: Use EEH PCI config accessorsGavin Shan1-11/+12
For EEH PowerNV backends, they need use their own PCI config accesors as the normal one could be blocked during PE reset. The patch also removes necessary parameter "hose" for the function ioda_eeh_bridge_reset(). Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com> Acked-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2014-04-14powerpc/eeh: Block PCI-CFG access during PE resetGavin Shan4-54/+126
We've observed multiple PE reset failures because of PCI-CFG access during that period. Potentially, some device drivers can't support EEH very well and they can't put the device to motionless state before PE reset. So those device drivers might produce PCI-CFG accesses during PE reset. Also, we could have PCI-CFG access from user space (e.g. "lspci"). Since access to frozen PE should return 0xFF's, we can block PCI-CFG access during the period of PE reset so that we won't get recrusive EEH errors. The patch adds flag EEH_PE_RESET, which is kept during PE reset. The PowerNV/pSeries PCI-CFG accessors reuse the flag to block PCI-CFG accordingly. Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com> Acked-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2014-04-14powerpc/eeh: EEH_PE_ISOLATED not reflect HW stateGavin Shan1-7/+3
When doing PE reset, EEH_PE_ISOLATED is cleared unconditionally. However, We should remove that if the PE reset has cleared the frozen state successfully. Otherwise, the flag should be kept. The patch fixes the issue. Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com> Acked-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2014-04-14powerpc/powernv: Remove fields in PHB diag-data dumpGavin Shan1-51/+40
For some fields (e.g. LEM, MMIO, DMA) in PHB diag-data dump, it's meaningless to print them if they have non-zero value in the corresponding mask registers because we always have non-zero values in the mask registers. The patch only prints those fieds if we have non-zero values in the primary registers (e.g. LEM, MMIO, DMA status) so that we can save couple of lines. The patch also removes unnecessary spare line before "brdgCtl:" and two leading spaces as prefix in each line as Ben suggested. Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com> Acked-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2014-04-11powerpc: Don't try to set LPCR unless we're in hypervisor modePaul Mackerras1-1/+2
Commit c7062d83fe7b ("powerpc/ppc64: Do not turn AIL (reloc-on interrupts) too early") added code to set the AIL bit in the LPCR without checking whether the kernel is running in hypervisor mode. The result is that when the kernel is running as a guest (i.e., under PowerKVM or PowerVM), the processor takes a privileged instruction interrupt at that point, causing a panic. The visible result is that the kernel hangs after printing "returning from prom_init". This fixes it by checking for hypervisor mode being available before setting LPCR. If we are not in hypervisor mode, we enable relocation-on interrupts later in pSeries_setup_arch using the H_SET_MODE hcall. This fixes BZ 108728. Signed-off-by: Paul Mackerras <paulus@samba.org>
2014-04-10kvm: Clear the runlatch bit of a vcpu before nappingPreeti U Murthy1-1/+11
When the guest cedes the vcpu or the vcpu has no guest to run it naps. Clear the runlatch bit of the vcpu before napping to indicate an idle cpu. Signed-off-by: Preeti U Murthy <preeti@linux.vnet.ibm.com> Acked-by: Paul Mackerras <paulus@samba.org> Reviewed-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
2014-04-10kvm: Set the runlatch bit of a CPU just before starting guestPreeti U Murthy1-0/+6
The secondary threads in the core have their runlatch bits cleared since they are offline. When the secondary threads are called in to start a guest their runlatch bits need to be set to indicate that they are busy. The primary thread has its runlatch bit set though, but there is no harm in setting this bit once again. Hence set the runlatch bit for all threads before they start guest. Signed-off-by: Preeti U Murthy <preeti@linux.vnet.ibm.com> Acked-by: Paul Mackerras <paulus@samba.org> Reviewed-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
2014-04-10kvm: Set the runlatch bits correctly for offline cpusPreeti U Murthy1-0/+3
Up until now we have been setting the runlatch bits for a busy CPU and clearing it when a CPU enters idle state. The runlatch bit has thus been consistent with the utilization of a CPU as long as the CPU is online. However when a CPU is hotplugged out the runlatch bit is not cleared. It needs to be cleared to indicate an unused CPU. OCC consumes the runlatch bit to decide the utilization of a thread and ends up seeing the offline threads as busy. Hence this patch has the runlatch bit cleared for an offline CPU just before entering an idle state and sets it immediately after it exits the idle state. Signed-off-by: Preeti U Murthy <preeti@linux.vnet.ibm.com> Acked-by: Paul Mackerras <paulus@samba.org> Reviewed-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
2014-04-09enclosure: fix WARN_ON in dual path device removingwenxiong@linux.vnet.ibm.com1-1/+2
The issue is happened in dual controller configuration. We got the sysfs warnings when rmmod the ipr module. enclosure_unregister() in drivers/msic/enclosure.c, call device_unregister() for each componment deivce, device_unregister() ->device_del()->kobject_del() ->sysfs_remove_dir(). In sysfs_remove_dir(), set kobj->sd = NULL. For each componment device, enclosure_component_release()->enclosure_remove_links()->sysfs_remove_link() in which checking kobj->sd again, it has been set as NULL when doing device_unregister. So we saw all these sysfs WARNING. sysfs: can not remove 'enclosure_device: P1-D1 2SS6', no directory ------------[ cut here ]------------ WARNING: at fs/sysfs/inode.c:325 Modules linked in: fuse loop dm_mod ses enclosure ipr(-) ipv6 ibmveth libata sg ext3 jbd mbcache sd_mod crc_t10dif crct10dif_common ibmvscsi scsi_transport_srp scsi_tgt scsi_dh_rdac scsi_dh_emc scsi_dh_hp_sw scsi_dh_alua scsi_dh scsi_mod CPU: 0 PID: 4006 Comm: rmmod Not tainted 3.12.0-scsi-0.11-ppc64 #1 task: c0000000f769aba0 ti: c0000000f8f9c000 task.ti: c0000000f8f9c000 NIP: c0000000002b038c LR: c0000000002b0388 CTR: 0000000000000000 REGS: c0000000f8f9ee70 TRAP: 0700 Not tainted (3.12.0-scsi-0.11-ppc64) MSR: 8000000000029032 <SF,EE,ME,IR,DR,RI> CR: 28008444 XER: 20000000 SOFTE: 1 CFAR: c000000000736118 GPR00: c0000000002b0388 c0000000f8f9f0f0 c0000000010ed630 0000000000000047 GPR04: c000000001502628 c000000001513010 0000000000000689 652027656e636c6f GPR08: 737572655f646576 c000000000ae2b7c 0000000000a20000 c000000000add630 GPR12: 0000000028008442 c000000007f20000 0000000000000000 0000000010146920 GPR16: 00000000100cb9d8 0000000010093088 0000000010146920 0000000000000000 GPR20: 0000000000000000 0000000010161900 00000000100ce458 0000000000000000 GPR24: 0000000010161940 0000000000000000 d0000000046ad440 0000000000000000 GPR28: c0000000f8f9f270 0000000000000000 c0000000fcb882c8 0000000000000000 NIP [c0000000002b038c] .sysfs_hash_and_remove+0xe4/0xf0 LR [c0000000002b0388] .sysfs_hash_and_remove+0xe0/0xf0 Call Trace: [c0000000f8f9f0f0] [c0000000002b0388] .sysfs_hash_and_remove+0xe0/0xf0 (unreliable) [c0000000f8f9f190] [c0000000002b4134] .sysfs_remove_link+0x24/0x60 [c0000000f8f9f200] [d000000004df037c] .enclosure_remove_links+0x64/0xa0 [enclosure] [c0000000f8f9f2d0] [d000000004df0518] .enclosure_component_release+0x30/0x60 [enclosure] [c0000000f8f9f350] [c000000000540068] .device_release+0x50/0xd8 [c0000000f8f9f3d0] [c0000000003b6f80] .kobject_cleanup+0xb8/0x230 [c0000000f8f9f460] [c00000000053f404] .put_device+0x1c/0x30 [c0000000f8f9f4d0] [d000000004df0db0] .enclosure_unregister+0xa0/0xe8 [enclosure] [c0000000f8f9f560] [d000000004f90094] .ses_intf_remove_enclosure+0x8c/0xa8 [ses] [c0000000f8f9f5f0] [c0000000005413ec] .device_del+0xf4/0x268 [c0000000f8f9f680] [c000000000541594] .device_unregister+0x34/0x88 [c0000000f8f9f700] [d000000001423d3c] .__scsi_remove_device+0x104/0x128 [scsi_mod] [c0000000f8f9f780] [d00000000141eff8] .scsi_forget_host+0x70/0xa0 [scsi_mod] [c0000000f8f9f800] [d000000001413dc0] .scsi_remove_host+0x88/0x178 [scsi_mod] [c0000000f8f9f890] [d00000000469db5c] .ipr_remove+0x7c/0xf8 [ipr] [c0000000f8f9f920] [c0000000003fe1f4] .pci_device_remove+0x64/0xf0 [c0000000f8f9f9b0] [c000000000544f10] .__device_release_driver+0xd0/0x158 [c0000000f8f9fa40] [c0000000005450d8] .driver_detach+0x140/0x148 [c0000000f8f9fae0] [c000000000543848] .bus_remove_driver+0xe0/0x188 [c0000000f8f9fb70] [c00000000054628c] .driver_unregister+0x3c/0x80 [c0000000f8f9fbf0] [c0000000003fe35c] .pci_unregister_driver+0x34/0xe8 [c0000000f8f9fc90] [d0000000046a5fb4] .ipr_exit+0x2c/0x44 [ipr] [c0000000f8f9fd20] [c0000000001359dc] .SyS_delete_module+0x204/0x308 [c0000000f8f9fe30] [c000000000009f60] syscall_exit+0x0/0xa0 Instruction dump: e8010010 eb81ffe0 7c0803a6 eba1ffe8 ebc1fff0 ebe1fff8 4e800020 3c62ff8a 7ca42b78 3863c388 48485d45 60000000 <0fe00000> 3860fffe 4bffff94 fba1ffe8 o Signed-off-by: Wen Xiong <wenxiong@linux.vnet.ibm.com>
2014-04-09md/raid5: Fix CPU hotplug callback registrationOleg Nesterov1-46/+44
commit 789b5e0315284463617e106baad360cb9e8db3ac upstream. Subsystems that want to register CPU hotplug callbacks, as well as perform initialization for the CPUs that are already online, often do it as shown below: get_online_cpus(); for_each_online_cpu(cpu) init_cpu(cpu); register_cpu_notifier(&foobar_cpu_notifier); put_online_cpus(); This is wrong, since it is prone to ABBA deadlocks involving the cpu_add_remove_lock and the cpu_hotplug.lock (when running concurrently with CPU hotplug operations). Interestingly, the raid5 code can actually prevent double initialization and hence can use the following simplified form of callback registration: register_cpu_notifier(&foobar_cpu_notifier); get_online_cpus(); for_each_online_cpu(cpu) init_cpu(cpu); put_online_cpus(); A hotplug operation that occurs between registering the notifier and calling get_online_cpus(), won't disrupt anything, because the code takes care to perform the memory allocations only once. So reorganize the code in raid5 this way to fix the deadlock with callback registration. This fixes BZ 103213. Cc: linux-raid@vger.kernel.org Fixes: 36d1c6476be51101778882897b315bd928c8c7b5 Signed-off-by: Oleg Nesterov <oleg@redhat.com> [Srivatsa: Fixed the unregister_cpu_notifier() deadlock, added the free_scratch_buffer() helper to condense code further and wrote the changelog.] Signed-off-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com> Signed-off-by: NeilBrown <neilb@suse.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-04-09md/raid5: Revert partial fix of CPU hotplug callback registrationSrivatsa S. Bhat1-34/+39
Commit 2775d6230 (md: Avoid deadlock in raid5_alloc_percpu) only partially fixed the deadlock involving CPU hotplug notifiers. In particular, it fixed the deadlock possibility in register_cpu_notifier(), but left the deadlock in unregister_cpu_notifier() unfixed. So revert this commit so that we can fix both the deadlocks properly, using the solution that was accepted upstream. Signed-off-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
2014-04-03KVM: PPC: VFIO: Fix compile error introduced by a typoSrivatsa S. Bhat1-1/+1
kvm_vfio_spapr_tce_release was spelled as ikvm_vfio_ispapr_tce_release which caused compilation to break in case of CONFIG_KVM_VFIO=n. Fix it. Cc: Alexey Kardashevskiy <aik@ozlabs.ru> Signed-off-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
2014-04-02KVM: PPC: Book3S HV: Fix check for running inside guest in global_invalidates()Paul Mackerras1-1/+2
The global_invalidates() function contains a check that is intended to tell whether we are currently executing in the context of a hypercall issued by the guest. The reason is that the optimization of using a local TLB invalidate instruction is only valid in that context. The check was testing local_paca->kvm_hstate.kvm_vcore, which gets set when entering the guest but no longer gets cleared when exiting the guest. To fix this, we use the kvm_vcpu field instead, which does get cleared when exiting the guest, by the kvmppc_release_hwthread() calls inside kvmppc_run_core(). The effect of having the check wrong was that when kvmppc_do_h_remove() got called from htab_write() on the destination machine during a migration, it cleared the current cpu's bit in kvm->arch.need_tlb_flush. This meant that when the guest started running in the destination VM, it may miss out on doing a complete TLB flush, and therefore may end up using stale TLB entries from a previous guest that used the same LPID value. This should make migration more reliable. Signed-off-by: Paul Mackerras <paulus@samba.org>
2014-04-02powerpc/powernv: Remove unused debugfs OPAL log interfaceJoel Stanley1-103/+0
The OPAL log is now accessed through sysfs at /sys/firmware/opal/msglog, so remove the old and buggy debugfs file. Signed-off-by: Joel Stanley <joel@jms.id.au>
2014-04-01powernv, cpufreq: Export nominal frequency via sysfs.Gautham R. Shenoy1-1/+48
Create a driver attribute named "cpuinfo_nominal_freq" which will in turn create a read-only sysfs interface that will be used to export the nominal frequency to the userspace. This will be necessary for creating an optimal "performance" policy which should be running the on-demand governor with "scaling_max_freq" to be set to the value exported via "cpuinfo_max_freq" and "scaling_min_freq" to be set to the nominal frequency exported via "cpuinfo_nominal_freq". The patch caches the values of max, min, nominal pstate ids and nr_pstates queried from the DT during the initialization of the driver so that they can be used in other places in the driver for validatation. Also, it adds a helper method that returns the frequency corresponding to a pstate id. This has been backported from the version posted against mainline which can be found here: https://www.mail-archive.com/linuxppc-dev@lists.ozlabs.org/msg76990.html Reviewed-by: Preeti U Murthy <preeti@linux.vnet.ibm.com> Signed-off-by: Gautham R. Shenoy <ego@linux.vnet.ibm.com>
2014-04-01cpuidle:Remove the debug messages printed on exit from idle statePreeti U Murthy1-19/+2
We had added the debug prints to confirm the idle state exit by the cpus. This was mainly to test if fast sleep was working fine. Now that we are confident about its functioning we can get rid of these prints. Signed-off-by: Preeti U Murthy <preeti@linux.vnet.ibm.com>
2014-04-01powerpc/powernv: OPAL message log interface reworkJoel Stanley1-18/+40
This reworks the opal message log following upstream review. A bug was fixed where wrapped logs were not read correctly, and locking was added to reduce the impact of races between reading counters and the buffer contents. Signed-off-by: Joel Stanley <joel@jms.id.au> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2014-03-31sched/autogroup: Fix race with task_groups listGerald Schaefer1-2/+1
In autogroup_create(), a tg is allocated and added to the task_groups list. If CONFIG_RT_GROUP_SCHED is set, this tg is then modified while on the list, without locking. This can race with someone walking the list, like __enable_runtime() during CPU unplug, and result in a use-after-free bug. To fix this, move sched_online_group(), which adds the tg to the list, to the end of the autogroup_create() function after the modification. Signed-off-by: Gerald Schaefer <gerald.schaefer@de.ibm.com> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/1369411669-46971-2-git-send-email-gerald.schaefer@de.ibm.com Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Preeti U Murthy <preeti@linux.vnet.ibm.com> (cherry picked from commit 41261b6a832ea0e788627f6a8707854423f9ff49)
2014-03-31tty/hvc_opal: Kick the HVC thread on OPAL console eventsBenjamin Herrenschmidt1-1/+25
The firmware can notify us when new input data is available, so let's make sure we wakeup the HVC thread in that case. Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2014-03-31powerpc/powernv: Add opal_notifier_unregister() and export to modulesBenjamin Herrenschmidt2-0/+16
opal_notifier_register() is missing a pending "unregister" variant and should be exposed to modules. Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2014-03-31powerpc/ppc64: Do not turn AIL (reloc-on interrupts) too earlyBenjamin Herrenschmidt2-5/+15
Turn them on at the same time as we allow MSR_IR/DR in the paca kernel MSR, ie, after the MMU has been setup enough to be able to handle relocated access to the linear mapping. Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2014-03-31powerpc/ppc64: Gracefully handle early interruptsBenjamin Herrenschmidt2-1/+17
If we take an interrupt such as a trap caused by a BUG_ON before the MMU has been setup, the interrupt handlers try to enable virutal mode and cause a recursive crash, making the original problem very hard to debug. This fixes it by adjusting the "kernel_msr" value in the PACA so that it only has MSR_IR and MSR_DR (translation for instruction and data) set after the MMU has been initialized for the processor. We may still not have a console yet but at least we don't get into a recursive fault (and early debug console or memory dump via JTAG of the kernel buffer *will* give us the proper error). Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2014-03-31powerpc/powernv: Add invalid OPAL callJoel Stanley3-0/+6
This call will not be understood by OPAL, and cause it to add an error to it's log. Among other things, this is useful for testing the behaviour of the log as it fills up. Signed-off-by: Joel Stanley <joel@jms.id.au> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2014-03-31powerpc/powernv: Add OPAL message log interfaceJoel Stanley4-2/+106
OPAL provides an in-memory circular buffer containing a message log populated with various runtime messages produced by the firmware. Provide a sysfs interface /sys/firmware/opal/messages for userspace to view the messages. Signed-off-by: Joel Stanley <joel@jms.id.au> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2014-03-31powerpc/serial: Use saner flags when creating legacy portsBenjamin Herrenschmidt1-6/+9
We had a mix & match of flags used when creating legacy ports depending on where we found them in the device-tree. Among others we were missing UPF_SKIP_TEST for some kind of ISA ports which is a problem as quite a few UARTs out there don't support the loopback test (such as a lot of BMCs). Let's pick the set of flags used by the SoC code and generalize it which means autoconf, no loopback test, irq maybe shared and fixed port. Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2014-03-31powerpc/ppc64: Print CPU/MMU/FW features at bootBenjamin Herrenschmidt1-0/+5
Helps debug funky firmware issues Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2014-03-31powerpc/mm: Fix page size passed to tlbieBenjamin Herrenschmidt1-20/+5
Commit b1022fbd293564de91596b8775340cf41ad5214c and subsequent ones (in 3.10) introduced some preparatory changes for THP which consist of trying to read the actual HPTE page size from the hash table to perform the right variant of tlbie. However this has two issues: - The hash entry can have been evicted and replaced by another one with a different page size. This can in turn cause us to use an impossible combination of psize and actual_psize, in turn causing tlbie to be called with an invalid LP bit combination causing a HW checkstop - The whole business is unnecessary as in 3.10 we don't have THP and thus always have psize == actual_psize When THP was actual enabled in 3.11, we discovered that this wasn't going to work and changed the code significantly to pass the proper actual_psize from the upper layers rather than tyring to deduce it from the HPTE. However, we didn't "fix" 3.10 as we didn't realize that the bug introduced an exposure without THP being enabled. If a user page was hashed as a 64k page, and later got evicted from the hash and replaced with a 4k hash entry (due to a segment being demoted to 4k, for example by subpage protection or because it's an IO page), we could get into a situation where we tried to do a tlbie with a psize of 64k and actual_psize of 4k which is deadly. This is a 3.10-only fix for this situation which essentially removes the actual_psize business from the normal updatepp and invalidate path in hash_native_64.c since we know on 3.10 that the psize coming from the upper levels is always correct (no THP). As such it's a partial revert of b1022fbd293564de91596b8775340cf41ad5214c (we don't touch the bolted path etc... those should be fine and we want to minimize churn). Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2014-03-25drm/radeon: remove generic rptr/wptr functions (v2)Alex Deucher10-99/+254
Fill in asic family specific versions rather than using the generic version. This lets us handle asic specific differences more easily. In this case, we disable sw swapping of the rtpr writeback value on r6xx+ since the hw does it for us. Fixes bogus rptr readback on BE systems. v2: remove missed cpu_to_le32(), add comments Signed-off-by: Alex Deucher <alexander.deucher@amd.com> (backported from commit ea31bf697d27270188a93cd78cf9de4bc968aca3) Signed-off-by: Kleber Sacilotto de Souza <klebers@linux.vnet.ibm.com> LTC-Bugzilla: #99530
2014-03-25drm/radeon: remove special handling for the DMA ringChristian König10-45/+84
Now that we have callbacks for [rw]ptr handling we can remove the special handling for the DMA rings and use the callbacks instead. Signed-off-by: Christian König <christian.koenig@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> (backported from commit 2e1e6dad6a6d437e4c40611fdcc4e6cd9e2f969e) Signed-off-by: Kleber Sacilotto de Souza <klebers@linux.vnet.ibm.com> LTC-Bugzilla: #99530
2014-03-25drm/radeon: rework UVD writeback & [rw]ptr handlingChristian König8-28/+37
The hardware just doesn't support this correctly. Disable it before we accidentally write anywhere we shouldn't. Signed-off-by: Christian König <christian.koenig@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> (backported from commit 02c9f7fa4e7230fc4ae8bf26f64e45aa76011f9c) Signed-off-by: Kleber Sacilotto de Souza <klebers@linux.vnet.ibm.com> LTC-Bugzilla: #99530
2014-03-25drm/radeon: rework ring function handlingChristian König3-599/+254
Give the ring functions a separate structure and let the asic structure point to the ring specific functions. This simplifies the code and allows us to make changes at only one point. No change in functionality. Signed-off-by: Christian König <christian.koenig@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> (backported from commit 76a0df859defc53e6cb61f698a48ac7da92c8d84) Signed-off-by: Kleber Sacilotto de Souza <klebers@linux.vnet.ibm.com> LTC-Bugzilla: #99530
2014-03-25drm/radeon: use callbacks for ring pointer handling (v3)Alex Deucher4-14/+182
Add callbacks to the radeon_asic struct to handle rptr/wptr fetchs and wptr updates. We currently use one version for all rings, but this allows us to override with a ring specific versions. Needed for compute rings on CIK. v2: udpate as per Christian's comments v3: fix some rebase cruft Signed-off-by: Alex Deucher <alexander.deucher@amd.com> (cherry picked from commit f93bdefe6269067afc85688d45c646cde350e0d8) Signed-off-by: Kleber Sacilotto de Souza <klebers@linux.vnet.ibm.com> LTC-Bugzilla: #99530
2014-03-19conf: default IO scheduler is set to deadlineEli Qiao1-0/+6
Signed-off-by: Eli Qiao <taget@linux.vnet.ibm.com>
2014-03-19powerpc/book3s: Fix CFAR clobbering issue in machine check handler.Mahesh Salgaonkar2-0/+13
While checking powersaving mode in machine check handler at 0x200, we clobber CFAR register. Fix it by saving and restoring it during beq/bgt. Signed-off-by: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
2014-03-19powerpc/ftrace: bugfix for test_24bit_addrLiu Ping Fan1-0/+1
The branch target should be the func addr, not the addr of func_descr_t. So using ppc_function_entry() to generate the right target addr. Signed-off-by: Liu Ping Fan <pingfank@linux.vnet.ibm.com> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2014-03-19hwsensors/ibmpowernv: Fix up sysfs file duplicationNeelesh Gupta1-0/+1
Fixing up the 'sysfs' file duplication by passing the initialized char array to strncpy() function as the result is not %NUL-terminated if the source exceeds 'copy_length' bytes. Signed-off-by: Neelesh Gupta <neelegup@linux.vnet.ibm.com>
2014-03-19ipr: Add new CCIN definition for Grand Canyon supportwenxiong@vmlinux.vnet.ibm.com2-0/+9
Add the appropriate definition and table entry for new hardware support. Signed-off-by: Wen Xiong <wenxiong@linux.vnet.ibm.com> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2014-03-19ipr: Format HCAM overlay ID 0x21wenxiong@vmlinux.vnet.ibm.com2-0/+53
This patch adds formatting error overlay 0x21 to improve debug capabilities. Signed-off-by: Wen Xiong <wenxiong@linux.vnet.ibm.com> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2014-03-19ipr: Get rid of superfluous call to pci_disbale_msi/msix()Alexander Gordeev1-2/+0
There is no need to call pci_disable_msi() or pci_disable_msix() in case the call to pci_enable_msi() or pci_enable_msix() failed. Signed-off-by: Alexander Gordeev <agordeev@redhat.com> Acked-by: Brian King <brking@linux.vnet.ibm.com> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2014-03-19ipr: Handle early EEHBrian King2-89/+179
If, when the ipr driver loads, the adapter is in an EEH error state, it will currently oops and not be able to recover, as it attempts to access memory that has not yet been allocated. We've seen this occur in some kexec scenarios. The following patch fixes the oops and also allows the driver to recover from these probe time EEH errors. Signed-off-by: Brian King <brking@linux.vnet.ibm.com> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2014-03-19ipr: Add new CCIN definition for new hardware supportwenxiong@vmlinux.vnet.ibm.com2-0/+3
Add the appropriate definition and table entry for new hardware support. Signed-off-by: Wen Xiong <wenxiong@linux.vnet.ibm.com> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2014-03-19ipr: Remove extended delay bit on GSCSI reads/writes opswenxiong@vmlinux.vnet.ibm.com2-1/+6
This patch removes extended delay bit on GSCSI reads/writes ops, the performance will be significanly better. Signed-off-by: Wen Xiong <wenxiong@linux.vnet.ibm.com> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2014-03-19ipr: Increase msi-x interrupt vectors to 16wenxiong@linux.vnet.ibm.com2-2/+2
Signed-off-by: Wen Xiong <wenxiong@linux.vnet.ibm.com> Acked-by: Brian King <brking@linux.vnet.ibm.com> Signed-off-by: James Bottomley <JBottomley@Parallels.com>
2014-03-19ipr: Add sereral new CCIN definitions for new adapters supportwenxiong@linux.vnet.ibm.com2-0/+21
Add the appropriate definitions and table entries for new adapter support. Signed-off-by: Wen Xiong <wenxiong@linux.vnet.ibm.com> Signed-off-by: James Bottomley <JBottomley@Parallels.com>
2014-03-19ipr: IOA Status Code(IOASC) updatewenxiong@linux.vnet.ibm.com1-1/+39
Signed-off-by: Wen Xiong <wenxiong@linux.vnet.ibm.com> Acked-by: Brian King <brking@linux.vnet.ibm.com> Signed-off-by: James Bottomley <JBottomley@Parallels.com>
2014-03-19ipr: qc_fill_rtf() method should not store alternate status registerSergei Shtylyov1-1/+0
The 'ctl' field of the 'struct ata_taskfile' is not really dual purpose, i.e. it is not intended for storing the alternate status register (which is mapped at the same address in the legacy IDE controllers) in the qc_fill_rtf() method. No other 'libata' driver except 'drivers/scsi/ipr.c' stores the alternate status register's value in the 'ctl' field of 'qc->result_tf', hence this driver should not do this as well... Signed-off-by: Sergei Shtylyov <sergei.shtylyov@cogentembedded.com> Acked-by: Brian King <brking@linux.vnet.ibm.com> Signed-off-by: James Bottomley <JBottomley@Parallels.com>
2014-03-18KVM: PPC: Book3S HV: Save/restore host PMU registers that are new in POWER8Paul Mackerras3-1/+33
Currently we save the host PMU configuration, counter values, etc., when entering a guest, and restore it on return from the guest. (We have to do this because the guest has control of the PMU while it is executing.) However, we missed saving/restoring the SIAR and SDAR registers, as well as the registers which are new on POWER8, namely SIER and MMCR2. This adds code to save the values of these registers when entering the guest and restore them on exit. This also works around the bug where setting PMAE with a counter already negative doesn't generate an interrupt. This was already worked around for the guest PMU state in an earlier commit, and is worked around for the host PMU state here. Signed-off-by: Paul Mackerras <paulus@samba.org>
2014-03-18KVM: PPC: Book3S HV: Work around POWER8 performance monitor bugsPaul Mackerras2-7/+64
This adds workarounds for two hardware bugs in the POWER8 performance monitor unit (PMU), both related to interrupt generation. The effect of these bugs is that PMU interrupts can get lost, leading to tools such as perf reporting fewer counts and samples than they should. The first bug relates to the PMAO (perf. mon. alert occurred) bit in MMCR0; setting it should cause an interrupt, but doesn't. The other bug relates to the PMAE (perf. mon. alert enable) bit in MMCR0. Setting PMAE when a counter is negative and counter negative conditions are enabled to cause alerts should cause an alert, but doesn't. The workaround for the first bug is to create conditions where a counter will overflow, whenever we are about to restore a MMCR0 value that has PMAO set (and PMAO_SYNC clear). The workaround for the second bug is to freeze all counters using MMCR2 before reading MMCR0. Signed-off-by: Paul Mackerras <paulus@samba.org>
2014-03-18KVM: PPC: Book3S HV: Move guest TM restore code out of PMU restore sequencePaul Mackerras1-38/+38
Somehow, the code that restores the guest transactional memory state got put in the middle of the code sequence that restores the guest PMU (performance monitor unit) state. This results in corruption of the value written to MMCR0 if the guest is in transactional state. This fixes it by moving the TM state-restoring code to come just before the PMU state-restoring code. This comes out in the patch as the first part of the PMU state-restoring code being moved down to just before the second part of the PMU state-restoring code. Signed-off-by: Paul Mackerras <paulus@samba.org>
2014-03-18powerpc/perf: Add lost exception workaroundMichael Ellerman3-1/+89
Some power8 revisions have a hardware bug where we can lose a PMU exception, this commit adds a workaround to detect the bad condition and rectify the situation. See the comment in the commit for a full description. Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Signed-off-by: Paul Mackerras <paulus@samba.org>
2014-03-18powerpc: Add a cpu feature CPU_FTR_PMAO_BUGMichael Ellerman2-3/+5
Some power8 revisions have a hardware bug where we can lose a Performance Monitor (PMU) exception under certain circumstances. We will be adding a workaround for this case, see the next commit for details. The observed behaviour is that writing PMAO doesn't cause an exception as we would expect, hence the name of the feature. Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Signed-off-by: Paul Mackerras <paulus@samba.org>
2014-03-10xhci: fix incorrect type in assignment in handle_device_notification()Xenia Ragiadakou1-1/+1
This patch converts Event TRB's 3rd field, which has type le32, to CPU byteorder before using it to retrieve the Slot ID with TRB_TO_SLOT_ID macro. This bug was found using sparse. Signed-off-by: Xenia Ragiadakou <burzalodowa@gmail.com> Signed-off-by: Sarah Sharp <sarah.a.sharp@linux.intel.com> [Backport of 7e76ad431545d013911ddc744843118b43d01e89] Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2014-03-10xhci: convert TRB_CYCLE to le32 before using it to set Link TRB's cycle bitXenia Ragiadakou1-2/+3
This patch converts TRB_CYCLE to le32 to update correctly the Cycle Bit in 'control' field of the link TRB. This bug was found using sparse. Signed-off-by: Xenia Ragiadakou <burzalodowa@gmail.com> Signed-off-by: Sarah Sharp <sarah.a.sharp@linux.intel.com> [Backport of 587194873820a4a1b2eda260ac851394095afd77] Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2014-03-06net/mlx4: Support shutdown() interfaceGavin Shan1-0/+1
In kexec scenario, we failed to load the mlx4 driver in the second kernel because the ownership bit was hold by the first kernel without release correctly. The patch adds shutdown() interface so that the ownership can be released correctly in the first kernel. It also helps avoiding EEH error happened during boot stage of the second kernel because of undesired traffic, which can't be handled by hardware during that stage on Power platform. Signed-off-by: Gavin Shan <shangw@linux.vnet.ibm.com> Tested-by: Wei Yang <weiyang@linux.vnet.ibm.com>
2014-03-05drivers/vfio/pci: Fix MSIx message lostGavin Shan1-0/+19
The problem is specific to the case of BIST issued to IPR adapter on the guest side. The IPR driver does something like this: pci_save_state(), BIST reset and then pci_save_state(). we lose everything in MSIx table with BIST reset and we never have chance to restore MSIx table under the case. pci_restore_msix_state() called by pci_save_state() mask all MSIx vectors by MSIx capability, restore MSIx table, and then unmask all MSIx vectors. We force the host kernel to restore the MSIx vector in the step of unmasking all MSIx vectors to fix the issue. The patch is under review this moment in Linux community. It'd better to have ack from Ben and Alexey if we really want this to be Frobisher. It's responsing to bug#103589. Reported-by: Wen Xiong <wenxiong@us.ibm.com> Signed-off-by: Gavin Shan <shangw@linux.vnet.ibm.com>
2014-03-05powerpc/eeh: Disable EEH on rebootGavin Shan2-1/+22
We possiblly detect EEH errors during reboot, particularly in kexec path, but it's impossible for device drivers and EEH core to handle or recover them properly. The patch registers one reboot notifier for EEH and disable EEH subsystem during reboot. That means the EEH errors is going to be cleared by hardware reset or second kernel during early stage of PCI probe. It's backporting commit 66f9af83e56bfa12964d251df9d60fb571579913 ("powerpc/eeh: Disable EEH on reboot") from 3.14 upstream for bug#103590 Signed-off-by: Gavin Shan <shangw@linux.vnet.ibm.com> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2014-03-05powerpc/eeh: Cleanup on eeh_subsystem_enabledGavin Shan4-10/+27
The patch cleans up variable eeh_subsystem_enabled so that we needn't refer the variable directly from external. Instead, we will use function eeh_enabled() and eeh_set_enable() to operate the variable. It's backporting 2ec5a0adf60c23bb6b0a95d3b96a8c1ff1e1aa5a ("powerpc/eeh: Cleanup on eeh_subsystem_enabled") from 3.14 upstream for bug#103590 Signed-off-by: Gavin Shan <shangw@linux.vnet.ibm.com> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2014-03-05powerpc/powernv: Rework EEH resetGavin Shan1-25/+4
When doing reset in order to recover the affected PE, we issue hot reset on PE primary bus if it's not root bus. Otherwise, we issue hot or fundamental reset on root port or PHB accordingly. For the later case, we didn't cover the situation where PE only includes root port and it potentially causes kernel crash upon EEH error to the PE. The patch reworks the logic of EEH reset to improve the code readability and also avoid the kernel crash. It's backporting commit 5b2e198e50f6ba57081586b853163ea1bb95f1a8 ("powerpc/powernv: Rework EEH reset") from 3.14 upstream for bug#103590 Cc: stable@vger.kernel.org Reported-by: Thadeu Lima de Souza Cascardo <cascardo@linux.vnet.ibm.com> Signed-off-by: Gavin Shan <shangw@linux.vnet.ibm.com> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2014-03-04powerpc: fix realmode-failure flag resetAlexey Kardashevskiy1-6/+6
A malicious guest can register an IOMMU in KVM while a TCE request is being passed from the real to virtual mode. If vcpu->arch.tce_rm_fail was previously used and not cleared because of missing LIOBN entry in KVM, this may cause unwanted put_page() in the virtual mode handler. This moves @tce_rm_fail earlier to avoid using the incorrect tce_rm_fail flag value. Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
2014-03-04powerpc/eeh: More reliability of PCI dev resetGavin Shan1-8/+11
The PCI core has function pci_reset_function() to do reset on the specified PCI device. Before the reset starts, the sate of the PCI device is saved and it is restored after reset. The real reset work could be routed to pcibios_set_pcie_reset_state() by quirks. However, the PCI bus or PCI device isn't settled down fully for restore (PCI config and MMIO for MSIx table) after reset and it would introduce unnecessary frozen PE. Eventually, we're stopped from passing through IPR adapter from host to KVM-based guest. The patch adds delay in pcibios_set_pcie_reset_state() so that the PCI bus/device can settle down fully before restoring PCI device states. It's part of the fixes regarding bug#103297 and bug#103589. Signed-off-by: Gavin Shan <shangw@linux.vnet.ibm.com>
2014-03-04tick: Make oneshot broadcast robust vs. CPU offliningThomas Gleixner1-2/+11
In periodic mode we remove offline cpus from the broadcast propagation mask. In oneshot mode we fail to do so. This was not a problem so far, but the recent changes to the broadcast propagation introduced a constellation which can result in a NULL pointer dereference. What happens is: CPU0 CPU1 idle() arch_idle() tick_broadcast_oneshot_control(OFF); set cpu1 in tick_broadcast_force_mask if (cpu_offline()) arch_cpu_dead() cpu_dead_cleanup(cpu1) cpu1 tickdevice pointer = NULL broadcast interrupt dereference cpu1 tickdevice pointer -> OOPS We dereference the pointer because cpu1 is still set in tick_broadcast_force_mask and tick_do_broadcast() expects a valid cpumask and therefor lacks any further checks. Remove the cpu from the tick_broadcast_force_mask before we set the tick device pointer to NULL. Also add a sanity check to the oneshot broadcast function, so we can detect such issues w/o crashing the machine. Reported-by: Prarit Bhargava <prarit@redhat.com> Cc: athorlton@sgi.com Cc: CAI Qian <caiqian@redhat.com> Link: http://lkml.kernel.org/r/alpine.DEB.2.02.1306261303260.4013@ionos.tec.linutronix.de Signed-off-by: Thomas Gleixner <tglx@linutronix.de> (cherry picked from commit c9b5a266b103af873abb9ac03bc3d067702c8f4b) Reviewed-by: Preeti U Murthy <preeti@linux.vnet.ibm.com> Signed-off-by: Preeti U Murthy <preeti@linux.vnet.ibm.com>
2014-03-04cpuidle/powernv: Enable Fastsleep at boot timePreeti U Murthy1-1/+4
Fast sleep can be enabled today, only after writing into the proc interface /proc/sys/kernel/powersave-nap with a value greater than 1. Remove this constraint, now that we have a stable framework to support fast sleep, so that it is enabled by default at boot. However the same proc interface is also used to convey if deep idle states beyond snooze can be entered into or not. Hence retain the check on powersave-nap in fast sleep to verify if this is the case. Signed-off-by: Preeti U Murthy <preeti@linux.vnet.ibm.com>
2014-03-04sapphire: Add skiroot configurationJeremy Kerr1-0/+227
Add a configuration file to use when building the skiroot (Sapphire bootloader) kernel. Signed-off-by: Jeremy Kerr <jk@ozlabs.org> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2014-03-04powerpc/powernv: Refactor PHB diag-data dumpGavin Shan1-95/+125
As Ben suggested, the patch prints PHB diag-data with multiple fields in one line and omits the line if the fields of that line are all zero. With the patch applied, the PHB3 diag-data dump looks like: PHB3 PHB#3 Diag-data (Version: 1) brdgCtl: 00000002 RootSts: 0000000f 00400000 b0830008 00100147 00002000 nFir: 0000000000000000 0030006e00000000 0000000000000000 PhbSts: 0000001c00000000 0000000000000000 Lem: 0000000000100000 42498e327f502eae 0000000000000000 InAErr: 8000000000000000 8000000000000000 0402030000000000 \ 0000000000000000 PE[ 8] A/B: 8480002b00000000 8000000000000000 Signed-off-by: Gavin Shan <shangw@linux.vnet.ibm.com> Signed-off-by: Jeremy Kerr <jk@ozlabs.org> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2014-03-04powerpc/powernv: Dump PHB diag-data immediatelyGavin Shan1-37/+42
The PHB diag-data is useful to help locating the root cause for frozen PE or fenced PHB. However, EEH core enables IO path by clearing part of HW registers before collecting it and eventually we got broken PHB diag-data. The patch intends to fix it by dumping the PHB diag-data immediately when frozen/fenced state on PE or PHB is detected for the first time in eeh_ops::get_state() or next_error() backend. Signed-off-by: Gavin Shan <shangw@linux.vnet.ibm.com> Signed-off-by: Jeremy Kerr <jk@ozlabs.org> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2014-03-04powerpc/powernv: Move PNV_EEH_STATE_ENABLED aroundGavin Shan3-11/+6
The flag PNV_EEH_STATE_ENABLED is put into pnv_phb::eeh_state, which is protected by CONFIG_EEH. We needn't that. Instead, we can have pnv_phb::flags and maintain all flags there, which is the purpose of the patch. The patch also renames PNV_EEH_STATE_ENABLED to PNV_PHB_FLAG_EEH. Signed-off-by: Gavin Shan <shangw@linux.vnet.ibm.com> Signed-off-by: Jeremy Kerr <jk@ozlabs.org> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2014-03-04powerpc/powernv: Remove PNV_EEH_STATE_REMOVEDGavin Shan2-42/+15
The PHB state PNV_EEH_STATE_REMOVED maintained in pnv_phb isn't so useful any more and it's duplicated to EEH_PE_ISOLATED. The patch replaces PNV_EEH_STATE_REMOVED with EEH_PE_ISOLATED. Signed-off-by: Gavin Shan <shangw@linux.vnet.ibm.com> Signed-off-by: Jeremy Kerr <jk@ozlabs.org> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2014-03-04powerpc/eeh: Remove EEH_PE_PHB_DEADGavin Shan3-14/+6
The PE state (for eeh_pe instance) EEH_PE_PHB_DEAD is duplicate to EEH_PE_ISOLATED. Originally, those PHBs (PHB PE) with EEH_PE_PHB_DEAD would be removed from the system. However, it's safe to replace that with EEH_PE_ISOLATED. The patch also clear EEH_PE_RECOVERING after fenced PHB has been handled, either failure or success. It makes the PHB PE state consistent with: PHB functions normally NONE PHB has been removed EEH_PE_ISOLATED PHB fenced, recovery in progress EEH_PE_ISOLATED | RECOVERING Signed-off-by: Gavin Shan <shangw@linux.vnet.ibm.com> Signed-off-by: Jeremy Kerr <jk@ozlabs.org> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2014-03-04hwsensors/ibmpowernv: Cleanup and fix up sysfs file duplicationNeelesh Gupta1-87/+76
Cleaning up the code, removing not necessary enumeration, clubbing the fragmented data structure and some conditional checks in node traversal in __init code. This also fixes a bug of sysfs file duplication. Signed-off-by: Neelesh Gupta <neelegup@linux.vnet.ibm.com> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2014-03-04powerpc: vfio: remove incorrect put_page() for PR KVMAlexey Kardashevskiy2-2/+2
This fixes memory corruption which happens when VFIO is used with PR KVM. Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
2014-03-04powerpc/powernv: Return secondary CPUs to firmware before FW updateVasant Hegde3-7/+66
Firmware update on PowerNV platform takes several minutes. During this time one CPU is stuck in FW and the kernel complains about "soft lockups". This patch returns all secondary CPUs to firmware before starting firmware update process. [ Reworked a bit and cleaned up -- BenH ] Signed-off-by: Vasant Hegde <hegdevasant@linux.vnet.ibm.com> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2014-03-04powerpc/powernv: Read OPAL error log and export it through sysfsStewart Smith2-212/+275
Cherry pick 3b3f89ac6614d6bc2e2edb32e49d4906d931c795, implementing the error log reading code we're pushing upstream. This changes the userspace interface for reading and acknowledging error logs, so userspace code will have to change if it relied on the old way. Based on a patch by: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com> This patch adds support to read error logs from OPAL and export them to userspace through a sysfs interface. We export each log entry as a directory in /sys/firmware/opal/elog/ Currently, OPAL will buffer up to 128 error log records, we don't need to have any knowledge of this limit on the Linux side as that is actually largely transparent to us. Each error log entry has the following files: id, type, acknowledge, raw. Currently we just export the raw binary error log in the 'raw' attribute. In a future patch, we may parse more of the error log to make it a bit easier for userspace (e.g. to be able to display a brief summary in petitboot without having to have a full parser). If we have >128 logs from OPAL, we'll only be notified of 128 until userspace starts acknowledging them. This limitation may be lifted in the future and with this patch, that should "just work" from the linux side. A userspace daemon should: - wait for error log entries using normal mechanisms (we announce creation) - read error log entry - save error log entry safely to disk - acknowledge the error log entry - rinse, repeat. On the Linux side, we read the error log when we're notified of it. This possibly isn't ideal as it would be better to only read them on-demand. However, this doesn't really work with current OPAL interface, so we read the error log immediately when notified at the moment. I've tested this pretty extensively and am rather confident that the linux side of things works rather well. There is currently an issue with the service processor side of things for >128 error logs though. Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com> Conflicts: arch/powerpc/include/asm/opal.h arch/powerpc/platforms/powernv/Makefile arch/powerpc/platforms/powernv/opal-elog.c
2014-03-04Backport upstream powerpc/powernv Platform dump sysfs interfaceStewart Smith4-239/+364
This patch makes the sysfs interface match that of what's pushed upstream. changes in kernel: - fetch dump on-demand - directory per dump - in sysfs rather than debugfs Userspace changes needed - read from sysfs rather than debugfs. This enables support for userspace to fetch and initiate FSP and Platform dumps from the service processor (via firmware) through sysfs. Based on original patch from Vasant Hegde <hegdevasant@linux.vnet.ibm.com> Flow: - We register for OPAL notification events. - OPAL sends new dump available notification. - We make information on dump available via sysfs - Userspace requests dump contents - We retrieve the dump via OPAL interface - User copies the dump data - userspace sends ack for dump - We send ACK to OPAL. sysfs files: - We add the /sys/firmware/opal/dump directory - echoing 1 (well, anything, but in future we may support different dump types) to /sys/firmware/opal/dump/initiate_dump will initiate a dump. - Each dump that we've been notified of gets a directory in /sys/firmware/opal/dump/ with a name of the dump type and ID (in hex, as this is what's used elsewhere to identify the dump). - Each dump has files: id, type, dump and acknowledge dump is binary and is the dump itself. echoing 'ack' to acknowledge (currently any string will do) will acknowledge the dump and it will soon after disappear from sysfs. OPAL APIs: - opal_dump_init() - opal_dump_info() - opal_dump_read() - opal_dump_ack() - opal_dump_resend_notification() Currently we are only ever notified for one dump at a time (until the user explicitly acks the current dump, then we get a notification of the next dump), but this kernel code should "just work" when OPAL starts notifying us of all the dumps present. Changes since v2: - fix bug where we would free the dump buffer after userspace read it, refetching if needed. Refetching doesn't currently work, so we must keep the dump around for subsequent reads. Changes since v1: - Add support for getting dump type from OPAL through new OPAL call (falling back to old OPAL_DUMP_INFO call if OPAL_DUMP_INFO2 isn't supported) - use dump type in directory name for dump Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com> Conflicts: arch/powerpc/include/asm/opal.h arch/powerpc/platforms/powernv/Makefile arch/powerpc/platforms/powernv/opal-dump.c arch/powerpc/platforms/powernv/opal-wrappers.S arch/powerpc/platforms/powernv/opal.c
2014-03-04powerpc/crashdump : fix page frame number check in copy_oldmem_pageMahesh Salgaonkar1-3/+5
In copy_oldmem_page, the current check using max_pfn and min_low_pfn to decide if the page is backed or not, is not valid when the memory layout is not continuous. This happens when running as a QEMU/KVM guest, where RTAS is mapped higher in the memory. In that case max_pfn points to the end of RTAS, and a hole between the end of the kdump kernel and RTAS is not backed by PTEs. As a consequence, the kdump kernel is crashing in copy_oldmem_page when accessing in a direct way the pages in that hole. This fix relies on the memblock's service memblock_is_region_memory to check if the read page is part or not of the directly accessible memory. This is a backport of upstream patch https://lists.ozlabs.org/pipermail/linuxppc-dev/2014-February/115569.html This fixes LTC BUG #104729 Signed-off-by: Laurent Dufour <ldufour@linux.vnet.ibm.com> Tested-by: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
2014-03-04net/cxgb4: use remove handler as shutdown handlerThadeu Lima de Souza Cascardo1-0/+1
Without a shutdown handler, T4 cards behave very badly after a kexec. Some firmware calls return errors indicating allocation failures, for example. This is probably because thouse resources were not released by a BYE message to the firmware, for example. Using the remove handler guarantees we will use a well tested path. With this patch I applied, I managed to use kexec multiple times and probe and iSCSI login worked every time. Signed-off-by: Thadeu Lima de Souza Cascardo <cascardo@linux.vnet.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net> LTC-Bugzilla: #103241 (cherry picked from commit 687d705c031916b83953b714917b04d899e23cf5)
2014-02-25spec : Fix build issue on koji server.Eli Qiao1-6/+11
Signed-off-by: Eli Qiao <taget@linux.vnet.ibm.com>
2014-02-20config: disable CONFIG_SCSI_CHELSIO_FCOE to avoid system crashWang Sen1-1/+1
https://bugzilla.linux.ibm.com/show_bug.cgi?id=104249 https://bugzilla.linux.ibm.com/show_bug.cgi?id=104444 Signed-off-by: Wang Sen <wangsen@linux.vnet.ibm.com>
2014-02-19powerpc: Fix kdump hang issue on p8 with relocation on exception enabled.Mahesh Salgaonkar2-0/+26
On p8 systems, with relocation on exception feature enabled we are seeing kdump kernel hang at interrupt vector 0xc*4400. The reason is, with this feature enabled, exception are raised with MMU (IR=DR=1) ON with the default offset of 0xc*4000. Since exception is raised in virtual mode it requires the vector region to be executable without which it fails to fetch and execute instruction at 0xc*4xxx. For default kernel since kernel is loaded at real 0, the htab mappings sets the entire kernel text region executable. But for relocatable kernel (e.g. kdump case) we only copy interrupt vectors down to real 0 and never marked that region as executable because in p7 and below we always get exception in real mode. This patch fixes this issue by marking htab mapping range as executable that overlaps with the interrupt vector region for relocatable kernel. Thanks to Ben who helped me to debug this issue and find the root cause. This is at least part of the fix for kdump failures that we are seeing in bug 103693. Signed-off-by: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org> (cherry picked from commit 429d2e8342954d337abe370d957e78291032d867) Signed-off-by: Paul Mackerras <paulus@samba.org>
2014-02-19powerpc/pseries: Disable relocation on exception while going down during crash.Mahesh Salgaonkar1-2/+1
Disable relocation on exception while going down even in kdump case. This is because we are about clear htab mappings while kexec-ing into kdump kernel and we may run into issues if we still have AIL ON. This is at least part of the fix for kdump failures that we are seeing in bug 103693. Signed-off-by: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org> (cherry picked from commit 3ec8b78fcc5aa7745026d8d85a4e9ab52c922765) Signed-off-by: Paul Mackerras <paulus@samba.org>
2014-02-19powernv: don't attempt to refetch the FSP dump until the user has explicitly ↵Stewart Smith1-0/+8
acked it. This fixes a bug where we would get two events from OPAL with DUMP_AVAIL set (which is valid for OPAL to do) and in the second run of extract_dump() we would fail to free the memory previously allocated for the dump (leaking ~6MB+) as well as on the second dump_read_data() call OPAL would not retrieve the dump, leaving us with a dump in linux that was the correct size but all zeros. Changes since v1: fixed typo Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com> Acked-by: Benjamin Herrenschmidt <benh@kernel.crashing.org> LTC-Bugzilla: #104211
2014-02-12KVM: PPC: Book3S HV: Fix decrementer timeouts with non-zero TB offsetPaul Mackerras1-1/+9
Commit 082fee36bd2c ("KVM: PPC: Book3S HV: Make physical thread 0 do the MMU switching") reordered the guest entry/exit code so that most of the guest register save/restore code happened in guest MMU context. A side effect of that is that the timebase still contains the guest timebase value at the point where we compute and use vcpu->arch.dec_expires, and therefore that is now a guest timebase value rather than a host timebase value. That in turn means that the timeouts computed in kvmppc_set_timer() are wrong if the timebase offset for the guest is non-zero. The consequence of that is things such as "sleep 1" in a guest after migration may sleep for much longer than they should. This fixes the problem by converting between guest and host timebase values as necessary, by adding or subtracting the timebase offset. This also fixes an incorrect comment. This is part of the fix for many of the migration-related bug reports. Signed-off-by: Paul Mackerras <paulus@samba.org>
2014-02-12powerpc/powernv: Respect max_cpus while initializing core split/unsplit feature.Mahesh Salgaonkar1-1/+13
In kdump kernel we see a hang during subcore_init() at unsplit_core()->wait_for_sync_step(). In kdump kernel we always boot with maxcpus=1 and all other cpus are waiting inside OPAL, hence with 1 online cpu the master thread keep waiting on secondary threads to set split_state indefinitely. This is even true for all cases where max_cpus is not aligned with threads_per_core. This patch fixes this issue by disabling core split/unsplit feature if max_cpus are not aligned with threads_per_core. This also fixes kdump hang issue. Signed-off-by: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
2014-02-05Bump SPEC file to pbeta1Crístian Viana1-1/+9
Signed-off-by: Crístian Viana <vianac@linux.vnet.ibm.com>
2014-02-04vfio: remove redundant put_page() when failed in real modeAlexey Kardashevskiy2-8/+9
This fixes one of the corner cases which produced wrong backtrack from put_page(). BZ: 103055 Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
2014-02-04powerpc/powernv: Clear IPI flag when flushing interruptsPaul Mackerras1-1/+3
icp_native_flush_interrupt() function is supposed to clear a pending interrupt, like local_irq_enable(); local_irq_disable() would, but without calling generic code. Unfortunately it missed clearing the "IPI pending" flag in the PACA (local_paca->kvm_hstate.host_ipi). The effect of this flag being set is that secondary CPU threads won't go into the KVM guest, leading to messages like: kvmppc_wait_for_nap timeout 0 1 when a KVM HV guest is run. This fixes it by adding a call to kvmppc_set_host_ipi to clear the flag. This fixes BZ 103513. Signed-off-by: Paul Mackerras <paulus@samba.org>
2014-02-04KVM: PPC: Book3S HV: Fix register usage when loading/saving VRSAVEPaul Mackerras1-2/+6
Commit 595e4f7e697e ("KVM: PPC: Book3S HV: Use load/store_fp_state functions in HV guest entry/exit") changed the register usage in kvmppc_save_fp() and kvmppc_load_fp() but omitted changing the instructions that load and save VRSAVE. The result is that the VRSAVE value was loaded from a constant address, and saved to a location past the end of the vcpu struct, causing host kernel memory corruption and various kinds of host kernel crashes. This fixes the problem by using register r31, which contains the vcpu pointer, instead of r3 and r4. This should help resolve several bugzillas involving guest or host crashes and hangs, including 98456, 102775, 103534, 100504, and possibly others. Signed-off-by: Paul Mackerras <paulus@samba.org>
2014-02-04powerpc: fix to compile without CONFIG_IOMMU_APIAlexey Kardashevskiy1-7/+0
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
2014-01-29md: Avoid deadlock in raid5_alloc_percpuOleg Nesterov1-39/+34
register_cpu_notifier() can deadlock if called inside a get/put_online_cpus block. To avoid this, move the call to register_cpu_notifier before the get_online_cpus(). [paulus@samba.org - renamed alloc_xxx to alloc_percpu_areas, fixed compile errors, made up patch description] This fixes BZ 103213. Signed-off-by: Paul Mackerras <paulus@samba.org>
2014-01-22spec: pbuild8Eli Qiao1-2/+15
change log: PPC: KVM: fix to compile without VFIO vfio: fix in-kernel and ioctl handlers Fix a bug where asking for a POWER8 guest on a POWER7 system doesn't fail, but should Fix and performance improvements for nested virtualization LTC BZ 101114 CPU Build0.6: Host Cpu Offline/online leads to instruction dump and further cpu online/offline functions are not PowerKVM Build 8 host platform support Fix problems reported by the kernel RCU checking machinery and may help fix the memory corruption issues we have been seeing LTC BZ 101123 Unable to bring up LE guest using libvirt/virsh Fixes a bug with not resetting page struct pointer which caused bugs in calling code. Fix one of the corner cases when the realmode handler fails to handle T_PUT_TCE_INDIRECT call and passes it further to the vir Signed-off-by: Eli Qiao <taget@linux.vnet.ibm.com>
2014-01-22vfio: fix virtmode handlerAlexey Kardashevskiy2-1/+9
The existing handler assumes that the first failed TCE entry's host physical address is saved in the tce_tmp_hpas cache but it is not so the virtmode handler has to read it from the TCE list again so does this patch. Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
2014-01-22Add CONFIG_SENSORS_IBMPOWERNV=yEli Qiao2-0/+4
to config-powerpc64 and config-powerpc64p7 Signed-off-by: Eli Qiao <taget@linux.vnet.ibm.com>
2014-01-22powerpc: Make sure "cache" directory is removed when offlining cpuPaul Mackerras1-0/+3
The code in remove_cache_dir() is supposed to remove the "cache" subdirectory from the sysfs directory for a CPU when that CPU is being offlined. It tries to do this by calling kobject_put() on the kobject for the subdirectory. However, the subdirectory only gets removed once the last reference goes away, and the reference being put here may well not be the last reference. That means that the "cache" subdirectory may still exist when the offlining operation has finished. If the same CPU subsequently gets onlined, the code tries to add a new "cache" subdirectory. If the old subdirectory has not yet been removed, we get a WARN_ON in the sysfs code, with stack trace, and an error message printed on the console. Further, we ultimately end up with an online cpu with no "cache" subdirectory. This fixes it by doing an explicit kobject_del() at the point where we want the subdirectory to go away. kobject_del() removes the sysfs directory even though the object still exists in memory. The object will get freed at some point in the future. A subsequent onlining operation can create a new sysfs directory, even if the old object still exists in memory, without causing any problems. This fixes BZ 101114. Signed-off-by: Paul Mackerras <paulus@samba.org>
2014-01-22vfio: fix realmode guest phys address converterAlexey Kardashevskiy1-1/+3
This fixes a bug with not resetting page struct pointer which caused bugs in calling code. Suggested-by: Thadeu Lima de Souza Cascardo <cascardo@linux.vnet.ibm.com> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
2014-01-22KVM: PPC: Book3S PR: Take SRCU read lock around RTAS kvm_read_guest() callPaul Mackerras1-2/+7
This does for PR KVM what c9438092cae4 ("KVM: PPC: Book3S HV: Take SRCU read lock around kvm_read_guest() call") did for HV KVM, that is, eliminate a "suspicious rcu_dereference_check() usage!" warning by taking the SRCU lock around the call to kvmppc_rtas_hcall(). It also fixes a return of RESUME_HOST to return EMULATE_FAIL instead, since kvmppc_h_pr() is supposed to return EMULATE_* values. Signed-off-by: Paul Mackerras <paulus@samba.org>
2014-01-22KVM: PPC: Book3S: Load/save FP/VMX/VSX state directly to/from vcpu structPaul Mackerras3-73/+19
Now that we have the vcpu floating-point and vector state stored in the same type of struct as the main kernel uses, we can load that state directly from the vcpu struct instead of having extra copies to/from the thread_struct. Similarly, when the guest state needs to be saved, we can have it saved it directly to the vcpu struct by setting the current->thread.fp_save_area and current->thread.vr_save_area pointers. That also means that we don't need to back up and restore userspace's FP/vector state. This all makes the code simpler and faster. Note that it's not necessary to save or modify current->thread.fpexc_mode, since nothing in KVM uses or is affected by its value. Nor is it necessary to touch used_vr or used_vsr. Signed-off-by: Paul Mackerras <paulus@samba.org>
2014-01-22vfio: fix in-kernel and ioctl handlersAlexey Kardashevskiy3-3/+8
This fixes missing read/write TCE bits in VFIO map/unmap ioctls. This fixes the real mode handler to switch to the virtual mode if pte does not have "write" AND "dirty" bits set. This fixes get_user_pages_fast() call in the virtual mode handler to use correct write flag (used to be 0 always). This adds a lock around a kvm_memory_slot struct use. Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru> (cherry picked from commit 754177ee49cd27c9380e7bb9c0de6f8488197ca3) Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
2014-01-22KVM: PPC: Book3S HV: Send some subcommands of H_SET_MODE to userspacePaul Mackerras1-53/+24
This removes the code that handles the H_SET_MODE_RESOURCE_LE and H_SET_MODE_RESOURCE_ADDR_TRANS_MODE subfunctions of the H_SET_MODE hypercall from the kernel. Instead we now return H_TOO_HARD which causes the hypercall to be sent up to userspace to be handled there. In addition we now also send any other subfunction which we don't recognize to userspace. The reason for doing these two subfunctions in userspace is that they need to modify LPCR across all vcpus of the guest. Modifying LPCR in the kernel like this introduces a race between the kernel's modification and any modification that userspace might be doing on another vcpu. Therefore it's better to let userspace do all the modifications, so it can do any necessary synchronization itself. This also adds code to make sure that the MSR_LE bit in intr_msr (the MSR value we set when synthesizing an interrupt for the guest) is in sync with the ILE bit in the virtual core's LPCR value. This is necessary for implementing the LE subfunction of H_SET_MODE in userspace. Signed-off-by: Paul Mackerras <paulus@samba.org>
2014-01-22KVM: PPC: Use load_fp/vr_state rather than load_up_fpu/altivecPaul Mackerras6-63/+14
The load_up_fpu and load_up_altivec functions were never intended to be called from C, and do things like modifying the MSR value in their callers' stack frames, which are assumed to be interrupt frames. In addition, on 32-bit Book S they require the MMU to be off. This makes KVM use the new load_fp_state() and load_vr_state() functions instead of load_up_fpu/altivec. This means we can remove the assembler glue in book3s_rmhandlers.S, and potentially fixes a bug on Book E, where load_up_fpu was called directly from C. Signed-off-by: Paul Mackerras <paulus@samba.org>
2014-01-22PPC: KVM: fix to compile without VFIOAlexey Kardashevskiy1-1/+1
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru> (cherry picked from commit 6a87e5da59bf1d1a4186bf27ad8aa5dc3b03dd63) Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
2014-01-22KVM: PPC: Book3S HV: Don't use kvm_memslots() in real modePaul Mackerras3-4/+16
With HV KVM, some high-frequency hypercalls such as H_ENTER are handled in real mode, and need to access the memslots array for the guest. Accessing the memslots array is safe, because we hold the SRCU read lock for the whole time that a guest vcpu is running. However, the checks that kvm_memslots() does when lockdep is enabled are potentially unsafe in real mode, when only the linear mapping is available. Furthermore, kvm_memslots() can be called from a secondary CPU thread, which is an offline CPU from the point of view of the host kernel, and is not running the task which holds the SRCU read lock. To avoid false positives in the checks in kvm_memslots(), and to avoid possible side effects from doing the checks in real mode, this replaces kvm_memslots() with kvm_memslots_raw() in all the places that execute in real mode. kvm_memslots_raw() is a new function that is like kvm_memslots() but uses rcu_dereference_raw_notrace() instead of kvm_dereference_check(). Signed-off-by: Paul Mackerras <paulus@samba.org>
2014-01-22KVM: PPC: Book3S HV: Only accept host PVR value for guest PVRPaul Mackerras1-1/+3
Since the guest can read the machine's PVR (Processor Version Register) directly and see the real value, we should disallow userspace from setting any value for the guest's PVR other than the real host value. Therefore this makes kvm_arch_vcpu_set_sregs_hv() check the supplied PVR value and return an error if it is different from the host value, which has been put into vcpu->arch.pvr at vcpu creation time. Signed-off-by: Paul Mackerras <paulus@samba.org>
2014-01-22powerpc/powernv: Don't call generic code on offline cpusPaul Mackerras4-15/+32
On PowerNV platforms, when a CPU is offline, we put it into nap mode. It's possible that the CPU wakes up from nap mode while it is still offline due to a stray IPI. A misdirected device interrupt could also potentially cause it to wake up. In that circumstance, we need to clear the interrupt so that the CPU can go back to nap mode. In the past the clearing of the interrupt was accomplished by briefly enabling interrupts and allowing the normal interrupt handling code (do_IRQ() etc.) to handle the interrupt. This has the problem that this code calls irq_enter() and irq_exit(), which call functions such as account_system_vtime() which use RCU internally. Use of RCU is not permitted on offline CPUs and will trigger errors if RCU checking is enabled. To avoid calling into any generic code which might use RCU, we adopt a different method of clearing interrupts on offline CPUs. Since we are on the PowerNV platform, we know that the system interrupt controller is a XICS being driven directly (i.e. not via hcalls) by the kernel. Hence this adds a new icp_native_flush_interrupt() function to the native-mode XICS driver and arranges to call that when an offline CPU is woken from nap. This new function reads the interrupt from the XICS. If it is an IPI, it clears the IPI; if it is a device interrupt, it prints a warning and disables the source. Then it does the end-of-interrupt processing for the interrupt. The other thing that briefly enabling interrupts did was to check and clear the irq_happened flag in this CPU's PACA. Therefore, after flushing the interrupt from the XICS, we also clear all bits except the PACA_IRQ_HARD_DIS (interrupts are hard disabled) bit from the irq_happened flag. The PACA_IRQ_HARD_DIS flag is set by power7_nap() and is left set to indicate that interrupts are hard disabled. This means we then have to ignore that flag in power7_nap(), which is reasonable since it doesn't indicate that any interrupt event needs servicing. Signed-off-by: Paul Mackerras <paulus@samba.org>
2014-01-22pci: add "fundamental reset" quirkThadeu Lima de Souza Cascardo1-0/+21
Signed-off-by: Thadeu Lima de Souza Cascardo <cascardo@linux.vnet.ibm.com> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
2014-01-22powernv/cpufreq: Fix crash on hotplug using a hotplug-invariant cpumaskSrivatsa S. Bhat1-2/+8
The policy->cpus mask populated by the cpufreq driver is expected to be hotplug invariant, since the cpufreq core copies this mask as-it-is to policy->related_cpus mask (which shouldn't vary upon hotplug). The cpufreq core code later prunes the offlines cpus from the policy->cpus mask. At the moment, the powerpc cpufreq driver uses topology_thread_cpumask() to populate policy->cpus during .init(), and hence this is NOT hotplug invariant. Due to this, we hit the following bug: 1. Once we offline all threads of a core, say CPUs 8-15, and online CPU 8 back, its related cpus mask shows: $ cat /sys/devices/system/cpu/cpu8/cpufreq/related_cpus 8 [ It should have actually shown 8 9 10 11 12 13 14 15 ] 2. When we try to online the next sibling thread (CPU 9), it tries to do a fresh initialization since it is not listed in the related_cpus mask of CPU 8.(Note that for CPU 9, the cpufreq driver would have populated the related_cpus mask as [ 8 9 ], since those are the 2 online threads in that core so far). During CPU 9 init, it fails in the call to cpufreq_add_dev_symlink() because it tries to initialize the sysfs files for CPU 8 as well (which had already been initialized) while iterating through the policy->cpus. As a result, we hit this bug while onlining CPU 9: [ 1019.458183] sysfs: cannot create duplicate filename '/devices/system/cpu/cpu8/cpufreq' [ 1019.458270] ------------[ cut here ]------------ [ 1019.458338] WARNING: at fs/sysfs/dir.c:530 [ 1019.458367] Modules linked in: xt_tcpudp ipt_REJECT nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack iptable_filter ip_tables x_tables kvm binfmt_misc autofs4 lpfc [ 1019.458543] CPU: 76 PID: 73014 Comm: bash Not tainted 3.10.11-cpufreq-10 #1 [ 1019.458590] task: c000000ff02c3200 ti: c000000fe7604000 task.ti: c000000fe7604000 [ 1019.458645] NIP: c000000000284634 LR: c000000000284630 CTR: c0000000005b5d10 [ 1019.458700] REGS: c000000fe7606fa0 TRAP: 0700 Not tainted (3.10.11-cpufreq-10) [ 1019.458754] MSR: 9000000100029032 <SF,HV,EE,ME,IR,DR,RI> CR: 28222824 XER: 20000000 [ 1019.458883] SOFTE: 1 [ 1019.458903] CFAR: c000000000874d6c [ 1019.458930] GPR00: c000000000284630 c000000fe7607220 c000000000d9ab60 000000000000004a GPR04: 0000000000000000 000000000000005a c000000000c82fb8 c000000004482448 GPR08: c000000000c7ab60 0000000000000000 0000000000000000 0000000000000000 GPR12: 0000000028222822 c00000000fe13000 0000000010142550 c000000000ce8d70 GPR16: 0000000000000001 c000000000f28c68 0000000000000000 c000000003c20030 GPR20: c000000ff6d91800 c000000000ce8fc8 c000000000b45340 c000000000e26858 GPR24: c000000000ce8d70 0000000000000000 0000000000000001 c000000ff6d91a70 GPR28: c000000fef1b2000 c000000fe7607320 c000000fc98087a0 ffffffffffffffef [ 1019.459605] NIP [c000000000284634] .sysfs_add_one+0xe4/0x100 [ 1019.459653] LR [c000000000284630] .sysfs_add_one+0xe0/0x100 [ 1019.459689] PACATMSCRATCH [9000000100009032] [ 1019.459726] Call Trace: [ 1019.459747] [c000000fe7607220] [c000000000284630] .sysfs_add_one+0xe0/0x100 (unreliable) [ 1019.459813] [c000000fe76072b0] [c0000000002854dc] .sysfs_do_create_link_sd+0x10c/0x320 [ 1019.459879] [c000000fe7607370] [c000000000718318] .cpufreq_add_dev_interface+0x2e8/0x410 [ 1019.459943] [c000000fe7607710] [c000000000718da0] .cpufreq_add_dev+0x590/0x6d0 [ 1019.460009] [c000000fe7607810] [c000000000899580] .cpufreq_cpu_callback+0x7c/0x94 [ 1019.460073] [c000000fe7607890] [c00000000086f40c] .notifier_call_chain+0x8c/0x100 [ 1019.460138] [c000000fe7607930] [c000000000091450] .cpu_notify+0x40/0xa0 [ 1019.460194] [c000000fe76079b0] [c00000000089696c] ._cpu_up+0x17c/0x1ec [ 1019.460249] [c000000fe7607a70] [c000000000896b40] .cpu_up+0x164/0x194 [ 1019.460304] [c000000fe7607b00] [c000000000746edc] .store_online+0xbc/0xa60 [ 1019.460361] [c000000fe7607bb0] [c0000000004faf64] .dev_attr_store+0x64/0xa0 [ 1019.460417] [c000000fe7607c40] [c000000000282244] .sysfs_write_file+0xf4/0x1d0 [ 1019.460482] [c000000fe7607cf0] [c0000000001f1fa8] .vfs_write+0xe8/0x260 [ 1019.460537] [c000000fe7607d90] [c0000000001f2c44] .SyS_write+0x64/0xe0 [ 1019.460593] [c000000fe7607e30] [c000000000009d54] syscall_exit+0x0/0x98 [ 1019.460647] Instruction dump: [ 1019.460675] 481b0b2d 60000000 e89e0010 7f83e378 38a01000 481b0b19 60000000 7f84e378 [ 1019.460774] 3c62ffd5 38632cf0 485f06dd 60000000 <0fe00000> 7f83e378 4bf5f8a5 60000000 [ 1019.460952] ---[ end trace 600f2280a5b2cd86 ]--- None of this would have occurred if related_cpus had remained unchanged during hotplug, because in that case, CPU 9 would have done a light-weight init, thus avoiding this duplication bug. So fix this by populating policy->cpus in a hotplug invariant manner in the cpufreq driver. Signed-off-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com> Signed-off-by: Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2014-01-22powerpc/powernv: Add power sensor data retrieval from FSPVaidyanathan Srinivasan1-0/+8
Platform will provide power data in watts, hwmon expects in micro-watts. Signed-off-by: Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2014-01-22powerpc/powernv: Fix platform dump interfaceVasant Hegde3-19/+38
This patch is a increamental patch on top of commit af93eec4. This adds support to resend dump available notification, updates README file. Alos fixes with few other minor issues. Signed-off-by: Vasant Hegde <hegdevasant@linux.vnet.ibm.com> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2014-01-22powerpc/book3s: Recover from MC in sapphire on SCOM read via MMIO.Mahesh Salgaonkar8-10/+146
Detect and recover from machine check when inside opal on a special scom load instructions. On specific SCOM read via MMIO we may get a machine check exception with SRR0 pointing inside opal. To recover from MC in this scenario, get a recovery instruction address and return to it from MC. OPAL will export the machine check recoverable ranges through device tree node mcheck-recoverable-ranges under ibm,opal: # hexdump /proc/device-tree/ibm,opal/mcheck-recoverable-ranges 0000000 0000 0000 3000 2804 0000 000c 0000 0000 0000010 3000 2814 0000 0000 3000 27f0 0000 000c 0000020 0000 0000 3000 2814 xxxx xxxx xxxx xxxx 0000030 llll llll yyyy yyyy yyyy yyyy ... ... # where: xxxx xxxx xxxx xxxx = Starting instruction address llll llll = Length of the address range. yyyy yyyy yyyy yyyy = recovery address Each recoverable address range entry is an (start address, len, recovery address), 2 cells each for start and recovery address, 1 cell for len, totalling 5 cells per entry. During kernel boot time, build up the recovery table with the list of recovery ranges from device-tree node which will be used during machine check exception to recover from MMIO SCOM UE. Changes in v2: - As per Ben's comment, added mcheck-recoverable-ranges property under ibm,opal node. - Changed the format of the mcheck-recoverable-ranges list. Signed-off-by: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2014-01-22powerpc/powernv: hwmon driver for power values, fan rpm and temperatureNeelesh Gupta3-0/+541
This patch adds basic kernel enablement for reading power values, fan speed rpm and temperature values on powernv platforms which will be exported to user space through /sys interface. Signed-off-by: Shivaprasad G Bhat <sbhat@linux.vnet.ibm.com> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2014-01-22powerpc/powernv: Enable fetching of platform sensor dataNeelesh Gupta4-1/+70
This patch enables fetching of various platform sensor data through OPAL and expects a sensor handle from the driver to pass to OPAL. Signed-off-by: Sahir K <sahirk1@linux.vnet.ibm.com> Signed-off-by: Neelesh Gupta <neelegup@linux.vnet.ibm.com> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2014-01-22powerpc/powernv: Enable reading and updating of system parametersNeelesh Gupta5-1/+308
This patch enables reading and updating of system parameters through OPAL call. Signed-off-by: Neelesh Gupta <neelegup@linux.vnet.ibm.com> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2014-01-22powerpc/powernv: Infrastructure to support OPAL async completionNeelesh Gupta3-2/+215
This patch adds support for notifying the clients of their request completion. Clients request for the token before making OPAL call and then wait for the response. This patch uses messaging infrastructure to pull the data to linux by registering itself for the message type OPAL_MSG_ASYNC_COMP. Signed-off-by: Neelesh Gupta <neelegup@linux.vnet.ibm.com> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2014-01-22powerpc/eeh: Escalate error on non-existing PEGavin Shan1-10/+21
Sometimes, especially in sinario of loading another kernel with kdump, we got EEH error on non-existing PE. That means the PEEV / PEST in the corresponding PHB would be messy and we can't handle that case. The patch escalates the error to fenced PHB so that the PHB could be rested in order to revoer the errors on non-existing PEs. Reported-by: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com> Signed-off-by: Gavin Shan <shangw@linux.vnet.ibm.com> Tested-by: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2014-01-22powerpc/eeh: Handle multiple EEH errorsGavin Shan3-87/+112
For one PCI error relevant OPAL event, we possibly have multiple EEH errors for that. For example, multiple frozen PEs detected on different PHBs. Unfortunately, we didn't cover the case. The patch enumarates the return value from eeh_ops::next_error() and change eeh_handle_special_event() and eeh_ops::next_error() to handle all existing EEH errors. As Ben pointed out, we needn't list_for_each_entry_safe() since we are not deleting any PHB from the hose_list and the EEH serialized lock should be held while purging EEH events. The patch covers those suggestions as well. Signed-off-by: Gavin Shan <shangw@linux.vnet.ibm.com> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2014-01-22powerpc: Fix races with irq_workBenjamin Herrenschmidt1-0/+12
If we set irq_work on a processor and immediately afterward, before the irq work has a chance to be processed, we change the decrementer value, we can seriously delay the handling of that irq_work. Fix it by checking in a few places for pending irq work, first before changing the decrementer in decrementer_set_next_event() and after changing it in the same function and in timer_interrupt(). Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2014-01-22Move precessing of MCE queued event out from syscall exit path.Mahesh Salgaonkar3-9/+10
Huge Dickins reported an issue that b5ff4211a829 "powerpc/book3s: Queue up and process delayed MCE events" breaks the PowerMac G5 boot. This patch fixes it by moving the mce even processing away from syscall exit, which was wrong to do that in first place, and using irq work framework to delay processing of mce event. Reported-by: Hugh Dickins <hughd@google.com Signed-off-by: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2014-01-22powerpc/powernv: Increase candidate fw image sizeVasant Hegde1-2/+2
At present we assume candidate image is <= 256MB. But in P8, candidate image size can go up to 750MB. Hence increasing candidate image max size to 1GB. Signed-off-by: Vasant Hegde <hegdevasant@linux.vnet.ibm.com> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2014-01-22powerpc: Add debug checks to catch invalid cpu-to-node mappingsSrivatsa S. Bhat1-2/+24
There have been some weird bugs in the past where the kernel tried to associate threads of the same core to different NUMA nodes, and things went haywire after that point (as expected). But unfortunately, root-causing such issues have been quite challenging, due to the lack of appropriate debug checks in the kernel. These bugs usually lead to some odd soft-lockups in the scheduler's build-sched-domain code in the CPU hotplug path, which makes it very hard to trace it back to the incorrect cpu-to-node mappings. So add appropriate debug checks to catch such invalid cpu-to-node mappings as early as possible. Signed-off-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2014-01-22powerpc: Fix the setup of CPU-to-Node mappings during CPU onlineSrivatsa S. Bhat2-4/+76
On POWER platforms, the hypervisor can notify the guest kernel about dynamic changes in the cpu-numa associativity (VPHN topology update). Hence the cpu-to-node mappings that we got from the firmware during boot, may no longer be valid after such updates. This is handled using the arch_update_cpu_topology() hook in the scheduler, and the sched-domains are rebuilt according to the new mappings. But unfortunately, at the moment, CPU hotplug ignores these updated mappings and instead queries the firmware for the cpu-to-numa relationships and uses them during CPU online. So the kernel can end up assigning wrong NUMA nodes to CPUs during subsequent CPU hotplug online operations (after booting). Further, a particularly problematic scenario can result from this bug: On POWER platforms, the SMT mode can be switched between 1, 2, 4 (and even 8) threads per core. The switch to Single-Threaded (ST) mode is performed by offlining all except the first CPU thread in each core. Switching back to SMT mode involves onlining those other threads back, in each core. Now consider this scenario: 1. During boot, the kernel gets the cpu-to-node mappings from the firmware and assigns the CPUs to NUMA nodes appropriately, during CPU online. 2. Later on, the hypervisor updates the cpu-to-node mappings dynamically and communicates this update to the kernel. The kernel in turn updates its cpu-to-node associations and rebuilds its sched domains. Everything is fine so far. 3. Now, the user switches the machine from SMT to ST mode (say, by running ppc64_cpu --smt=1). This involves offlining all except 1 thread in each core. 4. The user then tries to switch back from ST to SMT mode (say, by running ppc64_cpu --smt=4), and this involves onlining those threads back. Since CPU hotplug ignores the new mappings, it queries the firmware and tries to associate the newly onlined sibling threads to the old NUMA nodes. This results in sibling threads within the same core getting associated with different NUMA nodes, which is incorrect. The scheduler's build-sched-domains code gets thoroughly confused with this and enters an infinite loop and causes soft-lockups, as explained in detail in commit 3be7db6ab (powerpc: VPHN topology change updates all siblings). So to fix this, use the numa_cpu_lookup_table to remember the updated cpu-to-node mappings, and use them during CPU hotplug online operations. Further, we also need to ensure that all threads in a core are assigned to a common NUMA node, irrespective of whether all those threads were online during the topology update. To achieve this, we take care not to use cpu_sibling_mask() since it is not hotplug invariant. Instead, we use cpu_first_sibling_thread() and set up the mappings manually using the 'threads_per_core' value for that particular platform. This helps us ensure that we don't hit this bug with any combination of CPU hotplug and SMT mode switching. Cc: stable@vger.kernel.org Signed-off-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2014-01-22powerpc/iommu: Don't detach device without IOMMU groupGavin Shan1-0/+11
Some devices, for example PCI root port, don't have IOMMU table and group. We needn't detach them from their IOMMU group. Otherwise, it potentially incurs kernel crash because of referring NULL IOMMU group as following backtrace indicates: .iommu_group_remove_device+0x74/0x1b0 .iommu_bus_notifier+0x94/0xb4 .notifier_call_chain+0x78/0xe8 .__blocking_notifier_call_chain+0x7c/0xbc .blocking_notifier_call_chain+0x38/0x48 .device_del+0x50/0x234 .pci_remove_bus_device+0x88/0x138 .pci_stop_and_remove_bus_device+0x2c/0x40 .pcibios_remove_pci_devices+0xcc/0xfc .pcibios_remove_pci_devices+0x3c/0xfc Signed-off-by: Gavin Shan <shangw@linux.vnet.ibm.com> Reviewed-by: Alexey Kardashevskiy <aik@ozlabs.ru> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2014-01-22powerpc/eeh: Hotplug improvementGavin Shan3-4/+24
When EEH error comes to one specific PCI device before its driver is loaded, we will apply hotplug to recover the error. During the plug time, the PCI device will be probed and its driver is loaded. Then we wrongly calls to the error handlers if the driver supports EEH explicitly. The patch intends to fix by introducing flag EEH_DEV_NO_HANDLER and set it before we remove the PCI device. In turn, we can avoid wrongly calls the error handlers of the PCI device after its driver loaded. Signed-off-by: Gavin Shan <shangw@linux.vnet.ibm.com> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2014-01-22powerpc/eeh: Call opal_pci_reinit() on powernv for restoring config spaceGavin Shan2-3/+28
The patch implements the EEH operation backend restore_config() for PowerNV platform. That relies on OPAL API opal_pci_reinit() where we reinitialize the error reporting properly after PE or PHB reset. Signed-off-by: Gavin Shan <shangw@linux.vnet.ibm.com> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2014-01-22powerpc/eeh: Add restore_config operationGavin Shan4-2/+9
After reset on the specific PE or PHB, we never configure AER correctly on PowerNV platform. We needn't care it on pSeries platform. The patch introduces additional EEH operation eeh_ops:: restore_config() so that we have chance to configure AER correctly for PowerNV platform. Signed-off-by: Gavin Shan <shangw@linux.vnet.ibm.com> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2014-01-22powerpc/powernv: Remove unnecessary assignmentGavin Shan1-2/+1
We don't have IO ports on PHB3 and the assignment of variable "iomap_off" on PHB3 is meaningless. The patch just removes the unnecessary assignment to the variable. The code change should have been part of commit c35d2a8c ("powerpc/powernv: Needn't IO segment map for PHB3"). Signed-off-by: Gavin Shan <shangw@linux.vnet.ibm.com> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2014-01-22powernv/eeh: Add buffer for P7IOC hub error dataBrian W Hart2-14/+5
Prevent ioda_eeh_hub_diag() from clobbering itself when called by supplying a per-PHB buffer for P7IOC hub diagnostic data. Take care to inform OPAL of the correct size for the buffer. [Small style change to the use of sizeof -- BenH] Signed-off-by: Brian W Hart <hartb@linux.vnet.ibm.com> Acked-by: Gavin Shan <shangw@linux.vnet.ibm.com> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2014-01-22powerpc: Fix bad stack check in exception entryMichael Neuling1-1/+1
In EXCEPTION_PROLOG_COMMON() we check to see if the stack pointer (r1) is valid when coming from the kernel. If it's not valid, we die but with a nice oops message. Currently we allocate a stack frame (subtract INT_FRAME_SIZE) before we check to see if the stack pointer is negative. Unfortunately, this won't detect a bad stack where r1 is less than INT_FRAME_SIZE. This patch fixes the check to compare the modified r1 with -INT_FRAME_SIZE. With this, bad kernel stack pointers (including NULL pointers) are correctly detected again. Kudos to Paulus for finding this. Signed-off-by: Michael Neuling <mikey@neuling.org> cc: stable@vger.kernel.org Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2014-01-22spec: powerkvm build7 update 1Eli Qiao1-1/+5
Signed-off-by: Eli Qiao <taget@linux.vnet.ibm.com>
2014-01-16spec: Do not use SOURCE file in kernel.specEli Qiao1-39/+12
Signed-off-by: Eli Qiao <taget@linux.vnet.ibm.com>
2014-01-16Move these source file to git repo:Eli Qiao9-0/+561
Makefile.config Makefile.release cpupower.config cpupower.service mcp8_configs.tar.gz mod-extra.list mod-extra.sh mod-sign.sh x509.genkey Currently when build kernel on koji server, we need to build a srpm first. In kernel.spce file(the spce file is from mcp), we include some source files. so it make really hard to build kernel package on koji server. 1) Get kernel.spec from git repo. 2) Get SOURCES files from mcp cvs 2) Build a kernel package from kernel.spec + SOURCES files make srpm to create a kernel.srpm(contains kernel.spec + SOURCES) 4) run koji build --scratch mcp8-rawhide kernel.srpm to build a kernel package. by copying all there cource file to git repo, we can build kernel easily with follow steps: 1) Get kernel.spec from git repo. 2) Build a kernel package from kernel.spec(only contain the spec file) rpmbuild -bs kernel.spec 3) run koji build --scratch mcp8-rawhide kernel.srpm to build a kernel package. Signed-off-by: Eli Qiao <taget@linux.vnet.ibm.com>
2014-01-13This commit breaks PR KVM.Eli Qiao2-15/+22
Request by Alexey Kardashevskiy Revert "PPC: KVM: move TCE cache alloc/free from KVM-common to KVM-HV" This reverts commit af4c301bd5b700f62597bcdf8e6f66bd2fd65db9.
2014-01-13PPC: KVM: fix VCPU run for HV KVMAlexey Kardashevskiy1-1/+1
When write to MMIO happens and there is an ioeventfd for that and is handled successfully, ioeventfd_write() returns 0 (success) and kvmppc_handle_store() returns EMULATE_DONE. Then kvmppc_emulate_mmio() converts EMULATE_DONE to RESUME_GUEST_NV and this broke from the loop. This adds handling of RESUME_GUEST_NV in kvmppc_vcpu_run_hv(). Cc: Michael S. Tsirkin <mst@redhat.com> Suggested-by: Paul Mackerras <paulus@samba.org> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
2014-01-10spec : bump kernel verison to 3.10.23 for pbuild7Eli Qiao1-5/+18
Signed-off-by: Eli Qiao <taget@linux.vnet.ibm.com>
2014-01-08cxgb4: allow large buffer size to have page sizeThadeu Lima de Souza Cascardo1-1/+1
Since commit 52367a763d8046190754ab43743e42638564a2d1 ("cxgb4/cxgb4vf: Code cleanup to enable T4 Configuration File support"), we have failures like this during cxgb4 probe: cxgb4 0000:01:00.4: bad SGE FL page buffer sizes [65536, 65536] cxgb4: probe of 0000:01:00.4 failed with error -22 This happens whenever software parameters are used, without a configuration file. That happens when the hardware was already initialized (after kexec, or after csiostor is loaded). It happens that these values are acceptable, rendering fl_pg_order equal to 0, which is the case of a hard init when the page size is equal or larger than 65536. Accepting fl_large_pg equal to fl_small_pg solves the issue, and shouldn't cause any trouble besides a possible performance reduction when smaller pages are used. And that can be fixed by a configuration file. Signed-off-by: Thadeu Lima de Souza Cascardo <cascardo@linux.vnet.ibm.com>
2014-01-08bnx2x: Change to D3hot only on removalYuval Mintz2-10/+24
This changes the PCI power management scheme of the bnx2x driver to be similar to those of most network drivers - the driver will now changes the power state into D3hot whenever the driver will be removed, instead of whenever an interface is unloaded. This change enables the driver to access its eeprom via ethtool callbacks even when interfaces are unloaded (such access requires the function to be in D0active). Signed-off-by: Yuval Mintz <yuvalmin@broadcom.com> Signed-off-by: Ariel Elior <ariele@broadcom.com> Signed-off-by: Eilon Greenstein <eilong@broadcom.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-01-06vfio: use VFIO KVM device for real/virtual mode TCE hypercallsAlexey Kardashevskiy5-55/+160
The upstream kernel got a VFIO KVM device which we support on PPC64 to associate LIOBNs with IOMMU groups. This moves the existing real/virtual mode handlers to newer codebase. Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
2014-01-06PPC: KVM: move TCE cache alloc/free from KVM-common to KVM-HVAlexey Kardashevskiy2-22/+15
The idea of the tce_tmp_hpas cache is to pass partially processed TCEs from real mode to virtual mode via H_TOO_HARD mechanism. Since TCE hypercalls are never called in real mode under PR KVM, move them to HV KVM. Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
2014-01-06powerpc: fix compilation without CONFIG_IOMMU_APIAlexey Kardashevskiy1-0/+17
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
2014-01-06KVM: PPC: vfio kvm device: support spapr tceAlexey Kardashevskiy6-3/+164
In addition to the external VFIO user API, a VFIO KVM device has been introduced recently. sPAPR TCE IOMMU is para-virtualized and the guest does map/unmap via hypercalls which take a logical bus id (LIOBN) as a target IOMMU identifier. LIOBNs are made up and linked to IOMMU groups by the user space. In order to accelerate IOMMU operations in the KVM, we need to tell KVM the information about LIOBN-to-group mapping. For that, a new KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE_LIOBN parameter is added. It accepts a pair of a VFIO group fd and LIOBN. This also adds a new kvm_vfio_find_group_by_liobn() function which receives kvm struct, LIOBN and a callback. As it increases the IOMMU group use counter, the KVMr is required to pass a callback which called when the VFIO group is about to be removed VFIO-KVM tracking so the KVM is able to call iommu_group_put() to release the IOMMU group. The KVM uses kvm_vfio_find_group_by_liobn() once per KVM run and caches the result in kvm_arch. iommu_group_put() for all groups will be called when KVM finishes (in the SPAPR TCE in KVM enablement patch). Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
2014-01-06kvm: Create non-coherent DMA registerationAlex Williamson6-4/+92
We currently use some ad-hoc arch variables tied to legacy KVM device assignment to manage emulation of instructions that depend on whether non-coherent DMA is present. Create an interface for this, adapting legacy KVM device assignment and adding VFIO via the KVM-VFIO device. For now we assume that non-coherent DMA is possible any time we have a VFIO group. Eventually an interface can be developed as part of the VFIO external user interface to query the coherency of a group. Signed-off-by: Alex Williamson <alex.williamson@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> (cherry picked from commit e0f0bbc527f6e9c0261f1d16b2a0b47612b7f235) Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
2014-01-06kvm/x86: Convert iommu_flags to iommu_noncoherentAlex Williamson6-15/+12
Default to operating in coherent mode. This simplifies the logic when we switch to a model of registering and unregistering noncoherent I/O with KVM. Signed-off-by: Alex Williamson <alex.williamson@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> (cherry picked from commit d96eb2c6f480769bff32054e78b964860dae4d56) Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
2014-01-06kvm: Add VFIO deviceAlexey Kardashevskiy8-1/+257
So far we've succeeded at making KVM and VFIO mostly unaware of each other, but areas are cropping up where a connection beyond eventfds and irqfds needs to be made. This patch introduces a KVM-VFIO device that is meant to be a gateway for such interaction. The user creates the device and can add and remove VFIO groups to it via file descriptors. When a group is added, KVM verifies the group is valid and gets a reference to it via the VFIO external user interface. Signed-off-by: Alex Williamson <alex.williamson@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> (cherry picked from commit ec53500fae421e07c5d035918ca454a429732ef4) Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
2014-01-06vfio-pci: PCI hot reset interfaceAlex Williamson2-1/+323
The current VFIO_DEVICE_RESET interface only maps to PCI use cases where we can isolate the reset to the individual PCI function. This means the device must support FLR (PCIe or AF), PM reset on D3hot->D0 transition, device specific reset, or be a singleton device on a bus for a secondary bus reset. FLR does not have widespread support, PM reset is not very reliable, and bus topology is dictated by the system and device design. We need to provide a means for a user to induce a bus reset in cases where the existing mechanisms are not available or not reliable. This device specific extension to VFIO provides the user with this ability. Two new ioctls are introduced: - VFIO_DEVICE_PCI_GET_HOT_RESET_INFO - VFIO_DEVICE_PCI_HOT_RESET The first provides the user with information about the extent of devices affected by a hot reset. This is essentially a list of devices and the IOMMU groups they belong to. The user may then initiate a hot reset by calling the second ioctl. We must be careful that the user has ownership of all the affected devices found via the first ioctl, so the second ioctl takes a list of file descriptors for the VFIO groups affected by the reset. Each group must have IOMMU protection established for the ioctl to succeed. Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2014-01-06vfio-pci: Test for extended config spaceAlex Williamson1-3/+8
Having PCIe/PCI-X capability isn't enough to assume that there are extended capabilities. Both specs define that the first capability header is all zero if there are no extended capabilities. Testing for this avoids an erroneous message about hiding capability 0x0 at offset 0x100. Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2014-01-06vfio-pci: Use fdget() rather than eventfd_fget()Alex Williamson1-19/+16
eventfd_fget() tests to see whether the file is an eventfd file, which we then immediately pass to eventfd_ctx_fileget(), which again tests whether the file is an eventfd file. Simplify slightly by using fdget() so that we only test that we're looking at an eventfd once. fget() could also be used, but fdget() makes use of fget_light() for another slight optimization. Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
2014-01-06PCI: Add pci_probe_reset_slot() and pci_probe_reset_bus()Alex Williamson2-0/+26
Users of pci_reset_bus() and pci_reset_slot() need a way to probe whether the bus or slot supports reset. Add trivial helper functions and export them as vfio-pci will make use of these. Signed-off-by: Alex Williamson <alex.williamson@redhat.com> Signed-off-by: Bjorn Helgaas <bhelgaas@google.com> (cherry picked from commit 9a3d2b9beefd5b07c1d8f70ded01b88f203ee304) Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
2014-01-06PCI: Remove aer_do_secondary_bus_reset()Alex Williamson3-38/+4
One PCI bus reset function to rule them all. Signed-off-by: Alex Williamson <alex.williamson@redhat.com> Signed-off-by: Bjorn Helgaas <bhelgaas@google.com> (cherry picked from commit 1b95ce8fc9c12fdb60047f2f9950f29e76e7c66d) Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
2014-01-06PCI: Tune secondary bus reset timingAlex Williamson1-2/+14
The PCI spec indicates that with stable power, reset needs to be asserted for a minimum of 1ms (Trst). We should be able to assume stable power for a Hot Reset, but we add another millisecond as a fudge factor to make sure the reset is seen on the bus for at least a full 1ms. After reset is de-asserted we must wait for devices to complete initialization. The specs refer to this as "recovery time" (Trhfa). For PCI this is 2^25 clock cycles or 2^26 for PCI-X. For minimum bus speeds, both of those come to 1s. PCIe "softens" this requirement with the Configuration Request Retry Status (CRS) completion status. Theoretically we could use CRS to shorten the wait time. We don't make use of that here, using a fixed 1s delay to allow devices to re-initialize. Signed-off-by: Alex Williamson <alex.williamson@redhat.com> Signed-off-by: Bjorn Helgaas <bhelgaas@google.com> (cherry picked from commit de0c548c33429cc78fd47a3c190c6d00b0e4e441) Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
2014-01-06PCI: Wake-up devices before saving config space for resetAlex Williamson1-0/+7
Devices come out of reset in D0. Restoring a device to a different post-reset state takes more smarts than our simple config space restore, which can leave devices in an inconsistent state. For example, if a device is reset in D3, but the restore doesn't successfully return the device to D3, then the actual state of the device and dev->current_state are contradictory. Put everything in D0 going into the reset, then we don't need to do anything special on the way out. Signed-off-by: Alex Williamson <alex.williamson@redhat.com> Signed-off-by: Bjorn Helgaas <bhelgaas@google.com> (cherry picked from commit a6cbaadea0af9b4aa6eee2882f2aa761ab91a4f8) Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
2014-01-06PCI: Add pci_reset_slot() and pci_reset_bus()Alex Williamson2-0/+211
Sometimes pci_reset_function() is not sufficient. We have cases where devices do not support any kind of reset, but there might be multiple functions on the bus preventing pci_reset_function() from doing a secondary bus reset. We also have cases where a device will advertise that it supports a PM reset, but really does nothing on D3hot->D0 (graphics cards are notorious for this). These devices often also have more than one function, so even blacklisting PM reset for them wouldn't allow a secondary bus reset through pci_reset_function(). If a driver supports multiple devices it should have the ability to induce a bus reset when it needs to. This patch provides that ability through pci_reset_slot() and pci_reset_bus(). It's the caller's responsibility when using these interfaces to understand that all of the devices in or below the slot (or on or below the bus) will be reset and therefore should be under control of the caller. PCI state of all the affected devices is saved and restored around these resets, but internal state of all of the affected devices is reset (which should be the intention). Signed-off-by: Alex Williamson <alex.williamson@redhat.com> Signed-off-by: Bjorn Helgaas <bhelgaas@google.com> (cherry picked from commit 090a3c5322e900f468b3205b76d0837003ad57b2) Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>