kernel/git/paulus/powerpc.git - PowerPC development tree

Age	Commit message (Collapse)	Author	Files	Lines
2014-08-19	aio: fix kernel memory disclosure in io_getevents() introduced in v3.10powerkvm-v2.1.0.2 powerkvm-v2.1-srvc	Benjamin LaHaise	1	-0/+2
	commit edfbbf388f293d70bf4b7c0bc38774d05e6f711a upstream. A kernel memory disclosure was introduced in aio_read_events_ring() in v3.10 by commit a31ad380bed817aa25f8830ad23e1a0480fef797. The changes made to aio_read_events_ring() failed to correctly limit the index into ctx->ring_pages[], allowing an attacked to cause the subsequent kmap() of an arbitrary page with a copy_to_user() to copy the contents into userspace. This vulnerability has been assigned CVE-2014-0206. Thanks to Mateusz and Petr for disclosing this issue. This patch applies to v3.12+. A separate backport is needed for 3.10/3.11. [jmoyer@redhat.com: backported to 3.10] Signed-off-by: Benjamin LaHaise <bcrl@kvack.org> Signed-off-by: Jeff Moyer <jmoyer@redhat.com> Cc: Mateusz Guzik <mguzik@redhat.com> Cc: Petr Matousek <pmatouse@redhat.com> Cc: Kent Overstreet <kmo@daterainc.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> (cherry picked from commit d36db46c2cba973557eb6138d22210c4e0cf17d6) Signed-off-by: Paul Mackerras <paulus@samba.org>
2014-08-19	aio: fix aio request leak when events are reaped by userspace	Benjamin LaHaise	1	-3/+1
	commit f8567a3845ac05bb28f3c1b478ef752762bd39ef upstream. The aio cleanups and optimizations by kmo that were merged into the 3.10 tree added a regression for userspace event reaping. Specifically, the reference counts are not decremented if the event is reaped in userspace, leading to the application being unable to submit further aio requests. This patch applies to 3.12+. A separate backport is required for 3.10/3.11. This issue was uncovered as part of CVE-2014-0206. [jmoyer@redhat.com: backported to 3.10] Signed-off-by: Benjamin LaHaise <bcrl@kvack.org> Signed-off-by: Jeff Moyer <jmoyer@redhat.com> Cc: Kent Overstreet <kmo@daterainc.com> Cc: Mateusz Guzik <mguzik@redhat.com> Cc: Petr Matousek <pmatouse@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> (cherry picked from commit 6745cb91b5ec93a1b34221279863926fba43d0d7) Signed-off-by: Paul Mackerras <paulus@samba.org>
2014-08-19	fs,userns: Change inode_capable to capable_wrt_inode_uidgid	Andy Lutomirski	5	-24/+25
	commit 23adbe12ef7d3d4195e80800ab36b37bee28cd03 upstream. The kernel has no concept of capabilities with respect to inodes; inodes exist independently of namespaces. For example, inode_capable(inode, CAP_LINUX_IMMUTABLE) would be nonsense. This patch changes inode_capable to check for uid and gid mappings and renames it to capable_wrt_inode_uidgid, which should make it more obvious what it does. Fixes CVE-2014-4014. Cc: Theodore Ts'o <tytso@mit.edu> Cc: Serge Hallyn <serge.hallyn@ubuntu.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Dave Chinner <david@fromorbit.com> Signed-off-by: Andy Lutomirski <luto@amacapital.net> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> (cherry picked from commit 4f80c6c1825a91cecf3b3bd19c824e768d98fe48) Signed-off-by: Paul Mackerras <paulus@samba.org>
2014-08-19	futex-prevent-requeue-pi-on-same-futex.patch futex: Forbid uaddr == uaddr2 ↵	Thomas Gleixner	1	-0/+25
	in futex_requeue(..., requeue_pi=1) commit e9c243a5a6de0be8e584c604d353412584b592f8 upstream. If uaddr == uaddr2, then we have broken the rule of only requeueing from a non-pi futex to a pi futex with this call. If we attempt this, then dangling pointers may be left for rt_waiter resulting in an exploitable condition. This change brings futex_requeue() in line with futex_wait_requeue_pi() which performs the same check as per commit 6f7b0a2a5c0f ("futex: Forbid uaddr == uaddr2 in futex_wait_requeue_pi()") [ tglx: Compare the resulting keys as well, as uaddrs might be different depending on the mapping ] Fixes CVE-2014-3153. Reported-by: Pinkie Pie Signed-off-by: Will Drewry <wad@chromium.org> Signed-off-by: Kees Cook <keescook@chromium.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Darren Hart <dvhart@linux.intel.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> (cherry picked from commit b58623fb64ff0454ec20bce7a02275a20c23086d) Signed-off-by: Paul Mackerras <paulus@samba.org>
2014-08-19	media: media-device: fix infoleak in ioctl media_enum_entities()	Salva Peiró	1	-0/+1
	commit e6a623460e5fc960ac3ee9f946d3106233fd28d8 upstream. This fixes CVE-2014-1739. Signed-off-by: Salva Peiró <speiro@ai2.upv.es> Acked-by: Laurent Pinchart <laurent.pinchart@ideasonboard.com> Signed-off-by: Mauro Carvalho Chehab <m.chehab@samsung.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> (cherry picked from commit 4e32a7c66fae40bde0fbff8cbc893eabe8575135) Signed-off-by: Paul Mackerras <paulus@samba.org>
2014-08-08	Merge remote-tracking branch 'aik-ka1/pkvm210' into powerkvm-v2.1-srvc	Benjamin Herrenschmidt	4	-23/+172
	vhost LE fixes
2014-08-08	vhost_net: Byteswap virtio_net header	Cédric Le Goater	1	-9/+30
	Signed-off-by: Cédric Le Goater <clg@fr.ibm.com> Signed-off-by: Paul Mackerras <paulus@samba.org> (cherry picked from commit edc39cd39f5e2e659d598159ff6185752badcf73) Signed-off-by: Scott Garfinkle <seg@us.ibm.com>
2014-08-08	vhost: Byteswap virtqueue attributes	Cédric Le Goater	1	-10/+38
	The virtqueue structure shares a few attributes with the guest OS which need to be byteswapped when the endian order of the host is different. This patch uses the vq->byteswap attribute to decide whether to byteswap or not data being accessed in the guest memory. Signed-off-by: Cédric Le Goater <clg@fr.ibm.com> Signed-off-by: Paul Mackerras <paulus@samba.org> (cherry picked from commit a9a8f7a5697686356fc0c0728f59a3379cf0a212) Signed-off-by: Scott Garfinkle <seg@us.ibm.com>
2014-08-08	vhost: Byteswap virtqueue attributes fix	Cédric Le Goater	1	-3/+3
	commit a9a8f7a56976 "vhost: Byteswap virtqueue attributes" missed a few byteswap in vhost_add_used(). This patches adds the vq_put_user() calls required when accessing data in the the guest memory. BZ: 108753 Branch: powerkvm-v2.1.1 Signed-off-by: Cédric Le Goater <clg@fr.ibm.com> Signed-off-by: Paul Mackerras <paulus@samba.org> (cherry picked from commit e7fb63f8f2444a8f687b53311d8e83267fcaa143) Signed-off-by: Scott Garfinkle <seg@us.ibm.com>
2014-08-08	vhost: reset byteswap in vhost_vq_reset()	Cédric Le Goater	1	-0/+1
	This patch does not fix any known issues in the previous vhost patchset. Nevertheless, the byteswap attribute needs to be re-initialized like all other virtqueue attributes. BZ: 108753 Branch: powerkvm-v2.1.1 Signed-off-by: Cédric Le Goater <clg@fr.ibm.com> Cc: Paul Mackerras <paulus@au1.ibm.com> Signed-off-by: Paul Mackerras <paulus@samba.org> (cherry picked from commit af892adc789c20a9b8c42195cee2d6e2861ad037) Signed-off-by: Scott Garfinkle <seg@us.ibm.com>
2014-08-08	vhost: Add VHOST_VRING_F_BYTESWAP flag	Cédric Le Goater	3	-1/+7
	The VHOST_VRING_F_BYTESWAP flag is used by the host to byteswap data of the vring when the guest and the host have a different endian order. Signed-off-by: Cédric Le Goater <clg@fr.ibm.com> Signed-off-by: Paul Mackerras <paulus@samba.org> (cherry picked from commit 28936b89d8e5faa76ef424b2ec356c8ee67c884e) Signed-off-by: Scott E. Garfinkle <seg@us.ibm.com>
2014-08-08	vhost: Add byteswap routines	Cédric Le Goater	1	-0/+93
	This patch adds a few helper routines around get_user and put_user to ease byteswapping. Signed-off-by: Cédric Le Goater <clg@fr.ibm.com> Signed-off-by: Paul Mackerras <paulus@samba.org> (cherry picked from commit 727174f8bb2d1be197cb94e5c0341fa640952a79) Signed-off-by: Scott E. Garfinkle <seg@us.ibm.com>
2014-07-24	net/mlx4_en: Fix selftest failing on non 10G link speed	Amir Vadai	1	-2/+4
	Connect-X devices selftest speed test shouldn't fail on 1G and 40G link speeds. BZ: 113608 Cherry-pick of 313c2d375b1c9b648d9d4b96ec1b8185ac6a78c5 Signed-off-by: Carol Soto <clsoto@linux.vnet.ibm.com> Signed-off-by: Amir Vadai <amirv@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2014-07-18	KVM: PPC: Book3S HV: Fix LPCR one_reg interface	Alexey Kardashevskiy	3	-3/+14
	Unfortunately, the LPCR got defined as a 32-bit register in the one_reg interface. This is unfortunate because KVM allows userspace to control the DPFD (default prefetch depth) field, which is in the upper 32 bits. The result is that DPFD always get set to 0, which reduces performance in the guest. We can't just change KVM_REG_PPC_LPCR to be a 64-bit register ID, since that would break existing userspace binaries. Instead we define a new KVM_REG_PPC_LPCR_64 id which is 64-bit. Userspace can still use the old KVM_REG_PPC_LPCR id, but we now only modify those fields in the bottom 32 bits that userspace can modify (ILE, TC and AIL). If userspace uses the new KVM_REG_PPC_LPCR_64 id, it can modify DPFD as well. BZ: 111438 Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru> Signed-off-by: Paul Mackerras <paulus@samba.org>
2014-07-18	powerpc/eeh: Wrong place to call pci_get_slot()	Mike Qiu	1	-33/+13
	pci_get_slot() is called with hold of PCI bus semaphore and it's not safe to be called in interrupt context. However, we possibly checks EEH error and calls the function in interrupt context. To avoid using pci_get_slot(), we turn into device tree for fetching location code. Otherwise, we might run into WARN_ON() as following messages indicate: WARNING: at drivers/pci/search.c:223 CPU: 0 PID: 0 Comm: swapper/0 Not tainted 3.16.0-rc3+ #72 task: c000000001367af0 ti: c000000001444000 task.ti: c000000001444000 NIP: c000000000497b70 LR: c000000000037530 CTR: 000000003003d114 REGS: c000000001446fa0 TRAP: 0700 Not tainted (3.16.0-rc3+) MSR: 9000000000029032 <SF,HV,EE,ME,IR,DR,RI> CR: 48002422 XER: 20000000 CFAR: c00000000003752c SOFTE: 0 : NIP [c000000000497b70] .pci_get_slot+0x40/0x110 LR [c000000000037530] .eeh_pe_loc_get+0x150/0x190 Call Trace: .of_get_property+0x30/0x60 (unreliable) .eeh_pe_loc_get+0x150/0x190 .eeh_dev_check_failure+0x1b4/0x550 .eeh_check_failure+0x90/0xf0 .lpfc_sli_check_eratt+0x504/0x7c0 [lpfc] .lpfc_poll_eratt+0x64/0x100 [lpfc] .call_timer_fn+0x64/0x190 .run_timer_softirq+0x2cc/0x3e0 Signed-off-by: Mike Qiu <qiudayu@linux.vnet.ibm.com> Acked-by: Gavin Shan <gwshan@linux.vnet.ibm.com> Signed-off-by: Paul Mackerras <paulus@samba.org>
2014-07-17	CMA: aggressively allocate the pages on CMA reserved memory when not used	Joonsoo Kim	5	-3/+119
	CMA is introduced to provide physically contiguous pages at runtime. For this purpose, it reserves memory at boot time. Although it reserve memory, this reserved memory can be used for movable memory allocation request. This usecase is beneficial to the system that needs this CMA reserved memory infrequently and it is one of main purpose of introducing CMA. But, there is a problem in current implementation. The problem is that it works like as just reserved memory approach. The pages on CMA reserved memory are hardly used for movable memory allocation. This is caused by combination of allocation and reclaim policy. The pages on CMA reserved memory are allocated if there is no movable memory, that is, as fallback allocation. So the time this fallback allocation is started is under heavy memory pressure. Although it is under memory pressure, movable allocation easily succeed, since there would be many pages on CMA reserved memory. But this is not the case for unmovable and reclaimable allocation, because they can't use the pages on CMA reserved memory. These allocations regard system's free memory as (free pages - free CMA pages) on watermark checking, that is, free unmovable pages + free reclaimable pages + free movable pages. Because we already exhausted movable pages, only free pages we have are unmovable and reclaimable types and this would be really small amount. So watermark checking would be failed. It will wake up kswapd to make enough free memory for unmovable and reclaimable allocation and kswapd will do. So before we fully utilize pages on CMA reserved memory, kswapd start to reclaim memory and try to make free memory over the high watermark. This watermark checking by kswapd doesn't take care free CMA pages so many movable pages would be reclaimed. After then, we have a lot of movable pages again, so fallback allocation doesn't happen again. To conclude, amount of free memory on meminfo which includes free CMA pages is moving around 512 MB if I reserve 512 MB memory for CMA. I found this problem on following experiment. 4 CPUs, 1024 MB, VIRTUAL MACHINE make -j16 CMA reserve: 0 MB 512 MB Elapsed-time: 225.2 472.5 Average-MemFree: 322490 KB 630839 KB To solve this problem, I can think following 2 possible solutions. 1. allocate the pages on CMA reserved memory first, and if they are exhausted, allocate movable pages. 2. interleaved allocation: try to allocate specific amounts of memory from CMA reserved memory and then allocate from free movable memory. I tested #1 approach and found the problem. Although free memory on meminfo can move around low watermark, there is large fluctuation on free memory, because too many pages are reclaimed when kswapd is invoked. Reason for this behaviour is that successive allocated CMA pages are on the LRU list in that order and kswapd reclaim them in same order. These memory doesn't help watermark checking from kwapd, so too many pages are reclaimed, I guess. So, I implement #2 approach. One thing I should note is that we should not change allocation target (movable list or CMA) on each allocation attempt, since this prevent allocated pages to be in physically succession, so some I/O devices can be hurt their performance. To solve this, I keep allocation target in at least pageblock_nr_pages attempts and make this number reflect ratio, free pages without free CMA pages to free CMA pages. With this approach, system works very smoothly and fully utilize the pages on CMA reserved memory. Following is the experimental result of this patch. 4 CPUs, 1024 MB, VIRTUAL MACHINE make -j16 <Before> CMA reserve: 0 MB 512 MB Elapsed-time: 225.2 472.5 Average-MemFree: 322490 KB 630839 KB nr_free_cma: 0 131068 pswpin: 0 261666 pswpout: 75 1241363 <After> CMA reserve: 0 MB 512 MB Elapsed-time: 222.7 224 Average-MemFree: 325595 KB 393033 KB nr_free_cma: 0 61001 pswpin: 0 6 pswpout: 44 502 There is no difference if we don't have CMA reserved memory (0 MB case). But, with CMA reserved memory (512 MB case), we fully utilize these reserved memory through this patch and the system behaves like as it doesn't reserve any memory. With this patch, we aggressively allocate the pages on CMA reserved memory so latency of CMA can arise. Below is the experimental result about latency. 4 CPUs, 1024 MB, VIRTUAL MACHINE CMA reserve: 512 MB Backgound Workload: make -jN Real Workload: 8 MB CMA allocation/free 20 times with 5 sec interval N: 1 4 8 16 Elapsed-time(Before): 4309.75 9511.09 12276.1 77103.5 Elapsed-time(After): 5391.69 16114.1 19380.3 34879.2 So generally we can see latency increase. Ratio of this increase is rather big - up to 70%. But, under the heavy workload, it shows latency decrease - up to 55%. This may be worst-case scenario, but reducing it would be important for some system, so, I can say that this patch have advantages and disadvantages in terms of latency. Although I think that this patch is right direction for CMA, there is side-effect in following case. If there is small memory zone and CMA occupys most of them, LRU for this zone would have many CMA pages. When reclaim is started, these CMA pages would be reclaimed, but not counted for watermark checking, so too many CMA pages could be reclaimed unnecessarily. Until now, this can't happen because free CMA pages aren't used easily. But, with this patch, free CMA pages are used easily so this problem can be possible. I will handle it on another patchset after some investigating. v2: - In fastpath, just replenish counters. Calculation is done whenver CMA area is varied v3: - Use unsigned type in adjust_managed_cma_page_count() (per Gioh) - Fix +/- count when calling adjust_managed_cma_page_count() (per Gioh) - Instead of implementing __rmqueue_cma() which has another __rmqueue_smallest(), choose_rmqueue_migratetype() is implemented to change original migratetype to MIGRATE_CMA according to criteria. It helps not to violate layering. (per Minchan in offline discussion) BZ: 111727 Acked-by: Michal Nazarewicz <mina86@mina86.com> Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Paul Mackerras <paulus@samba.org>
2014-07-17	cpufreq: Fix timer/workqueue corruption due to double queueing	Stephen Boyd	1	-0/+3
	commit 3617f2ca6d0eba48114308532945a7f1577816a4 upstream. When a CPU is hot removed we'll cancel all the delayed work items via gov_cancel_work(). Normally this will just cancels a delayed timer on each CPU that the policy is managing and the work won't run, but if the work is already running the workqueue code will wait for the work to finish before continuing to prevent the work items from re-queuing themselves like they normally do. This scheme will work most of the time, except for the case where the work function determines that it should adjust the delay for all other CPUs that the policy is managing. If this scenario occurs, the canceling CPU will cancel its own work but queue up the other CPUs works to run. For example: CPU0 CPU1 ---- ---- cpu_down() ... __cpufreq_remove_dev() cpufreq_governor_dbs() case CPUFREQ_GOV_STOP: gov_cancel_work(dbs_data, policy); cpu0 work is canceled timer is canceled cpu1 work is canceled <work runs> <waits for cpu1> od_dbs_timer() gov_queue_work(, , true); cpu0 work queued cpu1 work queued cpu2 work queued ... cpu1 work is canceled cpu2 work is canceled ... At the end of the GOV_STOP case cpu0 still has a work queued to run although the code is expecting all of the works to be canceled. __cpufreq_remove_dev() will then proceed to re-initialize all the other CPUs works except for the CPU that is going down. The CPUFREQ_GOV_START case in cpufreq_governor_dbs() will trample over the queued work and debugobjects will spit out a warning: WARNING: at lib/debugobjects.c:260 debug_print_object+0x94/0xbc() ODEBUG: init active (active state 0) object type: timer_list hint: delayed_work_timer_fn+0x0/0x10 Modules linked in: CPU: 0 PID: 1491 Comm: sh Tainted: G W 3.10.0 #19 [<c010c178>] (unwind_backtrace+0x0/0x11c) from [<c0109dec>] (show_stack+0x10/0x14) [<c0109dec>] (show_stack+0x10/0x14) from [<c01904cc>] (warn_slowpath_common+0x4c/0x6c) [<c01904cc>] (warn_slowpath_common+0x4c/0x6c) from [<c019056c>] (warn_slowpath_fmt+0x2c/0x3c) [<c019056c>] (warn_slowpath_fmt+0x2c/0x3c) from [<c0388a7c>] (debug_print_object+0x94/0xbc) [<c0388a7c>] (debug_print_object+0x94/0xbc) from [<c0388e34>] (__debug_object_init+0x2d0/0x340) [<c0388e34>] (__debug_object_init+0x2d0/0x340) from [<c019e3b0>] (init_timer_key+0x14/0xb0) [<c019e3b0>] (init_timer_key+0x14/0xb0) from [<c0635f78>] (cpufreq_governor_dbs+0x3e8/0x5f8) [<c0635f78>] (cpufreq_governor_dbs+0x3e8/0x5f8) from [<c06325a0>] (__cpufreq_governor+0xdc/0x1a4) [<c06325a0>] (__cpufreq_governor+0xdc/0x1a4) from [<c0633704>] (__cpufreq_remove_dev.isra.10+0x3b4/0x434) [<c0633704>] (__cpufreq_remove_dev.isra.10+0x3b4/0x434) from [<c08989f4>] (cpufreq_cpu_callback+0x60/0x80) [<c08989f4>] (cpufreq_cpu_callback+0x60/0x80) from [<c08a43c0>] (notifier_call_chain+0x38/0x68) [<c08a43c0>] (notifier_call_chain+0x38/0x68) from [<c01938e0>] (__cpu_notify+0x28/0x40) [<c01938e0>] (__cpu_notify+0x28/0x40) from [<c0892ad4>] (_cpu_down+0x7c/0x2c0) [<c0892ad4>] (_cpu_down+0x7c/0x2c0) from [<c0892d3c>] (cpu_down+0x24/0x40) [<c0892d3c>] (cpu_down+0x24/0x40) from [<c0893ea8>] (store_online+0x2c/0x74) [<c0893ea8>] (store_online+0x2c/0x74) from [<c04519d8>] (dev_attr_store+0x18/0x24) [<c04519d8>] (dev_attr_store+0x18/0x24) from [<c02a69d4>] (sysfs_write_file+0x100/0x148) [<c02a69d4>] (sysfs_write_file+0x100/0x148) from [<c0255c18>] (vfs_write+0xcc/0x174) [<c0255c18>] (vfs_write+0xcc/0x174) from [<c0255f70>] (SyS_write+0x38/0x64) [<c0255f70>] (SyS_write+0x38/0x64) from [<c0106120>] (ret_fast_syscall+0x0/0x30) BZ: 113137 Signed-off-by: Stephen Boyd <sboyd@codeaurora.org> Acked-by: Viresh Kumar <viresh.kumar@linaro.org> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Cc: Krzysztof Kozlowski <k.kozlowski@samsung.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> (cherry picked from commit d8996f63abe5a9d9b24f7a4df2c8459659d0e76f) Signed-off-by: Paul Mackerras <paulus@samba.org>
2014-07-17	cpufreq: Fix governor start/stop race condition	Xiaoguang Chen	2	-0/+25
	commit 95731ebb114c5f0c028459388560fc2a72fe5049 upstream. Cpufreq governors' stop and start operations should be carried out in sequence. Otherwise, there will be unexpected behavior, like in the example below. Suppose there are 4 CPUs and policy->cpu=CPU0, CPU1/2/3 are linked to CPU0. The normal sequence is: 1) Current governor is userspace. An application tries to set the governor to ondemand. It will call __cpufreq_set_policy() in which it will stop the userspace governor and then start the ondemand governor. 2) Current governor is userspace. The online of CPU3 runs on CPU0. It will call cpufreq_add_policy_cpu() in which it will first stop the userspace governor, and then start it again. If the sequence of the above two cases interleaves, it becomes: 1) Application stops userspace governor 2) Hotplug stops userspace governor which is a problem, because the governor shouldn't be stopped twice in a row. What happens next is: 3) Application starts ondemand governor 4) Hotplug starts a governor In step 4, the hotplug is supposed to start the userspace governor, but now the governor has been changed by the application to ondemand, so the ondemand governor is started once again, which is incorrect. The solution is to prevent policy governors from being stopped multiple times in a row. A governor should only be stopped once for one policy. After it has been stopped, no more governor stop operations should be executed. Also add a mutex to serialize governor operations. BZ: 113137 [rjw: Changelog. And you owe me a beverage of my choice.] Signed-off-by: Xiaoguang Chen <chenxg@marvell.com> Acked-by: Viresh Kumar <viresh.kumar@linaro.org> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Cc: Krzysztof Kozlowski <k.kozlowski@samsung.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> (cherry picked from commit ba17ca46b968001df16f672ffe694fd0a12512f2) Signed-off-by: Paul Mackerras <paulus@samba.org>
2014-07-10	powerpc/perf: Never program book3s PMCs with values >= 0x80000000	Anton Blanchard	1	-1/+16
	We are seeing a lot of PMU warnings on POWER8: Can't find PMC that caused IRQ Looking closer, the active PMC is 0 at this point and we took a PMU exception on the transition from negative to 0. Some versions of POWER8 have an issue where they edge detect and not level detect PMC overflows. A number of places program the PMC with (0x80000000 - period_left), where period_left can be negative. We can either fix all of these or just ensure that period_left is always >= 1. This patch takes the second option. Cc: <stable@vger.kernel.org> Signed-off-by: Anton Blanchard <anton@samba.org> Signed-off-by: Paul Mackerras <paulus@samba.org>
2014-07-09	powerpc/perf: Clear MMCR2 when enabling PMU	Joel Stanley	1	-0/+3
	On POWER8 when switching to a KVM guest we set bits in MMCR2 to freeze the PMU counters. Aside from on boot they are then never reset, resulting in stuck perf counters for any user in the guest or host. We now set MMCR2 to 0 whenever enabling the PMU, which provides a sane state for perf to use the PMU counters under either the guest or the host. This was manifesting as a bug with ppc64_cpu --frequency: $ sudo ppc64_cpu --frequency WARNING: couldn't run on cpu 0 WARNING: couldn't run on cpu 8 ... WARNING: couldn't run on cpu 144 WARNING: couldn't run on cpu 152 min: 18446744073.710 GHz (cpu -1) max: 0.000 GHz (cpu -1) avg: 0.000 GHz The command uses a perf counter to measure CPU cycles over a fixed amount of time, in order to approximate the frequency of the machine. The counters were returning zero once a guest was started, regardless of weather it was still running or had been shut down. By dumping the value of MMCR2, it was observed that once a guest is running MMCR2 is set to 1s - which stops counters from running: $ sudo sh -c 'echo p > /proc/sysrq-trigger' CPU: 0 PMU registers, ppmu = POWER8 n_counters = 6 PMC1: 5b635e38 PMC2: 00000000 PMC3: 00000000 PMC4: 00000000 PMC5: 1bf5a646 PMC6: 5793d378 PMC7: deadbeef PMC8: deadbeef MMCR0: 0000000080000000 MMCR1: 000000001e000000 MMCRA: 0000040000000000 MMCR2: fffffffffffffc00 EBBHR: 0000000000000000 EBBRR: 0000000000000000 BESCR: 0000000000000000 SIAR: 00000000000a51cc SDAR: c00000000fc40000 SIER: 0000000001000000 This is done unconditionally in book3s_hv_interrupts.S upon entering the guest, and the original value is only save/restored if the host has indicated it was using the PMU. This is okay, however the user of the PMU needs to ensure that it is in a defined state when it starts using it. BZ: 112045 Fixes: e05b9b9e5c10 ("powerpc/perf: Power8 PMU support") Cc: stable@vger.kernel.org Signed-off-by: Joel Stanley <joel@jms.id.au> Acked-by: Michael Ellerman <mpe@ellerman.id.au> Signed-off-by: Paul Mackerras <paulus@samba.org>
2014-07-09	powerpc/perf: Add PPMU_ARCH_207S define	Joel Stanley	3	-3/+3
	Instead of separate bits for every POWER8 PMU feature, have a single one for v2.07 of the architecture. This saves us adding a MMCR2 define for a future patch. BZ: 112045 Cc: stable@vger.kernel.org Signed-off-by: Joel Stanley <joel@jms.id.au> Acked-by: Michael Ellerman <mpe@ellerman.id.au> Signed-off-by: Paul Mackerras <paulus@samba.org>
2014-07-09	powerpc/kvm: Remove redundant save of SIER and MMCR2	Joel Stanley	1	-5/+0
	These two registers are already saved in the block above. Aside from being unnecessary, by the time we get down to the second save location r8 no longer contains MMCR2, so we are clobbering the saved value with PMC5. MMCR2 primarily consists of counter freeze bits. So restoring the value of PMC5 into MMCR2 will most likely have the effect of freezing counters. BZ: 112045 Fixes: 72cde5a88d37 ("KVM: PPC: Book3S HV: Save/restore host PMU registers that are new in POWER8") Cc: stable@vger.kernel.org Signed-off-by: Joel Stanley <joel@jms.id.au> Acked-by: Michael Ellerman <mpe@ellerman.id.au> Acked-by: Paul Mackerras <paulus@samba.org> Reviewed-by: Alexander Graf <agraf@suse.de> Signed-off-by: Paul Mackerras <paulus@samba.org>
2014-07-09	powerpc/powernv: Release the refcount for pci_dev	Wei Yang	1	-1/+0
	On PowerNV platform, we are holding an unnecessary refcount on a pci_dev, which leads to the pci_dev is not destroyed when hotplugging a pci device. This patch release the unnecessary refcount. Signed-off-by: Wei Yang <weiyang@linux.vnet.ibm.com> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org> (cherry picked from commit 4966bfa1b3347ee75e6d93859a2e8ce9a662390c) Signed-off-by: Paul Mackerras <paulus@samba.org>
2014-07-09	powerpc/powernv: Reduce multi-hit of iommu_add_device()	Wei Yang	1	-1/+1
	During the EEH hotplug event, iommu_add_device() will be invoked three times and two of them will trigger warning or error. The three times to invoke the iommu_add_device() are: pci_device_add ... set_iommu_table_base_and_group <- 1st time, fail device_add ... tce_iommu_bus_notifier <- 2nd time, succees pcibios_add_pci_devices ... pcibios_setup_bus_devices <- 3rd time, re-attach The first time fails, since the dev->kobj->sd is not initialized. The dev->kobj->sd is initialized in device_add(). The third time's warning is triggered by the re-attach of the iommu_group. After applying this patch, the error iommu_tce: 0003:05:00.0 has not been added, ret=-14 and the warning [ 204.123609] ------------[ cut here ]------------ [ 204.123645] WARNING: at arch/powerpc/kernel/iommu.c:1125 [ 204.123680] Modules linked in: xt_CHECKSUM nf_conntrack_netbios_ns nf_conntrack_broadcast ipt_MASQUERADE ip6t_REJECT bnep bluetooth 6lowpan_iphc rfkill xt_conntrack ebtable_nat ebtable_broute bridge stp llc mlx4_ib ib_sa ib_mad ib_core ib_addr ebtable_filter ebtables ip6table_nat nf_conntrack_ipv6 nf_defrag_ipv6 nf_nat_ipv6 ip6table_mangle ip6table_security ip6table_raw ip6table_filter ip6_tables iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack iptable_mangle iptable_security iptable_raw bnx2x tg3 mlx4_core nfsd ptp mdio ses libcrc32c nfs_acl enclosure be2net pps_core shpchp lockd kvm uinput sunrpc binfmt_misc lpfc scsi_transport_fc ipr scsi_tgt [ 204.124356] CPU: 18 PID: 650 Comm: eehd Not tainted 3.14.0-rc5yw+ #102 [ 204.124400] task: c0000027ed485670 ti: c0000027ed50c000 task.ti: c0000027ed50c000 [ 204.124453] NIP: c00000000003cf80 LR: c00000000006c648 CTR: c00000000006c5c0 [ 204.124506] REGS: c0000027ed50f440 TRAP: 0700 Not tainted (3.14.0-rc5yw+) [ 204.124558] MSR: 9000000000029032 <SF,HV,EE,ME,IR,DR,RI> CR: 88008084 XER: 20000000 [ 204.124682] CFAR: c00000000006c644 SOFTE: 1 GPR00: c00000000006c648 c0000027ed50f6c0 c000000001398380 c0000027ec260300 GPR04: c0000027ea92c000 c00000000006ad00 c0000000016e41b0 0000000000000110 GPR08: c0000000012cd4c0 0000000000000001 c0000027ec2602ff 0000000000000062 GPR12: 0000000028008084 c00000000fdca200 c0000000000d1d90 c0000027ec281a80 GPR16: 0000000000000000 0000000000000000 0000000000000000 0000000000000000 GPR20: 0000000000000000 0000000000000000 0000000000000000 0000000000000001 GPR24: 000000005342697b 0000000000002906 c000001fe6ac9800 c000001fe6ac9800 GPR28: 0000000000000000 c0000000016e3a80 c0000027ea92c090 c0000027ea92c000 [ 204.125353] NIP [c00000000003cf80] .iommu_add_device+0x30/0x1f0 [ 204.125399] LR [c00000000006c648] .pnv_pci_ioda_dma_dev_setup+0x88/0xb0 [ 204.125443] Call Trace: [ 204.125464] [c0000027ed50f6c0] [c0000027ed50f750] 0xc0000027ed50f750 (unreliable) [ 204.125526] [c0000027ed50f750] [c00000000006c648] .pnv_pci_ioda_dma_dev_setup+0x88/0xb0 [ 204.125588] [c0000027ed50f7d0] [c000000000069cc8] .pnv_pci_dma_dev_setup+0x78/0x340 [ 204.125650] [c0000027ed50f870] [c000000000044408] .pcibios_setup_device+0x88/0x2f0 [ 204.125712] [c0000027ed50f940] [c000000000046040] .pcibios_setup_bus_devices+0x60/0xd0 [ 204.125774] [c0000027ed50f9c0] [c000000000043acc] .pcibios_add_pci_devices+0xdc/0x1c0 [ 204.125837] [c0000027ed50fa50] [c00000000086f970] .eeh_reset_device+0x36c/0x4f0 [ 204.125939] [c0000027ed50fb20] [c00000000003a2d8] .eeh_handle_normal_event+0x448/0x480 [ 204.126068] [c0000027ed50fbc0] [c00000000003a35c] .eeh_handle_event+0x4c/0x340 [ 204.126192] [c0000027ed50fc80] [c00000000003a74c] .eeh_event_handler+0xfc/0x1b0 [ 204.126319] [c0000027ed50fd30] [c0000000000d1ea0] .kthread+0x110/0x130 [ 204.126430] [c0000027ed50fe30] [c00000000000a460] .ret_from_kernel_thread+0x5c/0x7c [ 204.126556] Instruction dump: [ 204.126610] 7c0802a6 fba1ffe8 fbc1fff0 fbe1fff8 f8010010 f821ff71 7c7e1b78 60000000 [ 204.126787] 60000000 e87e0298 3143ffff 7d2a1910 <0b090000> 2fa90000 40de00c8 ebfe0218 [ 204.126966] ---[ end trace 6e7aefd80add2973 ]--- are cleared. This patch removes iommu_add_device() in pnv_pci_ioda_dma_dev_setup(), which revert part of the change in commit d905c5df(PPC: POWERNV: move iommu_add_device earlier). It is responding to bug#110805. Upstream commit 3f28c5af3964c11e61e9a58df77cae5ebdb8209e Signed-off-by: Wei Yang <weiyang@linux.vnet.ibm.com> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org> Signed-off-by: Paul Mackerras <paulus@samba.org>
2014-07-09	powerpc: irq work racing with timer interrupt can result in timer interrupt hang	Anton Blanchard	1	-3/+0
	I am seeing an issue where a CPU running perf eventually hangs. Traces show timer interrupts happening every 4 seconds even when a userspace task is running on the CPU. /proc/timer_list also shows pending hrtimers have not run in over an hour, including the scheduler. Looking closer, decrementers_next_tb is getting set to 0xffffffffffffffff, and at that point we will never take a timer interrupt again. In __timer_interrupt() we set decrementers_next_tb to 0xffffffffffffffff and rely on ->event_handler to update it: next_tb = ~(u64)0; if (evt->event_handler) evt->event_handler(evt); In this case ->event_handler is hrtimer_interrupt. This will eventually call back through the clockevents code with the next event to be programmed: static int decrementer_set_next_event(unsigned long evt, struct clock_event_device dev) { /* Don't adjust the decrementer if some irq work is pending / if (test_irq_work_pending()) return 0; __get_cpu_var(decrementers_next_tb) = get_tb_or_rtc() + evt; If irq work came in between these two points, we will return before updating decrementers_next_tb and we never process a timer interrupt again. This looks to have been introduced by 0215f7d8c53f (powerpc: Fix races with irq_work). Fix it by removing the early exit and relying on code later on in the function to force an early decrementer: / We may have raced with new irq work */ if (test_irq_work_pending()) set_dec(1); Liu Ping Fan <pingfank@linux.vnet.ibm.com>: backport from upstream commit 8050936caf125fbe54111ba5e696b68a360556ba to fix bug#104457. Signed-off-by: Anton Blanchard <anton@samba.org> Cc: stable@vger.kernel.org # 3.14+ Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org> Signed-off-by: Paul Mackerras <paulus@samba.org>
2014-06-25	powerpc: Set subcores_per_core world readable	Joel Stanley	1	-1/+1
	This brings the permissions in line with the upstream implementation, allowing users to see the state of the system. Signed-off-by: Joel Stanley <joel@jms.id.au> Signed-off-by: Paul Mackerras <paulus@samba.org>
2014-06-19	tty/hvc/hvc_console: Fix wakeup of HVC thread on hvc_kick()	Benjamin Herrenschmidt	1	-1/+8
	Some backends call hvc_kick() to wakeup the HVC thread from its slumber upon incoming characters. This however doesn't work properly because it uses msleep_interruptible() which is mostly immune to wake_up_process(). It will basically go back to sleep until the timeout is expired (only signals can really wake it). Replace it with a simple schedule_timeout_interruptible() instead, which may wakeup earlier every now and then but we really don't care in this case. Backport of upstream 15a2743193b099f82657ca315dd2e1091be6c1d3 Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org> Signed-off-by: Paul Mackerras <paulus@samba.org>
2014-06-19	powerpc/eeh: Drop taken reference to driver on eeh_rmv_device	Thadeu Lima de Souza Cascardo	1	-2/+5
	Commit f5c57710dd62dd06f176934a8b4b8accbf00f9f8 ("powerpc/eeh: Use partial hotplug for EEH unaware drivers") introduces eeh_rmv_device, which may grab a reference to a driver, but not release it. That prevents a driver from being removed after it has gone through EEH recovery. This patch drops the reference if it was taken. Signed-off-by: Thadeu Lima de Souza Cascardo <cascardo@linux.vnet.ibm.com> Acked-by: Gavin Shan <shangw@linux.vnet.ibm.com> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org> (cherry picked from commit 8cc6b6cd8713457be80202fc4264f05d20bc5e1b) Signed-off-by: Thadeu Lima de Souza Cascardo <cascardo@linux.vnet.ibm.com> Signed-off-by: Paul Mackerras <paulus@samba.org>
2014-06-17	powerpc/powernv: Fix race in checking reason for off-line CPU waking up	Paul Mackerras	1	-4/+10
	Currently, when an off-line CPU wakes up from nap, we check for the possible reasons for the wakeup (split-core mode change, or the CPU needs to come online) before clearing the IPI that woke us. This leaves a possible race in the situation where we wake up for some unrelated reason, typically a leftover IPI from a KVM guest exit. If some other CPU sets a flag and then sends an IPI at just the right time, it is possible that we don't see the flag but then clear the IPI that the other CPU sent, and therefore miss a wakeup. To fix this, we clear the IPI first, and only check for any flags after a barrier. That way if we miss the flag setting we are sure not to have cleared the IPI that the other CPU sent. Signed-off-by: Paul Mackerras <paulus@samba.org>
2014-06-16	powerpc/powernv: Fix killed EEH event	Gavin Shan	4	-10/+19
	On PowerNV platform, EEH errors are reported by IO accessors or poller driven by interrupt. After the PE is isolated, we won't produce EEH event for the PE. The current implementation has possibility of EEH event lost in this way: The interrupt handler queues one "special" event, which drives the poller. EEH thread doesn't pick the special event yet. IO accessors kicks in, the frozen PE is marked as "isolated" and EEH event is queued to the list. EEH thread runs because of special event and purge all existing EEH events. However, we never produce an other EEH event for the frozen PE. Eventually, the PE is marked as "isolated" and we don't have EEH event to recover it. The patch fixes the issue to keep EEH events for PEs that have been marked as "isolated" with the help of additional "force" help to eeh_remove_event(). The problem was reported by Rolf Brudeseth and we don't have opened bug tracing it. the patch has been merged into linux-mainline. It's porting it to Frobisher. Reported-by: Rolf Brudeseth <rolfb@us.ibm.com> Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org> Signed-off-by: Paul Mackerras <paulus@samba.org>
2014-06-16	lpfc 8.3.44: Fix kernel panics from corrupted ndlp list	James Smart	8	-32/+110
	Signed-off-by: James Smart <james.smart@emulex.com> Signed-off-by: James Bottomley <JBottomley@Parallels.com> (cherry picked from commit cff261f6bd03612e792e4c8872c6ad049f743863) Signed-off-by: Sanket Rathi <sanket@linux.vnet.ibm.com> Signed-off-by: Paul Mackerras <paulus@samba.org>
2014-06-16	lpfc 8.3.43: Fixed spinlock hang.	James Smart	1	-2/+6
	Signed-off-by: James Smart <james.smart@emulex.com> Signed-off-by: James Bottomley <JBottomley@Parallels.com> (cherry picked from commit 725dd399ae69d0703c0417f9ce0ce065d2a914d1) Signed-off-by: Sanket Rathi <sanket@linux.vnet.ibm.com> Signed-off-by: Paul Mackerras <paulus@samba.org>
2014-06-16	lpfc 8.3.43: Fixed spinlock inversion problem.	James Smart	1	-4/+5
	Signed-off-by: James Smart <james.smart@emulex.com> Signed-off-by: James Bottomley <JBottomley@Parallels.com> (cherry picked from commit 4902b381c6c99e5edaca1e2549f0a5149d90feec) Signed-off-by: Sanket Rathi <sanket@linux.vnet.ibm.com> Signed-off-by: Paul Mackerras <paulus@samba.org>
2014-06-16	lpfc 8.3.42: Fixed inconsistent spin lock usage.	James Smart	2	-18/+16
	Signed-off-by: James Smart <james.smart@emulex.com> Signed-off-by: James Bottomley <JBottomley@Parallels.com> (cherry picked from commit 164cecd1b9aed821d29ee9543ea4ad7435321823) Signed-off-by: Sanket Rathi <sanket@linux.vnet.ibm.com> Signed-off-by: Paul Mackerras <paulus@samba.org>
2014-06-16	lpfc 8.3.42: Fix driver's abort loop functionality to skip IOs already ↵	James Smart	2	-1/+14
	getting aborted Signed-off-by: James Smart <james.smart@emulex.com> Signed-off-by: James Bottomley <JBottomley@Parallels.com> (cherry picked from commit afbd8d8884325bcc4fc4c12fcb2eccbf9356feca) Signed-off-by: Sanket Rathi <sanket@linux.vnet.ibm.com> Signed-off-by: Paul Mackerras <paulus@samba.org>
2014-06-16	lpfc 8.3.40: Fixed system panic due to unsafe walking and deleting linked list	James Smart	1	-4/+5
	Signed-off-by: James Smart <james.smart@emulex.com> Signed-off-by: James Bottomley <JBottomley@Parallels.com> (cherry picked from commit 3be30e0e4486b3568044efe27caf405296d7845a) Signed-off-by: Sanket Rathi <sanket@linux.vnet.ibm.com> Signed-off-by: Paul Mackerras <paulus@samba.org>
2014-06-16	lpfc 8.3.40: Fix inconsistent list removal causes crash.	James Smart	1	-18/+0
	Signed-off-by: James Smart <james.smart@emulex.com> Signed-off-by: James Bottomley <JBottomley@Parallels.com> (cherry picked from commit 91f32d01d9fff7f5f15f3ad136e55dc42d02f9ff) Signed-off-by: Sanket Rathi <sanket@linux.vnet.ibm.com> Signed-off-by: Paul Mackerras <paulus@samba.org>
2014-06-16	lpfc 8.3.40: Fixed crash during FCoE failover testing.	James Smart	1	-1/+2
	Signed-off-by: James Smart <james.smart@emulex.com> Signed-off-by: James Bottomley <JBottomley@Parallels.com> (cherry picked from commit 3bf41ba9376cda911e908dca36fe016293ad8fef) Signed-off-by: Sanket Rathi <sanket@linux.vnet.ibm.com> Signed-off-by: Paul Mackerras <paulus@samba.org>
2014-06-09	powerpc/tm: Fix crash when forking inside a transaction	Michael Neuling	1	-0/+10
	commit 621b5060e823301d0cba4cb52a7ee3491922d291 upstream. When we fork/clone we currently don't copy any of the TM state to the new thread. This results in a TM bad thing (program check) when the new process is switched in as the kernel does a tmrechkpt with TEXASR FS not set. Also, since R1 is from userspace, we trigger the bad kernel stack pointer detection. So we end up with something like this: Bad kernel stack pointer 0 at c0000000000404fc cpu 0x2: Vector: 700 (Program Check) at [c00000003ffefd40] pc: c0000000000404fc: restore_gprs+0xc0/0x148 lr: 0000000000000000 sp: 0 msr: 9000000100201030 current = 0xc000001dd1417c30 paca = 0xc00000000fe00800 softe: 0 irq_happened: 0x01 pid = 0, comm = swapper/2 WARNING: exception is not recoverable, can't continue The below fixes this by flushing the TM state before we copy the task_struct to the clone. To do this we go through the tmreclaim patch, which removes the checkpointed registers from the CPU and transitions the CPU out of TM suspend mode. Hence we need to call tmrechkpt after to restore the checkpointed state and the TM mode for the current task. To make this fail from userspace is simply: tbegin li r0, 2 sc <boom> Kudos to Adhemerval Zanella Neto for finding this. Signed-off-by: Michael Neuling <mikey@neuling.org> cc: Adhemerval Zanella Neto <azanella@br.ibm.com> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org> [Backported to 3.10: context adjust] Signed-off-by: Xue Liu <liuxueliu.liu@huawei.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> (cherry picked from commit aece4fa7368debd14ac07ebaf569587ff02cc596) Signed-off-by: Paul Mackerras <paulus@samba.org>
2014-06-09	powerpc/tm: Disable IRQ in tm_recheckpoint	Michael Neuling	5	-7/+34
	commit e6b8fd028b584ffca7a7255b8971f254932c9fce upstream. We can't take an IRQ when we're about to do a trechkpt as our GPR state is set to user GPR values. We've hit this when running some IBM Java stress tests in the lab resulting in the following dump: cpu 0x3f: Vector: 700 (Program Check) at [c000000007eb3d40] pc: c000000000050074: restore_gprs+0xc0/0x148 lr: 00000000b52a8184 sp: ac57d360 msr: 8000000100201030 current = 0xc00000002c500000 paca = 0xc000000007dbfc00 softe: 0 irq_happened: 0x00 pid = 34535, comm = Pooled Thread # R00 = 00000000b52a8184 R16 = 00000000b3e48fda R01 = 00000000ac57d360 R17 = 00000000ade79bd8 R02 = 00000000ac586930 R18 = 000000000fac9bcc R03 = 00000000ade60000 R19 = 00000000ac57f930 R04 = 00000000f6624918 R20 = 00000000ade79be8 R05 = 00000000f663f238 R21 = 00000000ac218a54 R06 = 0000000000000002 R22 = 000000000f956280 R07 = 0000000000000008 R23 = 000000000000007e R08 = 000000000000000a R24 = 000000000000000c R09 = 00000000b6e69160 R25 = 00000000b424cf00 R10 = 0000000000000181 R26 = 00000000f66256d4 R11 = 000000000f365ec0 R27 = 00000000b6fdcdd0 R12 = 00000000f66400f0 R28 = 0000000000000001 R13 = 00000000ada71900 R29 = 00000000ade5a300 R14 = 00000000ac2185a8 R30 = 00000000f663f238 R15 = 0000000000000004 R31 = 00000000f6624918 pc = c000000000050074 restore_gprs+0xc0/0x148 cfar= c00000000004fe28 dont_restore_vec+0x1c/0x1a4 lr = 00000000b52a8184 msr = 8000000100201030 cr = 24804888 ctr = 0000000000000000 xer = 0000000000000000 trap = 700 This moves tm_recheckpoint to a C function and moves the tm_restore_sprs into that function. It then adds IRQ disabling over the trechkpt critical section. It also sets the TEXASR FS in the signals code to ensure this is never set now that we explictly write the TM sprs in tm_recheckpoint. Signed-off-by: Michael Neuling <mikey@neuling.org> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> (cherry picked from commit b2b708cf2f9c51bf5a75845eb0b2f2390707957c) Signed-off-by: Paul Mackerras <paulus@samba.org>
2014-06-03	net/cxgb4: Fix referencing freed adapter	Gavin Shan	1	-1/+1
	The adapter is freed before we check its flags. It was caused by commit 144be3d ("net/cxgb4: Avoid disabling PCI device for towice"). The problem was reported by Intel's "0-day" tool. The patch fixes it to avoid reverting commit 144be3d. It's responsing to bug#110450. Signed-off-by: Gavin Shan <shangw@linux.vnet.ibm.com>
2014-06-03	net/cxgb4: Don't retrieve stats during recovery	Gavin Shan	1	-0/+10
	We possibly retrieve the adapter's statistics during EEH recovery and that should be disallowed. Otherwise, it would possibly incur replicate EEH error and EEH recovery is going to fail eventually. The patch reuses statistics lock and checks net_device is attached before going to retrieve statistics, so that the problem can be avoided. It's responsing to bug#110450. Signed-off-by: Gavin Shan <shangw@linux.vnet.ibm.com>
2014-06-03	net/cxgb4: Avoid disabling PCI device for towice	Gavin Shan	2	-7/+21
	If we have EEH error happens to the adapter and we have to remove it from the system for some reasons (e.g. more than 5 EEH errors detected from the device in last hour), the adapter will be disabled for towice separately by eeh_err_detected() and remove_one(), which will incur following unexpected backtrace. The patch tries to avoid it. It's responsing bug#110450. WARNING: at drivers/pci/pci.c:1431 CPU: 12 PID: 121 Comm: eehd Not tainted 3.13.0-rc7+ #1 task: c0000001823a3780 ti: c00000018240c000 task.ti: c00000018240c000 NIP: c0000000003c1e40 LR: c0000000003c1e3c CTR: 0000000001764c5c REGS: c00000018240f470 TRAP: 0700 Not tainted (3.13.0-rc7+) MSR: 8000000000029032 <SF,EE,ME,IR,DR,RI> CR: 28000024 XER: 00000004 CFAR: c000000000706528 SOFTE: 1 GPR00: c0000000003c1e3c c00000018240f6f0 c0000000010fe1f8 0000000000000035 GPR04: 0000000000000000 0000000000000000 00000000003ae509 0000000000000000 GPR08: 000000000000346f 0000000000000000 0000000000000000 0000000000003fef GPR12: 0000000028000022 c00000000ec93000 c0000000000c11b0 c000000184ac3e40 GPR16: 0000000000000000 0000000000000000 0000000000000000 0000000000000000 GPR20: 0000000000000000 0000000000000000 0000000000000000 0000000000000000 GPR24: 0000000000000000 c0000000009398d8 c00000000101f9c0 c0000001860ae000 GPR28: c000000182ba0000 00000000000001f0 c0000001860ae6f8 c0000001860ae000 NIP [c0000000003c1e40] .pci_disable_device+0xd0/0xf0 LR [c0000000003c1e3c] .pci_disable_device+0xcc/0xf0 Call Trace: [c0000000003c1e3c] .pci_disable_device+0xcc/0xf0 (unreliable) [d0000000073881c4] .remove_one+0x174/0x320 [cxgb4] [c0000000003c57e0] .pci_device_remove+0x60/0x100 [c00000000046396c] .__device_release_driver+0x9c/0x120 [c000000000463a20] .device_release_driver+0x30/0x60 [c0000000003bcdb4] .pci_stop_bus_device+0x94/0xd0 [c0000000003bcf48] .pci_stop_and_remove_bus_device+0x18/0x30 [c00000000003f548] .pcibios_remove_pci_devices+0xa8/0x140 [c000000000035c00] .eeh_handle_normal_event+0xa0/0x3c0 [c000000000035f50] .eeh_handle_event+0x30/0x2b0 [c0000000000362c4] .eeh_event_handler+0xf4/0x1b0 [c0000000000c12b8] .kthread+0x108/0x130 [c00000000000a168] .ret_from_kernel_thread+0x5c/0x74 Signed-off-by: Gavin Shan <shangw@linux.vnet.ibm.com>
2014-06-03	KVM: PPC: Book3S HV: Increase timeout for grabbing secondary threads	Paul Mackerras	1	-1/+1
	Occasional failures have been seen with split-core mode and migration where the message "KVM: couldn't grab cpu" appears. This increases the length of time that we wait from 1ms to 10ms, which seems to work around the issue. Fixes: BZ 110865 Signed-off-by: Paul Mackerras <paulus@samba.org>
2014-05-23	bnx2x: EEH recovory failed with Shiner adapter	Wang Sen	1	-2/+2
	This patch fixes the EEH recoery issue in bnx2x. Signed-off-by: Wen Xiong <wenxiong@linux.vnet.ibm.com> LTC-Bugzilla: #110449
2014-05-22	bnx2x: Fix kernel crash and EEH recovery issues	wenxiong@linux.vnet.ibm.com	1	-0/+2
	On Tuleta system, HTX has miscompare data issue after EEH recovery. Signed-off-by: Wen Xiong <wenxiong@linux.vnet.ibm.com>
2014-05-22	lpfc: Add iotag memory barrier	Brian King	1	-0/+2
	Add a memory barrier to ensure the valid bit is read before any of the cqe payload is read. This fixes an issue seen on Power where the cqe payload was getting loaded before the valid bit. When this occurred, we saw an iotag out of range error when a command completed, but since the iotag looked invalid the command didn't get completed to scsi core. Later we hit the command timeout, attempted to abort the command, then waited for the aborted command to get returned. Since the adapter already returned the command, we timeout waiting, and end up escalating EEH all the way to host reset. This patch fixes this issue. Signed-off-by: Brian King <brking@linux.vnet.ibm.com>
2014-05-16	CVE-2014-0196	Mike Ranweiler	1	-0/+4
	Pulled from 3.10.23 stable for bug 110340.. >From abb5100737bba3f82b5514350fea89ca361ac66c Mon Sep 17 00:00:00 2001 From: Peter Hurley <peter@hurleysoftware.com> Date: Sat, 3 May 2014 14:04:59 +0200 Subject: n_tty: Fix n_tty_write crash when echoing in raw mode commit 4291086b1f081b869c6d79e5b7441633dc3ace00 upstream. The tty atomic_write_lock does not provide an exclusion guarantee for the tty driver if the termios settings are LECHO & !OPOST. And since it is unexpected and not allowed to call TTY buffer helpers like tty_insert_flip_string concurrently, this may lead to crashes when concurrect writers call pty_write. In that case the following two writers: * the ECHOing from a workqueue and * pty_write from the process race and can overflow the corresponding TTY buffer like follows. If we look into tty_insert_flip_string_fixed_flag, there is: int space = __tty_buffer_request_room(port, goal, flags); struct tty_buffer *tb = port->buf.tail; ... memcpy(char_buf_ptr(tb, tb->used), chars, space); ... tb->used += space; so the race of the two can result in something like this: A B __tty_buffer_request_room __tty_buffer_request_room memcpy(buf(tb->used), ...) tb->used += space; memcpy(buf(tb->used), ...) ->BOOM B's memcpy is past the tty_buffer due to the previous A's tb->used increment. Since the N_TTY line discipline input processing can output concurrently with a tty write, obtain the N_TTY ldisc output_lock to serialize echo output with normal tty writes. This ensures the tty buffer helper tty_insert_flip_string is not called concurrently and everything is fine. Note that this is nicely reproducible by an ordinary user using forkpty and some setup around that (raw termios + ECHO). And it is present in kernels at least after commit d945cb9cce20ac7143c2de8d88b187f62db99bdc (pty: Rework the pty layer to use the normal buffering logic) in 2.6.31-rc3. js: add more info to the commit log js: switch to bool js: lock unconditionally js: lock only the tty->ops->write call References: CVE-2014-0196 Reported-and-tested-by: Jiri Slaby <jslaby@suse.cz> Signed-off-by: Peter Hurley <peter@hurleysoftware.com> Signed-off-by: Jiri Slaby <jslaby@suse.cz> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Alan Cox <alan@lxorguk.ukuu.org.uk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-05-16	CVE-2014-0155	Mike Ranweiler	1	-1/+1
	Pulled from 3.10.23 stable for bug 110340. >From a9ded882d5168e2fd5c0c20e2874f85c56016b4b Mon Sep 17 00:00:00 2001 From: Paolo Bonzini <pbonzini@redhat.com> Date: Fri, 28 Mar 2014 20:41:50 +0100 Subject: KVM: ioapic: fix assignment of ioapic->rtc_status.pending_eoi (CVE-2014-0155) commit 5678de3f15010b9022ee45673f33bcfc71d47b60 upstream. QE reported that they got the BUG_ON in ioapic_service to trigger. I cannot reproduce it, but there are two reasons why this could happen. The less likely but also easiest one, is when kvm_irq_delivery_to_apic does not deliver to any APIC and returns -1. Because irqe.shorthand == 0, the kvm_for_each_vcpu loop in that function is never reached. However, you can target the similar loop in kvm_irq_delivery_to_apic_fast; just program a zero logical destination address into the IOAPIC, or an out-of-range physical destination address. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-05-15	KVM: PPC: Book3S HV: Reduce default CMA pool size	Paul Mackerras	1	-2/+5
	We have observed that on machines with all their memory in a single node, it is possible to hit an out of memory situation where kernel allocations (which can't use the CMA pool) fail, triggering the OOM killer, yet reclaim doesn't start because there is still free memory in the CMA pool. To alleviate this situation somewhat, this reduces the default CMA pool size from 5% to 3% of system memory. The 3% should still be enough in most situations, and if not, the user can specify a different amount on the kernel command line. This should help with BZ 110181. Signed-off-by: Paul Mackerras <paulus@samba.org>
2014-05-12	powerpc/eeh: Dump PE location code	Gavin Shan	4	-11/+81
	As Ben suggested, it's meaningful to dump PE's location code for site engineers when hitting EEH errors. The patch introduces function eeh_pe_loc_get() to retireve the location code from dev-tree so that we can output it when hitting EEH errors. If primary PE bus is root bus, the PHB's dev-node would be tried prior to root port's dev-node. Otherwise, the upstream bridge's dev-node of the primary PE bus will be check for the location code directly. This fixes BZ 109585. Please apply to the next build for GA. Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2014-05-08	KVM: PPC: Book3S HV: Fix two bugs in dirty-page tracking	Paul Mackerras	1	-11/+40
	The first bug is that we are testing the C (changed) bit in the hashed page table without first doing a tlbie. The architecture allows the update of the C bit to happen at any time up until we do a tlbie for the page. However, we don't want to do a tlbie for every page on every pass of a migration operation. Thus we do the tlbie if there are no vcpus currently running, which would indicate the final phase of migration. If any vcpus are running then reading the dirty log is already racy because pages could get dirtied immediately after we check them. Also, we don't need to do the tlbie if the HPT entry doesn't allow writing, since in that case the C bit can not get set. The second bug is that in the case where we see a dirty 16MB page followed by a dirty 4kB page (both mapping to the same guest real address), we return 1 rather than 16MB / PAGE_SIZE. The return value, indicating the number of dirty pages, needs to reflect the largest dirty page we come across, not the last dirty page we see. Fixes: 109551 (this time for sure) Signed-off-by: Paul Mackerras <paulus@samba.org>
2014-05-04	PPC: KVM: fix dirty map for hugepages	Paul Mackerras	1	-4/+7
	The dirty map is system page (4K/64K) per bit, and when we populate dirty map, we reset the Change bit in HPT which is expected to contains pages less or equal to the system page size. This works until we start using huge pages (16MB). In this case, we mark dirty just a single system page and miss the rest of 16MB page which may be dirty as well. This changes kvm_test_clear_dirty to return the actual number of pages which is calculated from HPT entry. This changes kvmppc_hv_get_dirty_log() to make pages dirty starting from the rounded guest physical page number. [paulus@samba.org - don't advance i in the loop to set dirty bits, so that we make sure to clear C in all HPTEs.] Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru> Signed-off-by: Paul Mackerras <paulus@samba.org>
2014-05-04	powerpc/kvm: Don't try to allocate from kernel page allocator for hash page ↵	Aneesh Kumar K.V	1	-17/+6
	table. We reserve 5% of total ram for CMA allocation and not using that can result in us running out of numa node memory with specific configuration. One caveat is we may not have node local hpt with pinned vcpu configuration. But currently libvirt also pins the vcpu to cpuset after creating hash page table. Reviewed-by: Paul Mackerras <paulus@samba.org> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
2014-05-04	powerpc/powernv: Don't escalate non-existing frozen PE	Gavin Shan	1	-6/+6
	Commit 63fa7d4 ("powerpc/eeh: Escalate error on non-existing PE") escalates the frozen state on non-existing PE to fenced PHB. It was to improve kdump reliability. After that, commit 716a0e8 (" powrpc/powernv: Reset PHB in kdump kernel") was introduced to apply complete reset on all PHBs to increase the kdump reliability. Commit 63fa7d4 becomes unuseful and to issue PHB reset on non-fenced (on HW level) PHB would cause unexpected problems. So I'd like to revert it. It's responsing to bug#109562. Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
2014-05-04	powerpc/eeh: Report frozen parent PE prior to child PE	Gavin Shan	2	-5/+52
	When we have the corner case of frozen parent and child PE at the same time, we have to handle the frozen parent PE prior to the child. Without clearning the frozen state on parent PE, the child PE can't be recovered successfully. There're 2 ways (polling and interrupt) to have frozen PE to be reported. If we have frozen parent PE out there, we have to report and handle that firstly. It's responsing to bug#109562. Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
2014-05-04	powerpc/eeh: Clear frozen state for child PE	Gavin Shan	1	-4/+16
	Since commit cb523e09 ("powerpc/eeh: Avoid I/O access during PE reset"), the PE is kept as frozen state on hardware level until the PE reset is done completely. After that, we explicitly clear the frozen state of the affected PE. However, there might have frozen child PEs of the affected PE and we also need clear their frozen state as well. Otherwise, the recovery is going to fail. It's responsing to bug#109562. Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
2014-04-29	powerpc/book3s: Improve machine check delivery to guest.	Mahesh Salgaonkar	2	-11/+23
	Currently we forward MCEs to guest which have been recovered by guest. And for unhandled errors we do not deliver the MCE to guest. It looks like with no support of FWNMI in qemu, guest just panics whenever we deliver the recovered MCEs to guest. Also, the existig code used to return to host for unhandled errors which was casuing guest to hang with soft lockups inside guest and makes it difficult to recover guest instance. This patch now forwards all fatal MCEs to guest causing guest to crash/panic. And, for recovered errors we just go back to normal functioning of guest instead of returning to host. This fixes soft lockup issues in guest. This patch also fixes an issue where guest MCE events were not logged to host console. This patch fixes bz108165 and bz108413 Signed-off-by: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
2014-04-29	powerpc/powernv: Protect split-core operations from CPU hotplug	Srivatsa S. Bhat	1	-0/+4
	During split-core operations, one of the online CPUs is nominated as the "master" and then stop_machine() is invoked to perform the split/unsplit procedure. Between these 2 steps, if CPU hotplug occurs and takes the just nominated "master" CPU offline, then the split/unsplit procedure does not complete properly and leads to undesirable effects. So protect the entire split-core operation with get/put_online_cpus() to synchronize with CPU hotplug. Fixes bz 105509. Acked-by: Michael Neuling <mikey@neuling.org> Signed-off-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
2014-04-29	powerpc: Remove timebase resync during split-core operations on DD2.1 chips	Alistair Popple	1	-6/+9
	The hardware manages the resync during split-core operations, on newer revisions (DD2.1 and higher). So we don't need to call opal_resync_timebase() on those systems. Fixes bz 105856. [Srivatsa: Added changelog] Signed-off-by: Alistair Popple <alistair@popple.id.au> Signed-off-by: Michael Neuling <mikey@neuling.org> Signed-off-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
2014-04-24	powerpc/book3s: Increment the mce counter during machine_check_early call.	Mahesh Salgaonkar	1	-0/+2
	We don't see MCE counter getting increased in /proc/interrupts which gives false impression of no MCE occurred even when there were MCE events. The machine check early handling was added for PowerKVM and we missed to increment the MCE count in the early handler. We also increment mce counters in the machine_check_exception call, but in most cases where we handle the error hypervisor never reaches there unless its fatal and we want to crash. Only during fatal situation we may see double increment of mce count. We need to fix that. But for now it always good to have some count increased instead of zero. This fixes the MCE count issue mentioned in bz108413 Signed-off-by: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
2014-04-22	powerpc/powernv: Add missing sysfs_attr_init()	Benjamin Herrenschmidt	1	-0/+1
	Without this, we get lockdep errors Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2014-04-22	powerpc/book3s: Add stack overflow check in machine check handler.	Mahesh Salgaonkar	1	-4/+20
	Currently machine check handler does not check for stack overflow for nested machine check. If we hit another MCE while inside the machine check handler repeatedly from same address then we get into risk of stack overflow which can cause huge memory corruption. This patch limits the nested MCE level to 4 and panic when we cross level 4. Signed-off-by: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2014-04-22	powerpc/powernv: Check sysparam size before creation	Joel Stanley	1	-0/+5
	The size of the sysparam sysfs files is determined from the device tree at boot. However the buffer is hard coded to 64 bytes. If we encounter a parameter that is larger than 64, or miss-parse the device tree, the buffer will overflow when reading or writing to the parameter. Check it at discovery time, and if the parameter is too large, do not create a sysfs entry for it. Signed-off-by: Joel Stanley <joel@jms.id.au> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2014-04-22	powerpc/powernv: Fix typos in sysparam code	Joel Stanley	1	-2/+2
	Signed-off-by: Joel Stanley <joel@jms.id.au> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2014-04-22	powerpc/powernv: Check sysfs size before copying	Joel Stanley	1	-0/+4
	The sysparam code currently uses the userspace supplied number of bytes when memcpy()ing in to a local 64-byte buffer. Limit the maximum number of bytes by the size of the buffer. Signed-off-by: Joel Stanley <joel@jms.id.au> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2014-04-22	powerpc/powernv: Use ssize_t for sysparam return values	Joel Stanley	1	-5/+6
	The OPAL calls are returning int64_t values, which the sysparam code stores in an int, and the sysfs callback returns ssize_t. Make code a easier to read by consistently using ssize_t. Signed-off-by: Joel Stanley <joel@jms.id.au> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2014-04-22	powerpc/powernv: Fix sysparam sysfs error handling	Joel Stanley	1	-2/+5
	When a sysparam query in OPAL returned a negative value (error code), sysfs would spew out a decent chunk of memory; almost 64K more than expected. This was traced to a sign/unsigned mix up in the OPAL sysparam sysfs code at sys_param_show. The return value of sys_param_show is a ssize_t, calculated using return ret ? ret : attr->param_size; Alan Modra explains: "attr->param_size" is an unsigned int, "ret" an int, so the overall expression has type unsigned int. Result is that ret is cast to unsigned int before being cast to ssize_t. Instead of using the ternary operator, set ret to the param_size if an error is not detected. The same bug exists in the sysfs write callback; this patch fixes it in the same way. A note on debugging this next time: on my system gcc will warn about this if compiled with -Wsign-compare, which is not enabled by -Wall, only -Wextra. Signed-off-by: Joel Stanley <joel@jms.id.au> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2014-04-22	tick-broadcast/cpuidle: Fix the programming of the broadcast hrtimer	Benjamin Herrenschmidt	1	-2/+1
	Today CPUs in fast sleep are being woken up to handle their timers by the tick broadcast framework using a hrtimer queued on a nominated broadcast CPU. The hrtimer is programmed for the earlier of the next wakeup and a broadcast period which happens to be a jiffy. This programming is being done incorrectly today. The current time is noted, the tick broadcast interrupt handler is called, then the time at which the hrtimer needs to be programmed is decided. By then the noted current time would be stale and the hrtimer will be forward much ahead than required, leading to delayed broadcast interrupts being delivered to sleeping cpus. Fix this by noting the current time just before programming the hrtimer. Signed-off-by: Preeti U Murthy <preeti@linux.vnet.ibm.com>
2014-04-22	powerpc/book3s: Improve machine check handling for unhandled errors	Mahesh Salgaonkar	1	-3/+37
	Current code does not check for unhandled/unrecovered errors and return from interrupt if it is recoverable exception which in-turn triggers same machine check exception in a loop causing hypervisor to be unresponsive. This patch fixes this situation and forces hypervisor to panic for unhandled/unrecovered errors. This patch also fixes another issue where unrecoverable_exception routine was called in real mode in case of unrecoverable exception (MSR_RI = 0). This causes another exception vector 0x300 (data access) during system crash leading to confusion while debugging cause of the system crash. With the above fixes we now throw correct console messages (see below) while crashing the system in case of unhandled/unrecoverable machine checks. -------------- Severe Machine check interrupt [[Not recovered] Initiator: CPU Error type: UE [Instruction fetch] Effective address: 0000000030002864 Oops: Machine check, sig: 7 [#1] SMP NR_CPUS=2048 NUMA PowerNV Modules linked in: bork(O) bridge stp llc kvm [last unloaded: bork] CPU: 36 PID: 55162 Comm: bash Tainted: G O 3.14.0mce #1 task: c000002d72d022d0 ti: c000000007ec0000 task.ti: c000002d72de4000 NIP: 0000000030002864 LR: 00000000300151a4 CTR: 000000003001518c REGS: c000000007ec3d80 TRAP: 0200 Tainted: G O (3.14.0mce) MSR: 9000000000041002 <SF,HV,ME,RI> CR: 28222848 XER: 20000000 CFAR: 0000000030002838 DAR: d0000000004d0000 DSISR: 00000000 SOFTE: 1 GPR00: 000000003001512c 0000000031f92cb0 0000000030078af0 0000000030002864 GPR04: d0000000004d0000 0000000000000000 0000000030002864 ffffffffffffffc9 GPR08: 0000000000000024 0000000030008af0 000000000000002c c00000000150e728 GPR12: 9000000000041002 0000000031f90000 0000000010142550 0000000040000000 GPR16: 0000000010143cdc 0000000000000000 00000000101306fc 00000000101424dc GPR20: 00000000101424e0 000000001013c6f0 0000000000000000 0000000000000000 GPR24: 0000000010143ce0 00000000100f6440 c000002d72de7e00 c000002d72860250 GPR28: c000002d72860240 c000002d72ac0038 0000000000000008 0000000000040000 NIP [0000000030002864] 0x30002864 LR [00000000300151a4] 0x300151a4 Call Trace: Instruction dump: XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX ---[ end trace 7285f0beac1e29d3 ]--- Sending IPI to other CPUs IPI complete OPAL V3 detected ! -------------- Signed-off-by: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2014-04-22	tick-broadcast/cpuidle: Fix the demuxing of PPC_MSG_TIMER IPI message	Benjamin Herrenschmidt	1	-1/+1
	The PPC_MSG_TIMER IPI message slot was introduced for the tick broadcast IPIs which are required to wakeup sleeping CPUs. The decrementer of the CPUs that enter fast sleep stops as a consequence of entering the idle state. Therefore such CPUs have to be woken up in time to handle their timers by a broadcast CPU which sends the PPC_MSG_TIMER IPIs to them. This IPI message is being parsed wrongly in smp_ipi_demux(). Thus the tick broadcast interrupt handler is never executed on the sleeping CPU. This could have led to unpleasant side effects like not handling timers in time on the sleeping cpus. But since the sleeping CPUs still receive the tick broadcast IPI, they are awoken from the idle state and their decrementers are back in action. As a result, its possible that they are managing to handle timers before they go to sleep again. Hence timers are being handled on the sleeping cpus although the tick broadcast interrupt handler, which is actually supposed to ensure that is never being called today due to the wrong number of shift bits while parsing the tick broadcast IPI. However we need to note that as a result of this discrepency, timer handling on the sleeping cpus may be unstable. This could be one of the reasons we are observing some softlockups in the cpuidle wakeup path. Signed-off-by: Preeti U Murthy <preeti@linux.vnet.ibm.com> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2014-04-16	vfio-pci: Use pci "try" reset interface	Alex Williamson	1	-20/+9
	PCI resets will attempt to take the device_lock for any device to be reset. This is a problem if that lock is already held, for instance in the device remove path. It's not sufficient to simply kill the user process or skip the reset if called after .remove as a race could result in the same deadlock. Instead, we handle all resets as "best effort" using the PCI "try" reset interfaces. This prevents the user from being able to induce a deadlock by triggering a reset. Signed-off-by: Alex Williamson <alex.williamson@redhat.com> Signed-off-by: Bjorn Helgaas <bhelgaas@google.com> (cherry picked from commit 890ed578df82f5b7b5a874f9f2fa4f117305df5f) Signed-off-by: Thadeu Lima de Souza Cascardo <cascardo@linux.vnet.ibm.com> LTC-Bugzilla: #104951
2014-04-16	PCI: Add pci_try_reset_function(), pci_try_reset_slot(), pci_try_reset_bus()	Alex Williamson	2	-0/+158
	When doing a function/slot/bus reset PCI grabs the device_lock for each device to block things like suspend and driver probes, but call paths exist where this lock may already be held. This creates an opportunity for deadlock. For instance, vfio allows userspace to issue resets so long as it owns the device(s). If a driver unbind .remove callback races with userspace issuing a reset, we have a deadlock as userspace gets stuck waiting on device_lock while another thread has device_lock and waits for .remove to complete. To resolve this, we can make a version of the reset interfaces which use trylock. With this, we can safely attempt a reset and return error to userspace if there is contention. [bhelgaas: the deadlock happens when A (userspace) has a file descriptor for the device, and B waits in this path: driver_detach device_lock # take device_lock __device_release_driver pci_device_remove # pci_bus_type.remove vfio_pci_remove # pci_driver .remove vfio_del_group_dev wait_event(vfio.release_q, !vfio_dev_present) # wait (holding device_lock) Now B is stuck until A gives up the file descriptor. If A tries to acquire device_lock for any reason, we deadlock because A is waiting for B to release the lock, and B is waiting for A to release the file descriptor.] Signed-off-by: Alex Williamson <alex.williamson@redhat.com> Signed-off-by: Bjorn Helgaas <bhelgaas@google.com> (cherry picked from commit 61cf16d8bd38c3dc52033ea75d5b1f8368514a17) Signed-off-by: Thadeu Lima de Souza Cascardo <cascardo@linux.vnet.ibm.com> LTC-Bugzilla: #104951
2014-04-16	powerpc/eeh: Can't recover from non-PE-reset case	Gavin Shan	1	-3/+9
	When PCI_ERS_RESULT_CAN_RECOVER returned from device drivers, the EEH core should enable I/O and DMA for the affected PE. However, it was missed to have DMA enabled in eeh_handle_normal_event(). Besides, the frozen state of the affected PE should be cleared after successful recovery, but we didn't. The patch fixes both of the issues as above. It's responsing to bug#105179. Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
2014-04-16	PCI: Export MSI message relevant functions	Gavin Shan	1	-0/+2
	As pointed by Alexey, we're going to hit build failure without exporting the functions when (CONFIG_VFIO_PCI == M). It should be part of commit 9762b50 ("drivers/vfio/pci: Fix MSIx message lost"). Reported-by: Alexey Kardashevskiy <aik@ozlabs.ru> Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
2014-04-16	KVM: PPC: Book3S PR: Handle facility-unavailable interrupts gracefully	Paul Mackerras	2	-0/+6
	At present, if a PR guest on a POWER8 machine tries to access some disabled functionality such as transactional memory, the result is a facility-unavailable interrupt, which isn't handled in kvmppc_handle_exit_pr(), resulting in a call to BUG(), crashing the PR host kernel. This adds code to handle the facility-unavailable interrupts and give the guest an illegal instruction interrupt, instead of crashing the PR host. Signed-off-by: Paul Mackerras <paulus@samba.org>
2014-04-16	KVM: PPC: Book3S PR: Implement ARCH_COMPAT register	Paul Mackerras	2	-0/+42
	This provides basic support for the KVM_REG_PPC_ARCH_COMPAT register in PR KVM. At present the value is sanity-checked when set, but doesn't actually affect anything yet. Implementing this makes it possible to use a qemu command-line argument such as "-cpu host,compat=power7" on a POWER8 machine, just as we would with HV KVM. Signed-off-by: Paul Mackerras <paulus@samba.org>
2014-04-16	KVM: PPC: Book3S PR: Unimplemented SPRs in supervisor mode don't cause trap	Paul Mackerras	1	-10/+15
	The Power ISA states that an mtspr or mfspr to/from an unimplemented SPR should be a no-op in privileged mode, rather than causing an program interrupt (0x700 vector), with the exception of mtspr to SPR 0 and mfspr from SPRs 0, 4, 5 or 6. Currently our SPR emulation code doesn't follow this rule. This modifies the code in kvmppc_core_emulate_m[ft]spr_pr() to check the PR bit in the MSR when we detect an unknown SPR number, and only return EMULATE_FAIL (which results in a program interrupt) if PR is 0 or the SPR number is one of the ones which are specifically defined to cause a program interrupt. Signed-off-by: Paul Mackerras <paulus@samba.org>
2014-04-14	tick, broadcast:Keep the cpu_online_mask and broadcast masks in sync with ↵	Preeti U Murthy	1	-1/+1
	each other Its possible that the tick_broadcast_force_mask contains cpus which are not in cpu_online_mask when a broadcast tick occurs. This could happen under the following circumstance assuming CPU1 is among the CPUs waiting for broadcast and the cpu being hotplugged out. CPU0 CPU1 Run CPU_DOWN_PREPARE notifiers Start stop_machine Gets woken up by IPI to run stop_machine, sets itself in tick_broadcast_force_mask if the time of broadcast interrupt is around the same time as this IPI. Start stop_machine set_cpu_online(cpu1, false) End stop_machine End stop_machine Broadcast interrupt Finds that cpu1 in tick_broadcast_force_mask is offline and triggers the WARN_ON in tick_handle_oneshot_broadcast() Clears all broadcast masks in CPU_DEAD stage. While the hotplugged cpu clears its bit in the tick_broadcast_oneshot_mask and tick_broadcast_pending mask during BROADCAST_EXIT, it sets its bit in the tick_broadcast_force_mask if the broadcast interrupt is found to be around the same time as the present time. Today we clear all the broadcast masks and shutdown tick devices in the CPU_DEAD stage. But as shown above the broadcast interrupt could occur before this stage is reached and the WARN_ON() gets triggered when it is found that the tick_broadcast_force_mask contains an offline cpu. Please note that a scenario such as above will occur only if the broadcast interrupt is delayed under some circumstance. Ideally the broadcast interrupt in the above scenario should have occured before we reach the irq_disabled stage of stop_machine and should have seen a valid broadcast mask. But for some reason that is yet to be understood it is getting delayed leading to the above scenario. Besides this another point to notice is that for a small duration between the CPU_DYING stage where the hotplugged cpu clears its bit from the cpu_online_mask and the CPU_DEAD stage where the broadcast_force_mask gets cleared of the same, both these masks are out of sync with each other during that time thus triggering the above scenario. The temporary solution to this is to move the clearing of broadcast masks to the CPU_DYING notification stage. The reason is, it is during this stage that the hotplugged cpu clears itself from the cpu_online_mask() and runs notifications relevant to this stage including those to clear the broadcast masks (with this patch). All this, while the rest of the cpus are busy spinning in stop_machine to notice this change. By the time this stage ends and all cpus resume work, the hotplugged cpu would have cleared itself from the cpu_online_mask and the broadcast cpu mask thus keeping them in sync with each other at such times when the rest of the cpus can read these masks. Since the above mentioned delay in the broadcast interrupt has not triggered any soft lockups so far, we are assuming its a non-fatal issue and have this patch to prevent the warning from popping up in this case. Suggested-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com> Signed-off-by: Preeti U Murthy <preeti@linux.vnet.ibm.com> Reviewed-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com> Acked-by: Benjamin Herrenschmidt <benh@au1.ibm.com>
2014-04-14	powerpc/mm: Don't update page->_mapcount for hugetlb tail pages	Aneesh Kumar K.V	1	-10/+0
	The changes to increment _mapcount was added w.r.t THP change 3526741f0964c88bc2ce511e1078359052bf225b. Later this was fixed to to handle the hugetlb case in 44518d2b32646e37b4b7a0813bbbe98dc21c7f8f Instead of backporting 44518, we can remove the _mapcount update since we don't support THP for kvm host yet. Fixes: bz# 108558 Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
2014-04-14	powrpc/powernv: Reset PHB in kdump kernel	Gavin Shan	3	-4/+24
	In the kdump scenario, the first kerenl doesn't shutdown PCI devices and the kdump kerenl clean PHB IODA table at the early probe time. That means the kdump kerenl can't support PCI transactions piled by the first kerenl. Otherwise, lots of EEH errors and frozen PEs will be detected. In order to avoid the EEH errors, the PHB is resetted to drop all PCI transaction from the first kerenl. It looks good on P7, but need to be verified on P8. Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com> Acked-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2014-04-14	powerpc/pci: Mask linkDown on resetting PCI bus	Gavin Shan	6	-10/+74
	The problem was initially reported by Wendy who tried pass through IPR adapter, which was connected to PHB root port directly, to KVM based guest. When doing that, pci_reset_bridge_secondary_bus() was called by VFIO driver and linkDown was detected by the root port. That caused all PEs to be frozen. The patch fixes the issue by routing the reset for the secondary bus of root port to underly firmware. For that, one more weak function pci_reset_secondary_bus() is introduced so that the individual platforms can override that and do specific reset for bridge's secondary bus. Reported-by: Wendy Xiong <wenxiong@linux.vnet.ibm.com> Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com> Acked-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2014-04-14	powerpc/eeh: Make the delay for PE reset unified	Gavin Shan	4	-16/+30
	Basically, we have 3 types of resets to fulfil PE reset: fundamental, hot and PHB reset. For the later 2 cases, we need PCI bus reset hold and settlement delay as specified by PCI spec. PowerNV and pSeries platforms are running on top of different firmware and some of the delays have been covered by underly firmware (PowerNV). The patch makes the delays unified to be done in backend, instead of EEH core. Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com> Acked-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2014-04-14	powerpc/powernv: Reset root port in firmware	Gavin Shan	1	-7/+6
	Resetting root port has more stuff to do than that for PCIe switch ports and we should have resetting root port done in firmware instead of the kernel itself. The problem was introduced by commit 5b2e198e ("powerpc/powernv: Rework EEH reset"). Cc: linux-stable <stable@vger.kernel.org> Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com> Acked-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2014-04-14	powerpc/pseries: Fix overwritten PE state	Gavin Shan	1	-0/+1
	In pseries_eeh_get_state(), EEH_STATE_UNAVAILABLE is always overwritten by EEH_STATE_NOT_SUPPORT because of the missed "break" there. The patch fixes the issue. Reported-by: Joe Perches <joe@perches.com> Cc: linux-stable <stable@vger.kernel.org> Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com> Acked-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2014-04-14	powerpc/powernv: Fix endless reporting frozen PE	Gavin Shan	1	-0/+7
	Once one specific PE has been marked as EEH_PE_ISOLATED, it's in the middile of recovery or removed permenently. We needn't report the frozen PE again. Otherwise, we will have endless reporting same frozen PE. Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com> Acked-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2014-04-14	powerpc/eeh: No hotplug on permanently removed dev	Gavin Shan	6	-20/+102
	The issue was detected in a bit complicated test case where we have multiple hierarchical PEs shown as following figure: +-----------------+ \| PE#3 p2p#0 \| \| p2p#1 \| +-----------------+ \| +-----------------+ \| PE#4 pdev#0 \| \| pdev#1 \| +-----------------+ PE#4 (have 2 PCI devices) is the child of PE#3, which has 2 p2p bridges. We accidentally had less-known scenario: PE#4 was removed permanently from the system because of permanent failure (e.g. exceeding the max allowd failure times in last hour), then we detects EEH errors on PE#3 and tried to recover it. However, eeh_dev instances for pdev#0/1 were not detached from PE#4, which was still connected to PE#3. All of that was because of the fact that we rely on count-based pcibios_release_device(), which isn't reliable enough. When doing recovery for PE#3, we still apply hotplug on PE#4 and pdev#0/1, which are not valid any more. Eventually, we run into kernel crash. The patch fixes above issue from two aspects. For unplug, we simply skip those permanently removed PE, whose state is (EEH_PE_STATE_ISOLATED && !EEH_PE_STATE_RECOVERING) and its frozen count should be greater than EEH_MAX_ALLOWED_FREEZES. For plug, we marked all permanently removed EEH devices with EEH_DEV_REMOVED and return 0xFF's on read its PCI config so that PCI core will omit them. Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com> Acked-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2014-04-14	powerpc/eeh: Allow to disable EEH	Gavin Shan	3	-7/+70
	The patch introduces bootarg "eeh=off" to disable EEH functinality. Also, it creates /sys/kerenl/debug/powerpc/eeh_enable to disable or enable EEH functionality. By default, we have the functionality enabled. For PowerNV platform, we will restore to have the conventional mechanism of clearing frozen PE during PCI config access if we're going to disable EEH functionality. Conversely, we will rely on EEH for error recovery. The patch also fixes the issue that we missed to cover the case of disabled EEH functionality in function ioda_eeh_event(). Those events driven by interrupt should be cleared to avoid endless reporting. Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com> Acked-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2014-04-14	powerpc/eeh: Cleanup EEH subsystem variables	Gavin Shan	2	-26/+34
	There're 2 EEH subsystem variables: eeh_subsystem_enabled and eeh_probe_mode. We needn't maintain 2 variables and we can just have one variable and introduce different flags. The patch also introduces additional flag EEH_FORCE_DISABLE, which will be used to disable EEH subsystem via boot parameter ("eeh=off") in future. Besides, the patch also introduces flag EEH_ENABLED, which is changed to disable or enable EEH functionality on the fly through debugfs entry in future. With the patch applied, the creteria to check the enabled EEH functionality is changed to: !EEH_FORCE_DISABLED && EEH_ENABLED : Enabled Other cases : Disabled Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com> Acked-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2014-04-14	powerpc/eeh: Use cached capability for log dump	Gavin Shan	4	-20/+53
	When calling into eeh_gather_pci_data() on pSeries platform, we possiblly don't have pci_dev instance yet, but eeh_dev is always ready. So we use cached capability from eeh_dev instead of pci_dev for log dump there. In order to keep things unified, we also cache PCI capability positions to eeh_dev for PowerNV as well. Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com> Acked-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2014-04-14	powerpc/eeh: Cleanup eeh_gather_pci_data()	Gavin Shan	1	-14/+12
	The patch replaces printk(KERN_WARNING ...) with pr_warn() in the function eeh_gather_pci_data(). Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com> Acked-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2014-04-14	powerpc/eeh: Avoid I/O access during PE reset	Gavin Shan	3	-70/+99
	We have suffered recrusive frozen PE a lot, which was caused by IO accesses during the PE reset. Ben came up with the good idea to keep frozen PE until recovery (BAR restore) gets done. With that, IO accesses during PE reset are dropped by hardware and wouldn't incur the recrusive frozen PE any more. The patch implements the idea. We don't clear the frozen state until PE reset is done completely. During the period, the EEH core expects unfrozen state from backend to keep going. So we have to reuse EEH_PE_RESET flag, which has been set during PE reset, to return normal state from backend. The side effect is we have to clear frozen state for towice (PE reset and clear it explicitly), but that's harmless. We have some limitations on pHyp. pHyp doesn't allow to enable IO or DMA for unfrozen PE. So we don't enable them on unfrozen PE in eeh_pci_enable(). We have to enable IO before grabbing logs on pHyp. Otherwise, 0xFF's is always returned from PCI config space. Also, we had wrong return value from eeh_pci_enable() for EEH_OPT_THAW_DMA case. The patch fixes it too. Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com> Acked-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2014-04-14	powerpc/powernv: Use EEH PCI config accessors	Gavin Shan	1	-11/+12
	For EEH PowerNV backends, they need use their own PCI config accesors as the normal one could be blocked during PE reset. The patch also removes necessary parameter "hose" for the function ioda_eeh_bridge_reset(). Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com> Acked-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2014-04-14	powerpc/eeh: Block PCI-CFG access during PE reset	Gavin Shan	4	-54/+126
	We've observed multiple PE reset failures because of PCI-CFG access during that period. Potentially, some device drivers can't support EEH very well and they can't put the device to motionless state before PE reset. So those device drivers might produce PCI-CFG accesses during PE reset. Also, we could have PCI-CFG access from user space (e.g. "lspci"). Since access to frozen PE should return 0xFF's, we can block PCI-CFG access during the period of PE reset so that we won't get recrusive EEH errors. The patch adds flag EEH_PE_RESET, which is kept during PE reset. The PowerNV/pSeries PCI-CFG accessors reuse the flag to block PCI-CFG accordingly. Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com> Acked-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2014-04-14	powerpc/eeh: EEH_PE_ISOLATED not reflect HW state	Gavin Shan	1	-7/+3
	When doing PE reset, EEH_PE_ISOLATED is cleared unconditionally. However, We should remove that if the PE reset has cleared the frozen state successfully. Otherwise, the flag should be kept. The patch fixes the issue. Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com> Acked-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2014-04-14	powerpc/powernv: Remove fields in PHB diag-data dump	Gavin Shan	1	-51/+40
	For some fields (e.g. LEM, MMIO, DMA) in PHB diag-data dump, it's meaningless to print them if they have non-zero value in the corresponding mask registers because we always have non-zero values in the mask registers. The patch only prints those fieds if we have non-zero values in the primary registers (e.g. LEM, MMIO, DMA status) so that we can save couple of lines. The patch also removes unnecessary spare line before "brdgCtl:" and two leading spaces as prefix in each line as Ben suggested. Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com> Acked-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2014-04-11	powerpc: Don't try to set LPCR unless we're in hypervisor mode	Paul Mackerras	1	-1/+2
	Commit c7062d83fe7b ("powerpc/ppc64: Do not turn AIL (reloc-on interrupts) too early") added code to set the AIL bit in the LPCR without checking whether the kernel is running in hypervisor mode. The result is that when the kernel is running as a guest (i.e., under PowerKVM or PowerVM), the processor takes a privileged instruction interrupt at that point, causing a panic. The visible result is that the kernel hangs after printing "returning from prom_init". This fixes it by checking for hypervisor mode being available before setting LPCR. If we are not in hypervisor mode, we enable relocation-on interrupts later in pSeries_setup_arch using the H_SET_MODE hcall. This fixes BZ 108728. Signed-off-by: Paul Mackerras <paulus@samba.org>
2014-04-10	kvm: Clear the runlatch bit of a vcpu before napping	Preeti U Murthy	1	-1/+11
	When the guest cedes the vcpu or the vcpu has no guest to run it naps. Clear the runlatch bit of the vcpu before napping to indicate an idle cpu. Signed-off-by: Preeti U Murthy <preeti@linux.vnet.ibm.com> Acked-by: Paul Mackerras <paulus@samba.org> Reviewed-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
2014-04-10	kvm: Set the runlatch bit of a CPU just before starting guest	Preeti U Murthy	1	-0/+6
	The secondary threads in the core have their runlatch bits cleared since they are offline. When the secondary threads are called in to start a guest their runlatch bits need to be set to indicate that they are busy. The primary thread has its runlatch bit set though, but there is no harm in setting this bit once again. Hence set the runlatch bit for all threads before they start guest. Signed-off-by: Preeti U Murthy <preeti@linux.vnet.ibm.com> Acked-by: Paul Mackerras <paulus@samba.org> Reviewed-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
2014-04-10	kvm: Set the runlatch bits correctly for offline cpus	Preeti U Murthy	1	-0/+3
	Up until now we have been setting the runlatch bits for a busy CPU and clearing it when a CPU enters idle state. The runlatch bit has thus been consistent with the utilization of a CPU as long as the CPU is online. However when a CPU is hotplugged out the runlatch bit is not cleared. It needs to be cleared to indicate an unused CPU. OCC consumes the runlatch bit to decide the utilization of a thread and ends up seeing the offline threads as busy. Hence this patch has the runlatch bit cleared for an offline CPU just before entering an idle state and sets it immediately after it exits the idle state. Signed-off-by: Preeti U Murthy <preeti@linux.vnet.ibm.com> Acked-by: Paul Mackerras <paulus@samba.org> Reviewed-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
2014-04-09	enclosure: fix WARN_ON in dual path device removing	wenxiong@linux.vnet.ibm.com	1	-1/+2
	The issue is happened in dual controller configuration. We got the sysfs warnings when rmmod the ipr module. enclosure_unregister() in drivers/msic/enclosure.c, call device_unregister() for each componment deivce, device_unregister() ->device_del()->kobject_del() ->sysfs_remove_dir(). In sysfs_remove_dir(), set kobj->sd = NULL. For each componment device, enclosure_component_release()->enclosure_remove_links()->sysfs_remove_link() in which checking kobj->sd again, it has been set as NULL when doing device_unregister. So we saw all these sysfs WARNING. sysfs: can not remove 'enclosure_device: P1-D1 2SS6', no directory ------------[ cut here ]------------ WARNING: at fs/sysfs/inode.c:325 Modules linked in: fuse loop dm_mod ses enclosure ipr(-) ipv6 ibmveth libata sg ext3 jbd mbcache sd_mod crc_t10dif crct10dif_common ibmvscsi scsi_transport_srp scsi_tgt scsi_dh_rdac scsi_dh_emc scsi_dh_hp_sw scsi_dh_alua scsi_dh scsi_mod CPU: 0 PID: 4006 Comm: rmmod Not tainted 3.12.0-scsi-0.11-ppc64 #1 task: c0000000f769aba0 ti: c0000000f8f9c000 task.ti: c0000000f8f9c000 NIP: c0000000002b038c LR: c0000000002b0388 CTR: 0000000000000000 REGS: c0000000f8f9ee70 TRAP: 0700 Not tainted (3.12.0-scsi-0.11-ppc64) MSR: 8000000000029032 <SF,EE,ME,IR,DR,RI> CR: 28008444 XER: 20000000 SOFTE: 1 CFAR: c000000000736118 GPR00: c0000000002b0388 c0000000f8f9f0f0 c0000000010ed630 0000000000000047 GPR04: c000000001502628 c000000001513010 0000000000000689 652027656e636c6f GPR08: 737572655f646576 c000000000ae2b7c 0000000000a20000 c000000000add630 GPR12: 0000000028008442 c000000007f20000 0000000000000000 0000000010146920 GPR16: 00000000100cb9d8 0000000010093088 0000000010146920 0000000000000000 GPR20: 0000000000000000 0000000010161900 00000000100ce458 0000000000000000 GPR24: 0000000010161940 0000000000000000 d0000000046ad440 0000000000000000 GPR28: c0000000f8f9f270 0000000000000000 c0000000fcb882c8 0000000000000000 NIP [c0000000002b038c] .sysfs_hash_and_remove+0xe4/0xf0 LR [c0000000002b0388] .sysfs_hash_and_remove+0xe0/0xf0 Call Trace: [c0000000f8f9f0f0] [c0000000002b0388] .sysfs_hash_and_remove+0xe0/0xf0 (unreliable) [c0000000f8f9f190] [c0000000002b4134] .sysfs_remove_link+0x24/0x60 [c0000000f8f9f200] [d000000004df037c] .enclosure_remove_links+0x64/0xa0 [enclosure] [c0000000f8f9f2d0] [d000000004df0518] .enclosure_component_release+0x30/0x60 [enclosure] [c0000000f8f9f350] [c000000000540068] .device_release+0x50/0xd8 [c0000000f8f9f3d0] [c0000000003b6f80] .kobject_cleanup+0xb8/0x230 [c0000000f8f9f460] [c00000000053f404] .put_device+0x1c/0x30 [c0000000f8f9f4d0] [d000000004df0db0] .enclosure_unregister+0xa0/0xe8 [enclosure] [c0000000f8f9f560] [d000000004f90094] .ses_intf_remove_enclosure+0x8c/0xa8 [ses] [c0000000f8f9f5f0] [c0000000005413ec] .device_del+0xf4/0x268 [c0000000f8f9f680] [c000000000541594] .device_unregister+0x34/0x88 [c0000000f8f9f700] [d000000001423d3c] .__scsi_remove_device+0x104/0x128 [scsi_mod] [c0000000f8f9f780] [d00000000141eff8] .scsi_forget_host+0x70/0xa0 [scsi_mod] [c0000000f8f9f800] [d000000001413dc0] .scsi_remove_host+0x88/0x178 [scsi_mod] [c0000000f8f9f890] [d00000000469db5c] .ipr_remove+0x7c/0xf8 [ipr] [c0000000f8f9f920] [c0000000003fe1f4] .pci_device_remove+0x64/0xf0 [c0000000f8f9f9b0] [c000000000544f10] .__device_release_driver+0xd0/0x158 [c0000000f8f9fa40] [c0000000005450d8] .driver_detach+0x140/0x148 [c0000000f8f9fae0] [c000000000543848] .bus_remove_driver+0xe0/0x188 [c0000000f8f9fb70] [c00000000054628c] .driver_unregister+0x3c/0x80 [c0000000f8f9fbf0] [c0000000003fe35c] .pci_unregister_driver+0x34/0xe8 [c0000000f8f9fc90] [d0000000046a5fb4] .ipr_exit+0x2c/0x44 [ipr] [c0000000f8f9fd20] [c0000000001359dc] .SyS_delete_module+0x204/0x308 [c0000000f8f9fe30] [c000000000009f60] syscall_exit+0x0/0xa0 Instruction dump: e8010010 eb81ffe0 7c0803a6 eba1ffe8 ebc1fff0 ebe1fff8 4e800020 3c62ff8a 7ca42b78 3863c388 48485d45 60000000 <0fe00000> 3860fffe 4bffff94 fba1ffe8 o Signed-off-by: Wen Xiong <wenxiong@linux.vnet.ibm.com>
2014-04-09	md/raid5: Fix CPU hotplug callback registration	Oleg Nesterov	1	-46/+44
	commit 789b5e0315284463617e106baad360cb9e8db3ac upstream. Subsystems that want to register CPU hotplug callbacks, as well as perform initialization for the CPUs that are already online, often do it as shown below: get_online_cpus(); for_each_online_cpu(cpu) init_cpu(cpu); register_cpu_notifier(&foobar_cpu_notifier); put_online_cpus(); This is wrong, since it is prone to ABBA deadlocks involving the cpu_add_remove_lock and the cpu_hotplug.lock (when running concurrently with CPU hotplug operations). Interestingly, the raid5 code can actually prevent double initialization and hence can use the following simplified form of callback registration: register_cpu_notifier(&foobar_cpu_notifier); get_online_cpus(); for_each_online_cpu(cpu) init_cpu(cpu); put_online_cpus(); A hotplug operation that occurs between registering the notifier and calling get_online_cpus(), won't disrupt anything, because the code takes care to perform the memory allocations only once. So reorganize the code in raid5 this way to fix the deadlock with callback registration. This fixes BZ 103213. Cc: linux-raid@vger.kernel.org Fixes: 36d1c6476be51101778882897b315bd928c8c7b5 Signed-off-by: Oleg Nesterov <oleg@redhat.com> [Srivatsa: Fixed the unregister_cpu_notifier() deadlock, added the free_scratch_buffer() helper to condense code further and wrote the changelog.] Signed-off-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com> Signed-off-by: NeilBrown <neilb@suse.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-04-09	md/raid5: Revert partial fix of CPU hotplug callback registration	Srivatsa S. Bhat	1	-34/+39
	Commit 2775d6230 (md: Avoid deadlock in raid5_alloc_percpu) only partially fixed the deadlock involving CPU hotplug notifiers. In particular, it fixed the deadlock possibility in register_cpu_notifier(), but left the deadlock in unregister_cpu_notifier() unfixed. So revert this commit so that we can fix both the deadlocks properly, using the solution that was accepted upstream. Signed-off-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
2014-04-03	KVM: PPC: VFIO: Fix compile error introduced by a typo	Srivatsa S. Bhat	1	-1/+1
	kvm_vfio_spapr_tce_release was spelled as ikvm_vfio_ispapr_tce_release which caused compilation to break in case of CONFIG_KVM_VFIO=n. Fix it. Cc: Alexey Kardashevskiy <aik@ozlabs.ru> Signed-off-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
2014-04-02	KVM: PPC: Book3S HV: Fix check for running inside guest in global_invalidates()	Paul Mackerras	1	-1/+2
	The global_invalidates() function contains a check that is intended to tell whether we are currently executing in the context of a hypercall issued by the guest. The reason is that the optimization of using a local TLB invalidate instruction is only valid in that context. The check was testing local_paca->kvm_hstate.kvm_vcore, which gets set when entering the guest but no longer gets cleared when exiting the guest. To fix this, we use the kvm_vcpu field instead, which does get cleared when exiting the guest, by the kvmppc_release_hwthread() calls inside kvmppc_run_core(). The effect of having the check wrong was that when kvmppc_do_h_remove() got called from htab_write() on the destination machine during a migration, it cleared the current cpu's bit in kvm->arch.need_tlb_flush. This meant that when the guest started running in the destination VM, it may miss out on doing a complete TLB flush, and therefore may end up using stale TLB entries from a previous guest that used the same LPID value. This should make migration more reliable. Signed-off-by: Paul Mackerras <paulus@samba.org>
2014-04-02	powerpc/powernv: Remove unused debugfs OPAL log interface	Joel Stanley	1	-103/+0
	The OPAL log is now accessed through sysfs at /sys/firmware/opal/msglog, so remove the old and buggy debugfs file. Signed-off-by: Joel Stanley <joel@jms.id.au>
2014-04-01	powernv, cpufreq: Export nominal frequency via sysfs.	Gautham R. Shenoy	1	-1/+48
	Create a driver attribute named "cpuinfo_nominal_freq" which will in turn create a read-only sysfs interface that will be used to export the nominal frequency to the userspace. This will be necessary for creating an optimal "performance" policy which should be running the on-demand governor with "scaling_max_freq" to be set to the value exported via "cpuinfo_max_freq" and "scaling_min_freq" to be set to the nominal frequency exported via "cpuinfo_nominal_freq". The patch caches the values of max, min, nominal pstate ids and nr_pstates queried from the DT during the initialization of the driver so that they can be used in other places in the driver for validatation. Also, it adds a helper method that returns the frequency corresponding to a pstate id. This has been backported from the version posted against mainline which can be found here: https://www.mail-archive.com/linuxppc-dev@lists.ozlabs.org/msg76990.html Reviewed-by: Preeti U Murthy <preeti@linux.vnet.ibm.com> Signed-off-by: Gautham R. Shenoy <ego@linux.vnet.ibm.com>
2014-04-01	cpuidle:Remove the debug messages printed on exit from idle state	Preeti U Murthy	1	-19/+2
	We had added the debug prints to confirm the idle state exit by the cpus. This was mainly to test if fast sleep was working fine. Now that we are confident about its functioning we can get rid of these prints. Signed-off-by: Preeti U Murthy <preeti@linux.vnet.ibm.com>
2014-04-01	powerpc/powernv: OPAL message log interface rework	Joel Stanley	1	-18/+40
	This reworks the opal message log following upstream review. A bug was fixed where wrapped logs were not read correctly, and locking was added to reduce the impact of races between reading counters and the buffer contents. Signed-off-by: Joel Stanley <joel@jms.id.au> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2014-03-31	sched/autogroup: Fix race with task_groups list	Gerald Schaefer	1	-2/+1
	In autogroup_create(), a tg is allocated and added to the task_groups list. If CONFIG_RT_GROUP_SCHED is set, this tg is then modified while on the list, without locking. This can race with someone walking the list, like __enable_runtime() during CPU unplug, and result in a use-after-free bug. To fix this, move sched_online_group(), which adds the tg to the list, to the end of the autogroup_create() function after the modification. Signed-off-by: Gerald Schaefer <gerald.schaefer@de.ibm.com> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/1369411669-46971-2-git-send-email-gerald.schaefer@de.ibm.com Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Preeti U Murthy <preeti@linux.vnet.ibm.com> (cherry picked from commit 41261b6a832ea0e788627f6a8707854423f9ff49)
2014-03-31	tty/hvc_opal: Kick the HVC thread on OPAL console events	Benjamin Herrenschmidt	1	-1/+25
	The firmware can notify us when new input data is available, so let's make sure we wakeup the HVC thread in that case. Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2014-03-31	powerpc/powernv: Add opal_notifier_unregister() and export to modules	Benjamin Herrenschmidt	2	-0/+16
	opal_notifier_register() is missing a pending "unregister" variant and should be exposed to modules. Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2014-03-31	powerpc/ppc64: Do not turn AIL (reloc-on interrupts) too early	Benjamin Herrenschmidt	2	-5/+15
	Turn them on at the same time as we allow MSR_IR/DR in the paca kernel MSR, ie, after the MMU has been setup enough to be able to handle relocated access to the linear mapping. Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2014-03-31	powerpc/ppc64: Gracefully handle early interrupts	Benjamin Herrenschmidt	2	-1/+17
	If we take an interrupt such as a trap caused by a BUG_ON before the MMU has been setup, the interrupt handlers try to enable virutal mode and cause a recursive crash, making the original problem very hard to debug. This fixes it by adjusting the "kernel_msr" value in the PACA so that it only has MSR_IR and MSR_DR (translation for instruction and data) set after the MMU has been initialized for the processor. We may still not have a console yet but at least we don't get into a recursive fault (and early debug console or memory dump via JTAG of the kernel buffer will give us the proper error). Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2014-03-31	powerpc/powernv: Add invalid OPAL call	Joel Stanley	3	-0/+6
	This call will not be understood by OPAL, and cause it to add an error to it's log. Among other things, this is useful for testing the behaviour of the log as it fills up. Signed-off-by: Joel Stanley <joel@jms.id.au> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2014-03-31	powerpc/powernv: Add OPAL message log interface	Joel Stanley	4	-2/+106
	OPAL provides an in-memory circular buffer containing a message log populated with various runtime messages produced by the firmware. Provide a sysfs interface /sys/firmware/opal/messages for userspace to view the messages. Signed-off-by: Joel Stanley <joel@jms.id.au> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2014-03-31	powerpc/serial: Use saner flags when creating legacy ports	Benjamin Herrenschmidt	1	-6/+9
	We had a mix & match of flags used when creating legacy ports depending on where we found them in the device-tree. Among others we were missing UPF_SKIP_TEST for some kind of ISA ports which is a problem as quite a few UARTs out there don't support the loopback test (such as a lot of BMCs). Let's pick the set of flags used by the SoC code and generalize it which means autoconf, no loopback test, irq maybe shared and fixed port. Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2014-03-31	powerpc/ppc64: Print CPU/MMU/FW features at boot	Benjamin Herrenschmidt	1	-0/+5
	Helps debug funky firmware issues Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2014-03-31	powerpc/mm: Fix page size passed to tlbie	Benjamin Herrenschmidt	1	-20/+5
	Commit b1022fbd293564de91596b8775340cf41ad5214c and subsequent ones (in 3.10) introduced some preparatory changes for THP which consist of trying to read the actual HPTE page size from the hash table to perform the right variant of tlbie. However this has two issues: - The hash entry can have been evicted and replaced by another one with a different page size. This can in turn cause us to use an impossible combination of psize and actual_psize, in turn causing tlbie to be called with an invalid LP bit combination causing a HW checkstop - The whole business is unnecessary as in 3.10 we don't have THP and thus always have psize == actual_psize When THP was actual enabled in 3.11, we discovered that this wasn't going to work and changed the code significantly to pass the proper actual_psize from the upper layers rather than tyring to deduce it from the HPTE. However, we didn't "fix" 3.10 as we didn't realize that the bug introduced an exposure without THP being enabled. If a user page was hashed as a 64k page, and later got evicted from the hash and replaced with a 4k hash entry (due to a segment being demoted to 4k, for example by subpage protection or because it's an IO page), we could get into a situation where we tried to do a tlbie with a psize of 64k and actual_psize of 4k which is deadly. This is a 3.10-only fix for this situation which essentially removes the actual_psize business from the normal updatepp and invalidate path in hash_native_64.c since we know on 3.10 that the psize coming from the upper levels is always correct (no THP). As such it's a partial revert of b1022fbd293564de91596b8775340cf41ad5214c (we don't touch the bolted path etc... those should be fine and we want to minimize churn). Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2014-03-25	drm/radeon: remove generic rptr/wptr functions (v2)	Alex Deucher	10	-99/+254
	Fill in asic family specific versions rather than using the generic version. This lets us handle asic specific differences more easily. In this case, we disable sw swapping of the rtpr writeback value on r6xx+ since the hw does it for us. Fixes bogus rptr readback on BE systems. v2: remove missed cpu_to_le32(), add comments Signed-off-by: Alex Deucher <alexander.deucher@amd.com> (backported from commit ea31bf697d27270188a93cd78cf9de4bc968aca3) Signed-off-by: Kleber Sacilotto de Souza <klebers@linux.vnet.ibm.com> LTC-Bugzilla: #99530
2014-03-25	drm/radeon: remove special handling for the DMA ring	Christian König	10	-45/+84
	Now that we have callbacks for [rw]ptr handling we can remove the special handling for the DMA rings and use the callbacks instead. Signed-off-by: Christian König <christian.koenig@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> (backported from commit 2e1e6dad6a6d437e4c40611fdcc4e6cd9e2f969e) Signed-off-by: Kleber Sacilotto de Souza <klebers@linux.vnet.ibm.com> LTC-Bugzilla: #99530
2014-03-25	drm/radeon: rework UVD writeback & [rw]ptr handling	Christian König	8	-28/+37
	The hardware just doesn't support this correctly. Disable it before we accidentally write anywhere we shouldn't. Signed-off-by: Christian König <christian.koenig@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> (backported from commit 02c9f7fa4e7230fc4ae8bf26f64e45aa76011f9c) Signed-off-by: Kleber Sacilotto de Souza <klebers@linux.vnet.ibm.com> LTC-Bugzilla: #99530
2014-03-25	drm/radeon: rework ring function handling	Christian König	3	-599/+254
	Give the ring functions a separate structure and let the asic structure point to the ring specific functions. This simplifies the code and allows us to make changes at only one point. No change in functionality. Signed-off-by: Christian König <christian.koenig@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> (backported from commit 76a0df859defc53e6cb61f698a48ac7da92c8d84) Signed-off-by: Kleber Sacilotto de Souza <klebers@linux.vnet.ibm.com> LTC-Bugzilla: #99530
2014-03-25	drm/radeon: use callbacks for ring pointer handling (v3)	Alex Deucher	4	-14/+182
	Add callbacks to the radeon_asic struct to handle rptr/wptr fetchs and wptr updates. We currently use one version for all rings, but this allows us to override with a ring specific versions. Needed for compute rings on CIK. v2: udpate as per Christian's comments v3: fix some rebase cruft Signed-off-by: Alex Deucher <alexander.deucher@amd.com> (cherry picked from commit f93bdefe6269067afc85688d45c646cde350e0d8) Signed-off-by: Kleber Sacilotto de Souza <klebers@linux.vnet.ibm.com> LTC-Bugzilla: #99530
2014-03-19	conf: default IO scheduler is set to deadline	Eli Qiao	1	-0/+6
	Signed-off-by: Eli Qiao <taget@linux.vnet.ibm.com>
2014-03-19	powerpc/book3s: Fix CFAR clobbering issue in machine check handler.	Mahesh Salgaonkar	2	-0/+13
	While checking powersaving mode in machine check handler at 0x200, we clobber CFAR register. Fix it by saving and restoring it during beq/bgt. Signed-off-by: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
2014-03-19	powerpc/ftrace: bugfix for test_24bit_addr	Liu Ping Fan	1	-0/+1
	The branch target should be the func addr, not the addr of func_descr_t. So using ppc_function_entry() to generate the right target addr. Signed-off-by: Liu Ping Fan <pingfank@linux.vnet.ibm.com> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2014-03-19	hwsensors/ibmpowernv: Fix up sysfs file duplication	Neelesh Gupta	1	-0/+1
	Fixing up the 'sysfs' file duplication by passing the initialized char array to strncpy() function as the result is not %NUL-terminated if the source exceeds 'copy_length' bytes. Signed-off-by: Neelesh Gupta <neelegup@linux.vnet.ibm.com>
2014-03-19	ipr: Add new CCIN definition for Grand Canyon support	wenxiong@vmlinux.vnet.ibm.com	2	-0/+9
	Add the appropriate definition and table entry for new hardware support. Signed-off-by: Wen Xiong <wenxiong@linux.vnet.ibm.com> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2014-03-19	ipr: Format HCAM overlay ID 0x21	wenxiong@vmlinux.vnet.ibm.com	2	-0/+53
	This patch adds formatting error overlay 0x21 to improve debug capabilities. Signed-off-by: Wen Xiong <wenxiong@linux.vnet.ibm.com> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2014-03-19	ipr: Get rid of superfluous call to pci_disbale_msi/msix()	Alexander Gordeev	1	-2/+0
	There is no need to call pci_disable_msi() or pci_disable_msix() in case the call to pci_enable_msi() or pci_enable_msix() failed. Signed-off-by: Alexander Gordeev <agordeev@redhat.com> Acked-by: Brian King <brking@linux.vnet.ibm.com> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2014-03-19	ipr: Handle early EEH	Brian King	2	-89/+179
	If, when the ipr driver loads, the adapter is in an EEH error state, it will currently oops and not be able to recover, as it attempts to access memory that has not yet been allocated. We've seen this occur in some kexec scenarios. The following patch fixes the oops and also allows the driver to recover from these probe time EEH errors. Signed-off-by: Brian King <brking@linux.vnet.ibm.com> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2014-03-19	ipr: Add new CCIN definition for new hardware support	wenxiong@vmlinux.vnet.ibm.com	2	-0/+3
	Add the appropriate definition and table entry for new hardware support. Signed-off-by: Wen Xiong <wenxiong@linux.vnet.ibm.com> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2014-03-19	ipr: Remove extended delay bit on GSCSI reads/writes ops	wenxiong@vmlinux.vnet.ibm.com	2	-1/+6
	This patch removes extended delay bit on GSCSI reads/writes ops, the performance will be significanly better. Signed-off-by: Wen Xiong <wenxiong@linux.vnet.ibm.com> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2014-03-19	ipr: Increase msi-x interrupt vectors to 16	wenxiong@linux.vnet.ibm.com	2	-2/+2
	Signed-off-by: Wen Xiong <wenxiong@linux.vnet.ibm.com> Acked-by: Brian King <brking@linux.vnet.ibm.com> Signed-off-by: James Bottomley <JBottomley@Parallels.com>
2014-03-19	ipr: Add sereral new CCIN definitions for new adapters support	wenxiong@linux.vnet.ibm.com	2	-0/+21
	Add the appropriate definitions and table entries for new adapter support. Signed-off-by: Wen Xiong <wenxiong@linux.vnet.ibm.com> Signed-off-by: James Bottomley <JBottomley@Parallels.com>
2014-03-19	ipr: IOA Status Code(IOASC) update	wenxiong@linux.vnet.ibm.com	1	-1/+39
	Signed-off-by: Wen Xiong <wenxiong@linux.vnet.ibm.com> Acked-by: Brian King <brking@linux.vnet.ibm.com> Signed-off-by: James Bottomley <JBottomley@Parallels.com>
2014-03-19	ipr: qc_fill_rtf() method should not store alternate status register	Sergei Shtylyov	1	-1/+0
	The 'ctl' field of the 'struct ata_taskfile' is not really dual purpose, i.e. it is not intended for storing the alternate status register (which is mapped at the same address in the legacy IDE controllers) in the qc_fill_rtf() method. No other 'libata' driver except 'drivers/scsi/ipr.c' stores the alternate status register's value in the 'ctl' field of 'qc->result_tf', hence this driver should not do this as well... Signed-off-by: Sergei Shtylyov <sergei.shtylyov@cogentembedded.com> Acked-by: Brian King <brking@linux.vnet.ibm.com> Signed-off-by: James Bottomley <JBottomley@Parallels.com>
2014-03-18	KVM: PPC: Book3S HV: Save/restore host PMU registers that are new in POWER8	Paul Mackerras	3	-1/+33
	Currently we save the host PMU configuration, counter values, etc., when entering a guest, and restore it on return from the guest. (We have to do this because the guest has control of the PMU while it is executing.) However, we missed saving/restoring the SIAR and SDAR registers, as well as the registers which are new on POWER8, namely SIER and MMCR2. This adds code to save the values of these registers when entering the guest and restore them on exit. This also works around the bug where setting PMAE with a counter already negative doesn't generate an interrupt. This was already worked around for the guest PMU state in an earlier commit, and is worked around for the host PMU state here. Signed-off-by: Paul Mackerras <paulus@samba.org>
2014-03-18	KVM: PPC: Book3S HV: Work around POWER8 performance monitor bugs	Paul Mackerras	2	-7/+64
	This adds workarounds for two hardware bugs in the POWER8 performance monitor unit (PMU), both related to interrupt generation. The effect of these bugs is that PMU interrupts can get lost, leading to tools such as perf reporting fewer counts and samples than they should. The first bug relates to the PMAO (perf. mon. alert occurred) bit in MMCR0; setting it should cause an interrupt, but doesn't. The other bug relates to the PMAE (perf. mon. alert enable) bit in MMCR0. Setting PMAE when a counter is negative and counter negative conditions are enabled to cause alerts should cause an alert, but doesn't. The workaround for the first bug is to create conditions where a counter will overflow, whenever we are about to restore a MMCR0 value that has PMAO set (and PMAO_SYNC clear). The workaround for the second bug is to freeze all counters using MMCR2 before reading MMCR0. Signed-off-by: Paul Mackerras <paulus@samba.org>
2014-03-18	KVM: PPC: Book3S HV: Move guest TM restore code out of PMU restore sequence	Paul Mackerras	1	-38/+38
	Somehow, the code that restores the guest transactional memory state got put in the middle of the code sequence that restores the guest PMU (performance monitor unit) state. This results in corruption of the value written to MMCR0 if the guest is in transactional state. This fixes it by moving the TM state-restoring code to come just before the PMU state-restoring code. This comes out in the patch as the first part of the PMU state-restoring code being moved down to just before the second part of the PMU state-restoring code. Signed-off-by: Paul Mackerras <paulus@samba.org>
2014-03-18	powerpc/perf: Add lost exception workaround	Michael Ellerman	3	-1/+89
	Some power8 revisions have a hardware bug where we can lose a PMU exception, this commit adds a workaround to detect the bad condition and rectify the situation. See the comment in the commit for a full description. Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Signed-off-by: Paul Mackerras <paulus@samba.org>
2014-03-18	powerpc: Add a cpu feature CPU_FTR_PMAO_BUG	Michael Ellerman	2	-3/+5
	Some power8 revisions have a hardware bug where we can lose a Performance Monitor (PMU) exception under certain circumstances. We will be adding a workaround for this case, see the next commit for details. The observed behaviour is that writing PMAO doesn't cause an exception as we would expect, hence the name of the feature. Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Signed-off-by: Paul Mackerras <paulus@samba.org>
2014-03-10	xhci: fix incorrect type in assignment in handle_device_notification()	Xenia Ragiadakou	1	-1/+1
	This patch converts Event TRB's 3rd field, which has type le32, to CPU byteorder before using it to retrieve the Slot ID with TRB_TO_SLOT_ID macro. This bug was found using sparse. Signed-off-by: Xenia Ragiadakou <burzalodowa@gmail.com> Signed-off-by: Sarah Sharp <sarah.a.sharp@linux.intel.com> [Backport of 7e76ad431545d013911ddc744843118b43d01e89] Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2014-03-10	xhci: convert TRB_CYCLE to le32 before using it to set Link TRB's cycle bit	Xenia Ragiadakou	1	-2/+3
	This patch converts TRB_CYCLE to le32 to update correctly the Cycle Bit in 'control' field of the link TRB. This bug was found using sparse. Signed-off-by: Xenia Ragiadakou <burzalodowa@gmail.com> Signed-off-by: Sarah Sharp <sarah.a.sharp@linux.intel.com> [Backport of 587194873820a4a1b2eda260ac851394095afd77] Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2014-03-06	net/mlx4: Support shutdown() interface	Gavin Shan	1	-0/+1
	In kexec scenario, we failed to load the mlx4 driver in the second kernel because the ownership bit was hold by the first kernel without release correctly. The patch adds shutdown() interface so that the ownership can be released correctly in the first kernel. It also helps avoiding EEH error happened during boot stage of the second kernel because of undesired traffic, which can't be handled by hardware during that stage on Power platform. Signed-off-by: Gavin Shan <shangw@linux.vnet.ibm.com> Tested-by: Wei Yang <weiyang@linux.vnet.ibm.com>
2014-03-05	drivers/vfio/pci: Fix MSIx message lost	Gavin Shan	1	-0/+19
	The problem is specific to the case of BIST issued to IPR adapter on the guest side. The IPR driver does something like this: pci_save_state(), BIST reset and then pci_save_state(). we lose everything in MSIx table with BIST reset and we never have chance to restore MSIx table under the case. pci_restore_msix_state() called by pci_save_state() mask all MSIx vectors by MSIx capability, restore MSIx table, and then unmask all MSIx vectors. We force the host kernel to restore the MSIx vector in the step of unmasking all MSIx vectors to fix the issue. The patch is under review this moment in Linux community. It'd better to have ack from Ben and Alexey if we really want this to be Frobisher. It's responsing to bug#103589. Reported-by: Wen Xiong <wenxiong@us.ibm.com> Signed-off-by: Gavin Shan <shangw@linux.vnet.ibm.com>
2014-03-05	powerpc/eeh: Disable EEH on reboot	Gavin Shan	2	-1/+22
	We possiblly detect EEH errors during reboot, particularly in kexec path, but it's impossible for device drivers and EEH core to handle or recover them properly. The patch registers one reboot notifier for EEH and disable EEH subsystem during reboot. That means the EEH errors is going to be cleared by hardware reset or second kernel during early stage of PCI probe. It's backporting commit 66f9af83e56bfa12964d251df9d60fb571579913 ("powerpc/eeh: Disable EEH on reboot") from 3.14 upstream for bug#103590 Signed-off-by: Gavin Shan <shangw@linux.vnet.ibm.com> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2014-03-05	powerpc/eeh: Cleanup on eeh_subsystem_enabled	Gavin Shan	4	-10/+27
	The patch cleans up variable eeh_subsystem_enabled so that we needn't refer the variable directly from external. Instead, we will use function eeh_enabled() and eeh_set_enable() to operate the variable. It's backporting 2ec5a0adf60c23bb6b0a95d3b96a8c1ff1e1aa5a ("powerpc/eeh: Cleanup on eeh_subsystem_enabled") from 3.14 upstream for bug#103590 Signed-off-by: Gavin Shan <shangw@linux.vnet.ibm.com> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2014-03-05	powerpc/powernv: Rework EEH reset	Gavin Shan	1	-25/+4
	When doing reset in order to recover the affected PE, we issue hot reset on PE primary bus if it's not root bus. Otherwise, we issue hot or fundamental reset on root port or PHB accordingly. For the later case, we didn't cover the situation where PE only includes root port and it potentially causes kernel crash upon EEH error to the PE. The patch reworks the logic of EEH reset to improve the code readability and also avoid the kernel crash. It's backporting commit 5b2e198e50f6ba57081586b853163ea1bb95f1a8 ("powerpc/powernv: Rework EEH reset") from 3.14 upstream for bug#103590 Cc: stable@vger.kernel.org Reported-by: Thadeu Lima de Souza Cascardo <cascardo@linux.vnet.ibm.com> Signed-off-by: Gavin Shan <shangw@linux.vnet.ibm.com> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2014-03-04	powerpc: fix realmode-failure flag reset	Alexey Kardashevskiy	1	-6/+6
	A malicious guest can register an IOMMU in KVM while a TCE request is being passed from the real to virtual mode. If vcpu->arch.tce_rm_fail was previously used and not cleared because of missing LIOBN entry in KVM, this may cause unwanted put_page() in the virtual mode handler. This moves @tce_rm_fail earlier to avoid using the incorrect tce_rm_fail flag value. Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
2014-03-04	powerpc/eeh: More reliability of PCI dev reset	Gavin Shan	1	-8/+11
	The PCI core has function pci_reset_function() to do reset on the specified PCI device. Before the reset starts, the sate of the PCI device is saved and it is restored after reset. The real reset work could be routed to pcibios_set_pcie_reset_state() by quirks. However, the PCI bus or PCI device isn't settled down fully for restore (PCI config and MMIO for MSIx table) after reset and it would introduce unnecessary frozen PE. Eventually, we're stopped from passing through IPR adapter from host to KVM-based guest. The patch adds delay in pcibios_set_pcie_reset_state() so that the PCI bus/device can settle down fully before restoring PCI device states. It's part of the fixes regarding bug#103297 and bug#103589. Signed-off-by: Gavin Shan <shangw@linux.vnet.ibm.com>
2014-03-04	tick: Make oneshot broadcast robust vs. CPU offlining	Thomas Gleixner	1	-2/+11
	In periodic mode we remove offline cpus from the broadcast propagation mask. In oneshot mode we fail to do so. This was not a problem so far, but the recent changes to the broadcast propagation introduced a constellation which can result in a NULL pointer dereference. What happens is: CPU0 CPU1 idle() arch_idle() tick_broadcast_oneshot_control(OFF); set cpu1 in tick_broadcast_force_mask if (cpu_offline()) arch_cpu_dead() cpu_dead_cleanup(cpu1) cpu1 tickdevice pointer = NULL broadcast interrupt dereference cpu1 tickdevice pointer -> OOPS We dereference the pointer because cpu1 is still set in tick_broadcast_force_mask and tick_do_broadcast() expects a valid cpumask and therefor lacks any further checks. Remove the cpu from the tick_broadcast_force_mask before we set the tick device pointer to NULL. Also add a sanity check to the oneshot broadcast function, so we can detect such issues w/o crashing the machine. Reported-by: Prarit Bhargava <prarit@redhat.com> Cc: athorlton@sgi.com Cc: CAI Qian <caiqian@redhat.com> Link: http://lkml.kernel.org/r/alpine.DEB.2.02.1306261303260.4013@ionos.tec.linutronix.de Signed-off-by: Thomas Gleixner <tglx@linutronix.de> (cherry picked from commit c9b5a266b103af873abb9ac03bc3d067702c8f4b) Reviewed-by: Preeti U Murthy <preeti@linux.vnet.ibm.com> Signed-off-by: Preeti U Murthy <preeti@linux.vnet.ibm.com>
2014-03-04	cpuidle/powernv: Enable Fastsleep at boot time	Preeti U Murthy	1	-1/+4
	Fast sleep can be enabled today, only after writing into the proc interface /proc/sys/kernel/powersave-nap with a value greater than 1. Remove this constraint, now that we have a stable framework to support fast sleep, so that it is enabled by default at boot. However the same proc interface is also used to convey if deep idle states beyond snooze can be entered into or not. Hence retain the check on powersave-nap in fast sleep to verify if this is the case. Signed-off-by: Preeti U Murthy <preeti@linux.vnet.ibm.com>
2014-03-04	sapphire: Add skiroot configuration	Jeremy Kerr	1	-0/+227
	Add a configuration file to use when building the skiroot (Sapphire bootloader) kernel. Signed-off-by: Jeremy Kerr <jk@ozlabs.org> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2014-03-04	powerpc/powernv: Refactor PHB diag-data dump	Gavin Shan	1	-95/+125
	As Ben suggested, the patch prints PHB diag-data with multiple fields in one line and omits the line if the fields of that line are all zero. With the patch applied, the PHB3 diag-data dump looks like: PHB3 PHB#3 Diag-data (Version: 1) brdgCtl: 00000002 RootSts: 0000000f 00400000 b0830008 00100147 00002000 nFir: 0000000000000000 0030006e00000000 0000000000000000 PhbSts: 0000001c00000000 0000000000000000 Lem: 0000000000100000 42498e327f502eae 0000000000000000 InAErr: 8000000000000000 8000000000000000 0402030000000000 \ 0000000000000000 PE[ 8] A/B: 8480002b00000000 8000000000000000 Signed-off-by: Gavin Shan <shangw@linux.vnet.ibm.com> Signed-off-by: Jeremy Kerr <jk@ozlabs.org> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2014-03-04	powerpc/powernv: Dump PHB diag-data immediately	Gavin Shan	1	-37/+42
	The PHB diag-data is useful to help locating the root cause for frozen PE or fenced PHB. However, EEH core enables IO path by clearing part of HW registers before collecting it and eventually we got broken PHB diag-data. The patch intends to fix it by dumping the PHB diag-data immediately when frozen/fenced state on PE or PHB is detected for the first time in eeh_ops::get_state() or next_error() backend. Signed-off-by: Gavin Shan <shangw@linux.vnet.ibm.com> Signed-off-by: Jeremy Kerr <jk@ozlabs.org> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2014-03-04	powerpc/powernv: Move PNV_EEH_STATE_ENABLED around	Gavin Shan	3	-11/+6
	The flag PNV_EEH_STATE_ENABLED is put into pnv_phb::eeh_state, which is protected by CONFIG_EEH. We needn't that. Instead, we can have pnv_phb::flags and maintain all flags there, which is the purpose of the patch. The patch also renames PNV_EEH_STATE_ENABLED to PNV_PHB_FLAG_EEH. Signed-off-by: Gavin Shan <shangw@linux.vnet.ibm.com> Signed-off-by: Jeremy Kerr <jk@ozlabs.org> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2014-03-04	powerpc/powernv: Remove PNV_EEH_STATE_REMOVED	Gavin Shan	2	-42/+15
	The PHB state PNV_EEH_STATE_REMOVED maintained in pnv_phb isn't so useful any more and it's duplicated to EEH_PE_ISOLATED. The patch replaces PNV_EEH_STATE_REMOVED with EEH_PE_ISOLATED. Signed-off-by: Gavin Shan <shangw@linux.vnet.ibm.com> Signed-off-by: Jeremy Kerr <jk@ozlabs.org> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2014-03-04	powerpc/eeh: Remove EEH_PE_PHB_DEAD	Gavin Shan	3	-14/+6
	The PE state (for eeh_pe instance) EEH_PE_PHB_DEAD is duplicate to EEH_PE_ISOLATED. Originally, those PHBs (PHB PE) with EEH_PE_PHB_DEAD would be removed from the system. However, it's safe to replace that with EEH_PE_ISOLATED. The patch also clear EEH_PE_RECOVERING after fenced PHB has been handled, either failure or success. It makes the PHB PE state consistent with: PHB functions normally NONE PHB has been removed EEH_PE_ISOLATED PHB fenced, recovery in progress EEH_PE_ISOLATED \| RECOVERING Signed-off-by: Gavin Shan <shangw@linux.vnet.ibm.com> Signed-off-by: Jeremy Kerr <jk@ozlabs.org> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2014-03-04	hwsensors/ibmpowernv: Cleanup and fix up sysfs file duplication	Neelesh Gupta	1	-87/+76
	Cleaning up the code, removing not necessary enumeration, clubbing the fragmented data structure and some conditional checks in node traversal in __init code. This also fixes a bug of sysfs file duplication. Signed-off-by: Neelesh Gupta <neelegup@linux.vnet.ibm.com> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2014-03-04	powerpc: vfio: remove incorrect put_page() for PR KVM	Alexey Kardashevskiy	2	-2/+2
	This fixes memory corruption which happens when VFIO is used with PR KVM. Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
2014-03-04	powerpc/powernv: Return secondary CPUs to firmware before FW update	Vasant Hegde	3	-7/+66
	Firmware update on PowerNV platform takes several minutes. During this time one CPU is stuck in FW and the kernel complains about "soft lockups". This patch returns all secondary CPUs to firmware before starting firmware update process. [ Reworked a bit and cleaned up -- BenH ] Signed-off-by: Vasant Hegde <hegdevasant@linux.vnet.ibm.com> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2014-03-04	powerpc/powernv: Read OPAL error log and export it through sysfs	Stewart Smith	2	-212/+275
	Cherry pick 3b3f89ac6614d6bc2e2edb32e49d4906d931c795, implementing the error log reading code we're pushing upstream. This changes the userspace interface for reading and acknowledging error logs, so userspace code will have to change if it relied on the old way. Based on a patch by: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com> This patch adds support to read error logs from OPAL and export them to userspace through a sysfs interface. We export each log entry as a directory in /sys/firmware/opal/elog/ Currently, OPAL will buffer up to 128 error log records, we don't need to have any knowledge of this limit on the Linux side as that is actually largely transparent to us. Each error log entry has the following files: id, type, acknowledge, raw. Currently we just export the raw binary error log in the 'raw' attribute. In a future patch, we may parse more of the error log to make it a bit easier for userspace (e.g. to be able to display a brief summary in petitboot without having to have a full parser). If we have >128 logs from OPAL, we'll only be notified of 128 until userspace starts acknowledging them. This limitation may be lifted in the future and with this patch, that should "just work" from the linux side. A userspace daemon should: - wait for error log entries using normal mechanisms (we announce creation) - read error log entry - save error log entry safely to disk - acknowledge the error log entry - rinse, repeat. On the Linux side, we read the error log when we're notified of it. This possibly isn't ideal as it would be better to only read them on-demand. However, this doesn't really work with current OPAL interface, so we read the error log immediately when notified at the moment. I've tested this pretty extensively and am rather confident that the linux side of things works rather well. There is currently an issue with the service processor side of things for >128 error logs though. Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com> Conflicts: arch/powerpc/include/asm/opal.h arch/powerpc/platforms/powernv/Makefile arch/powerpc/platforms/powernv/opal-elog.c
2014-03-04	Backport upstream powerpc/powernv Platform dump sysfs interface	Stewart Smith	4	-239/+364
	This patch makes the sysfs interface match that of what's pushed upstream. changes in kernel: - fetch dump on-demand - directory per dump - in sysfs rather than debugfs Userspace changes needed - read from sysfs rather than debugfs. This enables support for userspace to fetch and initiate FSP and Platform dumps from the service processor (via firmware) through sysfs. Based on original patch from Vasant Hegde <hegdevasant@linux.vnet.ibm.com> Flow: - We register for OPAL notification events. - OPAL sends new dump available notification. - We make information on dump available via sysfs - Userspace requests dump contents - We retrieve the dump via OPAL interface - User copies the dump data - userspace sends ack for dump - We send ACK to OPAL. sysfs files: - We add the /sys/firmware/opal/dump directory - echoing 1 (well, anything, but in future we may support different dump types) to /sys/firmware/opal/dump/initiate_dump will initiate a dump. - Each dump that we've been notified of gets a directory in /sys/firmware/opal/dump/ with a name of the dump type and ID (in hex, as this is what's used elsewhere to identify the dump). - Each dump has files: id, type, dump and acknowledge dump is binary and is the dump itself. echoing 'ack' to acknowledge (currently any string will do) will acknowledge the dump and it will soon after disappear from sysfs. OPAL APIs: - opal_dump_init() - opal_dump_info() - opal_dump_read() - opal_dump_ack() - opal_dump_resend_notification() Currently we are only ever notified for one dump at a time (until the user explicitly acks the current dump, then we get a notification of the next dump), but this kernel code should "just work" when OPAL starts notifying us of all the dumps present. Changes since v2: - fix bug where we would free the dump buffer after userspace read it, refetching if needed. Refetching doesn't currently work, so we must keep the dump around for subsequent reads. Changes since v1: - Add support for getting dump type from OPAL through new OPAL call (falling back to old OPAL_DUMP_INFO call if OPAL_DUMP_INFO2 isn't supported) - use dump type in directory name for dump Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com> Conflicts: arch/powerpc/include/asm/opal.h arch/powerpc/platforms/powernv/Makefile arch/powerpc/platforms/powernv/opal-dump.c arch/powerpc/platforms/powernv/opal-wrappers.S arch/powerpc/platforms/powernv/opal.c
2014-03-04	powerpc/crashdump : fix page frame number check in copy_oldmem_page	Mahesh Salgaonkar	1	-3/+5
	In copy_oldmem_page, the current check using max_pfn and min_low_pfn to decide if the page is backed or not, is not valid when the memory layout is not continuous. This happens when running as a QEMU/KVM guest, where RTAS is mapped higher in the memory. In that case max_pfn points to the end of RTAS, and a hole between the end of the kdump kernel and RTAS is not backed by PTEs. As a consequence, the kdump kernel is crashing in copy_oldmem_page when accessing in a direct way the pages in that hole. This fix relies on the memblock's service memblock_is_region_memory to check if the read page is part or not of the directly accessible memory. This is a backport of upstream patch https://lists.ozlabs.org/pipermail/linuxppc-dev/2014-February/115569.html This fixes LTC BUG #104729 Signed-off-by: Laurent Dufour <ldufour@linux.vnet.ibm.com> Tested-by: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
2014-03-04	net/cxgb4: use remove handler as shutdown handler	Thadeu Lima de Souza Cascardo	1	-0/+1
	Without a shutdown handler, T4 cards behave very badly after a kexec. Some firmware calls return errors indicating allocation failures, for example. This is probably because thouse resources were not released by a BYE message to the firmware, for example. Using the remove handler guarantees we will use a well tested path. With this patch I applied, I managed to use kexec multiple times and probe and iSCSI login worked every time. Signed-off-by: Thadeu Lima de Souza Cascardo <cascardo@linux.vnet.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net> LTC-Bugzilla: #103241 (cherry picked from commit 687d705c031916b83953b714917b04d899e23cf5)
2014-02-25	spec : Fix build issue on koji server.	Eli Qiao	1	-6/+11
	Signed-off-by: Eli Qiao <taget@linux.vnet.ibm.com>
2014-02-20	config: disable CONFIG_SCSI_CHELSIO_FCOE to avoid system crash	Wang Sen	1	-1/+1
	https://bugzilla.linux.ibm.com/show_bug.cgi?id=104249 https://bugzilla.linux.ibm.com/show_bug.cgi?id=104444 Signed-off-by: Wang Sen <wangsen@linux.vnet.ibm.com>
2014-02-19	powerpc: Fix kdump hang issue on p8 with relocation on exception enabled.	Mahesh Salgaonkar	2	-0/+26
	On p8 systems, with relocation on exception feature enabled we are seeing kdump kernel hang at interrupt vector 0xc4400. The reason is, with this feature enabled, exception are raised with MMU (IR=DR=1) ON with the default offset of 0xc4000. Since exception is raised in virtual mode it requires the vector region to be executable without which it fails to fetch and execute instruction at 0xc*4xxx. For default kernel since kernel is loaded at real 0, the htab mappings sets the entire kernel text region executable. But for relocatable kernel (e.g. kdump case) we only copy interrupt vectors down to real 0 and never marked that region as executable because in p7 and below we always get exception in real mode. This patch fixes this issue by marking htab mapping range as executable that overlaps with the interrupt vector region for relocatable kernel. Thanks to Ben who helped me to debug this issue and find the root cause. This is at least part of the fix for kdump failures that we are seeing in bug 103693. Signed-off-by: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org> (cherry picked from commit 429d2e8342954d337abe370d957e78291032d867) Signed-off-by: Paul Mackerras <paulus@samba.org>
2014-02-19	powerpc/pseries: Disable relocation on exception while going down during crash.	Mahesh Salgaonkar	1	-2/+1
	Disable relocation on exception while going down even in kdump case. This is because we are about clear htab mappings while kexec-ing into kdump kernel and we may run into issues if we still have AIL ON. This is at least part of the fix for kdump failures that we are seeing in bug 103693. Signed-off-by: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org> (cherry picked from commit 3ec8b78fcc5aa7745026d8d85a4e9ab52c922765) Signed-off-by: Paul Mackerras <paulus@samba.org>
2014-02-19	powernv: don't attempt to refetch the FSP dump until the user has explicitly ↵	Stewart Smith	1	-0/+8
	acked it. This fixes a bug where we would get two events from OPAL with DUMP_AVAIL set (which is valid for OPAL to do) and in the second run of extract_dump() we would fail to free the memory previously allocated for the dump (leaking ~6MB+) as well as on the second dump_read_data() call OPAL would not retrieve the dump, leaving us with a dump in linux that was the correct size but all zeros. Changes since v1: fixed typo Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com> Acked-by: Benjamin Herrenschmidt <benh@kernel.crashing.org> LTC-Bugzilla: #104211
2014-02-12	KVM: PPC: Book3S HV: Fix decrementer timeouts with non-zero TB offset	Paul Mackerras	1	-1/+9
	Commit 082fee36bd2c ("KVM: PPC: Book3S HV: Make physical thread 0 do the MMU switching") reordered the guest entry/exit code so that most of the guest register save/restore code happened in guest MMU context. A side effect of that is that the timebase still contains the guest timebase value at the point where we compute and use vcpu->arch.dec_expires, and therefore that is now a guest timebase value rather than a host timebase value. That in turn means that the timeouts computed in kvmppc_set_timer() are wrong if the timebase offset for the guest is non-zero. The consequence of that is things such as "sleep 1" in a guest after migration may sleep for much longer than they should. This fixes the problem by converting between guest and host timebase values as necessary, by adding or subtracting the timebase offset. This also fixes an incorrect comment. This is part of the fix for many of the migration-related bug reports. Signed-off-by: Paul Mackerras <paulus@samba.org>
2014-02-12	powerpc/powernv: Respect max_cpus while initializing core split/unsplit feature.	Mahesh Salgaonkar	1	-1/+13
	In kdump kernel we see a hang during subcore_init() at unsplit_core()->wait_for_sync_step(). In kdump kernel we always boot with maxcpus=1 and all other cpus are waiting inside OPAL, hence with 1 online cpu the master thread keep waiting on secondary threads to set split_state indefinitely. This is even true for all cases where max_cpus is not aligned with threads_per_core. This patch fixes this issue by disabling core split/unsplit feature if max_cpus are not aligned with threads_per_core. This also fixes kdump hang issue. Signed-off-by: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
2014-02-05	Bump SPEC file to pbeta1	Crístian Viana	1	-1/+9
	Signed-off-by: Crístian Viana <vianac@linux.vnet.ibm.com>
2014-02-04	vfio: remove redundant put_page() when failed in real mode	Alexey Kardashevskiy	2	-8/+9
	This fixes one of the corner cases which produced wrong backtrack from put_page(). BZ: 103055 Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
2014-02-04	powerpc/powernv: Clear IPI flag when flushing interrupts	Paul Mackerras	1	-1/+3
	icp_native_flush_interrupt() function is supposed to clear a pending interrupt, like local_irq_enable(); local_irq_disable() would, but without calling generic code. Unfortunately it missed clearing the "IPI pending" flag in the PACA (local_paca->kvm_hstate.host_ipi). The effect of this flag being set is that secondary CPU threads won't go into the KVM guest, leading to messages like: kvmppc_wait_for_nap timeout 0 1 when a KVM HV guest is run. This fixes it by adding a call to kvmppc_set_host_ipi to clear the flag. This fixes BZ 103513. Signed-off-by: Paul Mackerras <paulus@samba.org>
2014-02-04	KVM: PPC: Book3S HV: Fix register usage when loading/saving VRSAVE	Paul Mackerras	1	-2/+6
	Commit 595e4f7e697e ("KVM: PPC: Book3S HV: Use load/store_fp_state functions in HV guest entry/exit") changed the register usage in kvmppc_save_fp() and kvmppc_load_fp() but omitted changing the instructions that load and save VRSAVE. The result is that the VRSAVE value was loaded from a constant address, and saved to a location past the end of the vcpu struct, causing host kernel memory corruption and various kinds of host kernel crashes. This fixes the problem by using register r31, which contains the vcpu pointer, instead of r3 and r4. This should help resolve several bugzillas involving guest or host crashes and hangs, including 98456, 102775, 103534, 100504, and possibly others. Signed-off-by: Paul Mackerras <paulus@samba.org>
2014-02-04	powerpc: fix to compile without CONFIG_IOMMU_API	Alexey Kardashevskiy	1	-7/+0
	Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
2014-01-29	md: Avoid deadlock in raid5_alloc_percpu	Oleg Nesterov	1	-39/+34
	register_cpu_notifier() can deadlock if called inside a get/put_online_cpus block. To avoid this, move the call to register_cpu_notifier before the get_online_cpus(). [paulus@samba.org - renamed alloc_xxx to alloc_percpu_areas, fixed compile errors, made up patch description] This fixes BZ 103213. Signed-off-by: Paul Mackerras <paulus@samba.org>
2014-01-22	spec: pbuild8	Eli Qiao	1	-2/+15
	change log: PPC: KVM: fix to compile without VFIO vfio: fix in-kernel and ioctl handlers Fix a bug where asking for a POWER8 guest on a POWER7 system doesn't fail, but should Fix and performance improvements for nested virtualization LTC BZ 101114 CPU Build0.6: Host Cpu Offline/online leads to instruction dump and further cpu online/offline functions are not PowerKVM Build 8 host platform support Fix problems reported by the kernel RCU checking machinery and may help fix the memory corruption issues we have been seeing LTC BZ 101123 Unable to bring up LE guest using libvirt/virsh Fixes a bug with not resetting page struct pointer which caused bugs in calling code. Fix one of the corner cases when the realmode handler fails to handle T_PUT_TCE_INDIRECT call and passes it further to the vir Signed-off-by: Eli Qiao <taget@linux.vnet.ibm.com>
2014-01-22	vfio: fix virtmode handler	Alexey Kardashevskiy	2	-1/+9
	The existing handler assumes that the first failed TCE entry's host physical address is saved in the tce_tmp_hpas cache but it is not so the virtmode handler has to read it from the TCE list again so does this patch. Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
2014-01-22	Add CONFIG_SENSORS_IBMPOWERNV=y	Eli Qiao	2	-0/+4
	to config-powerpc64 and config-powerpc64p7 Signed-off-by: Eli Qiao <taget@linux.vnet.ibm.com>
2014-01-22	powerpc: Make sure "cache" directory is removed when offlining cpu	Paul Mackerras	1	-0/+3
	The code in remove_cache_dir() is supposed to remove the "cache" subdirectory from the sysfs directory for a CPU when that CPU is being offlined. It tries to do this by calling kobject_put() on the kobject for the subdirectory. However, the subdirectory only gets removed once the last reference goes away, and the reference being put here may well not be the last reference. That means that the "cache" subdirectory may still exist when the offlining operation has finished. If the same CPU subsequently gets onlined, the code tries to add a new "cache" subdirectory. If the old subdirectory has not yet been removed, we get a WARN_ON in the sysfs code, with stack trace, and an error message printed on the console. Further, we ultimately end up with an online cpu with no "cache" subdirectory. This fixes it by doing an explicit kobject_del() at the point where we want the subdirectory to go away. kobject_del() removes the sysfs directory even though the object still exists in memory. The object will get freed at some point in the future. A subsequent onlining operation can create a new sysfs directory, even if the old object still exists in memory, without causing any problems. This fixes BZ 101114. Signed-off-by: Paul Mackerras <paulus@samba.org>
2014-01-22	vfio: fix realmode guest phys address converter	Alexey Kardashevskiy	1	-1/+3
	This fixes a bug with not resetting page struct pointer which caused bugs in calling code. Suggested-by: Thadeu Lima de Souza Cascardo <cascardo@linux.vnet.ibm.com> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
2014-01-22	KVM: PPC: Book3S PR: Take SRCU read lock around RTAS kvm_read_guest() call	Paul Mackerras	1	-2/+7
	This does for PR KVM what c9438092cae4 ("KVM: PPC: Book3S HV: Take SRCU read lock around kvm_read_guest() call") did for HV KVM, that is, eliminate a "suspicious rcu_dereference_check() usage!" warning by taking the SRCU lock around the call to kvmppc_rtas_hcall(). It also fixes a return of RESUME_HOST to return EMULATE_FAIL instead, since kvmppc_h_pr() is supposed to return EMULATE_* values. Signed-off-by: Paul Mackerras <paulus@samba.org>
2014-01-22	KVM: PPC: Book3S: Load/save FP/VMX/VSX state directly to/from vcpu struct	Paul Mackerras	3	-73/+19
	Now that we have the vcpu floating-point and vector state stored in the same type of struct as the main kernel uses, we can load that state directly from the vcpu struct instead of having extra copies to/from the thread_struct. Similarly, when the guest state needs to be saved, we can have it saved it directly to the vcpu struct by setting the current->thread.fp_save_area and current->thread.vr_save_area pointers. That also means that we don't need to back up and restore userspace's FP/vector state. This all makes the code simpler and faster. Note that it's not necessary to save or modify current->thread.fpexc_mode, since nothing in KVM uses or is affected by its value. Nor is it necessary to touch used_vr or used_vsr. Signed-off-by: Paul Mackerras <paulus@samba.org>
2014-01-22	vfio: fix in-kernel and ioctl handlers	Alexey Kardashevskiy	3	-3/+8
	This fixes missing read/write TCE bits in VFIO map/unmap ioctls. This fixes the real mode handler to switch to the virtual mode if pte does not have "write" AND "dirty" bits set. This fixes get_user_pages_fast() call in the virtual mode handler to use correct write flag (used to be 0 always). This adds a lock around a kvm_memory_slot struct use. Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru> (cherry picked from commit 754177ee49cd27c9380e7bb9c0de6f8488197ca3) Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
2014-01-22	KVM: PPC: Book3S HV: Send some subcommands of H_SET_MODE to userspace	Paul Mackerras	1	-53/+24
	This removes the code that handles the H_SET_MODE_RESOURCE_LE and H_SET_MODE_RESOURCE_ADDR_TRANS_MODE subfunctions of the H_SET_MODE hypercall from the kernel. Instead we now return H_TOO_HARD which causes the hypercall to be sent up to userspace to be handled there. In addition we now also send any other subfunction which we don't recognize to userspace. The reason for doing these two subfunctions in userspace is that they need to modify LPCR across all vcpus of the guest. Modifying LPCR in the kernel like this introduces a race between the kernel's modification and any modification that userspace might be doing on another vcpu. Therefore it's better to let userspace do all the modifications, so it can do any necessary synchronization itself. This also adds code to make sure that the MSR_LE bit in intr_msr (the MSR value we set when synthesizing an interrupt for the guest) is in sync with the ILE bit in the virtual core's LPCR value. This is necessary for implementing the LE subfunction of H_SET_MODE in userspace. Signed-off-by: Paul Mackerras <paulus@samba.org>
2014-01-22	KVM: PPC: Use load_fp/vr_state rather than load_up_fpu/altivec	Paul Mackerras	6	-63/+14
	The load_up_fpu and load_up_altivec functions were never intended to be called from C, and do things like modifying the MSR value in their callers' stack frames, which are assumed to be interrupt frames. In addition, on 32-bit Book S they require the MMU to be off. This makes KVM use the new load_fp_state() and load_vr_state() functions instead of load_up_fpu/altivec. This means we can remove the assembler glue in book3s_rmhandlers.S, and potentially fixes a bug on Book E, where load_up_fpu was called directly from C. Signed-off-by: Paul Mackerras <paulus@samba.org>
2014-01-22	PPC: KVM: fix to compile without VFIO	Alexey Kardashevskiy	1	-1/+1
	Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru> (cherry picked from commit 6a87e5da59bf1d1a4186bf27ad8aa5dc3b03dd63) Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
2014-01-22	KVM: PPC: Book3S HV: Don't use kvm_memslots() in real mode	Paul Mackerras	3	-4/+16
	With HV KVM, some high-frequency hypercalls such as H_ENTER are handled in real mode, and need to access the memslots array for the guest. Accessing the memslots array is safe, because we hold the SRCU read lock for the whole time that a guest vcpu is running. However, the checks that kvm_memslots() does when lockdep is enabled are potentially unsafe in real mode, when only the linear mapping is available. Furthermore, kvm_memslots() can be called from a secondary CPU thread, which is an offline CPU from the point of view of the host kernel, and is not running the task which holds the SRCU read lock. To avoid false positives in the checks in kvm_memslots(), and to avoid possible side effects from doing the checks in real mode, this replaces kvm_memslots() with kvm_memslots_raw() in all the places that execute in real mode. kvm_memslots_raw() is a new function that is like kvm_memslots() but uses rcu_dereference_raw_notrace() instead of kvm_dereference_check(). Signed-off-by: Paul Mackerras <paulus@samba.org>
2014-01-22	KVM: PPC: Book3S HV: Only accept host PVR value for guest PVR	Paul Mackerras	1	-1/+3
	Since the guest can read the machine's PVR (Processor Version Register) directly and see the real value, we should disallow userspace from setting any value for the guest's PVR other than the real host value. Therefore this makes kvm_arch_vcpu_set_sregs_hv() check the supplied PVR value and return an error if it is different from the host value, which has been put into vcpu->arch.pvr at vcpu creation time. Signed-off-by: Paul Mackerras <paulus@samba.org>
2014-01-22	powerpc/powernv: Don't call generic code on offline cpus	Paul Mackerras	4	-15/+32
	On PowerNV platforms, when a CPU is offline, we put it into nap mode. It's possible that the CPU wakes up from nap mode while it is still offline due to a stray IPI. A misdirected device interrupt could also potentially cause it to wake up. In that circumstance, we need to clear the interrupt so that the CPU can go back to nap mode. In the past the clearing of the interrupt was accomplished by briefly enabling interrupts and allowing the normal interrupt handling code (do_IRQ() etc.) to handle the interrupt. This has the problem that this code calls irq_enter() and irq_exit(), which call functions such as account_system_vtime() which use RCU internally. Use of RCU is not permitted on offline CPUs and will trigger errors if RCU checking is enabled. To avoid calling into any generic code which might use RCU, we adopt a different method of clearing interrupts on offline CPUs. Since we are on the PowerNV platform, we know that the system interrupt controller is a XICS being driven directly (i.e. not via hcalls) by the kernel. Hence this adds a new icp_native_flush_interrupt() function to the native-mode XICS driver and arranges to call that when an offline CPU is woken from nap. This new function reads the interrupt from the XICS. If it is an IPI, it clears the IPI; if it is a device interrupt, it prints a warning and disables the source. Then it does the end-of-interrupt processing for the interrupt. The other thing that briefly enabling interrupts did was to check and clear the irq_happened flag in this CPU's PACA. Therefore, after flushing the interrupt from the XICS, we also clear all bits except the PACA_IRQ_HARD_DIS (interrupts are hard disabled) bit from the irq_happened flag. The PACA_IRQ_HARD_DIS flag is set by power7_nap() and is left set to indicate that interrupts are hard disabled. This means we then have to ignore that flag in power7_nap(), which is reasonable since it doesn't indicate that any interrupt event needs servicing. Signed-off-by: Paul Mackerras <paulus@samba.org>
2014-01-22	pci: add "fundamental reset" quirk	Thadeu Lima de Souza Cascardo	1	-0/+21
	Signed-off-by: Thadeu Lima de Souza Cascardo <cascardo@linux.vnet.ibm.com> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
2014-01-22	powernv/cpufreq: Fix crash on hotplug using a hotplug-invariant cpumask	Srivatsa S. Bhat	1	-2/+8
	The policy->cpus mask populated by the cpufreq driver is expected to be hotplug invariant, since the cpufreq core copies this mask as-it-is to policy->related_cpus mask (which shouldn't vary upon hotplug). The cpufreq core code later prunes the offlines cpus from the policy->cpus mask. At the moment, the powerpc cpufreq driver uses topology_thread_cpumask() to populate policy->cpus during .init(), and hence this is NOT hotplug invariant. Due to this, we hit the following bug: 1. Once we offline all threads of a core, say CPUs 8-15, and online CPU 8 back, its related cpus mask shows: $ cat /sys/devices/system/cpu/cpu8/cpufreq/related_cpus 8 [ It should have actually shown 8 9 10 11 12 13 14 15 ] 2. When we try to online the next sibling thread (CPU 9), it tries to do a fresh initialization since it is not listed in the related_cpus mask of CPU 8.(Note that for CPU 9, the cpufreq driver would have populated the related_cpus mask as [ 8 9 ], since those are the 2 online threads in that core so far). During CPU 9 init, it fails in the call to cpufreq_add_dev_symlink() because it tries to initialize the sysfs files for CPU 8 as well (which had already been initialized) while iterating through the policy->cpus. As a result, we hit this bug while onlining CPU 9: [ 1019.458183] sysfs: cannot create duplicate filename '/devices/system/cpu/cpu8/cpufreq' [ 1019.458270] ------------[ cut here ]------------ [ 1019.458338] WARNING: at fs/sysfs/dir.c:530 [ 1019.458367] Modules linked in: xt_tcpudp ipt_REJECT nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack iptable_filter ip_tables x_tables kvm binfmt_misc autofs4 lpfc [ 1019.458543] CPU: 76 PID: 73014 Comm: bash Not tainted 3.10.11-cpufreq-10 #1 [ 1019.458590] task: c000000ff02c3200 ti: c000000fe7604000 task.ti: c000000fe7604000 [ 1019.458645] NIP: c000000000284634 LR: c000000000284630 CTR: c0000000005b5d10 [ 1019.458700] REGS: c000000fe7606fa0 TRAP: 0700 Not tainted (3.10.11-cpufreq-10) [ 1019.458754] MSR: 9000000100029032 <SF,HV,EE,ME,IR,DR,RI> CR: 28222824 XER: 20000000 [ 1019.458883] SOFTE: 1 [ 1019.458903] CFAR: c000000000874d6c [ 1019.458930] GPR00: c000000000284630 c000000fe7607220 c000000000d9ab60 000000000000004a GPR04: 0000000000000000 000000000000005a c000000000c82fb8 c000000004482448 GPR08: c000000000c7ab60 0000000000000000 0000000000000000 0000000000000000 GPR12: 0000000028222822 c00000000fe13000 0000000010142550 c000000000ce8d70 GPR16: 0000000000000001 c000000000f28c68 0000000000000000 c000000003c20030 GPR20: c000000ff6d91800 c000000000ce8fc8 c000000000b45340 c000000000e26858 GPR24: c000000000ce8d70 0000000000000000 0000000000000001 c000000ff6d91a70 GPR28: c000000fef1b2000 c000000fe7607320 c000000fc98087a0 ffffffffffffffef [ 1019.459605] NIP [c000000000284634] .sysfs_add_one+0xe4/0x100 [ 1019.459653] LR [c000000000284630] .sysfs_add_one+0xe0/0x100 [ 1019.459689] PACATMSCRATCH [9000000100009032] [ 1019.459726] Call Trace: [ 1019.459747] [c000000fe7607220] [c000000000284630] .sysfs_add_one+0xe0/0x100 (unreliable) [ 1019.459813] [c000000fe76072b0] [c0000000002854dc] .sysfs_do_create_link_sd+0x10c/0x320 [ 1019.459879] [c000000fe7607370] [c000000000718318] .cpufreq_add_dev_interface+0x2e8/0x410 [ 1019.459943] [c000000fe7607710] [c000000000718da0] .cpufreq_add_dev+0x590/0x6d0 [ 1019.460009] [c000000fe7607810] [c000000000899580] .cpufreq_cpu_callback+0x7c/0x94 [ 1019.460073] [c000000fe7607890] [c00000000086f40c] .notifier_call_chain+0x8c/0x100 [ 1019.460138] [c000000fe7607930] [c000000000091450] .cpu_notify+0x40/0xa0 [ 1019.460194] [c000000fe76079b0] [c00000000089696c] ._cpu_up+0x17c/0x1ec [ 1019.460249] [c000000fe7607a70] [c000000000896b40] .cpu_up+0x164/0x194 [ 1019.460304] [c000000fe7607b00] [c000000000746edc] .store_online+0xbc/0xa60 [ 1019.460361] [c000000fe7607bb0] [c0000000004faf64] .dev_attr_store+0x64/0xa0 [ 1019.460417] [c000000fe7607c40] [c000000000282244] .sysfs_write_file+0xf4/0x1d0 [ 1019.460482] [c000000fe7607cf0] [c0000000001f1fa8] .vfs_write+0xe8/0x260 [ 1019.460537] [c000000fe7607d90] [c0000000001f2c44] .SyS_write+0x64/0xe0 [ 1019.460593] [c000000fe7607e30] [c000000000009d54] syscall_exit+0x0/0x98 [ 1019.460647] Instruction dump: [ 1019.460675] 481b0b2d 60000000 e89e0010 7f83e378 38a01000 481b0b19 60000000 7f84e378 [ 1019.460774] 3c62ffd5 38632cf0 485f06dd 60000000 <0fe00000> 7f83e378 4bf5f8a5 60000000 [ 1019.460952] ---[ end trace 600f2280a5b2cd86 ]--- None of this would have occurred if related_cpus had remained unchanged during hotplug, because in that case, CPU 9 would have done a light-weight init, thus avoiding this duplication bug. So fix this by populating policy->cpus in a hotplug invariant manner in the cpufreq driver. Signed-off-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com> Signed-off-by: Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2014-01-22	powerpc/powernv: Add power sensor data retrieval from FSP	Vaidyanathan Srinivasan	1	-0/+8
	Platform will provide power data in watts, hwmon expects in micro-watts. Signed-off-by: Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2014-01-22	powerpc/powernv: Fix platform dump interface	Vasant Hegde	3	-19/+38
	This patch is a increamental patch on top of commit af93eec4. This adds support to resend dump available notification, updates README file. Alos fixes with few other minor issues. Signed-off-by: Vasant Hegde <hegdevasant@linux.vnet.ibm.com> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2014-01-22	powerpc/book3s: Recover from MC in sapphire on SCOM read via MMIO.	Mahesh Salgaonkar	8	-10/+146
	Detect and recover from machine check when inside opal on a special scom load instructions. On specific SCOM read via MMIO we may get a machine check exception with SRR0 pointing inside opal. To recover from MC in this scenario, get a recovery instruction address and return to it from MC. OPAL will export the machine check recoverable ranges through device tree node mcheck-recoverable-ranges under ibm,opal: # hexdump /proc/device-tree/ibm,opal/mcheck-recoverable-ranges 0000000 0000 0000 3000 2804 0000 000c 0000 0000 0000010 3000 2814 0000 0000 3000 27f0 0000 000c 0000020 0000 0000 3000 2814 xxxx xxxx xxxx xxxx 0000030 llll llll yyyy yyyy yyyy yyyy ... ... # where: xxxx xxxx xxxx xxxx = Starting instruction address llll llll = Length of the address range. yyyy yyyy yyyy yyyy = recovery address Each recoverable address range entry is an (start address, len, recovery address), 2 cells each for start and recovery address, 1 cell for len, totalling 5 cells per entry. During kernel boot time, build up the recovery table with the list of recovery ranges from device-tree node which will be used during machine check exception to recover from MMIO SCOM UE. Changes in v2: - As per Ben's comment, added mcheck-recoverable-ranges property under ibm,opal node. - Changed the format of the mcheck-recoverable-ranges list. Signed-off-by: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2014-01-22	powerpc/powernv: hwmon driver for power values, fan rpm and temperature	Neelesh Gupta	3	-0/+541
	This patch adds basic kernel enablement for reading power values, fan speed rpm and temperature values on powernv platforms which will be exported to user space through /sys interface. Signed-off-by: Shivaprasad G Bhat <sbhat@linux.vnet.ibm.com> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>