Age | Commit message (Collapse) | Author | Files | Lines |
|
commit edfbbf388f293d70bf4b7c0bc38774d05e6f711a upstream.
A kernel memory disclosure was introduced in aio_read_events_ring() in v3.10
by commit a31ad380bed817aa25f8830ad23e1a0480fef797. The changes made to
aio_read_events_ring() failed to correctly limit the index into
ctx->ring_pages[], allowing an attacked to cause the subsequent kmap() of
an arbitrary page with a copy_to_user() to copy the contents into userspace.
This vulnerability has been assigned CVE-2014-0206. Thanks to Mateusz and
Petr for disclosing this issue.
This patch applies to v3.12+. A separate backport is needed for 3.10/3.11.
[jmoyer@redhat.com: backported to 3.10]
Signed-off-by: Benjamin LaHaise <bcrl@kvack.org>
Signed-off-by: Jeff Moyer <jmoyer@redhat.com>
Cc: Mateusz Guzik <mguzik@redhat.com>
Cc: Petr Matousek <pmatouse@redhat.com>
Cc: Kent Overstreet <kmo@daterainc.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
(cherry picked from commit d36db46c2cba973557eb6138d22210c4e0cf17d6)
Signed-off-by: Paul Mackerras <paulus@samba.org>
|
|
commit f8567a3845ac05bb28f3c1b478ef752762bd39ef upstream.
The aio cleanups and optimizations by kmo that were merged into the 3.10
tree added a regression for userspace event reaping. Specifically, the
reference counts are not decremented if the event is reaped in userspace,
leading to the application being unable to submit further aio requests.
This patch applies to 3.12+. A separate backport is required for 3.10/3.11.
This issue was uncovered as part of CVE-2014-0206.
[jmoyer@redhat.com: backported to 3.10]
Signed-off-by: Benjamin LaHaise <bcrl@kvack.org>
Signed-off-by: Jeff Moyer <jmoyer@redhat.com>
Cc: Kent Overstreet <kmo@daterainc.com>
Cc: Mateusz Guzik <mguzik@redhat.com>
Cc: Petr Matousek <pmatouse@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
(cherry picked from commit 6745cb91b5ec93a1b34221279863926fba43d0d7)
Signed-off-by: Paul Mackerras <paulus@samba.org>
|
|
commit 23adbe12ef7d3d4195e80800ab36b37bee28cd03 upstream.
The kernel has no concept of capabilities with respect to inodes; inodes
exist independently of namespaces. For example, inode_capable(inode,
CAP_LINUX_IMMUTABLE) would be nonsense.
This patch changes inode_capable to check for uid and gid mappings and
renames it to capable_wrt_inode_uidgid, which should make it more
obvious what it does.
Fixes CVE-2014-4014.
Cc: Theodore Ts'o <tytso@mit.edu>
Cc: Serge Hallyn <serge.hallyn@ubuntu.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Dave Chinner <david@fromorbit.com>
Signed-off-by: Andy Lutomirski <luto@amacapital.net>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
(cherry picked from commit 4f80c6c1825a91cecf3b3bd19c824e768d98fe48)
Signed-off-by: Paul Mackerras <paulus@samba.org>
|
|
in futex_requeue(..., requeue_pi=1)
commit e9c243a5a6de0be8e584c604d353412584b592f8 upstream.
If uaddr == uaddr2, then we have broken the rule of only requeueing from
a non-pi futex to a pi futex with this call. If we attempt this, then
dangling pointers may be left for rt_waiter resulting in an exploitable
condition.
This change brings futex_requeue() in line with futex_wait_requeue_pi()
which performs the same check as per commit 6f7b0a2a5c0f ("futex: Forbid
uaddr == uaddr2 in futex_wait_requeue_pi()")
[ tglx: Compare the resulting keys as well, as uaddrs might be
different depending on the mapping ]
Fixes CVE-2014-3153.
Reported-by: Pinkie Pie
Signed-off-by: Will Drewry <wad@chromium.org>
Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Darren Hart <dvhart@linux.intel.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
(cherry picked from commit b58623fb64ff0454ec20bce7a02275a20c23086d)
Signed-off-by: Paul Mackerras <paulus@samba.org>
|
|
commit e6a623460e5fc960ac3ee9f946d3106233fd28d8 upstream.
This fixes CVE-2014-1739.
Signed-off-by: Salva Peiró <speiro@ai2.upv.es>
Acked-by: Laurent Pinchart <laurent.pinchart@ideasonboard.com>
Signed-off-by: Mauro Carvalho Chehab <m.chehab@samsung.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
(cherry picked from commit 4e32a7c66fae40bde0fbff8cbc893eabe8575135)
Signed-off-by: Paul Mackerras <paulus@samba.org>
|
|
vhost LE fixes
|
|
Signed-off-by: Cédric Le Goater <clg@fr.ibm.com>
Signed-off-by: Paul Mackerras <paulus@samba.org>
(cherry picked from commit edc39cd39f5e2e659d598159ff6185752badcf73)
Signed-off-by: Scott Garfinkle <seg@us.ibm.com>
|
|
The virtqueue structure shares a few attributes with the guest OS
which need to be byteswapped when the endian order of the host is
different.
This patch uses the vq->byteswap attribute to decide whether to
byteswap or not data being accessed in the guest memory.
Signed-off-by: Cédric Le Goater <clg@fr.ibm.com>
Signed-off-by: Paul Mackerras <paulus@samba.org>
(cherry picked from commit a9a8f7a5697686356fc0c0728f59a3379cf0a212)
Signed-off-by: Scott Garfinkle <seg@us.ibm.com>
|
|
commit a9a8f7a56976 "vhost: Byteswap virtqueue attributes"
missed a few byteswap in vhost_add_used(). This patches adds
the vq_put_user() calls required when accessing data in the
the guest memory.
BZ: 108753
Branch: powerkvm-v2.1.1
Signed-off-by: Cédric Le Goater <clg@fr.ibm.com>
Signed-off-by: Paul Mackerras <paulus@samba.org>
(cherry picked from commit e7fb63f8f2444a8f687b53311d8e83267fcaa143)
Signed-off-by: Scott Garfinkle <seg@us.ibm.com>
|
|
This patch does not fix any known issues in the previous vhost patchset.
Nevertheless, the byteswap attribute needs to be re-initialized like
all other virtqueue attributes.
BZ: 108753
Branch: powerkvm-v2.1.1
Signed-off-by: Cédric Le Goater <clg@fr.ibm.com>
Cc: Paul Mackerras <paulus@au1.ibm.com>
Signed-off-by: Paul Mackerras <paulus@samba.org>
(cherry picked from commit af892adc789c20a9b8c42195cee2d6e2861ad037)
Signed-off-by: Scott Garfinkle <seg@us.ibm.com>
|
|
The VHOST_VRING_F_BYTESWAP flag is used by the host to byteswap
data of the vring when the guest and the host have a different
endian order.
Signed-off-by: Cédric Le Goater <clg@fr.ibm.com>
Signed-off-by: Paul Mackerras <paulus@samba.org>
(cherry picked from commit 28936b89d8e5faa76ef424b2ec356c8ee67c884e)
Signed-off-by: Scott E. Garfinkle <seg@us.ibm.com>
|
|
This patch adds a few helper routines around get_user and put_user
to ease byteswapping.
Signed-off-by: Cédric Le Goater <clg@fr.ibm.com>
Signed-off-by: Paul Mackerras <paulus@samba.org>
(cherry picked from commit 727174f8bb2d1be197cb94e5c0341fa640952a79)
Signed-off-by: Scott E. Garfinkle <seg@us.ibm.com>
|
|
Connect-X devices selftest speed test shouldn't fail on 1G and 40G link
speeds.
BZ: 113608
Cherry-pick of 313c2d375b1c9b648d9d4b96ec1b8185ac6a78c5
Signed-off-by: Carol Soto <clsoto@linux.vnet.ibm.com>
Signed-off-by: Amir Vadai <amirv@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
|
|
Unfortunately, the LPCR got defined as a 32-bit register in the
one_reg interface. This is unfortunate because KVM allows userspace
to control the DPFD (default prefetch depth) field, which is in the
upper 32 bits. The result is that DPFD always get set to 0, which
reduces performance in the guest.
We can't just change KVM_REG_PPC_LPCR to be a 64-bit register ID,
since that would break existing userspace binaries. Instead we define
a new KVM_REG_PPC_LPCR_64 id which is 64-bit. Userspace can still use
the old KVM_REG_PPC_LPCR id, but we now only modify those fields in
the bottom 32 bits that userspace can modify (ILE, TC and AIL).
If userspace uses the new KVM_REG_PPC_LPCR_64 id, it can modify DPFD
as well.
BZ: 111438
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Signed-off-by: Paul Mackerras <paulus@samba.org>
|
|
pci_get_slot() is called with hold of PCI bus semaphore and it's not
safe to be called in interrupt context. However, we possibly checks
EEH error and calls the function in interrupt context. To avoid using
pci_get_slot(), we turn into device tree for fetching location code.
Otherwise, we might run into WARN_ON() as following messages indicate:
WARNING: at drivers/pci/search.c:223
CPU: 0 PID: 0 Comm: swapper/0 Not tainted 3.16.0-rc3+ #72
task: c000000001367af0 ti: c000000001444000 task.ti: c000000001444000
NIP: c000000000497b70 LR: c000000000037530 CTR: 000000003003d114
REGS: c000000001446fa0 TRAP: 0700 Not tainted (3.16.0-rc3+)
MSR: 9000000000029032 <SF,HV,EE,ME,IR,DR,RI> CR: 48002422 XER: 20000000
CFAR: c00000000003752c SOFTE: 0
:
NIP [c000000000497b70] .pci_get_slot+0x40/0x110
LR [c000000000037530] .eeh_pe_loc_get+0x150/0x190
Call Trace:
.of_get_property+0x30/0x60 (unreliable)
.eeh_pe_loc_get+0x150/0x190
.eeh_dev_check_failure+0x1b4/0x550
.eeh_check_failure+0x90/0xf0
.lpfc_sli_check_eratt+0x504/0x7c0 [lpfc]
.lpfc_poll_eratt+0x64/0x100 [lpfc]
.call_timer_fn+0x64/0x190
.run_timer_softirq+0x2cc/0x3e0
Signed-off-by: Mike Qiu <qiudayu@linux.vnet.ibm.com>
Acked-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
Signed-off-by: Paul Mackerras <paulus@samba.org>
|
|
CMA is introduced to provide physically contiguous pages at runtime.
For this purpose, it reserves memory at boot time. Although it reserve
memory, this reserved memory can be used for movable memory allocation
request. This usecase is beneficial to the system that needs this CMA
reserved memory infrequently and it is one of main purpose of
introducing CMA.
But, there is a problem in current implementation. The problem is that
it works like as just reserved memory approach. The pages on CMA reserved
memory are hardly used for movable memory allocation. This is caused by
combination of allocation and reclaim policy.
The pages on CMA reserved memory are allocated if there is no movable
memory, that is, as fallback allocation. So the time this fallback
allocation is started is under heavy memory pressure. Although it is under
memory pressure, movable allocation easily succeed, since there would be
many pages on CMA reserved memory. But this is not the case for unmovable
and reclaimable allocation, because they can't use the pages on CMA
reserved memory. These allocations regard system's free memory as
(free pages - free CMA pages) on watermark checking, that is, free
unmovable pages + free reclaimable pages + free movable pages. Because
we already exhausted movable pages, only free pages we have are unmovable
and reclaimable types and this would be really small amount. So watermark
checking would be failed. It will wake up kswapd to make enough free
memory for unmovable and reclaimable allocation and kswapd will do.
So before we fully utilize pages on CMA reserved memory, kswapd start to
reclaim memory and try to make free memory over the high watermark. This
watermark checking by kswapd doesn't take care free CMA pages so many
movable pages would be reclaimed. After then, we have a lot of movable
pages again, so fallback allocation doesn't happen again. To conclude,
amount of free memory on meminfo which includes free CMA pages is moving
around 512 MB if I reserve 512 MB memory for CMA.
I found this problem on following experiment.
4 CPUs, 1024 MB, VIRTUAL MACHINE
make -j16
CMA reserve: 0 MB 512 MB
Elapsed-time: 225.2 472.5
Average-MemFree: 322490 KB 630839 KB
To solve this problem, I can think following 2 possible solutions.
1. allocate the pages on CMA reserved memory first, and if they are
exhausted, allocate movable pages.
2. interleaved allocation: try to allocate specific amounts of memory
from CMA reserved memory and then allocate from free movable memory.
I tested #1 approach and found the problem. Although free memory on
meminfo can move around low watermark, there is large fluctuation on free
memory, because too many pages are reclaimed when kswapd is invoked.
Reason for this behaviour is that successive allocated CMA pages are
on the LRU list in that order and kswapd reclaim them in same order.
These memory doesn't help watermark checking from kwapd, so too many
pages are reclaimed, I guess.
So, I implement #2 approach.
One thing I should note is that we should not change allocation target
(movable list or CMA) on each allocation attempt, since this prevent
allocated pages to be in physically succession, so some I/O devices can
be hurt their performance. To solve this, I keep allocation target
in at least pageblock_nr_pages attempts and make this number reflect
ratio, free pages without free CMA pages to free CMA pages. With this
approach, system works very smoothly and fully utilize the pages on
CMA reserved memory.
Following is the experimental result of this patch.
4 CPUs, 1024 MB, VIRTUAL MACHINE
make -j16
<Before>
CMA reserve: 0 MB 512 MB
Elapsed-time: 225.2 472.5
Average-MemFree: 322490 KB 630839 KB
nr_free_cma: 0 131068
pswpin: 0 261666
pswpout: 75 1241363
<After>
CMA reserve: 0 MB 512 MB
Elapsed-time: 222.7 224
Average-MemFree: 325595 KB 393033 KB
nr_free_cma: 0 61001
pswpin: 0 6
pswpout: 44 502
There is no difference if we don't have CMA reserved memory (0 MB case).
But, with CMA reserved memory (512 MB case), we fully utilize these
reserved memory through this patch and the system behaves like as
it doesn't reserve any memory.
With this patch, we aggressively allocate the pages on CMA reserved memory
so latency of CMA can arise. Below is the experimental result about
latency.
4 CPUs, 1024 MB, VIRTUAL MACHINE
CMA reserve: 512 MB
Backgound Workload: make -jN
Real Workload: 8 MB CMA allocation/free 20 times with 5 sec interval
N: 1 4 8 16
Elapsed-time(Before): 4309.75 9511.09 12276.1 77103.5
Elapsed-time(After): 5391.69 16114.1 19380.3 34879.2
So generally we can see latency increase. Ratio of this increase
is rather big - up to 70%. But, under the heavy workload, it shows
latency decrease - up to 55%. This may be worst-case scenario, but
reducing it would be important for some system, so, I can say that
this patch have advantages and disadvantages in terms of latency.
Although I think that this patch is right direction for CMA, there is
side-effect in following case. If there is small memory zone and CMA
occupys most of them, LRU for this zone would have many CMA pages. When
reclaim is started, these CMA pages would be reclaimed, but not counted
for watermark checking, so too many CMA pages could be reclaimed
unnecessarily. Until now, this can't happen because free CMA pages aren't
used easily. But, with this patch, free CMA pages are used easily so
this problem can be possible. I will handle it on another patchset
after some investigating.
v2:
- In fastpath, just replenish counters. Calculation is done whenver
CMA area is varied
v3:
- Use unsigned type in adjust_managed_cma_page_count() (per Gioh)
- Fix +/- count when calling adjust_managed_cma_page_count() (per Gioh)
- Instead of implementing __rmqueue_cma() which has another
__rmqueue_smallest(), choose_rmqueue_migratetype() is implemented to
change original migratetype to MIGRATE_CMA according to criteria. It
helps not to violate layering. (per Minchan in offline discussion)
BZ: 111727
Acked-by: Michal Nazarewicz <mina86@mina86.com>
Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Paul Mackerras <paulus@samba.org>
|
|
commit 3617f2ca6d0eba48114308532945a7f1577816a4 upstream.
When a CPU is hot removed we'll cancel all the delayed work items
via gov_cancel_work(). Normally this will just cancels a delayed
timer on each CPU that the policy is managing and the work won't
run, but if the work is already running the workqueue code will
wait for the work to finish before continuing to prevent the
work items from re-queuing themselves like they normally do. This
scheme will work most of the time, except for the case where the
work function determines that it should adjust the delay for all
other CPUs that the policy is managing. If this scenario occurs,
the canceling CPU will cancel its own work but queue up the other
CPUs works to run. For example:
CPU0 CPU1
---- ----
cpu_down()
...
__cpufreq_remove_dev()
cpufreq_governor_dbs()
case CPUFREQ_GOV_STOP:
gov_cancel_work(dbs_data, policy);
cpu0 work is canceled
timer is canceled
cpu1 work is canceled <work runs>
<waits for cpu1> od_dbs_timer()
gov_queue_work(*, *, true);
cpu0 work queued
cpu1 work queued
cpu2 work queued
...
cpu1 work is canceled
cpu2 work is canceled
...
At the end of the GOV_STOP case cpu0 still has a work queued to
run although the code is expecting all of the works to be
canceled. __cpufreq_remove_dev() will then proceed to
re-initialize all the other CPUs works except for the CPU that is
going down. The CPUFREQ_GOV_START case in cpufreq_governor_dbs()
will trample over the queued work and debugobjects will spit out
a warning:
WARNING: at lib/debugobjects.c:260 debug_print_object+0x94/0xbc()
ODEBUG: init active (active state 0) object type: timer_list hint: delayed_work_timer_fn+0x0/0x10
Modules linked in:
CPU: 0 PID: 1491 Comm: sh Tainted: G W 3.10.0 #19
[<c010c178>] (unwind_backtrace+0x0/0x11c) from [<c0109dec>] (show_stack+0x10/0x14)
[<c0109dec>] (show_stack+0x10/0x14) from [<c01904cc>] (warn_slowpath_common+0x4c/0x6c)
[<c01904cc>] (warn_slowpath_common+0x4c/0x6c) from [<c019056c>] (warn_slowpath_fmt+0x2c/0x3c)
[<c019056c>] (warn_slowpath_fmt+0x2c/0x3c) from [<c0388a7c>] (debug_print_object+0x94/0xbc)
[<c0388a7c>] (debug_print_object+0x94/0xbc) from [<c0388e34>] (__debug_object_init+0x2d0/0x340)
[<c0388e34>] (__debug_object_init+0x2d0/0x340) from [<c019e3b0>] (init_timer_key+0x14/0xb0)
[<c019e3b0>] (init_timer_key+0x14/0xb0) from [<c0635f78>] (cpufreq_governor_dbs+0x3e8/0x5f8)
[<c0635f78>] (cpufreq_governor_dbs+0x3e8/0x5f8) from [<c06325a0>] (__cpufreq_governor+0xdc/0x1a4)
[<c06325a0>] (__cpufreq_governor+0xdc/0x1a4) from [<c0633704>] (__cpufreq_remove_dev.isra.10+0x3b4/0x434)
[<c0633704>] (__cpufreq_remove_dev.isra.10+0x3b4/0x434) from [<c08989f4>] (cpufreq_cpu_callback+0x60/0x80)
[<c08989f4>] (cpufreq_cpu_callback+0x60/0x80) from [<c08a43c0>] (notifier_call_chain+0x38/0x68)
[<c08a43c0>] (notifier_call_chain+0x38/0x68) from [<c01938e0>] (__cpu_notify+0x28/0x40)
[<c01938e0>] (__cpu_notify+0x28/0x40) from [<c0892ad4>] (_cpu_down+0x7c/0x2c0)
[<c0892ad4>] (_cpu_down+0x7c/0x2c0) from [<c0892d3c>] (cpu_down+0x24/0x40)
[<c0892d3c>] (cpu_down+0x24/0x40) from [<c0893ea8>] (store_online+0x2c/0x74)
[<c0893ea8>] (store_online+0x2c/0x74) from [<c04519d8>] (dev_attr_store+0x18/0x24)
[<c04519d8>] (dev_attr_store+0x18/0x24) from [<c02a69d4>] (sysfs_write_file+0x100/0x148)
[<c02a69d4>] (sysfs_write_file+0x100/0x148) from [<c0255c18>] (vfs_write+0xcc/0x174)
[<c0255c18>] (vfs_write+0xcc/0x174) from [<c0255f70>] (SyS_write+0x38/0x64)
[<c0255f70>] (SyS_write+0x38/0x64) from [<c0106120>] (ret_fast_syscall+0x0/0x30)
BZ: 113137
Signed-off-by: Stephen Boyd <sboyd@codeaurora.org>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Cc: Krzysztof Kozlowski <k.kozlowski@samsung.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
(cherry picked from commit d8996f63abe5a9d9b24f7a4df2c8459659d0e76f)
Signed-off-by: Paul Mackerras <paulus@samba.org>
|
|
commit 95731ebb114c5f0c028459388560fc2a72fe5049 upstream.
Cpufreq governors' stop and start operations should be carried out
in sequence. Otherwise, there will be unexpected behavior, like in
the example below.
Suppose there are 4 CPUs and policy->cpu=CPU0, CPU1/2/3 are linked
to CPU0. The normal sequence is:
1) Current governor is userspace. An application tries to set the
governor to ondemand. It will call __cpufreq_set_policy() in
which it will stop the userspace governor and then start the
ondemand governor.
2) Current governor is userspace. The online of CPU3 runs on CPU0.
It will call cpufreq_add_policy_cpu() in which it will first
stop the userspace governor, and then start it again.
If the sequence of the above two cases interleaves, it becomes:
1) Application stops userspace governor
2) Hotplug stops userspace governor
which is a problem, because the governor shouldn't be stopped twice
in a row. What happens next is:
3) Application starts ondemand governor
4) Hotplug starts a governor
In step 4, the hotplug is supposed to start the userspace governor,
but now the governor has been changed by the application to ondemand,
so the ondemand governor is started once again, which is incorrect.
The solution is to prevent policy governors from being stopped
multiple times in a row. A governor should only be stopped once for
one policy. After it has been stopped, no more governor stop
operations should be executed.
Also add a mutex to serialize governor operations.
BZ: 113137
[rjw: Changelog. And you owe me a beverage of my choice.]
Signed-off-by: Xiaoguang Chen <chenxg@marvell.com>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Cc: Krzysztof Kozlowski <k.kozlowski@samsung.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
(cherry picked from commit ba17ca46b968001df16f672ffe694fd0a12512f2)
Signed-off-by: Paul Mackerras <paulus@samba.org>
|
|
We are seeing a lot of PMU warnings on POWER8:
Can't find PMC that caused IRQ
Looking closer, the active PMC is 0 at this point and we took a PMU
exception on the transition from negative to 0. Some versions of POWER8
have an issue where they edge detect and not level detect PMC overflows.
A number of places program the PMC with (0x80000000 - period_left),
where period_left can be negative. We can either fix all of these or
just ensure that period_left is always >= 1.
This patch takes the second option.
Cc: <stable@vger.kernel.org>
Signed-off-by: Anton Blanchard <anton@samba.org>
Signed-off-by: Paul Mackerras <paulus@samba.org>
|
|
On POWER8 when switching to a KVM guest we set bits in MMCR2 to freeze
the PMU counters. Aside from on boot they are then never reset,
resulting in stuck perf counters for any user in the guest or host.
We now set MMCR2 to 0 whenever enabling the PMU, which provides a sane
state for perf to use the PMU counters under either the guest or the
host.
This was manifesting as a bug with ppc64_cpu --frequency:
$ sudo ppc64_cpu --frequency
WARNING: couldn't run on cpu 0
WARNING: couldn't run on cpu 8
...
WARNING: couldn't run on cpu 144
WARNING: couldn't run on cpu 152
min: 18446744073.710 GHz (cpu -1)
max: 0.000 GHz (cpu -1)
avg: 0.000 GHz
The command uses a perf counter to measure CPU cycles over a fixed
amount of time, in order to approximate the frequency of the machine.
The counters were returning zero once a guest was started, regardless of
weather it was still running or had been shut down.
By dumping the value of MMCR2, it was observed that once a guest is
running MMCR2 is set to 1s - which stops counters from running:
$ sudo sh -c 'echo p > /proc/sysrq-trigger'
CPU: 0 PMU registers, ppmu = POWER8 n_counters = 6
PMC1: 5b635e38 PMC2: 00000000 PMC3: 00000000 PMC4: 00000000
PMC5: 1bf5a646 PMC6: 5793d378 PMC7: deadbeef PMC8: deadbeef
MMCR0: 0000000080000000 MMCR1: 000000001e000000 MMCRA: 0000040000000000
MMCR2: fffffffffffffc00 EBBHR: 0000000000000000
EBBRR: 0000000000000000 BESCR: 0000000000000000
SIAR: 00000000000a51cc SDAR: c00000000fc40000 SIER: 0000000001000000
This is done unconditionally in book3s_hv_interrupts.S upon entering the
guest, and the original value is only save/restored if the host has
indicated it was using the PMU. This is okay, however the user of the
PMU needs to ensure that it is in a defined state when it starts using
it.
BZ: 112045
Fixes: e05b9b9e5c10 ("powerpc/perf: Power8 PMU support")
Cc: stable@vger.kernel.org
Signed-off-by: Joel Stanley <joel@jms.id.au>
Acked-by: Michael Ellerman <mpe@ellerman.id.au>
Signed-off-by: Paul Mackerras <paulus@samba.org>
|
|
Instead of separate bits for every POWER8 PMU feature, have a single one
for v2.07 of the architecture.
This saves us adding a MMCR2 define for a future patch.
BZ: 112045
Cc: stable@vger.kernel.org
Signed-off-by: Joel Stanley <joel@jms.id.au>
Acked-by: Michael Ellerman <mpe@ellerman.id.au>
Signed-off-by: Paul Mackerras <paulus@samba.org>
|
|
These two registers are already saved in the block above. Aside from
being unnecessary, by the time we get down to the second save location
r8 no longer contains MMCR2, so we are clobbering the saved value with
PMC5.
MMCR2 primarily consists of counter freeze bits. So restoring the value
of PMC5 into MMCR2 will most likely have the effect of freezing
counters.
BZ: 112045
Fixes: 72cde5a88d37 ("KVM: PPC: Book3S HV: Save/restore host PMU registers that are new in POWER8")
Cc: stable@vger.kernel.org
Signed-off-by: Joel Stanley <joel@jms.id.au>
Acked-by: Michael Ellerman <mpe@ellerman.id.au>
Acked-by: Paul Mackerras <paulus@samba.org>
Reviewed-by: Alexander Graf <agraf@suse.de>
Signed-off-by: Paul Mackerras <paulus@samba.org>
|
|
On PowerNV platform, we are holding an unnecessary refcount on a pci_dev, which
leads to the pci_dev is not destroyed when hotplugging a pci device.
This patch release the unnecessary refcount.
Signed-off-by: Wei Yang <weiyang@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
(cherry picked from commit 4966bfa1b3347ee75e6d93859a2e8ce9a662390c)
Signed-off-by: Paul Mackerras <paulus@samba.org>
|
|
During the EEH hotplug event, iommu_add_device() will be invoked three times
and two of them will trigger warning or error.
The three times to invoke the iommu_add_device() are:
pci_device_add
...
set_iommu_table_base_and_group <- 1st time, fail
device_add
...
tce_iommu_bus_notifier <- 2nd time, succees
pcibios_add_pci_devices
...
pcibios_setup_bus_devices <- 3rd time, re-attach
The first time fails, since the dev->kobj->sd is not initialized. The
dev->kobj->sd is initialized in device_add().
The third time's warning is triggered by the re-attach of the iommu_group.
After applying this patch, the error
iommu_tce: 0003:05:00.0 has not been added, ret=-14
and the warning
[ 204.123609] ------------[ cut here ]------------
[ 204.123645] WARNING: at arch/powerpc/kernel/iommu.c:1125
[ 204.123680] Modules linked in: xt_CHECKSUM nf_conntrack_netbios_ns nf_conntrack_broadcast ipt_MASQUERADE ip6t_REJECT bnep bluetooth 6lowpan_iphc rfkill xt_conntrack ebtable_nat ebtable_broute bridge stp llc mlx4_ib ib_sa ib_mad ib_core ib_addr ebtable_filter ebtables ip6table_nat nf_conntrack_ipv6 nf_defrag_ipv6 nf_nat_ipv6 ip6table_mangle ip6table_security ip6table_raw ip6table_filter ip6_tables iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack iptable_mangle iptable_security iptable_raw bnx2x tg3 mlx4_core nfsd ptp mdio ses libcrc32c nfs_acl enclosure be2net pps_core shpchp lockd kvm uinput sunrpc binfmt_misc lpfc scsi_transport_fc ipr scsi_tgt
[ 204.124356] CPU: 18 PID: 650 Comm: eehd Not tainted 3.14.0-rc5yw+ #102
[ 204.124400] task: c0000027ed485670 ti: c0000027ed50c000 task.ti: c0000027ed50c000
[ 204.124453] NIP: c00000000003cf80 LR: c00000000006c648 CTR: c00000000006c5c0
[ 204.124506] REGS: c0000027ed50f440 TRAP: 0700 Not tainted (3.14.0-rc5yw+)
[ 204.124558] MSR: 9000000000029032 <SF,HV,EE,ME,IR,DR,RI> CR: 88008084 XER: 20000000
[ 204.124682] CFAR: c00000000006c644 SOFTE: 1
GPR00: c00000000006c648 c0000027ed50f6c0 c000000001398380 c0000027ec260300
GPR04: c0000027ea92c000 c00000000006ad00 c0000000016e41b0 0000000000000110
GPR08: c0000000012cd4c0 0000000000000001 c0000027ec2602ff 0000000000000062
GPR12: 0000000028008084 c00000000fdca200 c0000000000d1d90 c0000027ec281a80
GPR16: 0000000000000000 0000000000000000 0000000000000000 0000000000000000
GPR20: 0000000000000000 0000000000000000 0000000000000000 0000000000000001
GPR24: 000000005342697b 0000000000002906 c000001fe6ac9800 c000001fe6ac9800
GPR28: 0000000000000000 c0000000016e3a80 c0000027ea92c090 c0000027ea92c000
[ 204.125353] NIP [c00000000003cf80] .iommu_add_device+0x30/0x1f0
[ 204.125399] LR [c00000000006c648] .pnv_pci_ioda_dma_dev_setup+0x88/0xb0
[ 204.125443] Call Trace:
[ 204.125464] [c0000027ed50f6c0] [c0000027ed50f750] 0xc0000027ed50f750 (unreliable)
[ 204.125526] [c0000027ed50f750] [c00000000006c648] .pnv_pci_ioda_dma_dev_setup+0x88/0xb0
[ 204.125588] [c0000027ed50f7d0] [c000000000069cc8] .pnv_pci_dma_dev_setup+0x78/0x340
[ 204.125650] [c0000027ed50f870] [c000000000044408] .pcibios_setup_device+0x88/0x2f0
[ 204.125712] [c0000027ed50f940] [c000000000046040] .pcibios_setup_bus_devices+0x60/0xd0
[ 204.125774] [c0000027ed50f9c0] [c000000000043acc] .pcibios_add_pci_devices+0xdc/0x1c0
[ 204.125837] [c0000027ed50fa50] [c00000000086f970] .eeh_reset_device+0x36c/0x4f0
[ 204.125939] [c0000027ed50fb20] [c00000000003a2d8] .eeh_handle_normal_event+0x448/0x480
[ 204.126068] [c0000027ed50fbc0] [c00000000003a35c] .eeh_handle_event+0x4c/0x340
[ 204.126192] [c0000027ed50fc80] [c00000000003a74c] .eeh_event_handler+0xfc/0x1b0
[ 204.126319] [c0000027ed50fd30] [c0000000000d1ea0] .kthread+0x110/0x130
[ 204.126430] [c0000027ed50fe30] [c00000000000a460] .ret_from_kernel_thread+0x5c/0x7c
[ 204.126556] Instruction dump:
[ 204.126610] 7c0802a6 fba1ffe8 fbc1fff0 fbe1fff8 f8010010 f821ff71 7c7e1b78 60000000
[ 204.126787] 60000000 e87e0298 3143ffff 7d2a1910 <0b090000> 2fa90000 40de00c8 ebfe0218
[ 204.126966] ---[ end trace 6e7aefd80add2973 ]---
are cleared.
This patch removes iommu_add_device() in pnv_pci_ioda_dma_dev_setup(), which
revert part of the change in commit d905c5df(PPC: POWERNV: move
iommu_add_device earlier).
It is responding to bug#110805.
Upstream commit 3f28c5af3964c11e61e9a58df77cae5ebdb8209e
Signed-off-by: Wei Yang <weiyang@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Paul Mackerras <paulus@samba.org>
|
|
I am seeing an issue where a CPU running perf eventually hangs.
Traces show timer interrupts happening every 4 seconds even
when a userspace task is running on the CPU. /proc/timer_list
also shows pending hrtimers have not run in over an hour,
including the scheduler.
Looking closer, decrementers_next_tb is getting set to
0xffffffffffffffff, and at that point we will never take
a timer interrupt again.
In __timer_interrupt() we set decrementers_next_tb to
0xffffffffffffffff and rely on ->event_handler to update it:
*next_tb = ~(u64)0;
if (evt->event_handler)
evt->event_handler(evt);
In this case ->event_handler is hrtimer_interrupt. This will eventually
call back through the clockevents code with the next event to be
programmed:
static int decrementer_set_next_event(unsigned long evt,
struct clock_event_device *dev)
{
/* Don't adjust the decrementer if some irq work is pending */
if (test_irq_work_pending())
return 0;
__get_cpu_var(decrementers_next_tb) = get_tb_or_rtc() + evt;
If irq work came in between these two points, we will return
before updating decrementers_next_tb and we never process a timer
interrupt again.
This looks to have been introduced by 0215f7d8c53f (powerpc: Fix races
with irq_work). Fix it by removing the early exit and relying on
code later on in the function to force an early decrementer:
/* We may have raced with new irq work */
if (test_irq_work_pending())
set_dec(1);
Liu Ping Fan <pingfank@linux.vnet.ibm.com>: backport from upstream
commit 8050936caf125fbe54111ba5e696b68a360556ba to fix bug#104457.
Signed-off-by: Anton Blanchard <anton@samba.org>
Cc: stable@vger.kernel.org # 3.14+
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Paul Mackerras <paulus@samba.org>
|
|
This brings the permissions in line with the upstream implementation,
allowing users to see the state of the system.
Signed-off-by: Joel Stanley <joel@jms.id.au>
Signed-off-by: Paul Mackerras <paulus@samba.org>
|
|
Some backends call hvc_kick() to wakeup the HVC thread from its
slumber upon incoming characters. This however doesn't work
properly because it uses msleep_interruptible() which is mostly
immune to wake_up_process(). It will basically go back to sleep
until the timeout is expired (only signals can really wake it).
Replace it with a simple schedule_timeout_interruptible() instead,
which may wakeup earlier every now and then but we really don't
care in this case.
Backport of upstream 15a2743193b099f82657ca315dd2e1091be6c1d3
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Paul Mackerras <paulus@samba.org>
|
|
Commit f5c57710dd62dd06f176934a8b4b8accbf00f9f8 ("powerpc/eeh: Use
partial hotplug for EEH unaware drivers") introduces eeh_rmv_device,
which may grab a reference to a driver, but not release it.
That prevents a driver from being removed after it has gone through EEH
recovery.
This patch drops the reference if it was taken.
Signed-off-by: Thadeu Lima de Souza Cascardo <cascardo@linux.vnet.ibm.com>
Acked-by: Gavin Shan <shangw@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
(cherry picked from commit 8cc6b6cd8713457be80202fc4264f05d20bc5e1b)
Signed-off-by: Thadeu Lima de Souza Cascardo <cascardo@linux.vnet.ibm.com>
Signed-off-by: Paul Mackerras <paulus@samba.org>
|
|
Currently, when an off-line CPU wakes up from nap, we check for the
possible reasons for the wakeup (split-core mode change, or the CPU
needs to come online) before clearing the IPI that woke us. This
leaves a possible race in the situation where we wake up for some
unrelated reason, typically a leftover IPI from a KVM guest exit.
If some other CPU sets a flag and then sends an IPI at just the right
time, it is possible that we don't see the flag but then clear the
IPI that the other CPU sent, and therefore miss a wakeup.
To fix this, we clear the IPI first, and only check for any flags
after a barrier. That way if we miss the flag setting we are sure not
to have cleared the IPI that the other CPU sent.
Signed-off-by: Paul Mackerras <paulus@samba.org>
|
|
On PowerNV platform, EEH errors are reported by IO accessors or poller
driven by interrupt. After the PE is isolated, we won't produce EEH
event for the PE. The current implementation has possibility of EEH
event lost in this way:
The interrupt handler queues one "special" event, which drives the poller.
EEH thread doesn't pick the special event yet. IO accessors kicks in, the
frozen PE is marked as "isolated" and EEH event is queued to the list.
EEH thread runs because of special event and purge all existing EEH events.
However, we never produce an other EEH event for the frozen PE. Eventually,
the PE is marked as "isolated" and we don't have EEH event to recover it.
The patch fixes the issue to keep EEH events for PEs that have been
marked as "isolated" with the help of additional "force" help to
eeh_remove_event().
The problem was reported by Rolf Brudeseth and we don't have opened bug
tracing it. the patch has been merged into linux-mainline. It's porting it
to Frobisher.
Reported-by: Rolf Brudeseth <rolfb@us.ibm.com>
Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Paul Mackerras <paulus@samba.org>
|
|
Signed-off-by: James Smart <james.smart@emulex.com>
Signed-off-by: James Bottomley <JBottomley@Parallels.com>
(cherry picked from commit cff261f6bd03612e792e4c8872c6ad049f743863)
Signed-off-by: Sanket Rathi <sanket@linux.vnet.ibm.com>
Signed-off-by: Paul Mackerras <paulus@samba.org>
|
|
Signed-off-by: James Smart <james.smart@emulex.com>
Signed-off-by: James Bottomley <JBottomley@Parallels.com>
(cherry picked from commit 725dd399ae69d0703c0417f9ce0ce065d2a914d1)
Signed-off-by: Sanket Rathi <sanket@linux.vnet.ibm.com>
Signed-off-by: Paul Mackerras <paulus@samba.org>
|
|
Signed-off-by: James Smart <james.smart@emulex.com>
Signed-off-by: James Bottomley <JBottomley@Parallels.com>
(cherry picked from commit 4902b381c6c99e5edaca1e2549f0a5149d90feec)
Signed-off-by: Sanket Rathi <sanket@linux.vnet.ibm.com>
Signed-off-by: Paul Mackerras <paulus@samba.org>
|
|
Signed-off-by: James Smart <james.smart@emulex.com>
Signed-off-by: James Bottomley <JBottomley@Parallels.com>
(cherry picked from commit 164cecd1b9aed821d29ee9543ea4ad7435321823)
Signed-off-by: Sanket Rathi <sanket@linux.vnet.ibm.com>
Signed-off-by: Paul Mackerras <paulus@samba.org>
|
|
getting aborted
Signed-off-by: James Smart <james.smart@emulex.com>
Signed-off-by: James Bottomley <JBottomley@Parallels.com>
(cherry picked from commit afbd8d8884325bcc4fc4c12fcb2eccbf9356feca)
Signed-off-by: Sanket Rathi <sanket@linux.vnet.ibm.com>
Signed-off-by: Paul Mackerras <paulus@samba.org>
|
|
Signed-off-by: James Smart <james.smart@emulex.com>
Signed-off-by: James Bottomley <JBottomley@Parallels.com>
(cherry picked from commit 3be30e0e4486b3568044efe27caf405296d7845a)
Signed-off-by: Sanket Rathi <sanket@linux.vnet.ibm.com>
Signed-off-by: Paul Mackerras <paulus@samba.org>
|
|
Signed-off-by: James Smart <james.smart@emulex.com>
Signed-off-by: James Bottomley <JBottomley@Parallels.com>
(cherry picked from commit 91f32d01d9fff7f5f15f3ad136e55dc42d02f9ff)
Signed-off-by: Sanket Rathi <sanket@linux.vnet.ibm.com>
Signed-off-by: Paul Mackerras <paulus@samba.org>
|
|
Signed-off-by: James Smart <james.smart@emulex.com>
Signed-off-by: James Bottomley <JBottomley@Parallels.com>
(cherry picked from commit 3bf41ba9376cda911e908dca36fe016293ad8fef)
Signed-off-by: Sanket Rathi <sanket@linux.vnet.ibm.com>
Signed-off-by: Paul Mackerras <paulus@samba.org>
|
|
commit 621b5060e823301d0cba4cb52a7ee3491922d291 upstream.
When we fork/clone we currently don't copy any of the TM state to the new
thread. This results in a TM bad thing (program check) when the new process is
switched in as the kernel does a tmrechkpt with TEXASR FS not set. Also, since
R1 is from userspace, we trigger the bad kernel stack pointer detection. So we
end up with something like this:
Bad kernel stack pointer 0 at c0000000000404fc
cpu 0x2: Vector: 700 (Program Check) at [c00000003ffefd40]
pc: c0000000000404fc: restore_gprs+0xc0/0x148
lr: 0000000000000000
sp: 0
msr: 9000000100201030
current = 0xc000001dd1417c30
paca = 0xc00000000fe00800 softe: 0 irq_happened: 0x01
pid = 0, comm = swapper/2
WARNING: exception is not recoverable, can't continue
The below fixes this by flushing the TM state before we copy the task_struct to
the clone. To do this we go through the tmreclaim patch, which removes the
checkpointed registers from the CPU and transitions the CPU out of TM suspend
mode. Hence we need to call tmrechkpt after to restore the checkpointed state
and the TM mode for the current task.
To make this fail from userspace is simply:
tbegin
li r0, 2
sc
<boom>
Kudos to Adhemerval Zanella Neto for finding this.
Signed-off-by: Michael Neuling <mikey@neuling.org>
cc: Adhemerval Zanella Neto <azanella@br.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
[Backported to 3.10: context adjust]
Signed-off-by: Xue Liu <liuxueliu.liu@huawei.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
(cherry picked from commit aece4fa7368debd14ac07ebaf569587ff02cc596)
Signed-off-by: Paul Mackerras <paulus@samba.org>
|
|
commit e6b8fd028b584ffca7a7255b8971f254932c9fce upstream.
We can't take an IRQ when we're about to do a trechkpt as our GPR state is set
to user GPR values.
We've hit this when running some IBM Java stress tests in the lab resulting in
the following dump:
cpu 0x3f: Vector: 700 (Program Check) at [c000000007eb3d40]
pc: c000000000050074: restore_gprs+0xc0/0x148
lr: 00000000b52a8184
sp: ac57d360
msr: 8000000100201030
current = 0xc00000002c500000
paca = 0xc000000007dbfc00 softe: 0 irq_happened: 0x00
pid = 34535, comm = Pooled Thread #
R00 = 00000000b52a8184 R16 = 00000000b3e48fda
R01 = 00000000ac57d360 R17 = 00000000ade79bd8
R02 = 00000000ac586930 R18 = 000000000fac9bcc
R03 = 00000000ade60000 R19 = 00000000ac57f930
R04 = 00000000f6624918 R20 = 00000000ade79be8
R05 = 00000000f663f238 R21 = 00000000ac218a54
R06 = 0000000000000002 R22 = 000000000f956280
R07 = 0000000000000008 R23 = 000000000000007e
R08 = 000000000000000a R24 = 000000000000000c
R09 = 00000000b6e69160 R25 = 00000000b424cf00
R10 = 0000000000000181 R26 = 00000000f66256d4
R11 = 000000000f365ec0 R27 = 00000000b6fdcdd0
R12 = 00000000f66400f0 R28 = 0000000000000001
R13 = 00000000ada71900 R29 = 00000000ade5a300
R14 = 00000000ac2185a8 R30 = 00000000f663f238
R15 = 0000000000000004 R31 = 00000000f6624918
pc = c000000000050074 restore_gprs+0xc0/0x148
cfar= c00000000004fe28 dont_restore_vec+0x1c/0x1a4
lr = 00000000b52a8184
msr = 8000000100201030 cr = 24804888
ctr = 0000000000000000 xer = 0000000000000000 trap = 700
This moves tm_recheckpoint to a C function and moves the tm_restore_sprs into
that function. It then adds IRQ disabling over the trechkpt critical section.
It also sets the TEXASR FS in the signals code to ensure this is never set now
that we explictly write the TM sprs in tm_recheckpoint.
Signed-off-by: Michael Neuling <mikey@neuling.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
(cherry picked from commit b2b708cf2f9c51bf5a75845eb0b2f2390707957c)
Signed-off-by: Paul Mackerras <paulus@samba.org>
|
|
The adapter is freed before we check its flags. It was caused
by commit 144be3d ("net/cxgb4: Avoid disabling PCI device for
towice"). The problem was reported by Intel's "0-day" tool.
The patch fixes it to avoid reverting commit 144be3d. It's
responsing to bug#110450.
Signed-off-by: Gavin Shan <shangw@linux.vnet.ibm.com>
|
|
We possibly retrieve the adapter's statistics during EEH recovery
and that should be disallowed. Otherwise, it would possibly incur
replicate EEH error and EEH recovery is going to fail eventually.
The patch reuses statistics lock and checks net_device is attached
before going to retrieve statistics, so that the problem can be
avoided.
It's responsing to bug#110450.
Signed-off-by: Gavin Shan <shangw@linux.vnet.ibm.com>
|
|
If we have EEH error happens to the adapter and we have to remove
it from the system for some reasons (e.g. more than 5 EEH errors
detected from the device in last hour), the adapter will be disabled
for towice separately by eeh_err_detected() and remove_one(), which
will incur following unexpected backtrace. The patch tries to avoid
it.
It's responsing bug#110450.
WARNING: at drivers/pci/pci.c:1431
CPU: 12 PID: 121 Comm: eehd Not tainted 3.13.0-rc7+ #1
task: c0000001823a3780 ti: c00000018240c000 task.ti: c00000018240c000
NIP: c0000000003c1e40 LR: c0000000003c1e3c CTR: 0000000001764c5c
REGS: c00000018240f470 TRAP: 0700 Not tainted (3.13.0-rc7+)
MSR: 8000000000029032 <SF,EE,ME,IR,DR,RI> CR: 28000024 XER: 00000004
CFAR: c000000000706528 SOFTE: 1
GPR00: c0000000003c1e3c c00000018240f6f0 c0000000010fe1f8 0000000000000035
GPR04: 0000000000000000 0000000000000000 00000000003ae509 0000000000000000
GPR08: 000000000000346f 0000000000000000 0000000000000000 0000000000003fef
GPR12: 0000000028000022 c00000000ec93000 c0000000000c11b0 c000000184ac3e40
GPR16: 0000000000000000 0000000000000000 0000000000000000 0000000000000000
GPR20: 0000000000000000 0000000000000000 0000000000000000 0000000000000000
GPR24: 0000000000000000 c0000000009398d8 c00000000101f9c0 c0000001860ae000
GPR28: c000000182ba0000 00000000000001f0 c0000001860ae6f8 c0000001860ae000
NIP [c0000000003c1e40] .pci_disable_device+0xd0/0xf0
LR [c0000000003c1e3c] .pci_disable_device+0xcc/0xf0
Call Trace:
[c0000000003c1e3c] .pci_disable_device+0xcc/0xf0 (unreliable)
[d0000000073881c4] .remove_one+0x174/0x320 [cxgb4]
[c0000000003c57e0] .pci_device_remove+0x60/0x100
[c00000000046396c] .__device_release_driver+0x9c/0x120
[c000000000463a20] .device_release_driver+0x30/0x60
[c0000000003bcdb4] .pci_stop_bus_device+0x94/0xd0
[c0000000003bcf48] .pci_stop_and_remove_bus_device+0x18/0x30
[c00000000003f548] .pcibios_remove_pci_devices+0xa8/0x140
[c000000000035c00] .eeh_handle_normal_event+0xa0/0x3c0
[c000000000035f50] .eeh_handle_event+0x30/0x2b0
[c0000000000362c4] .eeh_event_handler+0xf4/0x1b0
[c0000000000c12b8] .kthread+0x108/0x130
[c00000000000a168] .ret_from_kernel_thread+0x5c/0x74
Signed-off-by: Gavin Shan <shangw@linux.vnet.ibm.com>
|
|
Occasional failures have been seen with split-core mode and migration
where the message "KVM: couldn't grab cpu" appears. This increases
the length of time that we wait from 1ms to 10ms, which seems to
work around the issue.
Fixes: BZ 110865
Signed-off-by: Paul Mackerras <paulus@samba.org>
|
|
This patch fixes the EEH recoery issue in bnx2x.
Signed-off-by: Wen Xiong <wenxiong@linux.vnet.ibm.com>
LTC-Bugzilla: #110449
|
|
On Tuleta system, HTX has miscompare data issue after EEH recovery.
Signed-off-by: Wen Xiong <wenxiong@linux.vnet.ibm.com>
|
|
Add a memory barrier to ensure the valid bit is read before
any of the cqe payload is read. This fixes an issue seen
on Power where the cqe payload was getting loaded before
the valid bit. When this occurred, we saw an iotag out of
range error when a command completed, but since the iotag
looked invalid the command didn't get completed to scsi core.
Later we hit the command timeout, attempted to abort the command,
then waited for the aborted command to get returned. Since the
adapter already returned the command, we timeout waiting,
and end up escalating EEH all the way to host reset. This
patch fixes this issue.
Signed-off-by: Brian King <brking@linux.vnet.ibm.com>
|
|
Pulled from 3.10.23 stable for bug 110340..
>From abb5100737bba3f82b5514350fea89ca361ac66c Mon Sep 17 00:00:00 2001
From: Peter Hurley <peter@hurleysoftware.com>
Date: Sat, 3 May 2014 14:04:59 +0200
Subject: n_tty: Fix n_tty_write crash when echoing in raw mode
commit 4291086b1f081b869c6d79e5b7441633dc3ace00 upstream.
The tty atomic_write_lock does not provide an exclusion guarantee for
the tty driver if the termios settings are LECHO & !OPOST. And since
it is unexpected and not allowed to call TTY buffer helpers like
tty_insert_flip_string concurrently, this may lead to crashes when
concurrect writers call pty_write. In that case the following two
writers:
* the ECHOing from a workqueue and
* pty_write from the process
race and can overflow the corresponding TTY buffer like follows.
If we look into tty_insert_flip_string_fixed_flag, there is:
int space = __tty_buffer_request_room(port, goal, flags);
struct tty_buffer *tb = port->buf.tail;
...
memcpy(char_buf_ptr(tb, tb->used), chars, space);
...
tb->used += space;
so the race of the two can result in something like this:
A B
__tty_buffer_request_room
__tty_buffer_request_room
memcpy(buf(tb->used), ...)
tb->used += space;
memcpy(buf(tb->used), ...) ->BOOM
B's memcpy is past the tty_buffer due to the previous A's tb->used
increment.
Since the N_TTY line discipline input processing can output
concurrently with a tty write, obtain the N_TTY ldisc output_lock to
serialize echo output with normal tty writes. This ensures the tty
buffer helper tty_insert_flip_string is not called concurrently and
everything is fine.
Note that this is nicely reproducible by an ordinary user using
forkpty and some setup around that (raw termios + ECHO). And it is
present in kernels at least after commit
d945cb9cce20ac7143c2de8d88b187f62db99bdc (pty: Rework the pty layer to
use the normal buffering logic) in 2.6.31-rc3.
js: add more info to the commit log
js: switch to bool
js: lock unconditionally
js: lock only the tty->ops->write call
References: CVE-2014-0196
Reported-and-tested-by: Jiri Slaby <jslaby@suse.cz>
Signed-off-by: Peter Hurley <peter@hurleysoftware.com>
Signed-off-by: Jiri Slaby <jslaby@suse.cz>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Alan Cox <alan@lxorguk.ukuu.org.uk>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
Pulled from 3.10.23 stable for bug 110340.
>From a9ded882d5168e2fd5c0c20e2874f85c56016b4b Mon Sep 17 00:00:00 2001
From: Paolo Bonzini <pbonzini@redhat.com>
Date: Fri, 28 Mar 2014 20:41:50 +0100
Subject: KVM: ioapic: fix assignment of ioapic->rtc_status.pending_eoi
(CVE-2014-0155)
commit 5678de3f15010b9022ee45673f33bcfc71d47b60 upstream.
QE reported that they got the BUG_ON in ioapic_service to trigger.
I cannot reproduce it, but there are two reasons why this could happen.
The less likely but also easiest one, is when kvm_irq_delivery_to_apic
does not deliver to any APIC and returns -1.
Because irqe.shorthand == 0, the kvm_for_each_vcpu loop in that
function is never reached. However, you can target the similar loop in
kvm_irq_delivery_to_apic_fast; just program a zero logical destination
address into the IOAPIC, or an out-of-range physical destination address.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
We have observed that on machines with all their memory in a single
node, it is possible to hit an out of memory situation where kernel
allocations (which can't use the CMA pool) fail, triggering the OOM
killer, yet reclaim doesn't start because there is still free memory
in the CMA pool. To alleviate this situation somewhat, this reduces
the default CMA pool size from 5% to 3% of system memory. The 3%
should still be enough in most situations, and if not, the user can
specify a different amount on the kernel command line.
This should help with BZ 110181.
Signed-off-by: Paul Mackerras <paulus@samba.org>
|
|
As Ben suggested, it's meaningful to dump PE's location code
for site engineers when hitting EEH errors. The patch introduces
function eeh_pe_loc_get() to retireve the location code from
dev-tree so that we can output it when hitting EEH errors.
If primary PE bus is root bus, the PHB's dev-node would be tried
prior to root port's dev-node. Otherwise, the upstream bridge's
dev-node of the primary PE bus will be check for the location code
directly.
This fixes BZ 109585. Please apply to the next build for GA.
Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
|
|
The first bug is that we are testing the C (changed) bit in the hashed
page table without first doing a tlbie. The architecture allows the
update of the C bit to happen at any time up until we do a tlbie for
the page. However, we don't want to do a tlbie for every page on every
pass of a migration operation. Thus we do the tlbie if there are no
vcpus currently running, which would indicate the final phase of
migration. If any vcpus are running then reading the dirty log is
already racy because pages could get dirtied immediately after we
check them. Also, we don't need to do the tlbie if the HPT entry
doesn't allow writing, since in that case the C bit can not get set.
The second bug is that in the case where we see a dirty 16MB page
followed by a dirty 4kB page (both mapping to the same guest real
address), we return 1 rather than 16MB / PAGE_SIZE. The return value,
indicating the number of dirty pages, needs to reflect the largest
dirty page we come across, not the last dirty page we see.
Fixes: 109551 (this time for sure)
Signed-off-by: Paul Mackerras <paulus@samba.org>
|
|
The dirty map is system page (4K/64K) per bit, and when we populate dirty
map, we reset the Change bit in HPT which is expected to contains pages
less or equal to the system page size. This works until we start using
huge pages (16MB). In this case, we mark dirty just a single system page
and miss the rest of 16MB page which may be dirty as well.
This changes kvm_test_clear_dirty to return the actual number of pages
which is calculated from HPT entry.
This changes kvmppc_hv_get_dirty_log() to make pages dirty starting from
the rounded guest physical page number.
[paulus@samba.org - don't advance i in the loop to set dirty bits, so
that we make sure to clear C in all HPTEs.]
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Signed-off-by: Paul Mackerras <paulus@samba.org>
|
|
table.
We reserve 5% of total ram for CMA allocation and not using that can
result in us running out of numa node memory with specific
configuration. One caveat is we may not have node local hpt with pinned
vcpu configuration. But currently libvirt also pins the vcpu to cpuset
after creating hash page table.
Reviewed-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
|
|
Commit 63fa7d4 ("powerpc/eeh: Escalate error on non-existing PE")
escalates the frozen state on non-existing PE to fenced PHB. It
was to improve kdump reliability. After that, commit 716a0e8 ("
powrpc/powernv: Reset PHB in kdump kernel") was introduced to
apply complete reset on all PHBs to increase the kdump reliability.
Commit 63fa7d4 becomes unuseful and to issue PHB reset on non-fenced
(on HW level) PHB would cause unexpected problems. So I'd like to
revert it.
It's responsing to bug#109562.
Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
|
|
When we have the corner case of frozen parent and child PE at the
same time, we have to handle the frozen parent PE prior to the
child. Without clearning the frozen state on parent PE, the child
PE can't be recovered successfully.
There're 2 ways (polling and interrupt) to have frozen PE to be
reported. If we have frozen parent PE out there, we have to report
and handle that firstly.
It's responsing to bug#109562.
Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
|
|
Since commit cb523e09 ("powerpc/eeh: Avoid I/O access during PE
reset"), the PE is kept as frozen state on hardware level until
the PE reset is done completely. After that, we explicitly clear
the frozen state of the affected PE. However, there might have
frozen child PEs of the affected PE and we also need clear their
frozen state as well. Otherwise, the recovery is going to fail.
It's responsing to bug#109562.
Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
|
|
Currently we forward MCEs to guest which have been recovered by guest.
And for unhandled errors we do not deliver the MCE to guest. It looks like
with no support of FWNMI in qemu, guest just panics whenever we deliver the
recovered MCEs to guest. Also, the existig code used to return to host for
unhandled errors which was casuing guest to hang with soft lockups inside
guest and makes it difficult to recover guest instance.
This patch now forwards all fatal MCEs to guest causing guest to crash/panic.
And, for recovered errors we just go back to normal functioning of guest
instead of returning to host. This fixes soft lockup issues in guest.
This patch also fixes an issue where guest MCE events were not logged to
host console.
This patch fixes bz108165 and bz108413
Signed-off-by: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
|
|
During split-core operations, one of the online CPUs is nominated as the
"master" and then stop_machine() is invoked to perform the split/unsplit
procedure. Between these 2 steps, if CPU hotplug occurs and takes the
just nominated "master" CPU offline, then the split/unsplit procedure
does not complete properly and leads to undesirable effects.
So protect the entire split-core operation with get/put_online_cpus()
to synchronize with CPU hotplug.
Fixes bz 105509.
Acked-by: Michael Neuling <mikey@neuling.org>
Signed-off-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
|
|
The hardware manages the resync during split-core operations, on newer
revisions (DD2.1 and higher). So we don't need to call opal_resync_timebase()
on those systems.
Fixes bz 105856.
[Srivatsa: Added changelog]
Signed-off-by: Alistair Popple <alistair@popple.id.au>
Signed-off-by: Michael Neuling <mikey@neuling.org>
Signed-off-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
|
|
We don't see MCE counter getting increased in /proc/interrupts which gives
false impression of no MCE occurred even when there were MCE events.
The machine check early handling was added for PowerKVM and we missed to
increment the MCE count in the early handler.
We also increment mce counters in the machine_check_exception call, but
in most cases where we handle the error hypervisor never reaches there
unless its fatal and we want to crash. Only during fatal situation we may
see double increment of mce count. We need to fix that. But for
now it always good to have some count increased instead of zero.
This fixes the MCE count issue mentioned in bz108413
Signed-off-by: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
|
|
Without this, we get lockdep errors
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
|
|
Currently machine check handler does not check for stack overflow for
nested machine check. If we hit another MCE while inside the machine check
handler repeatedly from same address then we get into risk of stack
overflow which can cause huge memory corruption. This patch limits the
nested MCE level to 4 and panic when we cross level 4.
Signed-off-by: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
|
|
The size of the sysparam sysfs files is determined from the device tree
at boot. However the buffer is hard coded to 64 bytes. If we encounter a
parameter that is larger than 64, or miss-parse the device tree, the
buffer will overflow when reading or writing to the parameter.
Check it at discovery time, and if the parameter is too large, do not
create a sysfs entry for it.
Signed-off-by: Joel Stanley <joel@jms.id.au>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
|
|
Signed-off-by: Joel Stanley <joel@jms.id.au>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
|
|
The sysparam code currently uses the userspace supplied number of
bytes when memcpy()ing in to a local 64-byte buffer.
Limit the maximum number of bytes by the size of the buffer.
Signed-off-by: Joel Stanley <joel@jms.id.au>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
|
|
The OPAL calls are returning int64_t values, which the sysparam code
stores in an int, and the sysfs callback returns ssize_t. Make code a
easier to read by consistently using ssize_t.
Signed-off-by: Joel Stanley <joel@jms.id.au>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
|
|
When a sysparam query in OPAL returned a negative value (error code),
sysfs would spew out a decent chunk of memory; almost 64K more than
expected. This was traced to a sign/unsigned mix up in the OPAL sysparam
sysfs code at sys_param_show.
The return value of sys_param_show is a ssize_t, calculated using
return ret ? ret : attr->param_size;
Alan Modra explains:
"attr->param_size" is an unsigned int, "ret" an int, so the overall
expression has type unsigned int. Result is that ret is cast to
unsigned int before being cast to ssize_t.
Instead of using the ternary operator, set ret to the param_size if an
error is not detected. The same bug exists in the sysfs write callback;
this patch fixes it in the same way.
A note on debugging this next time: on my system gcc will warn about
this if compiled with -Wsign-compare, which is not enabled by -Wall,
only -Wextra.
Signed-off-by: Joel Stanley <joel@jms.id.au>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
|
|
Today CPUs in fast sleep are being woken up to handle their timers
by the tick broadcast framework using a hrtimer queued on a nominated
broadcast CPU. The hrtimer is programmed for the earlier of the next
wakeup and a broadcast period which happens to be a jiffy. This
programming is being done incorrectly today. The current time
is noted, the tick broadcast interrupt handler is called, then the
time at which the hrtimer needs to be programmed is decided. By
then the noted current time would be stale and the hrtimer will
be forward much ahead than required, leading to delayed broadcast
interrupts being delivered to sleeping cpus.
Fix this by noting the current time just before programming the hrtimer.
Signed-off-by: Preeti U Murthy <preeti@linux.vnet.ibm.com>
|
|
Current code does not check for unhandled/unrecovered errors and return from
interrupt if it is recoverable exception which in-turn triggers same machine
check exception in a loop causing hypervisor to be unresponsive.
This patch fixes this situation and forces hypervisor to panic for
unhandled/unrecovered errors.
This patch also fixes another issue where unrecoverable_exception routine
was called in real mode in case of unrecoverable exception (MSR_RI = 0).
This causes another exception vector 0x300 (data access) during system crash
leading to confusion while debugging cause of the system crash.
With the above fixes we now throw correct console messages (see below) while
crashing the system in case of unhandled/unrecoverable machine checks.
--------------
Severe Machine check interrupt [[Not recovered]
Initiator: CPU
Error type: UE [Instruction fetch]
Effective address: 0000000030002864
Oops: Machine check, sig: 7 [#1]
SMP NR_CPUS=2048 NUMA PowerNV
Modules linked in: bork(O) bridge stp llc kvm [last unloaded: bork]
CPU: 36 PID: 55162 Comm: bash Tainted: G O 3.14.0mce #1
task: c000002d72d022d0 ti: c000000007ec0000 task.ti: c000002d72de4000
NIP: 0000000030002864 LR: 00000000300151a4 CTR: 000000003001518c
REGS: c000000007ec3d80 TRAP: 0200 Tainted: G O (3.14.0mce)
MSR: 9000000000041002 <SF,HV,ME,RI> CR: 28222848 XER: 20000000
CFAR: 0000000030002838 DAR: d0000000004d0000 DSISR: 00000000 SOFTE: 1
GPR00: 000000003001512c 0000000031f92cb0 0000000030078af0 0000000030002864
GPR04: d0000000004d0000 0000000000000000 0000000030002864 ffffffffffffffc9
GPR08: 0000000000000024 0000000030008af0 000000000000002c c00000000150e728
GPR12: 9000000000041002 0000000031f90000 0000000010142550 0000000040000000
GPR16: 0000000010143cdc 0000000000000000 00000000101306fc 00000000101424dc
GPR20: 00000000101424e0 000000001013c6f0 0000000000000000 0000000000000000
GPR24: 0000000010143ce0 00000000100f6440 c000002d72de7e00 c000002d72860250
GPR28: c000002d72860240 c000002d72ac0038 0000000000000008 0000000000040000
NIP [0000000030002864] 0x30002864
LR [00000000300151a4] 0x300151a4
Call Trace:
Instruction dump:
XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX
XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX
---[ end trace 7285f0beac1e29d3 ]---
Sending IPI to other CPUs
IPI complete
OPAL V3 detected !
--------------
Signed-off-by: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
|
|
The PPC_MSG_TIMER IPI message slot was introduced for the tick broadcast
IPIs which are required to wakeup sleeping CPUs. The decrementer of the
CPUs that enter fast sleep stops as a consequence of entering the idle
state. Therefore such CPUs have to be woken up in time to handle their
timers by a broadcast CPU which sends the PPC_MSG_TIMER IPIs to them.
This IPI message is being parsed wrongly in smp_ipi_demux(). Thus the
tick broadcast interrupt handler is never executed on the sleeping CPU.
This could have led to unpleasant side effects like not handling timers
in time on the sleeping cpus. But since the sleeping CPUs still receive
the tick broadcast IPI, they are awoken from the idle state and their
decrementers are back in action.
As a result, its possible that they are managing to handle timers
before they go to sleep again.
Hence timers are being handled on the sleeping cpus although the tick
broadcast interrupt handler, which is actually supposed to ensure that
is never being called today due to the wrong number of shift bits while
parsing the tick broadcast IPI.
However we need to note that as a result of this discrepency, timer
handling on the sleeping cpus may be unstable. This could be one of
the reasons we are observing some softlockups in the cpuidle wakeup path.
Signed-off-by: Preeti U Murthy <preeti@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
|
|
PCI resets will attempt to take the device_lock for any device to be
reset. This is a problem if that lock is already held, for instance
in the device remove path. It's not sufficient to simply kill the
user process or skip the reset if called after .remove as a race could
result in the same deadlock. Instead, we handle all resets as "best
effort" using the PCI "try" reset interfaces. This prevents the user
from being able to induce a deadlock by triggering a reset.
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
(cherry picked from commit 890ed578df82f5b7b5a874f9f2fa4f117305df5f)
Signed-off-by: Thadeu Lima de Souza Cascardo <cascardo@linux.vnet.ibm.com>
LTC-Bugzilla: #104951
|
|
When doing a function/slot/bus reset PCI grabs the device_lock for each
device to block things like suspend and driver probes, but call paths exist
where this lock may already be held. This creates an opportunity for
deadlock. For instance, vfio allows userspace to issue resets so long as
it owns the device(s). If a driver unbind .remove callback races with
userspace issuing a reset, we have a deadlock as userspace gets stuck
waiting on device_lock while another thread has device_lock and waits for
.remove to complete. To resolve this, we can make a version of the reset
interfaces which use trylock. With this, we can safely attempt a reset and
return error to userspace if there is contention.
[bhelgaas: the deadlock happens when A (userspace) has a file descriptor for
the device, and B waits in this path:
driver_detach
device_lock # take device_lock
__device_release_driver
pci_device_remove # pci_bus_type.remove
vfio_pci_remove # pci_driver .remove
vfio_del_group_dev
wait_event(vfio.release_q, !vfio_dev_present) # wait (holding device_lock)
Now B is stuck until A gives up the file descriptor. If A tries to acquire
device_lock for any reason, we deadlock because A is waiting for B to release
the lock, and B is waiting for A to release the file descriptor.]
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
(cherry picked from commit 61cf16d8bd38c3dc52033ea75d5b1f8368514a17)
Signed-off-by: Thadeu Lima de Souza Cascardo <cascardo@linux.vnet.ibm.com>
LTC-Bugzilla: #104951
|
|
When PCI_ERS_RESULT_CAN_RECOVER returned from device drivers, the
EEH core should enable I/O and DMA for the affected PE. However,
it was missed to have DMA enabled in eeh_handle_normal_event().
Besides, the frozen state of the affected PE should be cleared
after successful recovery, but we didn't.
The patch fixes both of the issues as above. It's responsing to
bug#105179.
Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
|
|
As pointed by Alexey, we're going to hit build failure without
exporting the functions when (CONFIG_VFIO_PCI == M). It should
be part of commit 9762b50 ("drivers/vfio/pci: Fix MSIx message
lost").
Reported-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
|
|
At present, if a PR guest on a POWER8 machine tries to access some
disabled functionality such as transactional memory, the result is
a facility-unavailable interrupt, which isn't handled in
kvmppc_handle_exit_pr(), resulting in a call to BUG(), crashing
the PR host kernel.
This adds code to handle the facility-unavailable interrupts and
give the guest an illegal instruction interrupt, instead of crashing
the PR host.
Signed-off-by: Paul Mackerras <paulus@samba.org>
|
|
This provides basic support for the KVM_REG_PPC_ARCH_COMPAT register
in PR KVM. At present the value is sanity-checked when set, but
doesn't actually affect anything yet.
Implementing this makes it possible to use a qemu command-line
argument such as "-cpu host,compat=power7" on a POWER8 machine,
just as we would with HV KVM.
Signed-off-by: Paul Mackerras <paulus@samba.org>
|
|
The Power ISA states that an mtspr or mfspr to/from an unimplemented
SPR should be a no-op in privileged mode, rather than causing an
program interrupt (0x700 vector), with the exception of mtspr to SPR 0
and mfspr from SPRs 0, 4, 5 or 6.
Currently our SPR emulation code doesn't follow this rule. This
modifies the code in kvmppc_core_emulate_m[ft]spr_pr() to check
the PR bit in the MSR when we detect an unknown SPR number, and
only return EMULATE_FAIL (which results in a program interrupt)
if PR is 0 or the SPR number is one of the ones which are specifically
defined to cause a program interrupt.
Signed-off-by: Paul Mackerras <paulus@samba.org>
|
|
each other
Its possible that the tick_broadcast_force_mask contains cpus which are not
in cpu_online_mask when a broadcast tick occurs. This could happen under the
following circumstance assuming CPU1 is among the CPUs waiting for broadcast
and the cpu being hotplugged out.
CPU0 CPU1
Run CPU_DOWN_PREPARE notifiers
Start stop_machine Gets woken up by IPI to run
stop_machine, sets itself in
tick_broadcast_force_mask if the
time of broadcast interrupt is around
the same time as this IPI.
Start stop_machine
set_cpu_online(cpu1, false)
End stop_machine End stop_machine
Broadcast interrupt
Finds that cpu1 in
tick_broadcast_force_mask is offline
and triggers the WARN_ON in
tick_handle_oneshot_broadcast()
Clears all broadcast masks
in CPU_DEAD stage.
While the hotplugged cpu clears its bit in the tick_broadcast_oneshot_mask
and tick_broadcast_pending mask during BROADCAST_EXIT, it *sets* its bit
in the tick_broadcast_force_mask if the broadcast interrupt is found to be
around the same time as the present time. Today we clear all the broadcast
masks and shutdown tick devices in the CPU_DEAD stage. But as shown above
the broadcast interrupt could occur before this stage is reached and the
WARN_ON() gets triggered when it is found that the tick_broadcast_force_mask
contains an offline cpu.
Please note that a scenario such as above will occur *only if the broadcast
interrupt is delayed under some circumstance*. Ideally the broadcast interrupt
in the above scenario should have occured before we reach the irq_disabled
stage of stop_machine and should have seen a valid broadcast mask. But for
some reason that is yet to be understood it is getting delayed leading to the
above scenario.
Besides this another point to notice is that for a small duration between
the CPU_DYING stage where the hotplugged cpu clears its bit from the
cpu_online_mask and the CPU_DEAD stage where the broadcast_force_mask gets
cleared of the same, both these masks are out of sync with each other during that
time thus triggering the above scenario.
The temporary solution to this is to move the clearing of broadcast masks to
the CPU_DYING notification stage. The reason is, it is during this stage that
the hotplugged cpu clears itself from the cpu_online_mask() and runs
notifications relevant to this stage including those to clear the broadcast masks
(with this patch).
All this, while the rest of the cpus are busy spinning in stop_machine to notice
this change. By the time this stage ends and all cpus resume work, the hotplugged
cpu would have cleared itself from the cpu_online_mask and the broadcast cpu mask
thus keeping them in sync with each other at such times when the rest of the cpus
can read these masks.
Since the above mentioned delay in the broadcast interrupt has not triggered
any soft lockups so far, we are assuming its a non-fatal issue and have this
patch to prevent the warning from popping up in this case.
Suggested-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
Signed-off-by: Preeti U Murthy <preeti@linux.vnet.ibm.com>
Reviewed-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
Acked-by: Benjamin Herrenschmidt <benh@au1.ibm.com>
|
|
The changes to increment _mapcount was added w.r.t THP change
3526741f0964c88bc2ce511e1078359052bf225b. Later this was fixed
to to handle the hugetlb case in 44518d2b32646e37b4b7a0813bbbe98dc21c7f8f
Instead of backporting 44518, we can remove the _mapcount update since
we don't support THP for kvm host yet.
Fixes: bz# 108558
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
|
|
In the kdump scenario, the first kerenl doesn't shutdown PCI devices
and the kdump kerenl clean PHB IODA table at the early probe time.
That means the kdump kerenl can't support PCI transactions piled
by the first kerenl. Otherwise, lots of EEH errors and frozen PEs
will be detected.
In order to avoid the EEH errors, the PHB is resetted to drop all
PCI transaction from the first kerenl. It looks good on P7, but need
to be verified on P8.
Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
Acked-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
|
|
The problem was initially reported by Wendy who tried pass through
IPR adapter, which was connected to PHB root port directly, to KVM
based guest. When doing that, pci_reset_bridge_secondary_bus() was
called by VFIO driver and linkDown was detected by the root port.
That caused all PEs to be frozen.
The patch fixes the issue by routing the reset for the secondary bus
of root port to underly firmware. For that, one more weak function
pci_reset_secondary_bus() is introduced so that the individual platforms
can override that and do specific reset for bridge's secondary bus.
Reported-by: Wendy Xiong <wenxiong@linux.vnet.ibm.com>
Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
Acked-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
|
|
Basically, we have 3 types of resets to fulfil PE reset: fundamental,
hot and PHB reset. For the later 2 cases, we need PCI bus reset hold
and settlement delay as specified by PCI spec. PowerNV and pSeries
platforms are running on top of different firmware and some of the
delays have been covered by underly firmware (PowerNV).
The patch makes the delays unified to be done in backend, instead of
EEH core.
Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
Acked-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
|
|
Resetting root port has more stuff to do than that for PCIe switch
ports and we should have resetting root port done in firmware instead
of the kernel itself. The problem was introduced by commit 5b2e198e
("powerpc/powernv: Rework EEH reset").
Cc: linux-stable <stable@vger.kernel.org>
Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
Acked-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
|
|
In pseries_eeh_get_state(), EEH_STATE_UNAVAILABLE is always
overwritten by EEH_STATE_NOT_SUPPORT because of the missed
"break" there. The patch fixes the issue.
Reported-by: Joe Perches <joe@perches.com>
Cc: linux-stable <stable@vger.kernel.org>
Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
Acked-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
|
|
Once one specific PE has been marked as EEH_PE_ISOLATED, it's in
the middile of recovery or removed permenently. We needn't report
the frozen PE again. Otherwise, we will have endless reporting
same frozen PE.
Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
Acked-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
|
|
The issue was detected in a bit complicated test case where
we have multiple hierarchical PEs shown as following figure:
+-----------------+
| PE#3 p2p#0 |
| p2p#1 |
+-----------------+
|
+-----------------+
| PE#4 pdev#0 |
| pdev#1 |
+-----------------+
PE#4 (have 2 PCI devices) is the child of PE#3, which has 2 p2p
bridges. We accidentally had less-known scenario: PE#4 was removed
permanently from the system because of permanent failure (e.g.
exceeding the max allowd failure times in last hour), then we detects
EEH errors on PE#3 and tried to recover it. However, eeh_dev instances
for pdev#0/1 were not detached from PE#4, which was still connected to
PE#3. All of that was because of the fact that we rely on count-based
pcibios_release_device(), which isn't reliable enough. When doing
recovery for PE#3, we still apply hotplug on PE#4 and pdev#0/1, which
are not valid any more. Eventually, we run into kernel crash.
The patch fixes above issue from two aspects. For unplug, we simply
skip those permanently removed PE, whose state is (EEH_PE_STATE_ISOLATED
&& !EEH_PE_STATE_RECOVERING) and its frozen count should be greater
than EEH_MAX_ALLOWED_FREEZES. For plug, we marked all permanently
removed EEH devices with EEH_DEV_REMOVED and return 0xFF's on read
its PCI config so that PCI core will omit them.
Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
Acked-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
|
|
The patch introduces bootarg "eeh=off" to disable EEH functinality.
Also, it creates /sys/kerenl/debug/powerpc/eeh_enable to disable
or enable EEH functionality. By default, we have the functionality
enabled.
For PowerNV platform, we will restore to have the conventional
mechanism of clearing frozen PE during PCI config access if we're
going to disable EEH functionality. Conversely, we will rely on
EEH for error recovery.
The patch also fixes the issue that we missed to cover the case
of disabled EEH functionality in function ioda_eeh_event(). Those
events driven by interrupt should be cleared to avoid endless
reporting.
Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
Acked-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
|
|
There're 2 EEH subsystem variables: eeh_subsystem_enabled and
eeh_probe_mode. We needn't maintain 2 variables and we can just
have one variable and introduce different flags. The patch also
introduces additional flag EEH_FORCE_DISABLE, which will be used
to disable EEH subsystem via boot parameter ("eeh=off") in future.
Besides, the patch also introduces flag EEH_ENABLED, which is
changed to disable or enable EEH functionality on the fly through
debugfs entry in future.
With the patch applied, the creteria to check the enabled EEH
functionality is changed to:
!EEH_FORCE_DISABLED && EEH_ENABLED : Enabled
Other cases : Disabled
Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
Acked-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
|
|
When calling into eeh_gather_pci_data() on pSeries platform, we
possiblly don't have pci_dev instance yet, but eeh_dev is always
ready. So we use cached capability from eeh_dev instead of pci_dev
for log dump there. In order to keep things unified, we also cache
PCI capability positions to eeh_dev for PowerNV as well.
Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
Acked-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
|
|
The patch replaces printk(KERN_WARNING ...) with pr_warn() in the
function eeh_gather_pci_data().
Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
Acked-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
|
|
We have suffered recrusive frozen PE a lot, which was caused
by IO accesses during the PE reset. Ben came up with the good
idea to keep frozen PE until recovery (BAR restore) gets done.
With that, IO accesses during PE reset are dropped by hardware
and wouldn't incur the recrusive frozen PE any more.
The patch implements the idea. We don't clear the frozen state
until PE reset is done completely. During the period, the EEH
core expects unfrozen state from backend to keep going. So we
have to reuse EEH_PE_RESET flag, which has been set during PE
reset, to return normal state from backend. The side effect is
we have to clear frozen state for towice (PE reset and clear it
explicitly), but that's harmless.
We have some limitations on pHyp. pHyp doesn't allow to enable
IO or DMA for unfrozen PE. So we don't enable them on unfrozen PE
in eeh_pci_enable(). We have to enable IO before grabbing logs on
pHyp. Otherwise, 0xFF's is always returned from PCI config space.
Also, we had wrong return value from eeh_pci_enable() for
EEH_OPT_THAW_DMA case. The patch fixes it too.
Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
Acked-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
|
|
For EEH PowerNV backends, they need use their own PCI config
accesors as the normal one could be blocked during PE reset.
The patch also removes necessary parameter "hose" for the
function ioda_eeh_bridge_reset().
Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
Acked-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
|
|
We've observed multiple PE reset failures because of PCI-CFG
access during that period. Potentially, some device drivers
can't support EEH very well and they can't put the device to
motionless state before PE reset. So those device drivers might
produce PCI-CFG accesses during PE reset. Also, we could have
PCI-CFG access from user space (e.g. "lspci"). Since access to
frozen PE should return 0xFF's, we can block PCI-CFG access
during the period of PE reset so that we won't get recrusive EEH
errors.
The patch adds flag EEH_PE_RESET, which is kept during PE reset.
The PowerNV/pSeries PCI-CFG accessors reuse the flag to block
PCI-CFG accordingly.
Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
Acked-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
|
|
When doing PE reset, EEH_PE_ISOLATED is cleared unconditionally.
However, We should remove that if the PE reset has cleared the
frozen state successfully. Otherwise, the flag should be kept.
The patch fixes the issue.
Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
Acked-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
|
|
For some fields (e.g. LEM, MMIO, DMA) in PHB diag-data dump, it's
meaningless to print them if they have non-zero value in the
corresponding mask registers because we always have non-zero values
in the mask registers. The patch only prints those fieds if we
have non-zero values in the primary registers (e.g. LEM, MMIO, DMA
status) so that we can save couple of lines. The patch also removes
unnecessary spare line before "brdgCtl:" and two leading spaces as
prefix in each line as Ben suggested.
Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
Acked-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
|
|
Commit c7062d83fe7b ("powerpc/ppc64: Do not turn AIL (reloc-on
interrupts) too early") added code to set the AIL bit in the LPCR
without checking whether the kernel is running in hypervisor mode.
The result is that when the kernel is running as a guest (i.e.,
under PowerKVM or PowerVM), the processor takes a privileged
instruction interrupt at that point, causing a panic. The visible
result is that the kernel hangs after printing "returning from
prom_init".
This fixes it by checking for hypervisor mode being available
before setting LPCR. If we are not in hypervisor mode, we enable
relocation-on interrupts later in pSeries_setup_arch using the
H_SET_MODE hcall.
This fixes BZ 108728.
Signed-off-by: Paul Mackerras <paulus@samba.org>
|
|
When the guest cedes the vcpu or the vcpu has no guest to
run it naps. Clear the runlatch bit of the vcpu before
napping to indicate an idle cpu.
Signed-off-by: Preeti U Murthy <preeti@linux.vnet.ibm.com>
Acked-by: Paul Mackerras <paulus@samba.org>
Reviewed-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
|
|
The secondary threads in the core have their runlatch bits cleared since they
are offline. When the secondary threads are called in to start a guest their
runlatch bits need to be set to indicate that they are busy. The primary
thread has its runlatch bit set though, but there is no harm in setting this
bit once again. Hence set the runlatch bit for all threads before they start
guest.
Signed-off-by: Preeti U Murthy <preeti@linux.vnet.ibm.com>
Acked-by: Paul Mackerras <paulus@samba.org>
Reviewed-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
|
|
Up until now we have been setting the runlatch bits for a busy CPU and
clearing it when a CPU enters idle state. The runlatch bit has thus
been consistent with the utilization of a CPU as long as the CPU is online.
However when a CPU is hotplugged out the runlatch bit is not cleared. It
needs to be cleared to indicate an unused CPU. OCC consumes the runlatch bit
to decide the utilization of a thread and ends up seeing the offline threads
as busy. Hence this patch has the runlatch bit cleared for an offline CPU
just before entering an idle state and sets it immediately after it exits
the idle state.
Signed-off-by: Preeti U Murthy <preeti@linux.vnet.ibm.com>
Acked-by: Paul Mackerras <paulus@samba.org>
Reviewed-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
|
|
The issue is happened in dual controller configuration. We got the
sysfs warnings when rmmod the ipr module.
enclosure_unregister() in drivers/msic/enclosure.c, call device_unregister()
for each componment deivce, device_unregister() ->device_del()->kobject_del()
->sysfs_remove_dir(). In sysfs_remove_dir(), set kobj->sd = NULL.
For each componment device, enclosure_component_release()->enclosure_remove_links()->sysfs_remove_link() in which checking kobj->sd again, it has been set as NULL when doing device_unregister. So we saw all these sysfs WARNING.
sysfs: can not remove 'enclosure_device: P1-D1 2SS6', no directory
------------[ cut here ]------------
WARNING: at fs/sysfs/inode.c:325
Modules linked in: fuse loop dm_mod ses enclosure ipr(-) ipv6 ibmveth libata sg ext3 jbd mbcache sd_mod crc_t10dif crct10dif_common ibmvscsi scsi_transport_srp scsi_tgt scsi_dh_rdac scsi_dh_emc scsi_dh_hp_sw scsi_dh_alua scsi_dh scsi_mod
CPU: 0 PID: 4006 Comm: rmmod Not tainted 3.12.0-scsi-0.11-ppc64 #1
task: c0000000f769aba0 ti: c0000000f8f9c000 task.ti: c0000000f8f9c000
NIP: c0000000002b038c LR: c0000000002b0388 CTR: 0000000000000000
REGS: c0000000f8f9ee70 TRAP: 0700 Not tainted (3.12.0-scsi-0.11-ppc64)
MSR: 8000000000029032 <SF,EE,ME,IR,DR,RI> CR: 28008444 XER: 20000000
SOFTE: 1
CFAR: c000000000736118
GPR00: c0000000002b0388 c0000000f8f9f0f0 c0000000010ed630 0000000000000047
GPR04: c000000001502628 c000000001513010 0000000000000689 652027656e636c6f
GPR08: 737572655f646576 c000000000ae2b7c 0000000000a20000 c000000000add630
GPR12: 0000000028008442 c000000007f20000 0000000000000000 0000000010146920
GPR16: 00000000100cb9d8 0000000010093088 0000000010146920 0000000000000000
GPR20: 0000000000000000 0000000010161900 00000000100ce458 0000000000000000
GPR24: 0000000010161940 0000000000000000 d0000000046ad440 0000000000000000
GPR28: c0000000f8f9f270 0000000000000000 c0000000fcb882c8 0000000000000000
NIP [c0000000002b038c] .sysfs_hash_and_remove+0xe4/0xf0
LR [c0000000002b0388] .sysfs_hash_and_remove+0xe0/0xf0
Call Trace:
[c0000000f8f9f0f0] [c0000000002b0388] .sysfs_hash_and_remove+0xe0/0xf0 (unreliable)
[c0000000f8f9f190] [c0000000002b4134] .sysfs_remove_link+0x24/0x60
[c0000000f8f9f200] [d000000004df037c] .enclosure_remove_links+0x64/0xa0 [enclosure]
[c0000000f8f9f2d0] [d000000004df0518] .enclosure_component_release+0x30/0x60 [enclosure]
[c0000000f8f9f350] [c000000000540068] .device_release+0x50/0xd8
[c0000000f8f9f3d0] [c0000000003b6f80] .kobject_cleanup+0xb8/0x230
[c0000000f8f9f460] [c00000000053f404] .put_device+0x1c/0x30
[c0000000f8f9f4d0] [d000000004df0db0] .enclosure_unregister+0xa0/0xe8 [enclosure]
[c0000000f8f9f560] [d000000004f90094] .ses_intf_remove_enclosure+0x8c/0xa8 [ses]
[c0000000f8f9f5f0] [c0000000005413ec] .device_del+0xf4/0x268
[c0000000f8f9f680] [c000000000541594] .device_unregister+0x34/0x88
[c0000000f8f9f700] [d000000001423d3c] .__scsi_remove_device+0x104/0x128 [scsi_mod]
[c0000000f8f9f780] [d00000000141eff8] .scsi_forget_host+0x70/0xa0 [scsi_mod]
[c0000000f8f9f800] [d000000001413dc0] .scsi_remove_host+0x88/0x178 [scsi_mod]
[c0000000f8f9f890] [d00000000469db5c] .ipr_remove+0x7c/0xf8 [ipr]
[c0000000f8f9f920] [c0000000003fe1f4] .pci_device_remove+0x64/0xf0
[c0000000f8f9f9b0] [c000000000544f10] .__device_release_driver+0xd0/0x158
[c0000000f8f9fa40] [c0000000005450d8] .driver_detach+0x140/0x148
[c0000000f8f9fae0] [c000000000543848] .bus_remove_driver+0xe0/0x188
[c0000000f8f9fb70] [c00000000054628c] .driver_unregister+0x3c/0x80
[c0000000f8f9fbf0] [c0000000003fe35c] .pci_unregister_driver+0x34/0xe8
[c0000000f8f9fc90] [d0000000046a5fb4] .ipr_exit+0x2c/0x44 [ipr]
[c0000000f8f9fd20] [c0000000001359dc] .SyS_delete_module+0x204/0x308
[c0000000f8f9fe30] [c000000000009f60] syscall_exit+0x0/0xa0
Instruction dump:
e8010010 eb81ffe0 7c0803a6 eba1ffe8 ebc1fff0 ebe1fff8 4e800020 3c62ff8a
7ca42b78 3863c388 48485d45 60000000 <0fe00000> 3860fffe 4bffff94 fba1ffe8
o
Signed-off-by: Wen Xiong <wenxiong@linux.vnet.ibm.com>
|
|
commit 789b5e0315284463617e106baad360cb9e8db3ac upstream.
Subsystems that want to register CPU hotplug callbacks, as well as perform
initialization for the CPUs that are already online, often do it as shown
below:
get_online_cpus();
for_each_online_cpu(cpu)
init_cpu(cpu);
register_cpu_notifier(&foobar_cpu_notifier);
put_online_cpus();
This is wrong, since it is prone to ABBA deadlocks involving the
cpu_add_remove_lock and the cpu_hotplug.lock (when running concurrently
with CPU hotplug operations).
Interestingly, the raid5 code can actually prevent double initialization and
hence can use the following simplified form of callback registration:
register_cpu_notifier(&foobar_cpu_notifier);
get_online_cpus();
for_each_online_cpu(cpu)
init_cpu(cpu);
put_online_cpus();
A hotplug operation that occurs between registering the notifier and calling
get_online_cpus(), won't disrupt anything, because the code takes care to
perform the memory allocations only once.
So reorganize the code in raid5 this way to fix the deadlock with callback
registration.
This fixes BZ 103213.
Cc: linux-raid@vger.kernel.org
Fixes: 36d1c6476be51101778882897b315bd928c8c7b5
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
[Srivatsa: Fixed the unregister_cpu_notifier() deadlock, added the
free_scratch_buffer() helper to condense code further and wrote the changelog.]
Signed-off-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
Signed-off-by: NeilBrown <neilb@suse.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
Commit 2775d6230 (md: Avoid deadlock in raid5_alloc_percpu) only partially
fixed the deadlock involving CPU hotplug notifiers. In particular, it fixed
the deadlock possibility in register_cpu_notifier(), but left the deadlock
in unregister_cpu_notifier() unfixed. So revert this commit so that we can
fix both the deadlocks properly, using the solution that was accepted
upstream.
Signed-off-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
|
|
kvm_vfio_spapr_tce_release was spelled as ikvm_vfio_ispapr_tce_release
which caused compilation to break in case of CONFIG_KVM_VFIO=n. Fix it.
Cc: Alexey Kardashevskiy <aik@ozlabs.ru>
Signed-off-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
|
|
The global_invalidates() function contains a check that is intended
to tell whether we are currently executing in the context of a hypercall
issued by the guest. The reason is that the optimization of using a
local TLB invalidate instruction is only valid in that context. The
check was testing local_paca->kvm_hstate.kvm_vcore, which gets set
when entering the guest but no longer gets cleared when exiting the
guest. To fix this, we use the kvm_vcpu field instead, which does
get cleared when exiting the guest, by the kvmppc_release_hwthread()
calls inside kvmppc_run_core().
The effect of having the check wrong was that when kvmppc_do_h_remove()
got called from htab_write() on the destination machine during a
migration, it cleared the current cpu's bit in kvm->arch.need_tlb_flush.
This meant that when the guest started running in the destination VM,
it may miss out on doing a complete TLB flush, and therefore may end
up using stale TLB entries from a previous guest that used the same
LPID value.
This should make migration more reliable.
Signed-off-by: Paul Mackerras <paulus@samba.org>
|
|
The OPAL log is now accessed through sysfs at /sys/firmware/opal/msglog,
so remove the old and buggy debugfs file.
Signed-off-by: Joel Stanley <joel@jms.id.au>
|
|
Create a driver attribute named "cpuinfo_nominal_freq" which will in
turn create a read-only sysfs interface that will be used to export
the nominal frequency to the userspace. This will be necessary for
creating an optimal "performance" policy which should be running the
on-demand governor with "scaling_max_freq" to be set to the value
exported via "cpuinfo_max_freq" and "scaling_min_freq" to be set to
the nominal frequency exported via "cpuinfo_nominal_freq".
The patch caches the values of max, min, nominal pstate ids and
nr_pstates queried from the DT during the initialization of the driver
so that they can be used in other places in the driver for
validatation.
Also, it adds a helper method that returns the frequency corresponding to
a pstate id.
This has been backported from the version posted against mainline
which can be found here:
https://www.mail-archive.com/linuxppc-dev@lists.ozlabs.org/msg76990.html
Reviewed-by: Preeti U Murthy <preeti@linux.vnet.ibm.com>
Signed-off-by: Gautham R. Shenoy <ego@linux.vnet.ibm.com>
|
|
We had added the debug prints to confirm the idle state exit
by the cpus. This was mainly to test if fast sleep was working
fine. Now that we are confident about its functioning we
can get rid of these prints.
Signed-off-by: Preeti U Murthy <preeti@linux.vnet.ibm.com>
|
|
This reworks the opal message log following upstream review. A bug was
fixed where wrapped logs were not read correctly, and locking was added
to reduce the impact of races between reading counters and the buffer
contents.
Signed-off-by: Joel Stanley <joel@jms.id.au>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
|
|
In autogroup_create(), a tg is allocated and added to the task_groups
list. If CONFIG_RT_GROUP_SCHED is set, this tg is then modified while on
the list, without locking. This can race with someone walking the list,
like __enable_runtime() during CPU unplug, and result in a use-after-free
bug.
To fix this, move sched_online_group(), which adds the tg to the list,
to the end of the autogroup_create() function after the modification.
Signed-off-by: Gerald Schaefer <gerald.schaefer@de.ibm.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1369411669-46971-2-git-send-email-gerald.schaefer@de.ibm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Preeti U Murthy <preeti@linux.vnet.ibm.com>
(cherry picked from commit 41261b6a832ea0e788627f6a8707854423f9ff49)
|
|
The firmware can notify us when new input data is available, so
let's make sure we wakeup the HVC thread in that case.
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
|
|
opal_notifier_register() is missing a pending "unregister" variant
and should be exposed to modules.
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
|
|
Turn them on at the same time as we allow MSR_IR/DR in the paca
kernel MSR, ie, after the MMU has been setup enough to be able
to handle relocated access to the linear mapping.
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
|
|
If we take an interrupt such as a trap caused by a BUG_ON before the
MMU has been setup, the interrupt handlers try to enable virutal mode
and cause a recursive crash, making the original problem very hard
to debug.
This fixes it by adjusting the "kernel_msr" value in the PACA so that
it only has MSR_IR and MSR_DR (translation for instruction and data)
set after the MMU has been initialized for the processor.
We may still not have a console yet but at least we don't get into
a recursive fault (and early debug console or memory dump via JTAG
of the kernel buffer *will* give us the proper error).
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
|
|
This call will not be understood by OPAL, and cause it to add an error
to it's log. Among other things, this is useful for testing the
behaviour of the log as it fills up.
Signed-off-by: Joel Stanley <joel@jms.id.au>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
|
|
OPAL provides an in-memory circular buffer containing a message log
populated with various runtime messages produced by the firmware.
Provide a sysfs interface /sys/firmware/opal/messages for userspace to
view the messages.
Signed-off-by: Joel Stanley <joel@jms.id.au>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
|
|
We had a mix & match of flags used when creating legacy ports
depending on where we found them in the device-tree. Among others
we were missing UPF_SKIP_TEST for some kind of ISA ports which is
a problem as quite a few UARTs out there don't support the loopback
test (such as a lot of BMCs).
Let's pick the set of flags used by the SoC code and generalize it
which means autoconf, no loopback test, irq maybe shared and fixed
port.
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
|
|
Helps debug funky firmware issues
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
|
|
Commit b1022fbd293564de91596b8775340cf41ad5214c and subsequent ones
(in 3.10) introduced some preparatory changes for THP which consist
of trying to read the actual HPTE page size from the hash table to
perform the right variant of tlbie. However this has two issues:
- The hash entry can have been evicted and replaced by another
one with a different page size. This can in turn cause us to use
an impossible combination of psize and actual_psize, in turn
causing tlbie to be called with an invalid LP bit combination
causing a HW checkstop
- The whole business is unnecessary as in 3.10 we don't have THP
and thus always have psize == actual_psize
When THP was actual enabled in 3.11, we discovered that this wasn't
going to work and changed the code significantly to pass the proper
actual_psize from the upper layers rather than tyring to deduce it
from the HPTE.
However, we didn't "fix" 3.10 as we didn't realize that the bug
introduced an exposure without THP being enabled.
If a user page was hashed as a 64k page, and later got evicted from
the hash and replaced with a 4k hash entry (due to a segment being
demoted to 4k, for example by subpage protection or because it's
an IO page), we could get into a situation where we tried to
do a tlbie with a psize of 64k and actual_psize of 4k which is
deadly.
This is a 3.10-only fix for this situation which essentially removes
the actual_psize business from the normal updatepp and invalidate
path in hash_native_64.c since we know on 3.10 that the psize coming
from the upper levels is always correct (no THP).
As such it's a partial revert of b1022fbd293564de91596b8775340cf41ad5214c
(we don't touch the bolted path etc... those should be fine and we
want to minimize churn).
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
|
|
Fill in asic family specific versions rather than
using the generic version. This lets us handle asic
specific differences more easily. In this case, we
disable sw swapping of the rtpr writeback value on
r6xx+ since the hw does it for us. Fixes bogus
rptr readback on BE systems.
v2: remove missed cpu_to_le32(), add comments
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
(backported from commit ea31bf697d27270188a93cd78cf9de4bc968aca3)
Signed-off-by: Kleber Sacilotto de Souza <klebers@linux.vnet.ibm.com>
LTC-Bugzilla: #99530
|
|
Now that we have callbacks for [rw]ptr handling we can
remove the special handling for the DMA rings and use
the callbacks instead.
Signed-off-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
(backported from commit 2e1e6dad6a6d437e4c40611fdcc4e6cd9e2f969e)
Signed-off-by: Kleber Sacilotto de Souza <klebers@linux.vnet.ibm.com>
LTC-Bugzilla: #99530
|
|
The hardware just doesn't support this correctly.
Disable it before we accidentally write anywhere we shouldn't.
Signed-off-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
(backported from commit 02c9f7fa4e7230fc4ae8bf26f64e45aa76011f9c)
Signed-off-by: Kleber Sacilotto de Souza <klebers@linux.vnet.ibm.com>
LTC-Bugzilla: #99530
|
|
Give the ring functions a separate structure and let the asic
structure point to the ring specific functions. This simplifies
the code and allows us to make changes at only one point.
No change in functionality.
Signed-off-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
(backported from commit 76a0df859defc53e6cb61f698a48ac7da92c8d84)
Signed-off-by: Kleber Sacilotto de Souza <klebers@linux.vnet.ibm.com>
LTC-Bugzilla: #99530
|
|
Add callbacks to the radeon_asic struct to handle
rptr/wptr fetchs and wptr updates.
We currently use one version for all rings, but this
allows us to override with a ring specific versions.
Needed for compute rings on CIK.
v2: udpate as per Christian's comments
v3: fix some rebase cruft
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
(cherry picked from commit f93bdefe6269067afc85688d45c646cde350e0d8)
Signed-off-by: Kleber Sacilotto de Souza <klebers@linux.vnet.ibm.com>
LTC-Bugzilla: #99530
|
|
Signed-off-by: Eli Qiao <taget@linux.vnet.ibm.com>
|
|
While checking powersaving mode in machine check handler at 0x200, we
clobber CFAR register. Fix it by saving and restoring it during beq/bgt.
Signed-off-by: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
|
|
The branch target should be the func addr, not the addr of func_descr_t.
So using ppc_function_entry() to generate the right target addr.
Signed-off-by: Liu Ping Fan <pingfank@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
|
|
Fixing up the 'sysfs' file duplication by passing the initialized
char array to strncpy() function as the result is not %NUL-terminated
if the source exceeds 'copy_length' bytes.
Signed-off-by: Neelesh Gupta <neelegup@linux.vnet.ibm.com>
|
|
Add the appropriate definition and table entry for new hardware support.
Signed-off-by: Wen Xiong <wenxiong@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
|
|
This patch adds formatting error overlay 0x21 to improve debug capabilities.
Signed-off-by: Wen Xiong <wenxiong@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
|
|
There is no need to call pci_disable_msi() or pci_disable_msix()
in case the call to pci_enable_msi() or pci_enable_msix() failed.
Signed-off-by: Alexander Gordeev <agordeev@redhat.com>
Acked-by: Brian King <brking@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
|
|
If, when the ipr driver loads, the adapter is in an EEH error state,
it will currently oops and not be able to recover, as it attempts
to access memory that has not yet been allocated. We've seen this
occur in some kexec scenarios. The following patch fixes the oops
and also allows the driver to recover from these probe time EEH errors.
Signed-off-by: Brian King <brking@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
|
|
Add the appropriate definition and table entry for new hardware support.
Signed-off-by: Wen Xiong <wenxiong@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
|
|
This patch removes extended delay bit on GSCSI reads/writes ops, the
performance will be significanly better.
Signed-off-by: Wen Xiong <wenxiong@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
|
|
Signed-off-by: Wen Xiong <wenxiong@linux.vnet.ibm.com>
Acked-by: Brian King <brking@linux.vnet.ibm.com>
Signed-off-by: James Bottomley <JBottomley@Parallels.com>
|
|
Add the appropriate definitions and table entries for new adapter support.
Signed-off-by: Wen Xiong <wenxiong@linux.vnet.ibm.com>
Signed-off-by: James Bottomley <JBottomley@Parallels.com>
|
|
Signed-off-by: Wen Xiong <wenxiong@linux.vnet.ibm.com>
Acked-by: Brian King <brking@linux.vnet.ibm.com>
Signed-off-by: James Bottomley <JBottomley@Parallels.com>
|
|
The 'ctl' field of the 'struct ata_taskfile' is not really dual purpose, i.e.
it is not intended for storing the alternate status register (which is mapped
at the same address in the legacy IDE controllers) in the qc_fill_rtf() method.
No other 'libata' driver except 'drivers/scsi/ipr.c' stores the alternate status
register's value in the 'ctl' field of 'qc->result_tf', hence this driver should
not do this as well...
Signed-off-by: Sergei Shtylyov <sergei.shtylyov@cogentembedded.com>
Acked-by: Brian King <brking@linux.vnet.ibm.com>
Signed-off-by: James Bottomley <JBottomley@Parallels.com>
|
|
Currently we save the host PMU configuration, counter values, etc.,
when entering a guest, and restore it on return from the guest.
(We have to do this because the guest has control of the PMU while
it is executing.) However, we missed saving/restoring the SIAR and
SDAR registers, as well as the registers which are new on POWER8,
namely SIER and MMCR2.
This adds code to save the values of these registers when entering
the guest and restore them on exit. This also works around the bug
where setting PMAE with a counter already negative doesn't generate
an interrupt. This was already worked around for the guest PMU state
in an earlier commit, and is worked around for the host PMU state here.
Signed-off-by: Paul Mackerras <paulus@samba.org>
|
|
This adds workarounds for two hardware bugs in the POWER8 performance
monitor unit (PMU), both related to interrupt generation. The effect
of these bugs is that PMU interrupts can get lost, leading to tools
such as perf reporting fewer counts and samples than they should.
The first bug relates to the PMAO (perf. mon. alert occurred) bit in
MMCR0; setting it should cause an interrupt, but doesn't. The other
bug relates to the PMAE (perf. mon. alert enable) bit in MMCR0.
Setting PMAE when a counter is negative and counter negative
conditions are enabled to cause alerts should cause an alert, but
doesn't.
The workaround for the first bug is to create conditions where a
counter will overflow, whenever we are about to restore a MMCR0
value that has PMAO set (and PMAO_SYNC clear). The workaround for
the second bug is to freeze all counters using MMCR2 before reading
MMCR0.
Signed-off-by: Paul Mackerras <paulus@samba.org>
|
|
Somehow, the code that restores the guest transactional memory state
got put in the middle of the code sequence that restores the guest
PMU (performance monitor unit) state. This results in corruption of
the value written to MMCR0 if the guest is in transactional state.
This fixes it by moving the TM state-restoring code to come just before
the PMU state-restoring code. This comes out in the patch as the
first part of the PMU state-restoring code being moved down to just
before the second part of the PMU state-restoring code.
Signed-off-by: Paul Mackerras <paulus@samba.org>
|
|
Some power8 revisions have a hardware bug where we can lose a PMU
exception, this commit adds a workaround to detect the bad condition and
rectify the situation.
See the comment in the commit for a full description.
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Signed-off-by: Paul Mackerras <paulus@samba.org>
|
|
Some power8 revisions have a hardware bug where we can lose a
Performance Monitor (PMU) exception under certain circumstances.
We will be adding a workaround for this case, see the next commit for
details. The observed behaviour is that writing PMAO doesn't cause an
exception as we would expect, hence the name of the feature.
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Signed-off-by: Paul Mackerras <paulus@samba.org>
|
|
This patch converts Event TRB's 3rd field, which has type le32, to CPU
byteorder before using it to retrieve the Slot ID with TRB_TO_SLOT_ID macro.
This bug was found using sparse.
Signed-off-by: Xenia Ragiadakou <burzalodowa@gmail.com>
Signed-off-by: Sarah Sharp <sarah.a.sharp@linux.intel.com>
[Backport of 7e76ad431545d013911ddc744843118b43d01e89]
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
|
|
This patch converts TRB_CYCLE to le32 to update correctly the Cycle Bit in
'control' field of the link TRB.
This bug was found using sparse.
Signed-off-by: Xenia Ragiadakou <burzalodowa@gmail.com>
Signed-off-by: Sarah Sharp <sarah.a.sharp@linux.intel.com>
[Backport of 587194873820a4a1b2eda260ac851394095afd77]
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
|
|
In kexec scenario, we failed to load the mlx4 driver in the
second kernel because the ownership bit was hold by the first
kernel without release correctly.
The patch adds shutdown() interface so that the ownership can
be released correctly in the first kernel. It also helps avoiding
EEH error happened during boot stage of the second kernel because
of undesired traffic, which can't be handled by hardware during
that stage on Power platform.
Signed-off-by: Gavin Shan <shangw@linux.vnet.ibm.com>
Tested-by: Wei Yang <weiyang@linux.vnet.ibm.com>
|
|
The problem is specific to the case of BIST issued to IPR adapter
on the guest side. The IPR driver does something like this:
pci_save_state(), BIST reset and then pci_save_state(). we lose
everything in MSIx table with BIST reset and we never have chance
to restore MSIx table under the case.
pci_restore_msix_state() called by pci_save_state() mask all MSIx
vectors by MSIx capability, restore MSIx table, and then unmask
all MSIx vectors. We force the host kernel to restore the MSIx
vector in the step of unmasking all MSIx vectors to fix the issue.
The patch is under review this moment in Linux community. It'd better
to have ack from Ben and Alexey if we really want this to be Frobisher.
It's responsing to bug#103589.
Reported-by: Wen Xiong <wenxiong@us.ibm.com>
Signed-off-by: Gavin Shan <shangw@linux.vnet.ibm.com>
|
|
We possiblly detect EEH errors during reboot, particularly in kexec
path, but it's impossible for device drivers and EEH core to handle
or recover them properly.
The patch registers one reboot notifier for EEH and disable EEH
subsystem during reboot. That means the EEH errors is going to be
cleared by hardware reset or second kernel during early stage of
PCI probe.
It's backporting commit 66f9af83e56bfa12964d251df9d60fb571579913
("powerpc/eeh: Disable EEH on reboot") from 3.14 upstream for
bug#103590
Signed-off-by: Gavin Shan <shangw@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
|
|
The patch cleans up variable eeh_subsystem_enabled so that we needn't
refer the variable directly from external. Instead, we will use
function eeh_enabled() and eeh_set_enable() to operate the variable.
It's backporting 2ec5a0adf60c23bb6b0a95d3b96a8c1ff1e1aa5a ("powerpc/eeh:
Cleanup on eeh_subsystem_enabled") from 3.14 upstream for bug#103590
Signed-off-by: Gavin Shan <shangw@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
|
|
When doing reset in order to recover the affected PE, we issue
hot reset on PE primary bus if it's not root bus. Otherwise, we
issue hot or fundamental reset on root port or PHB accordingly.
For the later case, we didn't cover the situation where PE only
includes root port and it potentially causes kernel crash upon
EEH error to the PE.
The patch reworks the logic of EEH reset to improve the code
readability and also avoid the kernel crash.
It's backporting commit 5b2e198e50f6ba57081586b853163ea1bb95f1a8
("powerpc/powernv: Rework EEH reset") from 3.14 upstream for
bug#103590
Cc: stable@vger.kernel.org
Reported-by: Thadeu Lima de Souza Cascardo <cascardo@linux.vnet.ibm.com>
Signed-off-by: Gavin Shan <shangw@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
|
|
A malicious guest can register an IOMMU in KVM while a TCE request is
being passed from the real to virtual mode. If vcpu->arch.tce_rm_fail
was previously used and not cleared because of missing LIOBN entry in KVM,
this may cause unwanted put_page() in the virtual mode handler.
This moves @tce_rm_fail earlier to avoid using the incorrect tce_rm_fail
flag value.
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
|
|
The PCI core has function pci_reset_function() to do reset on the
specified PCI device. Before the reset starts, the sate of the PCI
device is saved and it is restored after reset. The real reset work
could be routed to pcibios_set_pcie_reset_state() by quirks. However,
the PCI bus or PCI device isn't settled down fully for restore (PCI
config and MMIO for MSIx table) after reset and it would introduce
unnecessary frozen PE. Eventually, we're stopped from passing through
IPR adapter from host to KVM-based guest.
The patch adds delay in pcibios_set_pcie_reset_state() so that the
PCI bus/device can settle down fully before restoring PCI device
states. It's part of the fixes regarding bug#103297 and bug#103589.
Signed-off-by: Gavin Shan <shangw@linux.vnet.ibm.com>
|
|
In periodic mode we remove offline cpus from the broadcast propagation
mask. In oneshot mode we fail to do so. This was not a problem so far,
but the recent changes to the broadcast propagation introduced a
constellation which can result in a NULL pointer dereference.
What happens is:
CPU0 CPU1
idle()
arch_idle()
tick_broadcast_oneshot_control(OFF);
set cpu1 in tick_broadcast_force_mask
if (cpu_offline())
arch_cpu_dead()
cpu_dead_cleanup(cpu1)
cpu1 tickdevice pointer = NULL
broadcast interrupt
dereference cpu1 tickdevice pointer -> OOPS
We dereference the pointer because cpu1 is still set in
tick_broadcast_force_mask and tick_do_broadcast() expects a valid
cpumask and therefor lacks any further checks.
Remove the cpu from the tick_broadcast_force_mask before we set the
tick device pointer to NULL. Also add a sanity check to the oneshot
broadcast function, so we can detect such issues w/o crashing the
machine.
Reported-by: Prarit Bhargava <prarit@redhat.com>
Cc: athorlton@sgi.com
Cc: CAI Qian <caiqian@redhat.com>
Link: http://lkml.kernel.org/r/alpine.DEB.2.02.1306261303260.4013@ionos.tec.linutronix.de
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
(cherry picked from commit c9b5a266b103af873abb9ac03bc3d067702c8f4b)
Reviewed-by: Preeti U Murthy <preeti@linux.vnet.ibm.com>
Signed-off-by: Preeti U Murthy <preeti@linux.vnet.ibm.com>
|
|
Fast sleep can be enabled today, only after writing into the proc interface
/proc/sys/kernel/powersave-nap with a value greater than 1. Remove this
constraint, now that we have a stable framework to support fast sleep, so
that it is enabled by default at boot.
However the same proc interface is also used to convey if deep idle states
beyond snooze can be entered into or not. Hence retain the check on
powersave-nap in fast sleep to verify if this is the case.
Signed-off-by: Preeti U Murthy <preeti@linux.vnet.ibm.com>
|
|
Add a configuration file to use when building the skiroot
(Sapphire bootloader) kernel.
Signed-off-by: Jeremy Kerr <jk@ozlabs.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
|
|
As Ben suggested, the patch prints PHB diag-data with multiple
fields in one line and omits the line if the fields of that
line are all zero.
With the patch applied, the PHB3 diag-data dump looks like:
PHB3 PHB#3 Diag-data (Version: 1)
brdgCtl: 00000002
RootSts: 0000000f 00400000 b0830008 00100147 00002000
nFir: 0000000000000000 0030006e00000000 0000000000000000
PhbSts: 0000001c00000000 0000000000000000
Lem: 0000000000100000 42498e327f502eae 0000000000000000
InAErr: 8000000000000000 8000000000000000 0402030000000000 \
0000000000000000
PE[ 8] A/B: 8480002b00000000 8000000000000000
Signed-off-by: Gavin Shan <shangw@linux.vnet.ibm.com>
Signed-off-by: Jeremy Kerr <jk@ozlabs.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
|
|
The PHB diag-data is useful to help locating the root cause for
frozen PE or fenced PHB. However, EEH core enables IO path by clearing
part of HW registers before collecting it and eventually we got broken
PHB diag-data.
The patch intends to fix it by dumping the PHB diag-data immediately
when frozen/fenced state on PE or PHB is detected for the first time
in eeh_ops::get_state() or next_error() backend.
Signed-off-by: Gavin Shan <shangw@linux.vnet.ibm.com>
Signed-off-by: Jeremy Kerr <jk@ozlabs.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
|
|
The flag PNV_EEH_STATE_ENABLED is put into pnv_phb::eeh_state,
which is protected by CONFIG_EEH. We needn't that. Instead, we
can have pnv_phb::flags and maintain all flags there, which is
the purpose of the patch. The patch also renames PNV_EEH_STATE_ENABLED
to PNV_PHB_FLAG_EEH.
Signed-off-by: Gavin Shan <shangw@linux.vnet.ibm.com>
Signed-off-by: Jeremy Kerr <jk@ozlabs.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
|
|
The PHB state PNV_EEH_STATE_REMOVED maintained in pnv_phb isn't
so useful any more and it's duplicated to EEH_PE_ISOLATED. The
patch replaces PNV_EEH_STATE_REMOVED with EEH_PE_ISOLATED.
Signed-off-by: Gavin Shan <shangw@linux.vnet.ibm.com>
Signed-off-by: Jeremy Kerr <jk@ozlabs.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
|
|
The PE state (for eeh_pe instance) EEH_PE_PHB_DEAD is duplicate to
EEH_PE_ISOLATED. Originally, those PHBs (PHB PE) with EEH_PE_PHB_DEAD
would be removed from the system. However, it's safe to replace
that with EEH_PE_ISOLATED.
The patch also clear EEH_PE_RECOVERING after fenced PHB has been handled,
either failure or success. It makes the PHB PE state consistent with:
PHB functions normally NONE
PHB has been removed EEH_PE_ISOLATED
PHB fenced, recovery in progress EEH_PE_ISOLATED | RECOVERING
Signed-off-by: Gavin Shan <shangw@linux.vnet.ibm.com>
Signed-off-by: Jeremy Kerr <jk@ozlabs.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
|
|
Cleaning up the code, removing not necessary enumeration, clubbing the
fragmented data structure and some conditional checks in node traversal
in __init code.
This also fixes a bug of sysfs file duplication.
Signed-off-by: Neelesh Gupta <neelegup@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
|
|
This fixes memory corruption which happens when VFIO is used with
PR KVM.
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
|
|
Firmware update on PowerNV platform takes several minutes. During
this time one CPU is stuck in FW and the kernel complains about "soft
lockups".
This patch returns all secondary CPUs to firmware before starting
firmware update process.
[ Reworked a bit and cleaned up -- BenH ]
Signed-off-by: Vasant Hegde <hegdevasant@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
|
|
Cherry pick 3b3f89ac6614d6bc2e2edb32e49d4906d931c795, implementing the
error log reading code we're pushing upstream.
This changes the userspace interface for reading and acknowledging
error logs, so userspace code will have to change if it relied on the
old way.
Based on a patch by: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
This patch adds support to read error logs from OPAL and export
them to userspace through a sysfs interface.
We export each log entry as a directory in /sys/firmware/opal/elog/
Currently, OPAL will buffer up to 128 error log records, we don't
need to have any knowledge of this limit on the Linux side as that
is actually largely transparent to us.
Each error log entry has the following files: id, type, acknowledge, raw.
Currently we just export the raw binary error log in the 'raw' attribute.
In a future patch, we may parse more of the error log to make it a bit
easier for userspace (e.g. to be able to display a brief summary in
petitboot without having to have a full parser).
If we have >128 logs from OPAL, we'll only be notified of 128 until
userspace starts acknowledging them. This limitation may be lifted in
the future and with this patch, that should "just work" from the linux side.
A userspace daemon should:
- wait for error log entries using normal mechanisms (we announce creation)
- read error log entry
- save error log entry safely to disk
- acknowledge the error log entry
- rinse, repeat.
On the Linux side, we read the error log when we're notified of it. This
possibly isn't ideal as it would be better to only read them on-demand.
However, this doesn't really work with current OPAL interface, so we
read the error log immediately when notified at the moment.
I've tested this pretty extensively and am rather confident that the
linux side of things works rather well. There is currently an issue with
the service processor side of things for >128 error logs though.
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
Conflicts:
arch/powerpc/include/asm/opal.h
arch/powerpc/platforms/powernv/Makefile
arch/powerpc/platforms/powernv/opal-elog.c
|
|
This patch makes the sysfs interface match that of what's pushed upstream.
changes in kernel:
- fetch dump on-demand
- directory per dump
- in sysfs rather than debugfs
Userspace changes needed
- read from sysfs rather than debugfs.
This enables support for userspace to fetch and initiate FSP and
Platform dumps from the service processor (via firmware) through sysfs.
Based on original patch from Vasant Hegde <hegdevasant@linux.vnet.ibm.com>
Flow:
- We register for OPAL notification events.
- OPAL sends new dump available notification.
- We make information on dump available via sysfs
- Userspace requests dump contents
- We retrieve the dump via OPAL interface
- User copies the dump data
- userspace sends ack for dump
- We send ACK to OPAL.
sysfs files:
- We add the /sys/firmware/opal/dump directory
- echoing 1 (well, anything, but in future we may support
different dump types) to /sys/firmware/opal/dump/initiate_dump
will initiate a dump.
- Each dump that we've been notified of gets a directory
in /sys/firmware/opal/dump/ with a name of the dump type and ID (in hex,
as this is what's used elsewhere to identify the dump).
- Each dump has files: id, type, dump and acknowledge
dump is binary and is the dump itself.
echoing 'ack' to acknowledge (currently any string will do) will
acknowledge the dump and it will soon after disappear from sysfs.
OPAL APIs:
- opal_dump_init()
- opal_dump_info()
- opal_dump_read()
- opal_dump_ack()
- opal_dump_resend_notification()
Currently we are only ever notified for one dump at a time (until
the user explicitly acks the current dump, then we get a notification
of the next dump), but this kernel code should "just work" when OPAL
starts notifying us of all the dumps present.
Changes since v2:
- fix bug where we would free the dump buffer after userspace read it,
refetching if needed. Refetching doesn't currently work, so we must
keep the dump around for subsequent reads.
Changes since v1:
- Add support for getting dump type from OPAL through new OPAL call
(falling back to old OPAL_DUMP_INFO call if OPAL_DUMP_INFO2 isn't
supported)
- use dump type in directory name for dump
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
Conflicts:
arch/powerpc/include/asm/opal.h
arch/powerpc/platforms/powernv/Makefile
arch/powerpc/platforms/powernv/opal-dump.c
arch/powerpc/platforms/powernv/opal-wrappers.S
arch/powerpc/platforms/powernv/opal.c
|
|
In copy_oldmem_page, the current check using max_pfn and min_low_pfn to
decide if the page is backed or not, is not valid when the memory layout is
not continuous.
This happens when running as a QEMU/KVM guest, where RTAS is mapped higher
in the memory. In that case max_pfn points to the end of RTAS, and a hole
between the end of the kdump kernel and RTAS is not backed by PTEs. As a
consequence, the kdump kernel is crashing in copy_oldmem_page when accessing
in a direct way the pages in that hole.
This fix relies on the memblock's service memblock_is_region_memory to
check if the read page is part or not of the directly accessible memory.
This is a backport of upstream patch
https://lists.ozlabs.org/pipermail/linuxppc-dev/2014-February/115569.html
This fixes LTC BUG #104729
Signed-off-by: Laurent Dufour <ldufour@linux.vnet.ibm.com>
Tested-by: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
|
|
Without a shutdown handler, T4 cards behave very badly after a kexec.
Some firmware calls return errors indicating allocation failures, for
example. This is probably because thouse resources were not released by
a BYE message to the firmware, for example.
Using the remove handler guarantees we will use a well tested path.
With this patch I applied, I managed to use kexec multiple times and
probe and iSCSI login worked every time.
Signed-off-by: Thadeu Lima de Souza Cascardo <cascardo@linux.vnet.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
LTC-Bugzilla: #103241
(cherry picked from commit 687d705c031916b83953b714917b04d899e23cf5)
|
|
Signed-off-by: Eli Qiao <taget@linux.vnet.ibm.com>
|
|
https://bugzilla.linux.ibm.com/show_bug.cgi?id=104249
https://bugzilla.linux.ibm.com/show_bug.cgi?id=104444
Signed-off-by: Wang Sen <wangsen@linux.vnet.ibm.com>
|
|
On p8 systems, with relocation on exception feature enabled we are seeing
kdump kernel hang at interrupt vector 0xc*4400. The reason is, with this
feature enabled, exception are raised with MMU (IR=DR=1) ON with the
default offset of 0xc*4000. Since exception is raised in virtual mode it
requires the vector region to be executable without which it fails to
fetch and execute instruction at 0xc*4xxx. For default kernel since kernel
is loaded at real 0, the htab mappings sets the entire kernel text region
executable. But for relocatable kernel (e.g. kdump case) we only copy
interrupt vectors down to real 0 and never marked that region as
executable because in p7 and below we always get exception in real mode.
This patch fixes this issue by marking htab mapping range as executable
that overlaps with the interrupt vector region for relocatable kernel.
Thanks to Ben who helped me to debug this issue and find the root cause.
This is at least part of the fix for kdump failures that we are seeing
in bug 103693.
Signed-off-by: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
(cherry picked from commit 429d2e8342954d337abe370d957e78291032d867)
Signed-off-by: Paul Mackerras <paulus@samba.org>
|
|
Disable relocation on exception while going down even in kdump case. This
is because we are about clear htab mappings while kexec-ing into kdump
kernel and we may run into issues if we still have AIL ON.
This is at least part of the fix for kdump failures that we are seeing
in bug 103693.
Signed-off-by: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
(cherry picked from commit 3ec8b78fcc5aa7745026d8d85a4e9ab52c922765)
Signed-off-by: Paul Mackerras <paulus@samba.org>
|
|
acked it.
This fixes a bug where we would get two events from OPAL with DUMP_AVAIL
set (which is valid for OPAL to do) and in the second run of extract_dump()
we would fail to free the memory previously allocated for the dump
(leaking ~6MB+) as well as on the second dump_read_data() call OPAL
would not retrieve the dump, leaving us with a dump in linux that was
the correct size but all zeros.
Changes since v1: fixed typo
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
Acked-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
LTC-Bugzilla: #104211
|
|
Commit 082fee36bd2c ("KVM: PPC: Book3S HV: Make physical thread 0 do
the MMU switching") reordered the guest entry/exit code so that most
of the guest register save/restore code happened in guest MMU context.
A side effect of that is that the timebase still contains the guest
timebase value at the point where we compute and use vcpu->arch.dec_expires,
and therefore that is now a guest timebase value rather than a host
timebase value. That in turn means that the timeouts computed in
kvmppc_set_timer() are wrong if the timebase offset for the guest is
non-zero. The consequence of that is things such as "sleep 1" in a
guest after migration may sleep for much longer than they should.
This fixes the problem by converting between guest and host timebase
values as necessary, by adding or subtracting the timebase offset.
This also fixes an incorrect comment.
This is part of the fix for many of the migration-related bug reports.
Signed-off-by: Paul Mackerras <paulus@samba.org>
|
|
In kdump kernel we see a hang during subcore_init() at
unsplit_core()->wait_for_sync_step(). In kdump kernel we always boot with
maxcpus=1 and all other cpus are waiting inside OPAL, hence with 1 online
cpu the master thread keep waiting on secondary threads to set split_state
indefinitely. This is even true for all cases where max_cpus is not aligned
with threads_per_core. This patch fixes this issue by disabling
core split/unsplit feature if max_cpus are not aligned with threads_per_core.
This also fixes kdump hang issue.
Signed-off-by: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
|
|
Signed-off-by: Crístian Viana <vianac@linux.vnet.ibm.com>
|
|
This fixes one of the corner cases which produced wrong backtrack
from put_page().
BZ: 103055
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
|
|
icp_native_flush_interrupt() function is supposed to clear a pending
interrupt, like local_irq_enable(); local_irq_disable() would, but
without calling generic code. Unfortunately it missed clearing
the "IPI pending" flag in the PACA (local_paca->kvm_hstate.host_ipi).
The effect of this flag being set is that secondary CPU threads won't
go into the KVM guest, leading to messages like:
kvmppc_wait_for_nap timeout 0 1
when a KVM HV guest is run. This fixes it by adding a call to
kvmppc_set_host_ipi to clear the flag.
This fixes BZ 103513.
Signed-off-by: Paul Mackerras <paulus@samba.org>
|
|
Commit 595e4f7e697e ("KVM: PPC: Book3S HV: Use load/store_fp_state
functions in HV guest entry/exit") changed the register usage in
kvmppc_save_fp() and kvmppc_load_fp() but omitted changing the
instructions that load and save VRSAVE. The result is that the
VRSAVE value was loaded from a constant address, and saved to a
location past the end of the vcpu struct, causing host kernel
memory corruption and various kinds of host kernel crashes.
This fixes the problem by using register r31, which contains the
vcpu pointer, instead of r3 and r4.
This should help resolve several bugzillas involving guest or host
crashes and hangs, including 98456, 102775, 103534, 100504, and
possibly others.
Signed-off-by: Paul Mackerras <paulus@samba.org>
|
|
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
|
|
register_cpu_notifier() can deadlock if called inside a
get/put_online_cpus block. To avoid this, move the call to
register_cpu_notifier before the get_online_cpus().
[paulus@samba.org - renamed alloc_xxx to alloc_percpu_areas, fixed
compile errors, made up patch description]
This fixes BZ 103213.
Signed-off-by: Paul Mackerras <paulus@samba.org>
|
|
change log:
PPC: KVM: fix to compile without VFIO
vfio: fix in-kernel and ioctl handlers
Fix a bug where asking for a POWER8 guest on a POWER7 system doesn't fail, but should
Fix and performance improvements for nested virtualization
LTC BZ 101114 CPU Build0.6: Host Cpu Offline/online leads to instruction dump and further cpu online/offline functions are not
PowerKVM Build 8 host platform support
Fix problems reported by the kernel RCU checking machinery and may help fix the memory corruption issues we have been seeing
LTC BZ 101123 Unable to bring up LE guest using libvirt/virsh
Fixes a bug with not resetting page struct pointer which caused bugs in calling code.
Fix one of the corner cases when the realmode handler fails to handle T_PUT_TCE_INDIRECT call and passes it further to the vir
Signed-off-by: Eli Qiao <taget@linux.vnet.ibm.com>
|
|
The existing handler assumes that the first failed TCE entry's host
physical address is saved in the tce_tmp_hpas cache but it is not so
the virtmode handler has to read it from the TCE list again so does
this patch.
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
|
|
to config-powerpc64 and config-powerpc64p7
Signed-off-by: Eli Qiao <taget@linux.vnet.ibm.com>
|
|
The code in remove_cache_dir() is supposed to remove the "cache"
subdirectory from the sysfs directory for a CPU when that CPU is
being offlined. It tries to do this by calling kobject_put() on
the kobject for the subdirectory. However, the subdirectory only
gets removed once the last reference goes away, and the reference
being put here may well not be the last reference. That means
that the "cache" subdirectory may still exist when the offlining
operation has finished. If the same CPU subsequently gets onlined,
the code tries to add a new "cache" subdirectory. If the old
subdirectory has not yet been removed, we get a WARN_ON in the
sysfs code, with stack trace, and an error message printed on the
console. Further, we ultimately end up with an online cpu with no
"cache" subdirectory.
This fixes it by doing an explicit kobject_del() at the point where
we want the subdirectory to go away. kobject_del() removes the sysfs
directory even though the object still exists in memory. The object
will get freed at some point in the future. A subsequent onlining
operation can create a new sysfs directory, even if the old object
still exists in memory, without causing any problems.
This fixes BZ 101114.
Signed-off-by: Paul Mackerras <paulus@samba.org>
|
|
This fixes a bug with not resetting page struct pointer
which caused bugs in calling code.
Suggested-by: Thadeu Lima de Souza Cascardo <cascardo@linux.vnet.ibm.com>
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
|
|
This does for PR KVM what c9438092cae4 ("KVM: PPC: Book3S HV: Take SRCU
read lock around kvm_read_guest() call") did for HV KVM, that is,
eliminate a "suspicious rcu_dereference_check() usage!" warning by
taking the SRCU lock around the call to kvmppc_rtas_hcall().
It also fixes a return of RESUME_HOST to return EMULATE_FAIL instead,
since kvmppc_h_pr() is supposed to return EMULATE_* values.
Signed-off-by: Paul Mackerras <paulus@samba.org>
|
|
Now that we have the vcpu floating-point and vector state stored in
the same type of struct as the main kernel uses, we can load that
state directly from the vcpu struct instead of having extra copies
to/from the thread_struct. Similarly, when the guest state needs to
be saved, we can have it saved it directly to the vcpu struct by
setting the current->thread.fp_save_area and current->thread.vr_save_area
pointers. That also means that we don't need to back up and restore
userspace's FP/vector state. This all makes the code simpler and
faster.
Note that it's not necessary to save or modify current->thread.fpexc_mode,
since nothing in KVM uses or is affected by its value. Nor is it
necessary to touch used_vr or used_vsr.
Signed-off-by: Paul Mackerras <paulus@samba.org>
|
|
This fixes missing read/write TCE bits in VFIO map/unmap ioctls.
This fixes the real mode handler to switch to the virtual mode if
pte does not have "write" AND "dirty" bits set.
This fixes get_user_pages_fast() call in the virtual mode handler
to use correct write flag (used to be 0 always).
This adds a lock around a kvm_memory_slot struct use.
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
(cherry picked from commit 754177ee49cd27c9380e7bb9c0de6f8488197ca3)
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
|
|
This removes the code that handles the H_SET_MODE_RESOURCE_LE and
H_SET_MODE_RESOURCE_ADDR_TRANS_MODE subfunctions of the H_SET_MODE
hypercall from the kernel. Instead we now return H_TOO_HARD which
causes the hypercall to be sent up to userspace to be handled there.
In addition we now also send any other subfunction which we don't
recognize to userspace.
The reason for doing these two subfunctions in userspace is that they
need to modify LPCR across all vcpus of the guest. Modifying LPCR in
the kernel like this introduces a race between the kernel's
modification and any modification that userspace might be doing on
another vcpu. Therefore it's better to let userspace do all the
modifications, so it can do any necessary synchronization itself.
This also adds code to make sure that the MSR_LE bit in intr_msr
(the MSR value we set when synthesizing an interrupt for the guest)
is in sync with the ILE bit in the virtual core's LPCR value. This
is necessary for implementing the LE subfunction of H_SET_MODE in
userspace.
Signed-off-by: Paul Mackerras <paulus@samba.org>
|
|
The load_up_fpu and load_up_altivec functions were never intended to
be called from C, and do things like modifying the MSR value in their
callers' stack frames, which are assumed to be interrupt frames. In
addition, on 32-bit Book S they require the MMU to be off.
This makes KVM use the new load_fp_state() and load_vr_state() functions
instead of load_up_fpu/altivec. This means we can remove the assembler
glue in book3s_rmhandlers.S, and potentially fixes a bug on Book E,
where load_up_fpu was called directly from C.
Signed-off-by: Paul Mackerras <paulus@samba.org>
|
|
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
(cherry picked from commit 6a87e5da59bf1d1a4186bf27ad8aa5dc3b03dd63)
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
|
|
With HV KVM, some high-frequency hypercalls such as H_ENTER are handled
in real mode, and need to access the memslots array for the guest.
Accessing the memslots array is safe, because we hold the SRCU read
lock for the whole time that a guest vcpu is running. However, the
checks that kvm_memslots() does when lockdep is enabled are potentially
unsafe in real mode, when only the linear mapping is available.
Furthermore, kvm_memslots() can be called from a secondary CPU thread,
which is an offline CPU from the point of view of the host kernel,
and is not running the task which holds the SRCU read lock.
To avoid false positives in the checks in kvm_memslots(), and to avoid
possible side effects from doing the checks in real mode, this replaces
kvm_memslots() with kvm_memslots_raw() in all the places that execute
in real mode. kvm_memslots_raw() is a new function that is like
kvm_memslots() but uses rcu_dereference_raw_notrace() instead of
kvm_dereference_check().
Signed-off-by: Paul Mackerras <paulus@samba.org>
|
|
Since the guest can read the machine's PVR (Processor Version Register)
directly and see the real value, we should disallow userspace from
setting any value for the guest's PVR other than the real host value.
Therefore this makes kvm_arch_vcpu_set_sregs_hv() check the supplied
PVR value and return an error if it is different from the host value,
which has been put into vcpu->arch.pvr at vcpu creation time.
Signed-off-by: Paul Mackerras <paulus@samba.org>
|
|
On PowerNV platforms, when a CPU is offline, we put it into nap mode.
It's possible that the CPU wakes up from nap mode while it is still
offline due to a stray IPI. A misdirected device interrupt could also
potentially cause it to wake up. In that circumstance, we need to clear
the interrupt so that the CPU can go back to nap mode.
In the past the clearing of the interrupt was accomplished by briefly
enabling interrupts and allowing the normal interrupt handling code
(do_IRQ() etc.) to handle the interrupt. This has the problem that
this code calls irq_enter() and irq_exit(), which call functions such
as account_system_vtime() which use RCU internally. Use of RCU is not
permitted on offline CPUs and will trigger errors if RCU checking is
enabled.
To avoid calling into any generic code which might use RCU, we adopt
a different method of clearing interrupts on offline CPUs. Since we
are on the PowerNV platform, we know that the system interrupt
controller is a XICS being driven directly (i.e. not via hcalls) by
the kernel. Hence this adds a new icp_native_flush_interrupt()
function to the native-mode XICS driver and arranges to call that
when an offline CPU is woken from nap. This new function reads the
interrupt from the XICS. If it is an IPI, it clears the IPI; if it
is a device interrupt, it prints a warning and disables the source.
Then it does the end-of-interrupt processing for the interrupt.
The other thing that briefly enabling interrupts did was to check and
clear the irq_happened flag in this CPU's PACA. Therefore, after
flushing the interrupt from the XICS, we also clear all bits except
the PACA_IRQ_HARD_DIS (interrupts are hard disabled) bit from the
irq_happened flag. The PACA_IRQ_HARD_DIS flag is set by power7_nap()
and is left set to indicate that interrupts are hard disabled. This
means we then have to ignore that flag in power7_nap(), which is
reasonable since it doesn't indicate that any interrupt event needs
servicing.
Signed-off-by: Paul Mackerras <paulus@samba.org>
|
|
Signed-off-by: Thadeu Lima de Souza Cascardo <cascardo@linux.vnet.ibm.com>
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
|
|
The policy->cpus mask populated by the cpufreq driver is expected to be hotplug
invariant, since the cpufreq core copies this mask as-it-is to policy->related_cpus mask (which shouldn't vary upon hotplug).
The cpufreq core code later prunes the offlines cpus from the policy->cpus mask.
At the moment, the powerpc cpufreq driver uses topology_thread_cpumask() to
populate policy->cpus during .init(), and hence this is NOT hotplug invariant.
Due to this, we hit the following bug:
1. Once we offline all threads of a core, say CPUs 8-15, and online
CPU 8 back, its related cpus mask shows:
$ cat /sys/devices/system/cpu/cpu8/cpufreq/related_cpus
8
[ It should have actually shown 8 9 10 11 12 13 14 15 ]
2. When we try to online the next sibling thread (CPU 9), it tries to do a fresh
initialization since it is not listed in the related_cpus mask of CPU 8.(Note
that for CPU 9, the cpufreq driver would have populated the related_cpus mask
as [ 8 9 ], since those are the 2 online threads in that core so far). During
CPU 9 init, it fails in the call to cpufreq_add_dev_symlink() because it
tries to initialize the sysfs files for CPU 8 as well (which had already been
initialized) while iterating through the policy->cpus.
As a result, we hit this bug while onlining CPU 9:
[ 1019.458183] sysfs: cannot create duplicate filename '/devices/system/cpu/cpu8/cpufreq'
[ 1019.458270] ------------[ cut here ]------------
[ 1019.458338] WARNING: at fs/sysfs/dir.c:530
[ 1019.458367] Modules linked in: xt_tcpudp ipt_REJECT nf_conntrack_ipv4 nf_defrag_ipv4
xt_conntrack nf_conntrack iptable_filter ip_tables x_tables kvm binfmt_misc autofs4 lpfc
[ 1019.458543] CPU: 76 PID: 73014 Comm: bash Not tainted 3.10.11-cpufreq-10 #1
[ 1019.458590] task: c000000ff02c3200 ti: c000000fe7604000 task.ti: c000000fe7604000
[ 1019.458645] NIP: c000000000284634 LR: c000000000284630 CTR: c0000000005b5d10
[ 1019.458700] REGS: c000000fe7606fa0 TRAP: 0700 Not tainted (3.10.11-cpufreq-10)
[ 1019.458754] MSR: 9000000100029032 <SF,HV,EE,ME,IR,DR,RI> CR: 28222824 XER: 20000000
[ 1019.458883] SOFTE: 1
[ 1019.458903] CFAR: c000000000874d6c
[ 1019.458930]
GPR00: c000000000284630 c000000fe7607220 c000000000d9ab60 000000000000004a
GPR04: 0000000000000000 000000000000005a c000000000c82fb8 c000000004482448
GPR08: c000000000c7ab60 0000000000000000 0000000000000000 0000000000000000
GPR12: 0000000028222822 c00000000fe13000 0000000010142550 c000000000ce8d70
GPR16: 0000000000000001 c000000000f28c68 0000000000000000 c000000003c20030
GPR20: c000000ff6d91800 c000000000ce8fc8 c000000000b45340 c000000000e26858
GPR24: c000000000ce8d70 0000000000000000 0000000000000001 c000000ff6d91a70
GPR28: c000000fef1b2000 c000000fe7607320 c000000fc98087a0 ffffffffffffffef
[ 1019.459605] NIP [c000000000284634] .sysfs_add_one+0xe4/0x100
[ 1019.459653] LR [c000000000284630] .sysfs_add_one+0xe0/0x100
[ 1019.459689] PACATMSCRATCH [9000000100009032]
[ 1019.459726] Call Trace:
[ 1019.459747] [c000000fe7607220] [c000000000284630] .sysfs_add_one+0xe0/0x100 (unreliable)
[ 1019.459813] [c000000fe76072b0] [c0000000002854dc] .sysfs_do_create_link_sd+0x10c/0x320
[ 1019.459879] [c000000fe7607370] [c000000000718318] .cpufreq_add_dev_interface+0x2e8/0x410
[ 1019.459943] [c000000fe7607710] [c000000000718da0] .cpufreq_add_dev+0x590/0x6d0
[ 1019.460009] [c000000fe7607810] [c000000000899580] .cpufreq_cpu_callback+0x7c/0x94
[ 1019.460073] [c000000fe7607890] [c00000000086f40c] .notifier_call_chain+0x8c/0x100
[ 1019.460138] [c000000fe7607930] [c000000000091450] .cpu_notify+0x40/0xa0
[ 1019.460194] [c000000fe76079b0] [c00000000089696c] ._cpu_up+0x17c/0x1ec
[ 1019.460249] [c000000fe7607a70] [c000000000896b40] .cpu_up+0x164/0x194
[ 1019.460304] [c000000fe7607b00] [c000000000746edc] .store_online+0xbc/0xa60
[ 1019.460361] [c000000fe7607bb0] [c0000000004faf64] .dev_attr_store+0x64/0xa0
[ 1019.460417] [c000000fe7607c40] [c000000000282244] .sysfs_write_file+0xf4/0x1d0
[ 1019.460482] [c000000fe7607cf0] [c0000000001f1fa8] .vfs_write+0xe8/0x260
[ 1019.460537] [c000000fe7607d90] [c0000000001f2c44] .SyS_write+0x64/0xe0
[ 1019.460593] [c000000fe7607e30] [c000000000009d54] syscall_exit+0x0/0x98
[ 1019.460647] Instruction dump:
[ 1019.460675] 481b0b2d 60000000 e89e0010 7f83e378 38a01000 481b0b19 60000000 7f84e378
[ 1019.460774] 3c62ffd5 38632cf0 485f06dd 60000000 <0fe00000> 7f83e378 4bf5f8a5 60000000
[ 1019.460952] ---[ end trace 600f2280a5b2cd86 ]---
None of this would have occurred if related_cpus had remained unchanged during
hotplug, because in that case, CPU 9 would have done a light-weight init, thus
avoiding this duplication bug. So fix this by populating policy->cpus in a
hotplug invariant manner in the cpufreq driver.
Signed-off-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
Signed-off-by: Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
|
|
Platform will provide power data in watts, hwmon expects in
micro-watts.
Signed-off-by: Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
|
|
This patch is a increamental patch on top of commit af93eec4.
This adds support to resend dump available notification, updates
README file. Alos fixes with few other minor issues.
Signed-off-by: Vasant Hegde <hegdevasant@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
|
|
Detect and recover from machine check when inside opal on a special
scom load instructions. On specific SCOM read via MMIO we may get a machine
check exception with SRR0 pointing inside opal. To recover from MC
in this scenario, get a recovery instruction address and return to it from
MC.
OPAL will export the machine check recoverable ranges through
device tree node mcheck-recoverable-ranges under ibm,opal:
# hexdump /proc/device-tree/ibm,opal/mcheck-recoverable-ranges
0000000 0000 0000 3000 2804 0000 000c 0000 0000
0000010 3000 2814 0000 0000 3000 27f0 0000 000c
0000020 0000 0000 3000 2814 xxxx xxxx xxxx xxxx
0000030 llll llll yyyy yyyy yyyy yyyy
...
...
#
where:
xxxx xxxx xxxx xxxx = Starting instruction address
llll llll = Length of the address range.
yyyy yyyy yyyy yyyy = recovery address
Each recoverable address range entry is an (start address, len,
recovery address), 2 cells each for start and recovery address, 1 cell for
len, totalling 5 cells per entry. During kernel boot time, build up the
recovery table with the list of recovery ranges from device-tree node which
will be used during machine check exception to recover from MMIO SCOM UE.
Changes in v2:
- As per Ben's comment, added mcheck-recoverable-ranges property under
ibm,opal node.
- Changed the format of the mcheck-recoverable-ranges list.
Signed-off-by: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
|
|
This patch adds basic kernel enablement for reading power values, fan
speed rpm and temperature values on powernv platforms which will
be exported to user space through /sys interface.
Signed-off-by: Shivaprasad G Bhat <sbhat@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
|