aboutsummaryrefslogtreecommitdiffstats
AgeCommit message (Collapse)AuthorFilesLines
2018-08-19Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netHEADmasterLinus Torvalds60-352/+430
Pull networking fixes from David Miller: 1) Fix races in IPVS, from Tan Hu. 2) Missing unbind in matchall classifier, from Hangbin Liu. 3) Missing act_ife action release, from Vlad Buslov. 4) Cure lockdep splats in ila, from Cong Wang. 5) veth queue leak on link delete, from Toshiaki Makita. 6) Disable isdn's IIOCDBGVAR ioctl, it exposes kernel addresses. From Kees Cook. 7) RCU usage fixup in XDP, from Tariq Toukan. 8) Two TCP ULP fixes from Daniel Borkmann. 9) r8169 needs REALTEK_PHY as a Kconfig dependency, from Heiner Kallweit. 10) Always take tcf_lock with BH disabled, otherwise we can deadlock with rate estimator code paths. From Vlad Buslov. 11) Don't use MSI-X on RTL8106e r8169 chips, they don't resume properly. From Jian-Hong Pan. * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (41 commits) ip6_vti: fix creating fallback tunnel device for vti6 ip_vti: fix a null pointer deferrence when create vti fallback tunnel r8169: don't use MSI-X on RTL8106e net: lan743x_ptp: convert to ktime_get_clocktai_ts64 net: sched: always disable bh when taking tcf_lock ip6_vti: simplify stats handling in vti6_xmit bpf: fix redirect to map under tail calls r8169: add missing Kconfig dependency tools/bpf: fix bpf selftest test_cgroup_storage failure bpf, sockmap: fix sock_map_ctx_update_elem race with exist/noexist bpf, sockmap: fix map elem deletion race with smap_stop_sock bpf, sockmap: fix leakage of smap_psock_map_entry tcp, ulp: fix leftover icsk_ulp_ops preventing sock from reattach tcp, ulp: add alias for all ulp modules bpf: fix a rcu usage warning in bpf_prog_array_copy_core() samples/bpf: all XDP samples should unload xdp/bpf prog on SIGTERM net/xdp: Fix suspicious RCU usage warning net/mlx5e: Delete unneeded function argument Documentation: networking: ti-cpsw: correct cbs parameters for Eth1 100Mb isdn: Disable IIOCDBGVAR ...
2018-08-19ip6_vti: fix creating fallback tunnel device for vti6Haishuang Yan1-0/+2
When set fb_tunnels_only_for_init_net to 1, don't create fallback tunnel device for vti6 when a new namespace is created. Tested: [root@builder2 ~]# modprobe ip6_tunnel [root@builder2 ~]# modprobe ip6_vti [root@builder2 ~]# echo 1 > /proc/sys/net/core/fb_tunnels_only_for_init_net [root@builder2 ~]# unshare -n [root@builder2 ~]# ip link 1: lo: <LOOPBACK> mtu 65536 qdisc noop state DOWN mode DEFAULT group default qlen 1000 link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 Signed-off-by: Haishuang Yan <yanhaishuang@cmss.chinamobile.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-08-19ip_vti: fix a null pointer deferrence when create vti fallback tunnelHaishuang Yan1-1/+2
After set fb_tunnels_only_for_init_net to 1, the itn->fb_tunnel_dev will be NULL and will cause following crash: [ 2742.849298] BUG: unable to handle kernel NULL pointer dereference at 0000000000000941 [ 2742.851380] PGD 800000042c21a067 P4D 800000042c21a067 PUD 42aaed067 PMD 0 [ 2742.852818] Oops: 0002 [#1] SMP PTI [ 2742.853570] CPU: 7 PID: 2484 Comm: unshare Kdump: loaded Not tainted 4.18.0-rc8+ #2 [ 2742.855163] Hardware name: Fedora Project OpenStack Nova, BIOS seabios-1.7.5-11.el7 04/01/2014 [ 2742.856970] RIP: 0010:vti_init_net+0x3a/0x50 [ip_vti] [ 2742.858034] Code: 90 83 c0 48 c7 c2 20 a1 83 c0 48 89 fb e8 6e 3b f6 ff 85 c0 75 22 8b 0d f4 19 00 00 48 8b 93 00 14 00 00 48 8b 14 ca 48 8b 12 <c6> 82 41 09 00 00 04 c6 82 38 09 00 00 45 5b c3 66 0f 1f 44 00 00 [ 2742.861940] RSP: 0018:ffff9be28207fde0 EFLAGS: 00010246 [ 2742.863044] RAX: 0000000000000000 RBX: ffff8a71ebed4980 RCX: 0000000000000013 [ 2742.864540] RDX: 0000000000000000 RSI: 0000000000000013 RDI: ffff8a71ebed4980 [ 2742.866020] RBP: ffff8a71ea717000 R08: ffffffffc083903c R09: ffff8a71ea717000 [ 2742.867505] R10: 0000000000000000 R11: 0000000000000000 R12: ffff8a71ebed4980 [ 2742.868987] R13: 0000000000000013 R14: ffff8a71ea5b49c0 R15: 0000000000000000 [ 2742.870473] FS: 00007f02266c9740(0000) GS:ffff8a71ffdc0000(0000) knlGS:0000000000000000 [ 2742.872143] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 2742.873340] CR2: 0000000000000941 CR3: 000000042bc20006 CR4: 00000000001606e0 [ 2742.874821] Call Trace: [ 2742.875358] ops_init+0x38/0xf0 [ 2742.876078] setup_net+0xd9/0x1f0 [ 2742.876789] copy_net_ns+0xb7/0x130 [ 2742.877538] create_new_namespaces+0x11a/0x1d0 [ 2742.878525] unshare_nsproxy_namespaces+0x55/0xa0 [ 2742.879526] ksys_unshare+0x1a7/0x330 [ 2742.880313] __x64_sys_unshare+0xe/0x20 [ 2742.881131] do_syscall_64+0x5b/0x180 [ 2742.881933] entry_SYSCALL_64_after_hwframe+0x44/0xa9 Reproduce: echo 1 > /proc/sys/net/core/fb_tunnels_only_for_init_net modprobe ip_vti unshare -n Fixes: 79134e6ce2c9 ("net: do not create fallback tunnels for non-default namespaces") Cc: Eric Dumazet <edumazet@google.com> Signed-off-by: Haishuang Yan <yanhaishuang@cmss.chinamobile.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-08-19r8169: don't use MSI-X on RTL8106eJian-Hong Pan1-3/+6
Found the ethernet network on ASUS X441UAR doesn't come back on resume from suspend when using MSI-X. The chip is RTL8106e - version 39. [ 21.848357] libphy: r8169: probed [ 21.848473] r8169 0000:02:00.0 eth0: RTL8106e, 0c:9d:92:32:67:b4, XID 44900000, IRQ 127 [ 22.518860] r8169 0000:02:00.0 enp2s0: renamed from eth0 [ 29.458041] Generic PHY r8169-200:00: attached PHY driver [Generic PHY] (mii_bus:phy_addr=r8169-200:00, irq=IGNORE) [ 63.227398] r8169 0000:02:00.0 enp2s0: Link is Up - 100Mbps/Full - flow control off [ 124.514648] Generic PHY r8169-200:00: attached PHY driver [Generic PHY] (mii_bus:phy_addr=r8169-200:00, irq=IGNORE) Here is the ethernet controller in detail: 02:00.0 Ethernet controller [0200]: Realtek Semiconductor Co., Ltd. RTL8101/2/6E PCI Express Fast/Gigabit Ethernet controller [10ec:8136] (rev 07) Subsystem: ASUSTeK Computer Inc. RTL810xE PCI Express Fast Ethernet controller [1043:200f] Flags: bus master, fast devsel, latency 0, IRQ 16 I/O ports at e000 [size=256] Memory at ef100000 (64-bit, non-prefetchable) [size=4K] Memory at e0000000 (64-bit, prefetchable) [size=16K] Capabilities: <access denied> Kernel driver in use: r8169 Kernel modules: r8169 Falling back to MSI fixes the issue. Fixes: 6c6aa15fdea5 ("r8169: improve interrupt handling") Signed-off-by: Jian-Hong Pan <jian-hong@endlessm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-08-19net: lan743x_ptp: convert to ktime_get_clocktai_ts64Arnd Bergmann1-2/+1
timekeeping_clocktai64() has been renamed to ktime_get_clocktai_ts64() for consistency with the other ktime_get_* access functions. Rename the new caller that has come up as well. Question: this is the only ptp driver that sets the hardware time to the current system time in TAI. Why does it do that? Signed-off-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-08-19net: sched: always disable bh when taking tcf_lockVlad Buslov7-44/+47
Recently, ops->init() and ops->dump() of all actions were modified to always obtain tcf_lock when accessing private action state. Actions that don't depend on tcf_lock for synchronization with their data path use non-bh locking API. However, tcf_lock is also used to protect rate estimator stats in softirq context by timer callback. Change ops->init() and ops->dump() of all actions to disable bh when using tcf_lock to prevent deadlock reported by following lockdep warning: [ 105.470398] ================================ [ 105.475014] WARNING: inconsistent lock state [ 105.479628] 4.18.0-rc8+ #664 Not tainted [ 105.483897] -------------------------------- [ 105.488511] inconsistent {SOFTIRQ-ON-W} -> {IN-SOFTIRQ-W} usage. [ 105.494871] swapper/16/0 [HC0[0]:SC1[1]:HE1:SE0] takes: [ 105.500449] 00000000f86c012e (&(&p->tcfa_lock)->rlock){+.?.}, at: est_fetch_counters+0x3c/0xa0 [ 105.509696] {SOFTIRQ-ON-W} state was registered at: [ 105.514925] _raw_spin_lock+0x2c/0x40 [ 105.519022] tcf_bpf_init+0x579/0x820 [act_bpf] [ 105.523990] tcf_action_init_1+0x4e4/0x660 [ 105.528518] tcf_action_init+0x1ce/0x2d0 [ 105.532880] tcf_exts_validate+0x1d8/0x200 [ 105.537416] fl_change+0x55a/0x268b [cls_flower] [ 105.542469] tc_new_tfilter+0x748/0xa20 [ 105.546738] rtnetlink_rcv_msg+0x56a/0x6d0 [ 105.551268] netlink_rcv_skb+0x18d/0x200 [ 105.555628] netlink_unicast+0x2d0/0x370 [ 105.559990] netlink_sendmsg+0x3b9/0x6a0 [ 105.564349] sock_sendmsg+0x6b/0x80 [ 105.568271] ___sys_sendmsg+0x4a1/0x520 [ 105.572547] __sys_sendmsg+0xd7/0x150 [ 105.576655] do_syscall_64+0x72/0x2c0 [ 105.580757] entry_SYSCALL_64_after_hwframe+0x49/0xbe [ 105.586243] irq event stamp: 489296 [ 105.590084] hardirqs last enabled at (489296): [<ffffffffb507e639>] _raw_spin_unlock_irq+0x29/0x40 [ 105.599765] hardirqs last disabled at (489295): [<ffffffffb507e745>] _raw_spin_lock_irq+0x15/0x50 [ 105.609277] softirqs last enabled at (489292): [<ffffffffb413a6a3>] irq_enter+0x83/0xa0 [ 105.618001] softirqs last disabled at (489293): [<ffffffffb413a800>] irq_exit+0x140/0x190 [ 105.626813] other info that might help us debug this: [ 105.633976] Possible unsafe locking scenario: [ 105.640526] CPU0 [ 105.643325] ---- [ 105.646125] lock(&(&p->tcfa_lock)->rlock); [ 105.650747] <Interrupt> [ 105.653717] lock(&(&p->tcfa_lock)->rlock); [ 105.658514] *** DEADLOCK *** [ 105.665349] 1 lock held by swapper/16/0: [ 105.669629] #0: 00000000a640ad99 ((&est->timer)){+.-.}, at: call_timer_fn+0x10b/0x550 [ 105.678200] stack backtrace: [ 105.683194] CPU: 16 PID: 0 Comm: swapper/16 Not tainted 4.18.0-rc8+ #664 [ 105.690249] Hardware name: Supermicro SYS-2028TP-DECR/X10DRT-P, BIOS 2.0b 03/30/2017 [ 105.698626] Call Trace: [ 105.701421] <IRQ> [ 105.703791] dump_stack+0x92/0xeb [ 105.707461] print_usage_bug+0x336/0x34c [ 105.711744] mark_lock+0x7c9/0x980 [ 105.715500] ? print_shortest_lock_dependencies+0x2e0/0x2e0 [ 105.721424] ? check_usage_forwards+0x230/0x230 [ 105.726315] __lock_acquire+0x923/0x26f0 [ 105.730597] ? debug_show_all_locks+0x240/0x240 [ 105.735478] ? mark_lock+0x493/0x980 [ 105.739412] ? check_chain_key+0x140/0x1f0 [ 105.743861] ? __lock_acquire+0x836/0x26f0 [ 105.748323] ? lock_acquire+0x12e/0x290 [ 105.752516] lock_acquire+0x12e/0x290 [ 105.756539] ? est_fetch_counters+0x3c/0xa0 [ 105.761084] _raw_spin_lock+0x2c/0x40 [ 105.765099] ? est_fetch_counters+0x3c/0xa0 [ 105.769633] est_fetch_counters+0x3c/0xa0 [ 105.773995] est_timer+0x87/0x390 [ 105.777670] ? est_fetch_counters+0xa0/0xa0 [ 105.782210] ? lock_acquire+0x12e/0x290 [ 105.786410] call_timer_fn+0x161/0x550 [ 105.790512] ? est_fetch_counters+0xa0/0xa0 [ 105.795055] ? del_timer_sync+0xd0/0xd0 [ 105.799249] ? __lock_is_held+0x93/0x110 [ 105.803531] ? mark_held_locks+0x20/0xe0 [ 105.807813] ? _raw_spin_unlock_irq+0x29/0x40 [ 105.812525] ? est_fetch_counters+0xa0/0xa0 [ 105.817069] ? est_fetch_counters+0xa0/0xa0 [ 105.821610] run_timer_softirq+0x3c4/0x9f0 [ 105.826064] ? lock_acquire+0x12e/0x290 [ 105.830257] ? __bpf_trace_timer_class+0x10/0x10 [ 105.835237] ? __lock_is_held+0x25/0x110 [ 105.839517] __do_softirq+0x11d/0x7bf [ 105.843542] irq_exit+0x140/0x190 [ 105.847208] smp_apic_timer_interrupt+0xac/0x3b0 [ 105.852182] apic_timer_interrupt+0xf/0x20 [ 105.856628] </IRQ> [ 105.859081] RIP: 0010:cpuidle_enter_state+0xd8/0x4d0 [ 105.864395] Code: 46 ff 48 89 44 24 08 0f 1f 44 00 00 31 ff e8 cf ec 46 ff 80 7c 24 07 00 0f 85 1d 02 00 00 e8 9f 90 4b ff fb 66 0f 1f 44 00 00 <4c> 8b 6c 24 08 4d 29 fd 0f 80 36 03 00 00 4c 89 e8 48 ba cf f7 53 [ 105.884288] RSP: 0018:ffff8803ad94fd20 EFLAGS: 00000246 ORIG_RAX: ffffffffffffff13 [ 105.892494] RAX: 0000000000000000 RBX: ffffe8fb300829c0 RCX: ffffffffb41e19e1 [ 105.899988] RDX: 0000000000000007 RSI: dffffc0000000000 RDI: ffff8803ad9358ac [ 105.907503] RBP: ffffffffb6636300 R08: 0000000000000004 R09: 0000000000000000 [ 105.914997] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000004 [ 105.922487] R13: ffffffffb6636140 R14: ffffffffb66362d8 R15: 000000188d36091b [ 105.929988] ? trace_hardirqs_on_caller+0x141/0x2d0 [ 105.935232] do_idle+0x28e/0x320 [ 105.938817] ? arch_cpu_idle_exit+0x40/0x40 [ 105.943361] ? mark_lock+0x8c1/0x980 [ 105.947295] ? _raw_spin_unlock_irqrestore+0x32/0x60 [ 105.952619] cpu_startup_entry+0xc2/0xd0 [ 105.956900] ? cpu_in_idle+0x20/0x20 [ 105.960830] ? _raw_spin_unlock_irqrestore+0x32/0x60 [ 105.966146] ? trace_hardirqs_on_caller+0x141/0x2d0 [ 105.971391] start_secondary+0x2b5/0x360 [ 105.975669] ? set_cpu_sibling_map+0x1330/0x1330 [ 105.980654] secondary_startup_64+0xa5/0xb0 Taking tcf_lock in sample action with bh disabled causes lockdep to issue a warning regarding possible irq lock inversion dependency between tcf_lock, and psample_groups_lock that is taken when holding tcf_lock in sample init: [ 162.108959] Possible interrupt unsafe locking scenario: [ 162.116386] CPU0 CPU1 [ 162.121277] ---- ---- [ 162.126162] lock(psample_groups_lock); [ 162.130447] local_irq_disable(); [ 162.136772] lock(&(&p->tcfa_lock)->rlock); [ 162.143957] lock(psample_groups_lock); [ 162.150813] <Interrupt> [ 162.153808] lock(&(&p->tcfa_lock)->rlock); [ 162.158608] *** DEADLOCK *** In order to prevent potential lock inversion dependency between tcf_lock and psample_groups_lock, extract call to psample_group_get() from tcf_lock protected section in sample action init function. Fixes: 4e232818bd32 ("net: sched: act_mirred: remove dependency on rtnl lock") Fixes: 764e9a24480f ("net: sched: act_vlan: remove dependency on rtnl lock") Fixes: 729e01260989 ("net: sched: act_tunnel_key: remove dependency on rtnl lock") Fixes: d77284956656 ("net: sched: act_sample: remove dependency on rtnl lock") Fixes: e8917f437006 ("net: sched: act_gact: remove dependency on rtnl lock") Fixes: b6a2b971c0b0 ("net: sched: act_csum: remove dependency on rtnl lock") Fixes: 2142236b4584 ("net: sched: act_bpf: remove dependency on rtnl lock") Signed-off-by: Vlad Buslov <vladbu@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-08-19Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvmLinus Torvalds53-726/+3110
Pull first set of KVM updates from Paolo Bonzini: "PPC: - minor code cleanups x86: - PCID emulation and CR3 caching for shadow page tables - nested VMX live migration - nested VMCS shadowing - optimized IPI hypercall - some optimizations ARM will come next week" * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (85 commits) kvm: x86: Set highest physical address bits in non-present/reserved SPTEs KVM/x86: Use CC_SET()/CC_OUT in arch/x86/kvm/vmx.c KVM: X86: Implement PV IPIs in linux guest KVM: X86: Add kvm hypervisor init time platform setup callback KVM: X86: Implement "send IPI" hypercall KVM/x86: Move X86_CR4_OSXSAVE check into kvm_valid_sregs() KVM: x86: Skip pae_root shadow allocation if tdp enabled KVM/MMU: Combine flushing remote tlb in mmu_set_spte() KVM: vmx: skip VMWRITE of HOST_{FS,GS}_BASE when possible KVM: vmx: skip VMWRITE of HOST_{FS,GS}_SEL when possible KVM: vmx: always initialize HOST_{FS,GS}_BASE to zero during setup KVM: vmx: move struct host_state usage to struct loaded_vmcs KVM: vmx: compute need to reload FS/GS/LDT on demand KVM: nVMX: remove a misleading comment regarding vmcs02 fields KVM: vmx: rename __vmx_load_host_state() and vmx_save_host_state() KVM: vmx: add dedicated utility to access guest's kernel_gs_base KVM: vmx: track host_state.loaded using a loaded_vmcs pointer KVM: vmx: refactor segmentation code in vmx_save_host_state() kvm: nVMX: Fix fault priority for VMX operations kvm: nVMX: Fix fault vector for VMX operation at CPL > 0 ...
2018-08-19Merge tag 'riscv-for-linus-4.19-mw0' of ↵Linus Torvalds27-59/+625
git://git.kernel.org/pub/scm/linux/kernel/git/palmer/riscv-linux Pull RISC-V updates from Palmer Dabbelt: "This contains some major improvements to the RISC-V port, including the necessary interrupt controller and timer support to actually make it to userspace. Support for three devices has been added: - the ISA-mandated timers on RISC-V systems. - the ISA-mandated first-level interrupt controller on RISC-V systems, which is handled as part of our core arch code because it's very small and tightly tied to the ISA. - SiFive's platform-level interrupt controller, which talks to the actual devices. In addition to these new devices, there are a handful of cleanups all over the RISC-V tree: - build fixes for various configurations: * A fix to the vDSO build's makefile so it respects CFLAGS. * The addition of __lshrti3, a libgcc derived function necessary for some 32-bit configurations. * !SMP && PERF_EVENTS - Cleanups to the arch code to remove the remnants of old versions of the drivers that were just properly submitted. * Some dead code from the timer driver, most of which wasn't ever even compiled. * Cleanups of some interrupt #defines, which are now local to the interrupt handling code. - Fixes to ptrace(), which while not being sufficient to fully make GDB work are at least sufficient to get simple GDB tasks to work. - Early printk support via RISC-V's architecturally mandated SBI console device. - A fix to our early debug trap handler to ensure it's always aligned. These patches have all been through a fairly extensive review process, but as this enables a whole pile of functionality (ie, userspace) I'm confident we'll need to submit a few more patches. The only concrete issues I know about are the sys_riscv_flush_icache patches, but as I managed to screw those up on Friday I figured it'd be best to let them bake another week. This tag boots a Fedora root filesystem on QEMU's master branch for me, and before this morning's rebase (from 4.18-rc8 to 4.18) it booted on the HiFive Unleashed. Thanks to Christoph Hellwig and the other guys at WD for getting the new drivers in shape!" * tag 'riscv-for-linus-4.19-mw0' of git://git.kernel.org/pub/scm/linux/kernel/git/palmer/riscv-linux: dt-bindings: interrupt-controller: SiFive Plaform Level Interrupt Controller dt-bindings: interrupt-controller: RISC-V local interrupt controller RISC-V: Fix !CONFIG_SMP compilation error irqchip: add a SiFive PLIC driver RISC-V: Add the directive for alignment of stvec's value clocksource: new RISC-V SBI timer driver RISC-V: implement low-level interrupt handling RISC-V: add a definition for the SIE SEIE bit RISC-V: remove INTERRUPT_CAUSE_* defines from asm/irq.h RISC-V: simplify software interrupt / IPI code RISC-V: remove timer leftovers RISC-V: Add early printk support via the SBI console RISC-V: Don't increment sepc after breakpoint. RISC-V: implement __lshrti3. RISC-V: Use KBUILD_CFLAGS instead of KCFLAGS when building the vDSO
2018-08-19Merge tag 'char-misc-4.19-rc1' of ↵Linus Torvalds1-6/+2
git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc Pull UIO fix from Greg KH: "Here is a single UIO fix that I forgot to send before 4.18-final came out. It reverts a UIO patch that went in the 4.18 development window that was causing problems. This patch has been in linux-next for a while with no problems, I just forgot to send it earlier, or as part of the larger char/misc patch series from yesterday, my fault" * tag 'char-misc-4.19-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc: Revert "uio: use request_threaded_irq instead"
2018-08-18Merge branch 'for-linus' of ↵Linus Torvalds99-967/+1468
git://git.kernel.org/pub/scm/linux/kernel/git/dtor/input Pull input updates from Dmitry Torokhov: - a new driver for Rohm BU21029 touch controller - new bitmap APIs: bitmap_alloc, bitmap_zalloc and bitmap_free - updates to Atmel, eeti. pxrc and iforce drivers - assorted driver cleanups and fixes. * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/dtor/input: (57 commits) MAINTAINERS: Add PhoenixRC Flight Controller Adapter Input: do not use WARN() in input_alloc_absinfo() Input: mark expected switch fall-throughs Input: raydium_i2c_ts - use true and false for boolean values Input: evdev - switch to bitmap API Input: gpio-keys - switch to bitmap_zalloc() Input: elan_i2c_smbus - cast sizeof to int for comparison bitmap: Add bitmap_alloc(), bitmap_zalloc() and bitmap_free() md: Avoid namespace collision with bitmap API dm: Avoid namespace collision with bitmap API Input: pm8941-pwrkey - add resin entry Input: pm8941-pwrkey - abstract register offsets and event code Input: iforce - reorganize joystick configuration lists Input: atmel_mxt_ts - move completion to after config crc is updated Input: atmel_mxt_ts - don't report zero pressure from T9 Input: atmel_mxt_ts - zero terminate config firmware file Input: atmel_mxt_ts - refactor config update code to add context struct Input: atmel_mxt_ts - config CRC may start at T71 Input: atmel_mxt_ts - remove unnecessary debug on ENOMEM Input: atmel_mxt_ts - remove duplicate setup of ABS_MT_PRESSURE ...
2018-08-18Merge tag 'hwlock-v4.19' of git://github.com/andersson/remoteprocLinus Torvalds3-9/+262
Pull hwspinlock updates from Bjorn Andersson: "This introduces devres helpers and an API to request a lock by name, then migrates the sprd SPI driver to use these" * tag 'hwlock-v4.19' of git://github.com/andersson/remoteproc: hwspinlock: Fix incorrect return pointers spi: sprd: Change to use devm_hwspin_lock_request_specific() spi: sprd: Replace of_hwspin_lock_get_id() with of_hwspin_lock_get_id_byname() hwspinlock: Fix one comment mistake hwspinlock: Remove redundant config hwspinlock: Add devm_xxx() APIs to register/unregister one hwlock controller hwspinlock: Add devm_xxx() APIs to request/free hwlock hwspinlock: Add one new API to support getting a specific hwlock by the name
2018-08-18Merge tag 'rpmsg-v4.19' of git://github.com/andersson/remoteprocLinus Torvalds5-22/+53
Pull rpmsg updates from Bjorn Andersson: "This fixes a few compile and kerneldoc warnings, allows rpmsg devices to handle power domains, allow for labeling GLINK edges and supports compat for rpmsg_char" * tag 'rpmsg-v4.19' of git://github.com/andersson/remoteproc: rpmsg: Add compat ioctl for rpmsg char driver rpmsg: glink: Store edge name for glink device dt-bindings: soc: qcom: Add label for GLINK bindings rpmsg: core: add support to power domains for devices rpmsg: smd: fix kerneldoc warnings rpmsg: glink: Fix various kerneldoc warnings. rpmsg: glink: correctly annotate intent members rpmsg: smd: Add missing include of sizes.h
2018-08-18Merge tag 'rproc-v4.19' of git://github.com/andersson/remoteprocLinus Torvalds18-361/+1188
Pull remoteproc updates from Bjorn Andersson: "This adds support for pre-start and post-shutdown hooks for remoteproc subdevices, refactors the Qualcomm Hexagon support to allow reuse between several drivers, makes authentication in the MDT file loader optional, migrates a few format strings to use %pK and migrates the Davinci driver to use the reset framework" * tag 'rproc-v4.19' of git://github.com/andersson/remoteproc: remoteproc/davinci: use the reset framework remoteproc/davinci: Mark error recovery as disabled remoteproc: st_slim: replace "%p" with "%pK" remoteproc: replace "%p" with "%pK" remoteproc: qcom: fix Q6V5_WCSS dependencies remoteproc: Reset table_ptr in rproc_start() failure paths remoteproc: qcom: q6v5-pil: fix modem hang on SDM845 after axis2 clk unvote remoteproc: qcom q6v5: fix modular build remoteproc: Introduce prepare and unprepare for subdevices remoteproc: rename subdev probe and remove functions remoteproc: Make client initialize ops in rproc_subdev remoteproc: Make start and stop in subdev optional remoteproc: Rename subdev functions to start/stop remoteproc: qcom: Introduce Hexagon V5 based WCSS driver remoteproc: qcom: q6v5-pil: Use common q6v5 helpers remoteproc: qcom: adsp: Use common q6v5 helpers remoteproc: q6v5: Extract common resource handling remoteproc: qcom: mdt_loader: Make the firmware authentication optional
2018-08-18Merge tag 'linux-watchdog-4.19-rc1' of ↵Linus Torvalds14-105/+351
git://www.linux-watchdog.org/linux-watchdog Pull watchdog updates from Wim Van Sebroeck: - add MEN 16z069 IP-Core driver - renesas-wdt: add support for the R8A77990 wdt - stm32_iwdg: Add stm32mp1 support and pclk feature - sp805_wdt, orion_wdt, sprd_wdt: several improvements - imx2_wdt, stmp3xxx: switch to SPDX identifier * tag 'linux-watchdog-4.19-rc1' of git://www.linux-watchdog.org/linux-watchdog: watchdog: fix dependencies of menz69_wdt.o watchdog: sp805: Add clock-frequency property watchdog: add driver for the MEN 16z069 IP-Core watchdog: sprd_wdt: Remove redundant dev_err call in sprd_wdt_probe() watchdog: stmp3xxx: Switch to SPDX identifier watchdog: imx2_wdt: Switch to SPDX identifier watchdog: sp805: set WDOG_HW_RUNNING when appropriate watchdog: sp805: add 'timeout-sec' DT property support dt-bindings: watchdog: Add optional 'timeout-sec' property for sp805 dt-bindings: watchdog: Consolidate SP805 binding docs watchdog: orion_wdt: Mark watchdog as active when running at probe watchdog: stm32: add pclk feature for stm32mp1 dt-bindings: watchdog: add stm32mp1 support dt-bindings: watchdog: renesas-wdt: Add support for the R8A77990 wdt
2018-08-18Merge tag 'dmaengine-4.19-rc1' of git://git.infradead.org/users/vkoul/slave-dmaLinus Torvalds26-307/+1600
Pull DMAengine updates from Vinod Koul: "This round brings couple of framework changes, a new driver and usual driver updates: - new managed helper for dmaengine framework registration - split dmaengine pause capability to pause and resume and allow drivers to report that individually - update dma_request_chan_by_mask() to handle deferred probing - move imx-sdma to use virt-dma - new driver for Actions Semi Owl family S900 controller - minor updates to intel, renesas, mv_xor, pl330 etc" * tag 'dmaengine-4.19-rc1' of git://git.infradead.org/users/vkoul/slave-dma: (46 commits) dmaengine: Add Actions Semi Owl family S900 DMA driver dt-bindings: dmaengine: Add binding for Actions Semi Owl SoCs dmaengine: sh: rcar-dmac: Should not stop the DMAC by rcar_dmac_sync_tcr() dmaengine: mic_x100_dma: use the new helper to simplify the code dmaengine: add a new helper dmaenginem_async_device_register dmaengine: imx-sdma: add memcpy interface dmaengine: imx-sdma: add SDMA_BD_MAX_CNT to replace '0xffff' dmaengine: dma_request_chan_by_mask() to handle deferred probing dmaengine: pl330: fix irq race with terminate_all dmaengine: Revert "dmaengine: mv_xor_v2: enable COMPILE_TEST" dmaengine: mv_xor_v2: use {lower,upper}_32_bits to configure HW descriptor address dmaengine: mv_xor_v2: enable COMPILE_TEST dmaengine: mv_xor_v2: move unmap to before callback dmaengine: mv_xor_v2: convert callback to helper function dmaengine: mv_xor_v2: kill the tasklets upon exit dmaengine: mv_xor_v2: explicitly freeup irq dmaengine: sh: rcar-dmac: Add dma_pause operation dmaengine: sh: rcar-dmac: add a new function to clear CHCR.DE with barrier dmaengine: idma64: Support dmaengine_terminate_sync() dmaengine: hsu: Support dmaengine_terminate_sync() ...
2018-08-18Merge tag 'mmc-v4.19' of git://git.kernel.org/pub/scm/linux/kernel/git/ulfh/mmcLinus Torvalds43-494/+1373
Pull MMC updates from Ulf Hansson: "Updates for MMC for v4.19. MMC core: - Add some fine-grained hooks to further support HS400 tuning - Improve error path for bus width setting for HS400es - Use a common method when checking R1 status MMC host: - renesas_sdhi: Add r8a77990 support - renesas_sdhi: Add eMMC HS400 mode support - tmio/renesas_sdhi: Improve tuning/clock management - tmio: Add eMMC HS400 mode support - sunxi: Add support for 3.3V eMMC DDR mode - mmci: Initial support to manage variant specific callbacks - sdhci: Don't try 3.3V I/O voltage if not supported - sdhci-pci-dwc-mshc: Add driver to support Synopsys dwc mshc SDHCI PCI - sdhci-of-dwcmshc: Add driver to support Synopsys DWC MSHC SDHCI - sdhci-msm: Add support for new version sdcc V5 - sdhci-pci-o2micro: Add support for O2 eMMC HS200 mode - sdhci-pci-o2micro: Add support for O2 hardware tuning - sdhci-pci-o2micro: Add MSI interrupt support for O2 SD host - sdhci-pci: Add support for Intel ICP - sdhci-tegra: Prevent ACMD23 and HS200 mode on Tegra 3 - sdhci-tegra: Fix eMMC DDR52 mode - sdhci-tegra: Improve clock management - dw_mmc-rockchip: Document compatible string for px30 - sdhci-esdhc-imx: Add support for 3.3V eMMC DDR mode - sdhci-of-esdhc: Set proper DMA mask for ls104x chips - sdhci-of-esdhc: Improve clock management - sdhci-of-arasan: Add a quirk to manage unstable clocks - dw_mmc-exynos: Address potential external abort during system resume - pxamci: Add support for common MMC DT bindings - pxamci: Several cleanups and improvements - pxamci: Merge immutable branch for pxa to switch to DMA slave maps" * tag 'mmc-v4.19' of git://git.kernel.org/pub/scm/linux/kernel/git/ulfh/mmc: (56 commits) mmc: core: improve reasonableness of bus width setting for HS400es mmc: tmio: remove unneeded variable in tmio_mmc_start_command() mmc: renesas_sdhi: Fix sampling clock position selecting mmc: tmio: Fix tuning flow mmc: sunxi: remove output of virtual base address dt-bindings: mmc: rockchip-dw-mshc: add description for px30 mmc: renesas_sdhi: Add r8a77990 support mmc: sunxi: allow 3.3V DDR when DDR is available mmc: mmci: Add and implement a ->dma_setup() callback for qcom dml mmc: mmci: Initial support to manage variant specific callbacks mmc: tegra: Force correct divider calculation on DDR50/52 mmc: sdhci: Add MSI interrupt support for O2 SD host mmc: sdhci: Add support for O2 hardware tuning mmc: sdhci: Export sdhci tuning function symbol mmc: sdhci: Change O2 Host HS200 mode clock frequency to 200MHz mmc: sdhci: Add support for O2 eMMC HS200 mode mmc: tegra: Add and use tegra_sdhci_get_max_clock() mmc: sdhci-esdhc-imx: fix indent mmc: sdhci-esdhc-imx: disable clocks before changing frequency mmc: tegra: prevent ACMD23 on Tegra 3 ...
2018-08-18ip6_vti: simplify stats handling in vti6_xmitHaishuang Yan1-11/+3
Same as ip_vti, use iptunnel_xmit_stats to updates stats in tunnel xmit code path. Signed-off-by: Haishuang Yan <yanhaishuang@cmss.chinamobile.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-08-18pcmcia: remove long deprecated pcmcia_request_exclusive_irq() functionLinus Torvalds3-49/+0
This function was created as a deprecated fallback case back in 2010 by commit eb14120f743d ("pcmcia: re-work pcmcia_request_irq()") for legacy cases. Actual in-kernel users haven't been around for a long while. The last in-kernel user was apparently removed four years ago by commit 5f5316fcd08e ("am2150: Update nmclan_cs.c to use update PCMCIA API"). Just remove it entirely. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-18deprecate the '__deprecated' attribute warnings entirely and for goodLinus Torvalds3-28/+2
We haven't had lots of deprecation warnings lately, but the rdma use of it made them flare up again. They are not useful. They annoy everybody, and nobody ever does anything about them, because it's always "somebody elses problem". And when people start thinking that warnings are normal, they stop looking at them, and the real warnings that mean something go unnoticed. If you want to get rid of a function, just get rid of it. Convert every user to the new world order. And if you can't do that, then don't annoy everybody else with your marking that says "I couldn't be bothered to fix this, so I'll just spam everybody elses build logs with warnings about my laziness". Make a kernelnewbies wiki page about things that could be cleaned up, write a blog post about it, or talk to people on the mailing lists. But don't add warnings to the kernel build about cleanup that you think should happen but you aren't doing yourself. Don't. Just don't. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-18Merge tag 'driver-core-4.19-rc1' of ↵Linus Torvalds26-161/+272
git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core Pull driver core updates from Greg KH: "Here are all of the driver core and related patches for 4.19-rc1. Nothing huge here, just a number of small cleanups and the ability to now stop the deferred probing after init happens. All of these have been in linux-next for a while with only a merge issue reported" * tag 'driver-core-4.19-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core: (21 commits) base: core: Remove WARN_ON from link dependencies check drivers/base: stop new probing during shutdown drivers: core: Remove glue dirs from sysfs earlier driver core: remove unnecessary function extern declare sysfs.h: fix non-kernel-doc comment PM / Domains: Stop deferring probe at the end of initcall iommu: Remove IOMMU_OF_DECLARE iommu: Stop deferring probe at end of initcalls pinctrl: Support stopping deferred probe after initcalls dt-bindings: pinctrl: add a 'pinctrl-use-default' property driver core: allow stopping deferred probe after init driver core: add a debugfs entry to show deferred devices sysfs: Fix internal_create_group() for named group updates base: fix order of OF initialization linux/device.h: fix kernel-doc notation warning Documentation: update firmware loader fallback reference kobject: Replace strncpy with memcpy drivers: base: cacheinfo: use OF property_read_u32 instead of get_property,read_number kernfs: Replace strncpy with memcpy device: Add #define dev_fmt similar to #define pr_fmt ...
2018-08-18Merge tag 'char-misc-4.19-rc1' of ↵Linus Torvalds302-1559/+18142
git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc Pull char/misc driver updates from Greg KH: "Here is the bit set of char/misc drivers for 4.19-rc1 There is a lot here, much more than normal, seems like everyone is writing new driver subsystems these days... Anyway, major things here are: - new FSI driver subsystem, yet-another-powerpc low-level hardware bus - gnss, finally an in-kernel GPS subsystem to try to tame all of the crazy out-of-tree drivers that have been floating around for years, combined with some really hacky userspace implementations. This is only for GNSS receivers, but you have to start somewhere, and this is great to see. Other than that, there are new slimbus drivers, new coresight drivers, new fpga drivers, and loads of DT bindings for all of these and existing drivers. All of these have been in linux-next for a while with no reported issues" * tag 'char-misc-4.19-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc: (255 commits) android: binder: Rate-limit debug and userspace triggered err msgs fsi: sbefifo: Bump max command length fsi: scom: Fix NULL dereference misc: mic: SCIF Fix scif_get_new_port() error handling misc: cxl: changed asterisk position genwqe: card_base: Use true and false for boolean values misc: eeprom: assignment outside the if statement uio: potential double frees if __uio_register_device() fails eeprom: idt_89hpesx: clean up an error pointer vs NULL inconsistency misc: ti-st: Fix memory leak in the error path of probe() android: binder: Show extra_buffers_size in trace firmware: vpd: Fix section enabled flag on vpd_section_destroy platform: goldfish: Retire pdev_bus goldfish: Use dedicated macros instead of manual bit shifting goldfish: Add missing includes to goldfish.h mux: adgs1408: new driver for Analog Devices ADGS1408/1409 mux dt-bindings: mux: add adi,adgs1408 Drivers: hv: vmbus: Cleanup synic memory free path Drivers: hv: vmbus: Remove use of slow_virt_to_phys() Drivers: hv: vmbus: Reset the channel callback in vmbus_onoffer_rescind() ...
2018-08-18Merge tag 'staging-4.19-rc1' of ↵Linus Torvalds466-27574/+28168
git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/staging Pull staging and IIO updates from Greg KH: "Here are the big staging/iio patches for 4.19-rc1. Lots of churn here, with tons of cleanups happening in staging drivers, a removal of an old crypto driver that no one was using (skein), and the addition of some new IIO drivers. Also added was a "gasket" driver from Google that needs loads of work and the erofs filesystem. Even with adding all of the new drivers and a new filesystem, we are only adding about 1000 lines overall to the kernel linecount, which shows just how much cleanup happened, and how big the unused crypto driver was. All of these have been in the linux-next tree for a while now with no reported issues" * tag 'staging-4.19-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/staging: (903 commits) staging:rtl8192u: Remove unused macro definitions - Style staging:rtl8192u: Add spaces around '+' operator - Style staging:rtl8192u: Remove stale comment - Style staging: rtl8188eu: remove unused mp_custom_oid.h staging: fbtft: Add spaces around / - Style staging: fbtft: Erases some repetitive usage of function name - Style staging: fbtft: Adjust some empty-line problems - Style staging: fbtft: Removes one nesting level to help readability - Style staging: fbtft: Changes gamma table to define. staging: fbtft: A bit more information on dev_err. staging: fbtft: Fixes some alignment issues - Style staging: fbtft: Puts macro arguments in parenthesis to avoid precedence issues - Style staging: rtl8188eu: remove unused array dB_Invert_Table staging: rtl8188eu: remove whitespace, add missing blank line staging: rtl8188eu: use is_multicast_ether_addr in rtw_sta_mgt.c staging: rtl8188eu: remove whitespace - style staging: rtl8188eu: cleanup block comment - style staging: rtl8188eu: use is_multicast_ether_addr in rtl8188eu_xmit.c staging: rtl8188eu: use is_multicast_ether_addr in recv_linux.c staging: rtlwifi: refactor rtl_get_tcb_desc ...
2018-08-18Merge tag 'tty-4.19-rc1' of ↵Linus Torvalds44-338/+1381
git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty Pull tty/serial driver updates from Greg KH: "Here is the big tty and serial driver pull request for 4.19-rc1. It's not all that big, just a number of small serial driver updates and fixes, along with some better vt handling for unicode characters for those using braille terminals. All of these patches have been in linux-next for a long time with no reported issues" * tag 'tty-4.19-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty: (73 commits) tty: serial: 8250: Revert NXP SC16C2552 workaround serial: 8250_exar: Read INT0 from slave device, too tty: rocket: Fix possible buffer overwrite on register_PCI serial: 8250_dw: Add ACPI support for uart on Broadcom SoC serial: 8250_dw: always set baud rate in dw8250_set_termios dt-bindings: serial: Add binding for uartlite tty: serial: uartlite: Add support for suspend and resume tty: serial: uartlite: Add clock adaptation tty: serial: uartlite: Add structure for private data serial: sh-sci: Improve support for separate TEI and DRI interrupts serial: sh-sci: Remove SCIx_RZ_SCIFA_REGTYPE serial: sh-sci: Allow for compressed SCIF address serial: sh-sci: Improve interrupts description serial: 8250: Use cached port name directly in messages serial: 8250_exar: Drop unused variable in pci_xr17v35x_setup() vt: drop unused struct vt_struct vt: avoid a VLA in the unicode screen scroll function vt: add /dev/vcsu* to devices.txt vt: coherence validation code for the unicode screen buffer vt: selection: take screen contents from uniscr if available ...
2018-08-18Merge tag 'usb-4.19-rc1' of ↵Linus Torvalds165-1813/+5847
git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb Pull USB/PHY updates from Greg KH: "Here is the big USB and phy driver patch set for 4.19-rc1. Nothing huge but there was a lot of work that happened this development cycle: - lots of type-c work, with drivers graduating out of staging, and displayport support being added. - new PHY drivers - the normal collection of gadget driver updates and fixes - code churn to work on the urb handling path, using irqsave() everywhere in anticipation of making this codepath a lot simpler in the future. - usbserial driver fixes and reworks - other misc changes All of these have been in linux-next with no reported issues for a while" * tag 'usb-4.19-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb: (159 commits) USB: serial: pl2303: add a new device id for ATEN usb: renesas_usbhs: Kconfig: convert to SPDX identifiers usb: dwc3: gadget: Check MaxPacketSize from descriptor usb: dwc2: Turn on uframe_sched on "stm32f4x9_fsotg" platforms usb: dwc2: Turn on uframe_sched on "amlogic" platforms usb: dwc2: Turn on uframe_sched on "his" platforms usb: dwc2: Turn on uframe_sched on "bcm" platforms usb: dwc2: gadget: ISOC's starting flow improvement usb: dwc2: Make dwc2_readl/writel functions endianness-agnostic. usb: dwc3: core: Enable AutoRetry feature in the controller usb: dwc3: Set default mode for dwc_usb31 usb: gadget: udc: renesas_usb3: Add register of usb role switch usb: dwc2: replace ioread32/iowrite32_rep with dwc2_readl/writel_rep usb: dwc2: Modify dwc2_readl/writel functions prototype usb: dwc3: pci: Intel Merrifield can be host usb: dwc3: pci: Supply device properties via driver data arm64: dts: dwc3: description of incr burst type usb: dwc3: Enable undefined length INCR burst type usb: dwc3: add global soc bus configuration reg0 usb: dwc3: Describe 'wakeup_work' field of struct dwc3_pci ...
2018-08-18Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpfDavid S. Miller16-130/+123
Daniel Borkmann says: ==================== pull-request: bpf 2018-08-18 The following pull-request contains BPF updates for your *net* tree. The main changes are: 1) Fix a BPF selftest failure in test_cgroup_storage due to rlimit restrictions, from Yonghong. 2) Fix a suspicious RCU rcu_dereference_check() warning triggered from removing a device's XDP memory allocator by using the correct rhashtable lookup function, from Tariq. 3) A batch of BPF sockmap and ULP fixes mainly fixing leaks and races as well as enforcing module aliases for ULPs. Another fix for BPF map redirect to make them work again with tail calls, from Daniel. 4) Fix XDP BPF samples to unload their programs upon SIGTERM, from Jesper. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2018-08-18Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nfDavid S. Miller20-95/+163
Pablo Neira Ayuso says: ==================== Netfilter/IPVS fixes for net The following patchset contains Netfilter/IPVS fixes for your net tree: 1) Infinite loop in IPVS when net namespace is released, from Tan Hu. 2) Do not show negative timeouts in ip_vs_conn by using the new jiffies_delta_to_msecs(), patches from Matteo Croce. 3) Set F_IFACE flag for linklocal addresses in ip6t_rpfilter, from Florian Westphal. 4) Fix overflow in set size allocation, from Taehee Yoo. 5) Use netlink_dump_start() from ctnetlink to fix memleak from the error path, again from Florian. 6) Register nfnetlink_subsys in last place, otherwise netns init path may lose race and see net->nft uninitialized data. This also reverts previous attempt to fix this by increase netns refcount, patches from Florian. 7) Remove conntrack entries on layer 4 protocol tracker module removal, from Florian. 8) Use GFP_KERNEL_ACCOUNT for xtables blob allocation, from Michal Hocko. 9) Get tproxy documentation in sync with existing codebase, from Mate Eckl. 10) Honor preset layer 3 protocol via ctx->family in the new nft_ct timeout infrastructure, from Harsha Sharma. 11) Let uapi nfnetlink_osf.h compile standalone with no errors, from Dmitry V. Levin. 12) Missing braces compilation warning in nft_tproxy, patch from Mate Eclk. 13) Disregard bogus check to bail out on non-anonymous sets from the dynamic set update extension. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2018-08-17Merge tag '9p-for-4.19-2' of git://github.com/martinetd/linuxLinus Torvalds11-116/+122
Pull 9p updates from Dominique Martinet: "This contains mostly fixes (6 to be backported to stable) and a few changes, here is the breakdown: - rework how fids are attributed by replacing some custom tracking in a list by an idr - for packet-based transports (virtio/rdma) validate that the packet length matches what the header says - a few race condition fixes found by syzkaller - missing argument check when NULL device is passed in sys_mount - a few virtio fixes - some spelling and style fixes" * tag '9p-for-4.19-2' of git://github.com/martinetd/linux: (21 commits) net/9p/trans_virtio.c: add null terminal for mount tag 9p/virtio: fix off-by-one error in sg list bounds check 9p: fix whitespace issues 9p: fix multiple NULL-pointer-dereferences fs/9p/xattr.c: catch the error of p9_client_clunk when setting xattr failed 9p: validate PDU length net/9p/trans_fd.c: fix race by holding the lock net/9p/trans_fd.c: fix race-condition by flushing workqueue before the kfree() net/9p/virtio: Fix hard lockup in req_done net/9p/trans_virtio.c: fix some spell mistakes in comments 9p/net: Fix zero-copy path in the 9p virtio transport 9p: Embed wait_queue_head into p9_req_t 9p: Replace the fidlist with an IDR 9p: Change p9_fid_create calling convention 9p: Fix comment on smp_wmb net/9p/client.c: version pointer uninitialized fs/9p/v9fs.c: fix spelling mistake "Uknown" -> "Unknown" net/9p: fix error path of p9_virtio_probe 9p/net/protocol.c: return -ENOMEM when kmalloc() failed net/9p/client.c: add missing '\n' at the end of p9_debug() ...
2018-08-17Merge branch 'akpm' (patches from Andrew)Linus Torvalds152-1165/+2122
Merge updates from Andrew Morton: - a few misc things - a few Y2038 fixes - ntfs fixes - arch/sh tweaks - ocfs2 updates - most of MM * emailed patches from Andrew Morton <akpm@linux-foundation.org>: (111 commits) mm/hmm.c: remove unused variables align_start and align_end fs/userfaultfd.c: remove redundant pointer uwq mm, vmacache: hash addresses based on pmd mm/list_lru: introduce list_lru_shrink_walk_irq() mm/list_lru.c: pass struct list_lru_node* as an argument to __list_lru_walk_one() mm/list_lru.c: move locking from __list_lru_walk_one() to its caller mm/list_lru.c: use list_lru_walk_one() in list_lru_walk_node() mm, swap: make CONFIG_THP_SWAP depend on CONFIG_SWAP mm/sparse: delete old sparse_init and enable new one mm/sparse: add new sparse_init_nid() and sparse_init() mm/sparse: move buffer init/fini to the common place mm/sparse: use the new sparse buffer functions in non-vmemmap mm/sparse: abstract sparse buffer allocations mm/hugetlb.c: don't zero 1GiB bootmem pages mm, page_alloc: double zone's batchsize mm/oom_kill.c: document oom_lock mm/hugetlb: remove gigantic page support for HIGHMEM mm, oom: remove sleep from under oom_lock kernel/dma: remove unsupported gfp_mask parameter from dma_alloc_from_contiguous() mm/cma: remove unsupported gfp_mask parameter from cma_alloc() ...
2018-08-17mm/hmm.c: remove unused variables align_start and align_endColin Ian King1-4/+1
Variables align_start and align_end are being assigned but are never used hence they are redundant and can be removed. Cleans up clang warnings: warning: variable 'align_start' set but not used [-Wunused-but-set-variable] warning: variable 'align_size' set but not used [-Wunused-but-set-variable] Link: http://lkml.kernel.org/r/20180714161124.3923-1-colin.king@canonical.com Signed-off-by: Colin Ian King <colin.king@canonical.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-17fs/userfaultfd.c: remove redundant pointer uwqColin Ian King1-3/+0
Pointer uwq is being assigned but is never used hence it is redundant and can be removed. Cleans up clang warning: warning: variable 'uwq' set but not used [-Wunused-but-set-variable] Link: http://lkml.kernel.org/r/20180717090802.18357-1-colin.king@canonical.com Signed-off-by: Colin Ian King <colin.king@canonical.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-17mm, vmacache: hash addresses based on pmdDavid Rientjes2-15/+29
When perf profiling a wide variety of different workloads, it was found that vmacache_find() had higher than expected cost: up to 0.08% of cpu utilization in some cases. This was found to rival other core VM functions such as alloc_pages_vma() with thp enabled and default mempolicy, and the conditionals in __get_vma_policy(). VMACACHE_HASH() determines which of the four per-task_struct slots a vma is cached for a particular address. This currently depends on the pfn, so pfn 5212 occupies a different vmacache slot than its neighboring pfn 5213. vmacache_find() iterates through all four of current's vmacache slots when looking up an address. Hashing based on pfn, an address has ~1/VMACACHE_SIZE chance of being cached in the first vmacache slot, or about 25%, *if* the vma is cached. This patch hashes an address by its pmd instead of pte to optimize for workloads with good spatial locality. This results in a higher probability of vmas being cached in the first slot that is checked: normally ~70% on the same workloads instead of 25%. [rientjes@google.com: various updates] Link: http://lkml.kernel.org/r/alpine.DEB.2.21.1807231532290.109445@chino.kir.corp.google.com Link: http://lkml.kernel.org/r/alpine.DEB.2.21.1807091749150.114630@chino.kir.corp.google.com Signed-off-by: David Rientjes <rientjes@google.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-17mm/list_lru: introduce list_lru_shrink_walk_irq()Sebastian Andrzej Siewior3-6/+42
Provide list_lru_shrink_walk_irq() and let it behave like list_lru_walk_one() except that it locks the spinlock with spin_lock_irq(). This is used by scan_shadow_nodes() because its lock nests within the i_pages lock which is acquired with IRQ. This change allows to use proper locking promitives instead hand crafted lock_irq_disable() plus spin_lock(). There is no EXPORT_SYMBOL provided because the current user is in-kernel only. Add list_lru_shrink_walk_irq() which acquires the spinlock with the proper locking primitives. Link: http://lkml.kernel.org/r/20180716111921.5365-5-bigeasy@linutronix.de Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Reviewed-by: Vladimir Davydov <vdavydov.dev@gmail.com> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-17mm/list_lru.c: pass struct list_lru_node* as an argument to ↵Sebastian Andrzej Siewior1-6/+6
__list_lru_walk_one() __list_lru_walk_one() is invoked with struct list_lru *lru, int nid as the first two argument. Those two are only used to retrieve struct list_lru_node. Since this is already done by the caller of the function for the locking, we can pass struct list_lru_node* directly and avoid the dance around it. Link: http://lkml.kernel.org/r/20180716111921.5365-4-bigeasy@linutronix.de Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Reviewed-by: Vladimir Davydov <vdavydov.dev@gmail.com> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-17mm/list_lru.c: move locking from __list_lru_walk_one() to its callerSebastian Andrzej Siewior1-5/+13
Move the locking inside __list_lru_walk_one() to its caller. This is a preparation step in order to introduce list_lru_walk_one_irq() which does spin_lock_irq() instead of spin_lock() for the locking. Link: http://lkml.kernel.org/r/20180716111921.5365-3-bigeasy@linutronix.de Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Reviewed-by: Vladimir Davydov <vdavydov.dev@gmail.com> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-17mm/list_lru.c: use list_lru_walk_one() in list_lru_walk_node()Sebastian Andrzej Siewior1-2/+2
Patch series "mm/list_lru: Add list_lru_shrink_walk_irq() and a user". This series removes the local_irq_disable() around list_lru_shrink_walk() (as used by mm/workingset) by adding list_lru_shrink_walk_irq(). Vladimir Davydov preferred this over `irq' argument which I added to struct list_lru. The initial post (of this series) received a Reviewed-by tag by Vladimir Davydov which I added to each patch of the series. The series applies on top of akpm's tree which has Kirill's shrink_slab series and does not clash with it (akpm asked me to wait a week or so and repost it then). I tested the code paths by triggering the OOM-killer via memory over commit and lockdep did not complain (nor did I see any warnings). This patch (of 4): list_lru_walk_node() invokes __list_lru_walk_one() with -1 as the memcg_idx parameter. The same can be achieved by list_lru_walk_one() and passing NULL as memcg argument which then gets converted into -1. This is a preparation step when the spin_lock() function is lifted to the caller of __list_lru_walk_one(). Invoke list_lru_walk_one() instead __list_lru_walk_one() when possible. Link: http://lkml.kernel.org/r/20180716111921.5365-2-bigeasy@linutronix.de Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Reviewed-by: Vladimir Davydov <vdavydov.dev@gmail.com> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-17mm, swap: make CONFIG_THP_SWAP depend on CONFIG_SWAPHuang Ying1-2/+3
CONFIG_THP_SWAP should depend on CONFIG_SWAP, because it's unreasonable to optimize swapping for THP (Transparent Huge Page) without basic swapping support. In original code, when CONFIG_SWAP=n and CONFIG_THP_SWAP=y, split_swap_cluster() will not be built because it is in swapfile.c, but it will be called in huge_memory.c. This doesn't trigger a build error in practice because the call site is enclosed by PageSwapCache(), which is defined to be constant 0 when CONFIG_SWAP=n. But this is fragile and should be fixed. The comments are fixed too to reflect the latest progress. Link: http://lkml.kernel.org/r/20180713021228.439-1-ying.huang@intel.com Fixes: 38d8b4e6bdc8 ("mm, THP, swap: delay splitting THP during swap out") Signed-off-by: "Huang, Ying" <ying.huang@intel.com> Reviewed-by: Dan Williams <dan.j.williams@intel.com> Reviewed-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Shaohua Li <shli@kernel.org> Cc: Hugh Dickins <hughd@google.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Rik van Riel <riel@redhat.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Zi Yan <zi.yan@cs.rutgers.edu> Cc: Daniel Jordan <daniel.m.jordan@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-17mm/sparse: delete old sparse_init and enable new onePavel Tatashin4-267/+1
Rename new_sparse_init() to sparse_init() which enables it. Delete old sparse_init() and all the code that became obsolete with. [pasha.tatashin@oracle.com: remove unused sparse_mem_maps_populate_node()] Link: http://lkml.kernel.org/r/20180716174447.14529-6-pasha.tatashin@oracle.com Link: http://lkml.kernel.org/r/20180712203730.8703-6-pasha.tatashin@oracle.com Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com> Tested-by: Michael Ellerman <mpe@ellerman.id.au> [powerpc] Tested-by: Oscar Salvador <osalvador@suse.de> Reviewed-by: Oscar Salvador <osalvador@suse.de> Cc: Pasha Tatashin <Pavel.Tatashin@microsoft.com> Cc: Abdul Haleem <abdhalee@linux.vnet.ibm.com> Cc: Baoquan He <bhe@redhat.com> Cc: Daniel Jordan <daniel.m.jordan@oracle.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: David Rientjes <rientjes@google.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Ingo Molnar <mingo@kernel.org> Cc: Jan Kara <jack@suse.cz> Cc: Jérôme Glisse <jglisse@redhat.com> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Souptick Joarder <jrdr.linux@gmail.com> Cc: Steven Sistare <steven.sistare@oracle.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Wei Yang <richard.weiyang@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-17mm/sparse: add new sparse_init_nid() and sparse_init()Pavel Tatashin1-0/+85
sparse_init() requires to temporary allocate two large buffers: usemap_map and map_map. Baoquan He has identified that these buffers are so large that Linux is not bootable on small memory machines, such as a kdump boot. The buffers are especially large when CONFIG_X86_5LEVEL is set, as they are scaled to the maximum physical memory size. Baoquan provided a fix, which reduces these sizes of these buffers, but it is much better to get rid of them entirely. Add a new way to initialize sparse memory: sparse_init_nid(), which only operates within one memory node, and thus allocates memory either in large contiguous block or allocates section by section. This eliminates the need for use of temporary buffers. For simplified bisecting and review temporarly call sparse_init() new_sparse_init(), the new interface is going to be enabled as well as old code removed in the next patch. Link: http://lkml.kernel.org/r/20180712203730.8703-5-pasha.tatashin@oracle.com Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com> Reviewed-by: Oscar Salvador <osalvador@suse.de> Tested-by: Oscar Salvador <osalvador@suse.de> Tested-by: Michael Ellerman <mpe@ellerman.id.au> [powerpc] Cc: Pasha Tatashin <Pavel.Tatashin@microsoft.com> Cc: Abdul Haleem <abdhalee@linux.vnet.ibm.com> Cc: Baoquan He <bhe@redhat.com> Cc: Daniel Jordan <daniel.m.jordan@oracle.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: David Rientjes <rientjes@google.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Ingo Molnar <mingo@kernel.org> Cc: Jan Kara <jack@suse.cz> Cc: Jérôme Glisse <jglisse@redhat.com> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Souptick Joarder <jrdr.linux@gmail.com> Cc: Steven Sistare <steven.sistare@oracle.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Wei Yang <richard.weiyang@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-17mm/sparse: move buffer init/fini to the common placePavel Tatashin3-12/+7
Now that both variants of sparse memory use the same buffers to populate memory map, we can move sparse_buffer_init()/sparse_buffer_fini() to the common place. Link: http://lkml.kernel.org/r/20180712203730.8703-4-pasha.tatashin@oracle.com Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com> Tested-by: Michael Ellerman <mpe@ellerman.id.au> [powerpc] Tested-by: Oscar Salvador <osalvador@suse.de> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: Pasha Tatashin <Pavel.Tatashin@microsoft.com> Cc: Abdul Haleem <abdhalee@linux.vnet.ibm.com> Cc: Baoquan He <bhe@redhat.com> Cc: Daniel Jordan <daniel.m.jordan@oracle.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: David Rientjes <rientjes@google.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Ingo Molnar <mingo@kernel.org> Cc: Jan Kara <jack@suse.cz> Cc: Jérôme Glisse <jglisse@redhat.com> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Souptick Joarder <jrdr.linux@gmail.com> Cc: Steven Sistare <steven.sistare@oracle.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Wei Yang <richard.weiyang@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-17mm/sparse: use the new sparse buffer functions in non-vmemmapPavel Tatashin1-27/+14
non-vmemmap sparse also allocated large contiguous chunk of memory, and if fails falls back to smaller allocations. Use the same functions to allocate buffer as the vmemmap-sparse Link: http://lkml.kernel.org/r/20180712203730.8703-3-pasha.tatashin@oracle.com Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com> Tested-by: Michael Ellerman <mpe@ellerman.id.au> [powerpc] Reviewed-by: Oscar Salvador <osalvador@suse.de> Tested-by: Oscar Salvador <osalvador@suse.de> Cc: Pasha Tatashin <Pavel.Tatashin@microsoft.com> Cc: Abdul Haleem <abdhalee@linux.vnet.ibm.com> Cc: Baoquan He <bhe@redhat.com> Cc: Daniel Jordan <daniel.m.jordan@oracle.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: David Rientjes <rientjes@google.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Ingo Molnar <mingo@kernel.org> Cc: Jan Kara <jack@suse.cz> Cc: Jérôme Glisse <jglisse@redhat.com> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Souptick Joarder <jrdr.linux@gmail.com> Cc: Steven Sistare <steven.sistare@oracle.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Wei Yang <richard.weiyang@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-17mm/sparse: abstract sparse buffer allocationsPavel Tatashin3-35/+54
Patch series "sparse_init rewrite", v6. In sparse_init() we allocate two large buffers to temporary hold usemap and memmap for the whole machine. However, we can avoid doing that if we changed sparse_init() to operated on per-node bases instead of doing it on the whole machine beforehand. As shown by Baoquan http://lkml.kernel.org/r/20180628062857.29658-1-bhe@redhat.com The buffers are large enough to cause machine stop to boot on small memory systems. Another benefit of these changes is that they also obsolete CONFIG_SPARSEMEM_ALLOC_MEM_MAP_TOGETHER. This patch (of 5): When struct pages are allocated for sparse-vmemmap VA layout, we first try to allocate one large buffer, and than if that fails allocate struct pages for each section as we go. The code that allocates buffer is uses global variables and is spread across several call sites. Cleanup the code by introducing three functions to handle the global buffer: sparse_buffer_init() initialize the buffer sparse_buffer_fini() free the remaining part of the buffer sparse_buffer_alloc() alloc from the buffer, and if buffer is empty return NULL Define these functions in sparse.c instead of sparse-vmemmap.c because later we will use them for non-vmemmap sparse allocations as well. [akpm@linux-foundation.org: use PTR_ALIGN()] [akpm@linux-foundation.org: s/BUG_ON/WARN_ON/] Link: http://lkml.kernel.org/r/20180712203730.8703-2-pasha.tatashin@oracle.com Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com> Tested-by: Michael Ellerman <mpe@ellerman.id.au> [powerpc] Reviewed-by: Oscar Salvador <osalvador@suse.de> Tested-by: Oscar Salvador <osalvador@suse.de> Cc: Pasha Tatashin <Pavel.Tatashin@microsoft.com> Cc: Steven Sistare <steven.sistare@oracle.com> Cc: Daniel Jordan <daniel.m.jordan@oracle.com> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Jan Kara <jack@suse.cz> Cc: Jérôme Glisse <jglisse@redhat.com> Cc: Souptick Joarder <jrdr.linux@gmail.com> Cc: Baoquan He <bhe@redhat.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Wei Yang <richard.weiyang@gmail.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: David Rientjes <rientjes@google.com> Cc: Ingo Molnar <mingo@kernel.org> Cc: Abdul Haleem <abdhalee@linux.vnet.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-17mm/hugetlb.c: don't zero 1GiB bootmem pagesCannon Matthews1-1/+2
When using 1GiB pages during early boot, use the new memblock_virt_alloc_try_nid_raw() to allocate memory without zeroing it. Zeroing out hundreds or thousands of GiB in a single core memset() call is very slow, and can make early boot last upwards of 20-30 minutes on multi TiB machines. The memory does not need to be zero'd as the hugetlb pages are always zero'd on page fault. Tested: Booted with ~3800 1G pages, and it booted successfully in roughly the same amount of time as with 0, as opposed to the 25+ minutes it would take before. Link: http://lkml.kernel.org/r/20180711213313.92481-1-cannonmatthews@google.com Signed-off-by: Cannon Matthews <cannonmatthews@google.com> Acked-by: Mike Kravetz <mike.kravetz@oracle.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Andres Lagar-Cavilla <andreslc@google.com> Cc: Peter Feiner <pfeiner@google.com> Cc: David Matlack <dmatlack@google.com> Cc: Greg Thelen <gthelen@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-17mm, page_alloc: double zone's batchsizeAaron Lu1-5/+4
To improve page allocator's performance for order-0 pages, each CPU has a Per-CPU-Pageset(PCP) per zone. Whenever an order-0 page is needed, PCP will be checked first before asking pages from Buddy. When PCP is used up, a batch of pages will be fetched from Buddy to improve performance and the size of batch can affect performance. zone's batch size gets doubled last time by commit ba56e91c9401("mm: page_alloc: increase size of per-cpu-pages") over ten years ago. Since then, CPU has envolved a lot and CPU's cache sizes also increased. Dave Hansen is concerned the current batch size doesn't fit well with modern hardware and suggested me to do two things: first, use a page allocator intensive benchmark, e.g. will-it-scale/page_fault1 to find out how performance changes with different batch sizes on various machines and then choose a new default batch size; second, see how this new batch size work with other workloads. In the first test, we saw performance gains on high-core-count systems and little to no effect on older systems with more modest core counts. In this phase's test data, two candidates: 63 and 127 are chosen. In the second step, ebizzy, oltp, kbuild, pigz, netperf, vm-scalability and more will-it-scale sub-tests are tested to see how these two candidates work with these workloads and decides a new default according to their results. Most test results are flat. will-it-scale/page_fault2 process mode has 10%-18% performance increase on 4-sockets Skylake and Broadwell. vm-scalability/lru-file-mmap-read has 17%-47% performance increase for 4-sockets servers while for 2-sockets servers, it caused 3%-8% performance drop. Further analysis showed that, with a larger pcp->batch and thus larger pcp->high(the relationship of pcp->high=6 * pcp->batch is maintained in this patch), zone lock contention shifted to LRU add side lock contention and that caused performance drop. This performance drop might be mitigated by others' work on optimizing LRU lock. Another downside of increasing pcp->batch is, when PCP is used up and need to fetch a batch of pages from Buddy, since batch is increased, that time can be longer than before. My understanding is, this doesn't affect slowpath where direct reclaim and compaction dominates. For fastpath, throughput is a win(according to will-it-scale/page_fault1) but worst latency can be larger now. Overall, I think double the batch size from 31 to 63 is relatively safe and provide good performance boost for high-core-count systems. The two phase's test results are listed below(all tests are done with THP disabled). Phase one(will-it-scale/page_fault1) test results: Skylake-EX: increased batch size has a good effect on zone->lock contention, though LRU contention will rise at the same time and limited the final performance increase. batch score change zone_contention lru_contention total_contention 31 15345900 +0.00% 64% 8% 72% 53 17903847 +16.67% 32% 38% 70% 63 17992886 +17.25% 24% 45% 69% 73 18022825 +17.44% 10% 61% 71% 119 18023401 +17.45% 4% 66% 70% 127 18029012 +17.48% 3% 66% 69% 137 18036075 +17.53% 4% 66% 70% 165 18035964 +17.53% 2% 67% 69% 188 18101105 +17.95% 2% 67% 69% 223 18130951 +18.15% 2% 67% 69% 255 18118898 +18.07% 2% 67% 69% 267 18101559 +17.96% 2% 67% 69% 299 18160468 +18.34% 2% 68% 70% 320 18139845 +18.21% 2% 67% 69% 393 18160869 +18.34% 2% 68% 70% 424 18170999 +18.41% 2% 68% 70% 458 18144868 +18.24% 2% 68% 70% 467 18142366 +18.22% 2% 68% 70% 498 18154549 +18.30% 1% 68% 69% 511 18134525 +18.17% 1% 69% 70% Broadwell-EX: similar pattern as Skylake-EX. batch score change zone_contention lru_contention total_contention 31 16703983 +0.00% 67% 7% 74% 53 18195393 +8.93% 43% 28% 71% 63 18288885 +9.49% 38% 33% 71% 73 18344329 +9.82% 35% 37% 72% 119 18535529 +10.96% 24% 46% 70% 127 18513596 +10.83% 23% 48% 71% 137 18514327 +10.84% 23% 48% 71% 165 18511840 +10.82% 22% 49% 71% 188 18593478 +11.31% 17% 53% 70% 223 18601667 +11.36% 17% 52% 69% 255 18774825 +12.40% 12% 58% 70% 267 18754781 +12.28% 9% 60% 69% 299 18892265 +13.10% 7% 63% 70% 320 18873812 +12.99% 8% 62% 70% 393 18891174 +13.09% 6% 64% 70% 424 18975108 +13.60% 6% 64% 70% 458 18932364 +13.34% 8% 62% 70% 467 18960891 +13.51% 5% 65% 70% 498 18944526 +13.41% 5% 64% 69% 511 18960839 +13.51% 5% 64% 69% Skylake-EP: although increased batch reduced zone->lock contention, but the effect is not as good as EX: zone->lock contention is still as high as 20% with a very high batch value instead of 1% on Skylake-EX or 5% on Broadwell-EX. Also, total_contention actually decreased with a higher batch but that doesn't translate to performance increase. batch score change zone_contention lru_contention total_contention 31 9554867 +0.00% 66% 3% 69% 53 9855486 +3.15% 63% 3% 66% 63 9980145 +4.45% 62% 4% 66% 73 10092774 +5.63% 62% 5% 67% 119 10310061 +7.90% 45% 19% 64% 127 10342019 +8.24% 42% 19% 61% 137 10358182 +8.41% 42% 21% 63% 165 10397060 +8.81% 37% 24% 61% 188 10341808 +8.24% 34% 26% 60% 223 10349135 +8.31% 31% 27% 58% 255 10327189 +8.08% 28% 29% 57% 267 10344204 +8.26% 27% 29% 56% 299 10325043 +8.06% 25% 30% 55% 320 10310325 +7.91% 25% 31% 56% 393 10293274 +7.73% 21% 31% 52% 424 10311099 +7.91% 21% 32% 53% 458 10321375 +8.02% 21% 32% 53% 467 10303881 +7.84% 21% 32% 53% 498 10332462 +8.14% 20% 33% 53% 511 10325016 +8.06% 20% 32% 52% Broadwell-EP: zone->lock and lru lock had an agreement to make sure performance doesn't increase and they successfully managed to keep total contention at 70%. batch score change zone_contention lru_contention total_contention 31 10121178 +0.00% 19% 50% 69% 53 10142366 +0.21% 6% 63% 69% 63 10117984 -0.03% 11% 58% 69% 73 10123330 +0.02% 7% 63% 70% 119 10108791 -0.12% 2% 67% 69% 127 10166074 +0.44% 3% 66% 69% 137 10141574 +0.20% 3% 66% 69% 165 10154499 +0.33% 2% 68% 70% 188 10124921 +0.04% 2% 67% 69% 223 10137399 +0.16% 2% 67% 69% 255 10143289 +0.22% 0% 68% 68% 267 10123535 +0.02% 1% 68% 69% 299 10140952 +0.20% 0% 68% 68% 320 10163170 +0.41% 0% 68% 68% 393 10000633 -1.19% 0% 69% 69% 424 10087998 -0.33% 0% 69% 69% 458 10187116 +0.65% 0% 69% 69% 467 10146790 +0.25% 0% 69% 69% 498 10197958 +0.76% 0% 69% 69% 511 10152326 +0.31% 0% 69% 69% Haswell-EP: similar to Broadwell-EP. batch score change zone_contention lru_contention total_contention 31 10442205 +0.00% 14% 48% 62% 53 10442255 +0.00% 5% 57% 62% 63 10452059 +0.09% 6% 57% 63% 73 10482349 +0.38% 5% 59% 64% 119 10454644 +0.12% 3% 60% 63% 127 10431514 -0.10% 3% 59% 62% 137 10423785 -0.18% 3% 60% 63% 165 10481216 +0.37% 2% 61% 63% 188 10448755 +0.06% 2% 61% 63% 223 10467144 +0.24% 2% 61% 63% 255 10480215 +0.36% 2% 61% 63% 267 10484279 +0.40% 2% 61% 63% 299 10466450 +0.23% 2% 61% 63% 320 10452578 +0.10% 2% 61% 63% 393 10499678 +0.55% 1% 62% 63% 424 10481454 +0.38% 1% 62% 63% 458 10473562 +0.30% 1% 62% 63% 467 10484269 +0.40% 0% 62% 62% 498 10505599 +0.61% 0% 62% 62% 511 10483395 +0.39% 0% 62% 62% Westmere-EP: contention is pretty small so not interesting. Note too high a batch value could hurt performance. batch score change zone_contention lru_contention total_contention 31 4831523 +0.00% 2% 3% 5% 53 4834086 +0.05% 2% 4% 6% 63 4834262 +0.06% 2% 3% 5% 73 4832851 +0.03% 2% 4% 6% 119 4830534 -0.02% 1% 3% 4% 127 4827461 -0.08% 1% 4% 5% 137 4827459 -0.08% 1% 3% 4% 165 4820534 -0.23% 0% 4% 4% 188 4817947 -0.28% 0% 3% 3% 223 4809671 -0.45% 0% 3% 3% 255 4802463 -0.60% 0% 4% 4% 267 4801634 -0.62% 0% 3% 3% 299 4798047 -0.69% 0% 3% 3% 320 4793084 -0.80% 0% 3% 3% 393 4785877 -0.94% 0% 3% 3% 424 4782911 -1.01% 0% 3% 3% 458 4779346 -1.08% 0% 3% 3% 467 4780306 -1.06% 0% 3% 3% 498 4780589 -1.05% 0% 3% 3% 511 4773724 -1.20% 0% 3% 3% Skylake-Desktop: similar to Westmere-EP, nothing interesting. batch score change zone_contention lru_contention total_contention 31 3906608 +0.00% 2% 3% 5% 53 3940164 +0.86% 2% 3% 5% 63 3937289 +0.79% 2% 3% 5% 73 3940201 +0.86% 2% 3% 5% 119 3933240 +0.68% 2% 3% 5% 127 3930514 +0.61% 2% 4% 6% 137 3938639 +0.82% 0% 3% 3% 165 3908755 +0.05% 0% 3% 3% 188 3905621 -0.03% 0% 3% 3% 223 3903015 -0.09% 0% 4% 4% 255 3889480 -0.44% 0% 3% 3% 267 3891669 -0.38% 0% 4% 4% 299 3898728 -0.20% 0% 4% 4% 320 3894547 -0.31% 0% 4% 4% 393 3875137 -0.81% 0% 4% 4% 424 3874521 -0.82% 0% 3% 3% 458 3880432 -0.67% 0% 4% 4% 467 3888715 -0.46% 0% 3% 3% 498 3888633 -0.46% 0% 4% 4% 511 3875305 -0.80% 0% 5% 5% Haswell-Desktop: zone->lock is pretty low as other desktops, though lru contention is higher than other desktops. batch score change zone_contention lru_contention total_contention 31 3511158 +0.00% 2% 5% 7% 53 3555445 +1.26% 2% 6% 8% 63 3561082 +1.42% 2% 6% 8% 73 3547218 +1.03% 2% 6% 8% 119 3571319 +1.71% 1% 7% 8% 127 3549375 +1.09% 0% 6% 6% 137 3560233 +1.40% 0% 6% 6% 165 3555176 +1.25% 2% 6% 8% 188 3551501 +1.15% 0% 8% 8% 223 3531462 +0.58% 0% 7% 7% 255 3570400 +1.69% 0% 7% 7% 267 3532235 +0.60% 1% 8% 9% 299 3562326 +1.46% 0% 6% 6% 320 3553569 +1.21% 0% 8% 8% 393 3539519 +0.81% 0% 7% 7% 424 3549271 +1.09% 0% 8% 8% 458 3528885 +0.50% 0% 8% 8% 467 3526554 +0.44% 0% 7% 7% 498 3525302 +0.40% 0% 9% 9% 511 3527556 +0.47% 0% 8% 8% Sandybridge-Desktop: the 0% contention isn't accurate but caused by dropped fractional part. Since multiple contention path's contentions are all under 1% here, with some arithmetic operations like add, the final deviation could be as large as 3%. batch score change zone_contention lru_contention total_contention 31 1744495 +0.00% 0% 0% 0% 53 1755341 +0.62% 0% 0% 0% 63 1758469 +0.80% 0% 0% 0% 73 1759626 +0.87% 0% 0% 0% 119 1770417 +1.49% 0% 0% 0% 127 1768252 +1.36% 0% 0% 0% 137 1767848 +1.34% 0% 0% 0% 165 1765088 +1.18% 0% 0% 0% 188 1766918 +1.29% 0% 0% 0% 223 1767866 +1.34% 0% 0% 0% 255 1768074 +1.35% 0% 0% 0% 267 1763187 +1.07% 0% 0% 0% 299 1765620 +1.21% 0% 0% 0% 320 1767603 +1.32% 0% 0% 0% 393 1764612 +1.15% 0% 0% 0% 424 1758476 +0.80% 0% 0% 0% 458 1758593 +0.81% 0% 0% 0% 467 1757915 +0.77% 0% 0% 0% 498 1753363 +0.51% 0% 0% 0% 511 1755548 +0.63% 0% 0% 0% Phase two test results: Note: all percent change is against base(batch=31). ebizzy.throughput (higer is better) machine batch=31 batch=63 batch=127 lkp-skl-4sp1 2410037±7% 2600451±2% +7.9% 2602878 +8.0% lkp-bdw-ex1 1493328 1489243 -0.3% 1492145 -0.1% lkp-skl-2sp2 1329674 1345891 +1.2% 1351056 +1.6% lkp-bdw-ep2 711511 711511 0.0% 710708 -0.1% lkp-wsm-ep2 75750 75528 -0.3% 75441 -0.4% lkp-skl-d01 264126 262791 -0.5% 264113 +0.0% lkp-hsw-d01 176601 176328 -0.2% 176368 -0.1% lkp-sb02 98937 98937 +0.0% 99030 +0.1% kbuild.buildtime (less is better) machine batch=31 batch=63 batch=127 lkp-skl-4sp1 107.00 107.67 +0.6% 107.11 +0.1% lkp-bdw-ex1 97.33 97.33 +0.0% 97.42 +0.1% lkp-skl-2sp2 180.00 179.83 -0.1% 179.83 -0.1% lkp-bdw-ep2 178.17 179.17 +0.6% 177.50 -0.4% lkp-wsm-ep2 737.00 738.00 +0.1% 738.00 +0.1% lkp-skl-d01 642.00 653.00 +1.7% 653.00 +1.7% lkp-hsw-d01 1310.00 1316.00 +0.5% 1311.00 +0.1% netperf/TCP_STREAM.Throughput_total_Mbps (higher is better) machine batch=31 batch=63 batch=127 lkp-skl-4sp1 948790 947144 -0.2% 948333 -0.0% lkp-bdw-ex1 904224 904366 +0.0% 904926 +0.1% lkp-skl-2sp2 239731 239607 -0.1% 239565 -0.1% lk-bdw-ep2 365764 365933 +0.0% 365951 +0.1% lkp-wsm-ep2 93736 93803 +0.1% 93808 +0.1% lkp-skl-d01 77314 77303 -0.0% 77375 +0.1% lkp-hsw-d01 58617 60387 +3.0% 60208 +2.7% lkp-sb02 29990 30137 +0.5% 30103 +0.4% oltp.transactions (higer is better) machine batch=31 batch=63 batch=127 lkp-bdw-ex1 9073276 9100377 +0.3% 9036344 -0.4% lkp-skl-2sp2 8898717 8852054 -0.5% 8894459 -0.0% lkp-bdw-ep2 13426155 13384654 -0.3% 13333637 -0.7% lkp-hsw-ep2 13146314 13232784 +0.7% 13193163 +0.4% lkp-wsm-ep2 5035355 5019348 -0.3% 5033418 -0.0% lkp-skl-d01 418485 4413339 -0.1% 4419039 +0.0% lkp-hsw-d01 3517817±5% 3396120±3% -3.5% 3455138±3% -1.8% pigz.throughput (higer is better) machine batch=31 batch=63 batch=127 lkp-skl-4sp1 1.513e+08 1.507e+08 -0.4% 1.511e+08 -0.2% lkp-bdw-ex1 2.060e+08 2.052e+08 -0.4% 2.044e+08 -0.8% lkp-skl-2sp2 8.836e+08 8.845e+08 +0.1% 8.836e+08 -0.0% lkp-bdw-ep2 8.275e+08 8.464e+08 +2.3% 8.330e+08 +0.7% lkp-wsm-ep2 2.224e+08 2.221e+08 -0.2% 2.218e+08 -0.3% lkp-skl-d01 1.177e+08 1.177e+08 -0.0% 1.176e+08 -0.1% lkp-hsw-d01 1.154e+08 1.154e+08 +0.1% 1.154e+08 -0.0% lkp-sb02 0.633e+08 0.633e+08 +0.1% 0.633e+08 +0.0% will-it-scale.malloc1.processes (higher is better) machine batch=31 batch=63 batch=127 lkp-skl-4sp1 620181 620484 +0.0% 620240 +0.0% lkp-bdw-ex1 1403610 1401201 -0.2% 1417900 +1.0% lkp-skl-2sp2 1288097 1284145 -0.3% 1283907 -0.3% lkp-bdw-ep2 1427879 1427675 -0.0% 1428266 +0.0% lkp-hsw-ep2 1362546 1353965 -0.6% 1354759 -0.6% lkp-wsm-ep2 2099657 2107576 +0.4% 2100226 +0.0% lkp-skl-d01 1476835 1476358 -0.0% 1474487 -0.2% lkp-hsw-d01 1308810 1303429 -0.4% 1301299 -0.6% lkp-sb02 589286 589284 -0.0% 588101 -0.2% will-it-scale.malloc1.threads (higher is better) machine batch=31 batch=63 batch=127 lkp-skl-4sp1 21289 21125 -0.8% 21241 -0.2% lkp-bdw-ex1 28114 28089 -0.1% 28007 -0.4% lkp-skl-2sp2 91866 91946 +0.1% 92723 +0.9% lkp-bdw-ep2 37637 37501 -0.4% 37317 -0.9% lkp-hsw-ep2 43673 43590 -0.2% 43754 +0.2% lkp-wsm-ep2 28577 28298 -1.0% 28545 -0.1% lkp-skl-d01 175277 173343 -1.1% 173082 -1.3% lkp-hsw-d01 130303 129566 -0.6% 129250 -0.8% lkp-sb02 113742±3% 116911 +2.8% 116417±3% +2.4% will-it-scale.malloc2.processes (higer is better) machine batch=31 batch=63 batch=127 lkp-skl-4sp1 1.206e+09 1.206e+09 -0.0% 1.206e+09 +0.0% lkp-bdw-ex1 1.319e+09 1.319e+09 -0.0% 1.319e+09 +0.0% lkp-skl-2sp2 8.000e+08 8.021e+08 +0.3% 7.995e+08 -0.1% lkp-bdw-ep2 6.582e+08 6.634e+08 +0.8% 6.513e+08 -1.1% lkp-hsw-ep2 6.671e+08 6.669e+08 -0.0% 6.665e+08 -0.1% lkp-wsm-ep2 1.805e+08 1.806e+08 +0.0% 1.804e+08 -0.1% lkp-skl-d01 1.611e+08 1.611e+08 -0.0% 1.610e+08 -0.0% lkp-hsw-d01 1.333e+08 1.332e+08 -0.0% 1.332e+08 -0.0% lkp-sb02 82485104 82478206 -0.0% 82473546 -0.0% will-it-scale.malloc2.threads (higer is better) machine batch=31 batch=63 batch=127 lkp-skl-4sp1 1.574e+09 1.574e+09 -0.0% 1.574e+09 -0.0% lkp-bdw-ex1 1.737e+09 1.737e+09 +0.0% 1.737e+09 -0.0% lkp-skl-2sp2 9.161e+08 9.162e+08 +0.0% 9.181e+08 +0.2% lkp-bdw-ep2 7.856e+08 8.015e+08 +2.0% 8.113e+08 +3.3% lkp-hsw-ep2 6.908e+08 6.904e+08 -0.1% 6.907e+08 -0.0% lkp-wsm-ep2 2.409e+08 2.409e+08 +0.0% 2.409e+08 -0.0% lkp-skl-d01 1.199e+08 1.199e+08 -0.0% 1.199e+08 -0.0% lkp-hsw-d01 1.029e+08 1.029e+08 -0.0% 1.029e+08 +0.0% lkp-sb02 68081213 68061423 -0.0% 68076037 -0.0% will-it-scale.page_fault2.processes (higer is better) machine batch=31 batch=63 batch=127 lkp-skl-4sp1 14509125±4% 16472364 +13.5% 17123117 +18.0% lkp-bdw-ex1 14736381 16196588 +9.9% 16364011 +11.0% lkp-skl-2sp2 6354925 6435444 +1.3% 6436644 +1.3% lkp-bdw-ep2 8749584 8834422 +1.0% 8827179 +0.9% lkp-hsw-ep2 8762591 8845920 +1.0% 8825697 +0.7% lkp-wsm-ep2 3036083 3030428 -0.2% 3021741 -0.5% lkp-skl-d01 2307834 2304731 -0.1% 2286142 -0.9% lkp-hsw-d01 1806237 1800786 -0.3% 1795943 -0.6% lkp-sb02 842616 837844 -0.6% 833921 -1.0% will-it-scale.page_fault2.threads machine batch=31 batch=63 batch=127 lkp-skl-4sp1 1623294 1615132±2% -0.5% 1656777 +2.1% lkp-bdw-ex1 1995714 2025948 +1.5% 2113753±3% +5.9% lkp-skl-2sp2 2346708 2415591 +2.9% 2416919 +3.0% lkp-bdw-ep2 2342564 2344882 +0.1% 2300206 -1.8% lkp-hsw-ep2 1820658 1831681 +0.6% 1844057 +1.3% lkp-wsm-ep2 1725482 1733774 +0.5% 1740517 +0.9% lkp-skl-d01 1832833 1823628 -0.5% 1806489 -1.4% lkp-hsw-d01 1427913 1427287 -0.0% 1420226 -0.5% lkp-sb02 750626 748615 -0.3% 746621 -0.5% will-it-scale.page_fault3.processes (higher is better) machine batch=31 batch=63 batch=127 lkp-skl-4sp1 24382726 24400317 +0.1% 24668774 +1.2% lkp-bdw-ex1 35399750 35683124 +0.8% 35829492 +1.2% lkp-skl-2sp2 28136820 28068248 -0.2% 28147989 +0.0% lkp-bdw-ep2 37269077 37459490 +0.5% 37373073 +0.3% lkp-hsw-ep2 36224967 36114085 -0.3% 36104908 -0.3% lkp-wsm-ep2 16820457 16911005 +0.5% 16968596 +0.9% lkp-skl-d01 7721138 7725904 +0.1% 7756740 +0.5% lkp-hsw-d01 7611979 7650928 +0.5% 7651323 +0.5% lkp-sb02 3781546 3796502 +0.4% 3796827 +0.4% will-it-scale.page_fault3.threads (higer is better) machine batch=31 batch=63 batch=127 lkp-skl-4sp1 1865820±3% 1900917±2% +1.9% 1826245±4% -2.1% lkp-bdw-ex1 3094060 3148326 +1.8% 3150036 +1.8% lkp-skl-2sp2 3952940 3953898 +0.0% 3989360 +0.9% lkp-bdw-ep2 3420373±3% 3643964 +6.5% 3644910±5% +6.6% lkp-hsw-ep2 2609635±2% 2582310±3% -1.0% 2780459 +6.5% lkp-wsm-ep2 4395001 4417196 +0.5% 4432499 +0.9% lkp-skl-d01 5363977 5400003 +0.7% 5411370 +0.9% lkp-hsw-d01 5274131 5311294 +0.7% 5319359 +0.9% lkp-sb02 2917314 2913004 -0.1% 2935286 +0.6% will-it-scale.read1.processes (higer is better) machine batch=31 batch=63 batch=127 lkp-skl-4sp1 73762279±14% 69322519±10% -6.0% 69349855±13% -6.0% (result unstable) lkp-bdw-ex1 1.701e+08 1.704e+08 +0.1% 1.705e+08 +0.2% lkp-skl-2sp2 63111570 63113953 +0.0% 63836573 +1.1% lkp-bdw-ep2 79247409 79424610 +0.2% 78012656 -1.6% lkp-hsw-ep2 67677026 68308800 +0.9% 67539106 -0.2% lkp-wsm-ep2 13339630 13939817 +4.5% 13766865 +3.2% lkp-skl-d01 10969487 10972650 +0.0% no data lkp-hsw-d01 9857342±2% 10080592±2% +2.3% 10131560 +2.8% lkp-sb02 5189076 5197473 +0.2% 5163253 -0.5% will-it-scale.read1.threads (higher is better) machine batch=31 batch=63 batch=127 lkp-skl-4sp1 62468045±12% 73666726±7% +17.9% 79553123±12% +27.4% (result unstable) lkp-bdw-ex1 1.62e+08 1.624e+08 +0.3% 1.614e+08 -0.3% lkp-skl-2sp2 58319780 59181032 +1.5% 59821353 +2.6% lkp-bdw-ep2 74057992 75698171 +2.2% 74990869 +1.3% lkp-hsw-ep2 63672959 63639652 -0.1% 64387051 +1.1% lkp-wsm-ep2 13489943 13526058 +0.3% 13259032 -1.7% lkp-skl-d01 10297906 10338796 +0.4% 10407328 +1.1% lkp-hsw-d01 9636721 9667376 +0.3% 9341147 -3.1% lkp-sb02 4801938 4804496 +0.1% 4802290 +0.0% will-it-scale.write1.processes (higer is better) machine batch=31 batch=63 batch=127 lkp-skl-4sp1 1.111e+08 1.104e+08±2% -0.7% 1.122e+08±2% +1.0% lkp-bdw-ex1 1.392e+08 1.399e+08 +0.5% 1.397e+08 +0.4% lkp-skl-2sp2 59369233 58994841 -0.6% 58715168 -1.1% lkp-bdw-ep2 61820979 CPU throttle 63593123 +2.9% lkp-hsw-ep2 57897587 57435605 -0.8% 56347450 -2.7% lkp-wsm-ep2 7814203 7918017±2% +1.3% 7669068 -1.9% lkp-skl-d01 8886557 8971422 +1.0% 8818366 -0.8% lkp-hsw-d01 9171001±5% 9189915 +0.2% 9483909 +3.4% lkp-sb02 4475406 4475294 -0.0% 4501756 +0.6% will-it-scale.write1.threads (higer is better) machine batch=31 batch=63 batch=127 lkp-skl-4sp1 1.058e+08 1.055e+08±2% -0.2% 1.065e+08 +0.7% lkp-bdw-ex1 1.316e+08 1.300e+08 -1.2% 1.308e+08 -0.6% lkp-skl-2sp2 54492421 56086678 +2.9% 55975657 +2.7% lkp-bdw-ep2 59360449 59003957 -0.6% 58101262 -2.1% lkp-hsw-ep2 53346346±2% 52530876 -1.5% 52902487 -0.8% lkp-wsm-ep2 7774006 7800092±2% +0.3% 7558833 -2.8% lkp-skl-d01 8346174 8235695 -1.3% no data lkp-hsw-d01 8636244 8655731 +0.2% 8658868 +0.3% lkp-sb02 4181820 4204107 +0.5% 4182992 +0.0% vm-scalability.anon-r-rand.throughput (higher is better) machine batch=31 batch=63 batch=127 lkp-skl-4sp1 11933873±3% 12356544±2% +3.5% 12188624 +2.1% lkp-bdw-ex1 7114424±2% 7330949±2% +3.0% 7392419 +3.9% lkp-skl-2sp2 6773277±5% 6492332±8% -4.1% 6543962 -3.4% lkp-bdw-ep2 7133846±4% 7233508 +1.4% 7013518±3% -1.7% lkp-hsw-ep2 4576626 4527098 -1.1% 4551679 -0.5% lkp-wsm-ep2 2583599 2592492 +0.3% 2588039 +0.2% lkp-hsw-d01 998199±2% 1028311 +3.0% 1006460±2% +0.8% lkp-sb02 570572 567854 -0.5% 568449 -0.4% vm-scalability.anon-r-rand-mt.throughput (higher is better) machine batch=31 batch=63 batch=127 lkp-skl-4sp1 1789419 1787830 -0.1% 1788208 -0.1% lkp-bdw-ex1 3492595±2% 3554966±2% +1.8% 3558835±3% +1.9% lkp-skl-2sp2 3856238±2% 3975403±4% +3.1% 3994600 +3.6% lkp-bdw-ep2 3726963±11% 3809292±6% +2.2% 3871924±4% +3.9% lkp-hsw-ep2 2131760±3% 2033578±4% -4.6% 2130727±6% -0.0% lkp-wsm-ep2 2369731 2368384 -0.1% 2370252 +0.0% lkp-skl-d01 1207128 1206220 -0.1% 1205801 -0.1% lkp-hsw-d01 964317 992329±2% +2.9% 992099±2% +2.9% lkp-sb02 567137 567346 +0.0% 566144 -0.2% vm-scalability.lru-file-mmap-read.throughput (higher is better) machine batch=31 batch=63 batch=127 lkp-skl-4sp1 19560469±6% 23018999 +17.7% 23418800 +19.7% lkp-bdw-ex1 17769135±14% 26141676±3% +47.1% 26284723±5% +47.9% lkp-skl-2sp2 14056512 13578884 -3.4% 13146214 -6.5% lkp-bdw-ep2 15336542 14737654 -3.9% 14088159 -8.1% lkp-hsw-ep2 16275498 15756296 -3.2% 15018090 -7.7% lkp-wsm-ep2 11272160 11237231 -0.3% 11310047 +0.3% lkp-skl-d01 7322119 7324569 +0.0% 7184148 -1.9% lkp-hsw-d01 6449234 6404542 -0.7% 6356141 -1.4% lkp-sb02 3517943 3520668 +0.1% 3527309 +0.3% vm-scalability.lru-file-mmap-read-rand.throughput (higher is better) machine batch=31 batch=63 batch=127 lkp-skl-4sp1 1689052 1697553 +0.5% 1698726 +0.6% lkp-bdw-ex1 1675246 1699764 +1.5% 1712226 +2.2% lkp-skl-2sp2 1800533 1799749 -0.0% 1800581 +0.0% lkp-bdw-ep2 1807422 1807758 +0.0% 1804932 -0.1% lkp-hsw-ep2 1809807 1808781 -0.1% 1807811 -0.1% lkp-wsm-ep2 1800198 1802434 +0.1% 1801236 +0.1% lkp-skl-d01 696689 695537 -0.2% 694106 -0.4% lkp-hsw-d01 698364 698666 +0.0% 696686 -0.2% lkp-sb02 258939 258787 -0.1% 258199 -0.3% Link: http://lkml.kernel.org/r/20180711055855.29072-1-aaron.lu@intel.com Signed-off-by: Aaron Lu <aaron.lu@intel.com> Suggested-by: Dave Hansen <dave.hansen@intel.com> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Jesper Dangaard Brouer <brouer@redhat.com> Cc: Huang Ying <ying.huang@intel.com> Cc: Kemi Wang <kemi.wang@intel.com> Cc: Tim Chen <tim.c.chen@linux.intel.com> Cc: Andi Kleen <ak@linux.intel.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Jesper Dangaard Brouer <brouer@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-17mm/oom_kill.c: document oom_lockMichal Hocko1-0/+8
Add comments describing oom_lock's scope. Requested-by: David Rientjes <rientjes@google.com> Link: http://lkml.kernel.org/r/20180711120121.25635-1-mhocko@kernel.org Signed-off-by: Michal Hocko <mhocko@suse.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: David Rientjes <rientjes@google.com> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-17mm/hugetlb: remove gigantic page support for HIGHMEMMike Kravetz2-11/+1
This reverts ee8f248d266e ("hugetlb: add phys addr to struct huge_bootmem_page"). At one time powerpc used this field and supporting code. However that was removed with commit 79cc38ded1e1 ("powerpc/mm/hugetlb: Add support for reserving gigantic huge pages via kernel command line"). There are no users of this field and supporting code, so remove it. Link: http://lkml.kernel.org/r/20180711195913.1294-1-mike.kravetz@oracle.com Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Acked-by: Michal Hocko <mhocko@suse.com> Cc: "Aneesh Kumar K . V" <aneesh.kumar@linux.vnet.ibm.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Cannon Matthews <cannonmatthews@google.com> Cc: Becky Bruce <beckyb@kernel.crashing.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-17mm, oom: remove sleep from under oom_lockMichal Hocko1-7/+1
Tetsuo has pointed out that since 27ae357fa82b ("mm, oom: fix concurrent munlock and oom reaper unmap, v3") we have a strong synchronization between the oom_killer and victim's exiting because both have to take the oom_lock. Therefore the original heuristic to sleep for a short time in out_of_memory doesn't serve the original purpose. Moreover Tetsuo has noticed that the short sleep can be more harmful than actually useful. Hammering the system with many processes can lead to a starvation when the task holding the oom_lock can block for a long time (minutes) and block any further progress because the oom_reaper depends on the oom_lock as well. Drop the short sleep from out_of_memory when we hold the lock. Keep the sleep when the trylock fails to throttle the concurrent OOM paths a bit. This should be solved in a more reasonable way (e.g. sleep proportional to the time spent in the active reclaiming etc.) but this is much more complex thing to achieve. This is a quick fixup to remove a stale code. Link: http://lkml.kernel.org/r/20180709074706.30635-1-mhocko@kernel.org Signed-off-by: Michal Hocko <mhocko@suse.com> Reported-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-17kernel/dma: remove unsupported gfp_mask parameter from ↵Marek Szyprowski8-14/+16
dma_alloc_from_contiguous() The CMA memory allocator doesn't support standard gfp flags for memory allocation, so there is no point having it as a parameter for dma_alloc_from_contiguous() function. Replace it by a boolean no_warn argument, which covers all the underlaying cma_alloc() function supports. This will help to avoid giving false feeling that this function supports standard gfp flags and callers can pass __GFP_ZERO to get zeroed buffer, what has already been an issue: see commit dd65a941f6ba ("arm64: dma-mapping: clear buffers allocated with FORCE_CONTIGUOUS flag"). Link: http://lkml.kernel.org/r/20180709122020eucas1p21a71b092975cb4a3b9954ffc63f699d1~-sqUFoa-h2939329393eucas1p2Y@eucas1p2.samsung.com Signed-off-by: Marek Szyprowski <m.szyprowski@samsung.com> Acked-by: Michał Nazarewicz <mina86@mina86.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: Christoph Hellwig <hch@lst.de> Cc: Laura Abbott <labbott@redhat.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Joonsoo Kim <js1304@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-17mm/cma: remove unsupported gfp_mask parameter from cma_alloc()Marek Szyprowski7-10/+11
cma_alloc() doesn't really support gfp flags other than __GFP_NOWARN, so convert gfp_mask parameter to boolean no_warn parameter. This will help to avoid giving false feeling that this function supports standard gfp flags and callers can pass __GFP_ZERO to get zeroed buffer, what has already been an issue: see commit dd65a941f6ba ("arm64: dma-mapping: clear buffers allocated with FORCE_CONTIGUOUS flag"). Link: http://lkml.kernel.org/r/20180709122019eucas1p2340da484acfcc932537e6014f4fd2c29~-sqTPJKij2939229392eucas1p2j@eucas1p2.samsung.com Signed-off-by: Marek Szyprowski <m.szyprowski@samsung.com> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Michał Nazarewicz <mina86@mina86.com> Acked-by: Laura Abbott <labbott@redhat.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: Christoph Hellwig <hch@lst.de> Cc: Joonsoo Kim <js1304@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-17Revert "mm: always flush VMA ranges affected by zap_page_range"Rik van Riel1-13/+1
There was a bug in Linux that could cause madvise (and mprotect?) system calls to return to userspace without the TLB having been flushed for all the pages involved. This could happen when multiple threads of a process made simultaneous madvise and/or mprotect calls. This was noticed in the summer of 2017, at which time two solutions were created: 56236a59556c ("mm: refactor TLB gathering API") 99baac21e458 ("mm: fix MADV_[FREE|DONTNEED] TLB flush miss problem") and 4647706ebeee ("mm: always flush VMA ranges affected by zap_page_range") We need only one of these solutions, and the former appears to be a little more efficient than the latter, so revert that one. This reverts 4647706ebeee6e50 ("mm: always flush VMA ranges affected by zap_page_range") Link: http://lkml.kernel.org/r/20180706131019.51e3a5f0@imladris.surriel.com Signed-off-by: Rik van Riel <riel@surriel.com> Acked-by: Mel Gorman <mgorman@techsingularity.net> Cc: Andy Lutomirski <luto@kernel.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Minchan Kim <minchan@kernel.org> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Nadav Amit <nadav.amit@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-17mm/sparse: optimize memmap allocation during sparse_init()Baoquan He2-11/+41
In sparse_init(), two temporary pointer arrays, usemap_map and map_map are allocated with the size of NR_MEM_SECTIONS. They are used to store each memory section's usemap and mem map if marked as present. With the help of these two arrays, continuous memory chunk is allocated for usemap and memmap for memory sections on one node. This avoids too many memory fragmentations. Like below diagram, '1' indicates the present memory section, '0' means absent one. The number 'n' could be much smaller than NR_MEM_SECTIONS on most of systems. |1|1|1|1|0|0|0|0|1|1|0|0|...|1|0||1|0|...|1||0|1|...|0| ------------------------------------------------------- 0 1 2 3 4 5 i i+1 n-1 n If we fail to populate the page tables to map one section's memmap, its ->section_mem_map will be cleared finally to indicate that it's not present. After use, these two arrays will be released at the end of sparse_init(). In 4-level paging mode, each array costs 4M which can be ignorable. While in 5-level paging, they costs 256M each, 512M altogether. Kdump kernel Usually only reserves very few memory, e.g 256M. So, even thouth they are temporarily allocated, still not acceptable. In fact, there's no need to allocate them with the size of NR_MEM_SECTIONS. Since the ->section_mem_map clearing has been deferred to the last, the number of present memory sections are kept the same during sparse_init() until we finally clear out the memory section's ->section_mem_map if its usemap or memmap is not correctly handled. Thus in the middle whenever for_each_present_section_nr() loop is taken, the i-th present memory section is always the same one. Here only allocate usemap_map and map_map with the size of 'nr_present_sections'. For the i-th present memory section, install its usemap and memmap to usemap_map[i] and mam_map[i] during allocation. Then in the last for_each_present_section_nr() loop which clears the failed memory section's ->section_mem_map, fetch usemap and memmap from usemap_map[] and map_map[] array and set them into mem_section[] accordingly. [akpm@linux-foundation.org: coding-style fixes] Link: http://lkml.kernel.org/r/20180628062857.29658-5-bhe@redhat.com Signed-off-by: Baoquan He <bhe@redhat.com> Reviewed-by: Pavel Tatashin <pasha.tatashin@oracle.com> Cc: Pasha Tatashin <Pavel.Tatashin@microsoft.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Oscar Salvador <osalvador@techadventures.net> Cc: Pankaj Gupta <pagupta@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-17mm/sparse.c: add a new parameter 'data_unit_size' for alloc_usemap_and_memmapBaoquan He1-3/+7
It's used to pass the size of map data unit into alloc_usemap_and_memmap, and is preparation for next patch. Link: http://lkml.kernel.org/r/20180228032657.32385-4-bhe@redhat.com Signed-off-by: Baoquan He <bhe@redhat.com> Reviewed-by: Pavel Tatashin <pasha.tatashin@oracle.com> Reviewed-by: Oscar Salvador <osalvador@suse.de> Cc: Pasha Tatashin <Pavel.Tatashin@microsoft.com> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Pankaj Gupta <pagupta@redhat.com> Cc: Dave Hansen <dave.hansen@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-17mm/sparsemem.c: defer the ms->section_mem_map clearingBaoquan He2-8/+8
In sparse_init(), if CONFIG_SPARSEMEM_ALLOC_MEM_MAP_TOGETHER=y, system will allocate one continuous memory chunk for mem maps on one node and populate the relevant page tables to map memory section one by one. If fail to populate for a certain mem section, print warning and its ->section_mem_map will be cleared to cancel the marking of being present. Like this, the number of mem sections marked as present could become less during sparse_init() execution. Here just defer the ms->section_mem_map clearing if failed to populate its page tables until the last for_each_present_section_nr() loop. This is in preparation for later optimizing the mem map allocation. [akpm@linux-foundation.org: remove now-unused local `ms', per Oscar] Link: http://lkml.kernel.org/r/20180228032657.32385-3-bhe@redhat.com Signed-off-by: Baoquan He <bhe@redhat.com> Acked-by: Dave Hansen <dave.hansen@intel.com> Reviewed-by: Pavel Tatashin <pasha.tatashin@oracle.com> Reviewed-by: Oscar Salvador <osalvador@suse.de> Cc: Pasha Tatashin <Pavel.Tatashin@microsoft.com> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Pankaj Gupta <pagupta@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-17mm/sparse.c: add a static variable nr_present_sectionsBaoquan He1-0/+7
Patch series "mm/sparse: Optimize memmap allocation during sparse_init()", v6. In sparse_init(), two temporary pointer arrays, usemap_map and map_map are allocated with the size of NR_MEM_SECTIONS. They are used to store each memory section's usemap and mem map if marked as present. In 5-level paging mode, this will cost 512M memory though they will be released at the end of sparse_init(). System with few memory, like kdump kernel which usually only has about 256M, will fail to boot because of allocation failure if CONFIG_X86_5LEVEL=y. In this patchset, optimize the memmap allocation code to only use usemap_map and map_map with the size of nr_present_sections. This makes kdump kernel boot up with normal crashkernel='' setting when CONFIG_X86_5LEVEL=y. This patch (of 5): nr_present_sections is used to record how many memory sections are marked as present during system boot up, and will be used in the later patch. Link: http://lkml.kernel.org/r/20180228032657.32385-2-bhe@redhat.com Signed-off-by: Baoquan He <bhe@redhat.com> Acked-by: Dave Hansen <dave.hansen@intel.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Pavel Tatashin <pasha.tatashin@oracle.com> Reviewed-by: Oscar Salvador <osalvador@suse.de> Cc: Pasha Tatashin <Pavel.Tatashin@microsoft.com> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Pankaj Gupta <pagupta@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-17mm: use special value SHRINKER_REGISTERING instead of list_empty() checkKirill Tkhai1-22/+21
The patch introduces a special value SHRINKER_REGISTERING to use instead of list_empty() to differ a registering shrinker from unregistered shrinker. Why we need that at all? Shrinker registration is split in two parts. The first one is prealloc_shrinker(), which allocates shrinker memory and reserves ID in shrinker_idr. This function can fail. The second is register_shrinker_prepared(), and it finalizes the registration. This function actually makes shrinker available to be used from shrink_slab(), and it can't fail. One shrinker may be based on more then one LRU lists. So, we never clear the bit in memcg shrinker maps, when (one of) corresponding LRU list becomes empty, since other LRU lists may be not empty. See superblock shrinker for example: it is based on two LRU lists: s_inode_lru and s_dentry_lru. We do not want to clear shrinker bit, when there are no inodes in s_inode_lru, as s_dentry_lru may contain dentries. Instead of that, we use special algorithm to detect shrinkers having no elements at all its LRU lists, and this is made in shrink_slab_memcg(). See the comment in this function for the details. Also, in shrink_slab_memcg() we clear shrinker bit in the map, when we meet unregistered shrinker (bit is set, while there is no a shrinker in IDR). Otherwise, we would have done that at the moment of shrinker unregistration for all memcgs (and this looks worse, since iteration over all memcg may take much time). Also this would have imposed restrictions on shrinker unregistration order for its users: they would have had to guarantee, there are no new elements after unregister_shrinker() (otherwise, a new added element would have set a bit). So, if we meet a set bit in map and no shrinker in IDR when we're iterating over the map in shrink_slab_memcg(), this means the corresponding shrinker is unregistered, and we must clear the bit. Another case is shrinker registration. We want two things there: 1) do_shrink_slab() can be called only for completely registered shrinkers; 2) shrinker internal lists may be populated in any order with register_shrinker_prepared() (let's talk on the example with sb). Both of: a)list_lru_add(&inode->i_sb->s_inode_lru, &inode->i_lru); [cpu0] memcg_set_shrinker_bit(); [cpu0] ... register_shrinker_prepared(); [cpu1] and b)register_shrinker_prepared(); [cpu0] ... list_lru_add(&inode->i_sb->s_inode_lru, &inode->i_lru); [cpu1] memcg_set_shrinker_bit(); [cpu1] are legitimate. We don't want to impose restriction here and to force people to use only (b) variant. We don't want to force people to care, there is no elements in LRU lists before the shrinker is completely registered. Internal users of LRU lists and shrinker code are two different subsystems, and they have to be closed in themselves each other. In (a) case we have the bit set before shrinker is completely registered. We don't want do_shrink_slab() is called at this moment, so we have to detect such the registering shrinkers. Before this patch list_empty() (shrinker is not linked to the list) check was used for that. So, in (a) there could be a bit set, but we don't call do_shrink_slab() unless shrinker is linked to the list. It's just an indicator, I just overloaded linking to the list. This was not the best solution, since it's better not to touch the shrinker memory from shrink_slab_memcg() before it's completely registered (this also will be useful in the future to make shrink_slab() completely lockless). So, this patch introduces better way to detect registering shrinker, which allows not to dereference shrinker memory. It's just a ~0UL value, which we insert into the IDR during ID allocation. After shrinker is ready to be used, we insert actual shrinker pointer in the IDR, and it becomes available to shrink_slab_memcg(). We can't use NULL instead of this new value for this purpose as: shrink_slab_memcg() already uses NULL to detect unregistered shrinkers, and we don't want the function sees NULL and clears the bit, otherwise (a) won't work. This is the only thing the patch makes: the better way to detect registering shrinker. Nothing else this patch makes. Also this gives a better assembler, but it's minor side of the patch: Before: callq <idr_find> mov %rax,%r15 test %rax,%rax je <shrink_slab_memcg+0x1d5> mov 0x20(%rax),%rax lea 0x20(%r15),%rdx cmp %rax,%rdx je <shrink_slab_memcg+0xbd> mov 0x8(%rsp),%edx mov %r15,%rsi lea 0x10(%rsp),%rdi callq <do_shrink_slab> After: callq <idr_find> mov %rax,%r15 lea -0x1(%rax),%rax cmp $0xfffffffffffffffd,%rax ja <shrink_slab_memcg+0x1cd> mov 0x8(%rsp),%edx mov %r15,%rsi lea 0x10(%rsp),%rdi callq ffffffff810cefd0 <do_shrink_slab> [ktkhai@virtuozzo.com: add #ifdef CONFIG_MEMCG_KMEM around idr_replace()] Link: http://lkml.kernel.org/r/758b8fec-7573-47eb-b26a-7b2847ae7b8c@virtuozzo.com Link: http://lkml.kernel.org/r/153355467546.11522.4518015068123480218.stgit@localhost.localdomain Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Andrey Ryabinin <aryabinin@virtuozzo.com> Cc: "Huang, Ying" <ying.huang@intel.com> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: Matthew Wilcox <willy@infradead.org> Cc: Shakeel Butt <shakeelb@google.com> Cc: Josef Bacik <jbacik@fb.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-17mm/vmscan.c: move check for SHRINKER_NUMA_AWARE to do_shrink_slab()Kirill Tkhai1-3/+3
In case of shrink_slab_memcg() we do not zero nid, when shrinker is not numa-aware. This is not a real problem, since currently all memcg-aware shrinkers are numa-aware too (we have two: super_block shrinker and workingset shrinker), but something may change in the future. Link: http://lkml.kernel.org/r/153320759911.18959.8842396230157677671.stgit@localhost.localdomain Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Andrey Ryabinin <aryabinin@virtuozzo.com> Cc: "Huang, Ying" <ying.huang@intel.com> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: Matthew Wilcox <willy@infradead.org> Cc: Shakeel Butt <shakeelb@google.com> Cc: Josef Bacik <jbacik@fb.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-17mm/vmscan.c: clear shrinker bit if there are no objects related to memcgKirill Tkhai2-2/+26
To avoid further unneed calls of do_shrink_slab() for shrinkers, which already do not have any charged objects in a memcg, their bits have to be cleared. This patch introduces a lockless mechanism to do that without races without parallel list lru add. After do_shrink_slab() returns SHRINK_EMPTY the first time, we clear the bit and call it once again. Then we restore the bit, if the new return value is different. Note, that single smp_mb__after_atomic() in shrink_slab_memcg() covers two situations: 1)list_lru_add() shrink_slab_memcg list_add_tail() for_each_set_bit() <--- read bit do_shrink_slab() <--- missed list update (no barrier) <MB> <MB> set_bit() do_shrink_slab() <--- seen list update This situation, when the first do_shrink_slab() sees set bit, but it doesn't see list update (i.e., race with the first element queueing), is rare. So we don't add <MB> before the first call of do_shrink_slab() instead of this to do not slow down generic case. Also, it's need the second call as seen in below in (2). 2)list_lru_add() shrink_slab_memcg() list_add_tail() ... set_bit() ... ... for_each_set_bit() do_shrink_slab() do_shrink_slab() clear_bit() ... ... ... list_lru_add() ... list_add_tail() clear_bit() <MB> <MB> set_bit() do_shrink_slab() The barriers guarantee that the second do_shrink_slab() in the right side task sees list update if really cleared the bit. This case is drawn in the code comment. [Results/performance of the patchset] After the whole patchset applied the below test shows signify increase of performance: $echo 1 > /sys/fs/cgroup/memory/memory.use_hierarchy $mkdir /sys/fs/cgroup/memory/ct $echo 4000M > /sys/fs/cgroup/memory/ct/memory.kmem.limit_in_bytes $for i in `seq 0 4000`; do mkdir /sys/fs/cgroup/memory/ct/$i; echo $$ > /sys/fs/cgroup/memory/ct/$i/cgroup.procs; mkdir -p s/$i; mount -t tmpfs $i s/$i; touch s/$i/file; done Then, 5 sequential calls of drop caches: $time echo 3 > /proc/sys/vm/drop_caches 1)Before: 0.00user 13.78system 0:13.78elapsed 99%CPU 0.00user 5.59system 0:05.60elapsed 99%CPU 0.00user 5.48system 0:05.48elapsed 99%CPU 0.00user 8.35system 0:08.35elapsed 99%CPU 0.00user 8.34system 0:08.35elapsed 99%CPU 2)After 0.00user 1.10system 0:01.10elapsed 99%CPU 0.00user 0.00system 0:00.01elapsed 64%CPU 0.00user 0.01system 0:00.01elapsed 82%CPU 0.00user 0.00system 0:00.01elapsed 64%CPU 0.00user 0.01system 0:00.01elapsed 82%CPU The results show the performance increases at least in 548 times. Shakeel Butt tested this patchset with fork-bomb on his configuration: > I created 255 memcgs, 255 ext4 mounts and made each memcg create a > file containing few KiBs on corresponding mount. Then in a separate > memcg of 200 MiB limit ran a fork-bomb. > > I ran the "perf record -ag -- sleep 60" and below are the results: > > Without the patch series: > Samples: 4M of event 'cycles', Event count (approx.): 3279403076005 > + 36.40% fb.sh [kernel.kallsyms] [k] shrink_slab > + 18.97% fb.sh [kernel.kallsyms] [k] list_lru_count_one > + 6.75% fb.sh [kernel.kallsyms] [k] super_cache_count > + 0.49% fb.sh [kernel.kallsyms] [k] down_read_trylock > + 0.44% fb.sh [kernel.kallsyms] [k] mem_cgroup_iter > + 0.27% fb.sh [kernel.kallsyms] [k] up_read > + 0.21% fb.sh [kernel.kallsyms] [k] osq_lock > + 0.13% fb.sh [kernel.kallsyms] [k] shmem_unused_huge_count > + 0.08% fb.sh [kernel.kallsyms] [k] shrink_node_memcg > + 0.08% fb.sh [kernel.kallsyms] [k] shrink_node > > With the patch series: > Samples: 4M of event 'cycles', Event count (approx.): 2756866824946 > + 47.49% fb.sh [kernel.kallsyms] [k] down_read_trylock > + 30.72% fb.sh [kernel.kallsyms] [k] up_read > + 9.51% fb.sh [kernel.kallsyms] [k] mem_cgroup_iter > + 1.69% fb.sh [kernel.kallsyms] [k] shrink_node_memcg > + 1.35% fb.sh [kernel.kallsyms] [k] mem_cgroup_protected > + 1.05% fb.sh [kernel.kallsyms] [k] queued_spin_lock_slowpath > + 0.85% fb.sh [kernel.kallsyms] [k] _raw_spin_lock > + 0.78% fb.sh [kernel.kallsyms] [k] lruvec_lru_size > + 0.57% fb.sh [kernel.kallsyms] [k] shrink_node > + 0.54% fb.sh [kernel.kallsyms] [k] queue_work_on > + 0.46% fb.sh [kernel.kallsyms] [k] shrink_slab_memcg [ktkhai@virtuozzo.com: v9] Link: http://lkml.kernel.org/r/153112561772.4097.11011071937553113003.stgit@localhost.localdomain Link: http://lkml.kernel.org/r/153063070859.1818.11870882950920963480.stgit@localhost.localdomain Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com> Acked-by: Vladimir Davydov <vdavydov.dev@gmail.com> Tested-by: Shakeel Butt <shakeelb@google.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Andrey Ryabinin <aryabinin@virtuozzo.com> Cc: Chris Wilson <chris@chris-wilson.co.uk> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Guenter Roeck <linux@roeck-us.net> Cc: "Huang, Ying" <ying.huang@intel.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Josef Bacik <jbacik@fb.com> Cc: Li RongQing <lirongqing@baidu.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Matthias Kaehlcke <mka@chromium.org> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Michal Hocko <mhocko@kernel.org> Cc: Minchan Kim <minchan@kernel.org> Cc: Philippe Ombredanne <pombredanne@nexb.com> Cc: Roman Gushchin <guro@fb.com> Cc: Sahitya Tummala <stummala@codeaurora.org> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Waiman Long <longman@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-17mm: add SHRINK_EMPTY shrinker methods return valueKirill Tkhai4-5/+20
We need to distinguish the situations when shrinker has very small amount of objects (see vfs_pressure_ratio() called from super_cache_count()), and when it has no objects at all. Currently, in the both of these cases, shrinker::count_objects() returns 0. The patch introduces new SHRINK_EMPTY return value, which will be used for "no objects at all" case. It's is a refactoring mostly, as SHRINK_EMPTY is replaced by 0 by all callers of do_shrink_slab() in this patch, and all the magic will happen in further. Link: http://lkml.kernel.org/r/153063069574.1818.11037751256699341813.stgit@localhost.localdomain Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com> Acked-by: Vladimir Davydov <vdavydov.dev@gmail.com> Tested-by: Shakeel Butt <shakeelb@google.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Andrey Ryabinin <aryabinin@virtuozzo.com> Cc: Chris Wilson <chris@chris-wilson.co.uk> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Guenter Roeck <linux@roeck-us.net> Cc: "Huang, Ying" <ying.huang@intel.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Josef Bacik <jbacik@fb.com> Cc: Li RongQing <lirongqing@baidu.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Matthias Kaehlcke <mka@chromium.org> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Michal Hocko <mhocko@kernel.org> Cc: Minchan Kim <minchan@kernel.org> Cc: Philippe Ombredanne <pombredanne@nexb.com> Cc: Roman Gushchin <guro@fb.com> Cc: Sahitya Tummala <stummala@codeaurora.org> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Waiman Long <longman@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-17mm/vmscan.c: generalize shrink_slab() calls in shrink_node()Vladimir Davydov1-15/+6
The patch makes shrink_slab() be called for root_mem_cgroup in the same way as it's called for the rest of cgroups. This simplifies the logic and improves the readability. [ktkhai@virtuozzo.com: wrote changelog] Link: http://lkml.kernel.org/r/153063068338.1818.11496084754797453962.stgit@localhost.localdomain Signed-off-by: Vladimir Davydov <vdavydov.dev@gmail.com> Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com> Tested-by: Shakeel Butt <shakeelb@google.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Andrey Ryabinin <aryabinin@virtuozzo.com> Cc: Chris Wilson <chris@chris-wilson.co.uk> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Guenter Roeck <linux@roeck-us.net> Cc: "Huang, Ying" <ying.huang@intel.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Josef Bacik <jbacik@fb.com> Cc: Li RongQing <lirongqing@baidu.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Matthias Kaehlcke <mka@chromium.org> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Michal Hocko <mhocko@kernel.org> Cc: Minchan Kim <minchan@kernel.org> Cc: Philippe Ombredanne <pombredanne@nexb.com> Cc: Roman Gushchin <guro@fb.com> Cc: Sahitya Tummala <stummala@codeaurora.org> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Waiman Long <longman@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-17mm/vmscan.c: iterate only over charged shrinkers during memcg shrink_slab()Kirill Tkhai1-9/+75
Using the preparations made in previous patches, in case of memcg shrink, we may avoid shrinkers, which are not set in memcg's shrinkers bitmap. To do that, we separate iterations over memcg-aware and !memcg-aware shrinkers, and memcg-aware shrinkers are chosen via for_each_set_bit() from the bitmap. In case of big nodes, having many isolated environments, this gives significant performance growth. See next patches for the details. Note that the patch does not respect to empty memcg shrinkers, since we never clear the bitmap bits after we set it once. Their shrinkers will be called again, with no shrinked objects as result. This functionality is provided by next patches. [ktkhai@virtuozzo.com: v9] Link: http://lkml.kernel.org/r/153112558507.4097.12713813335683345488.stgit@localhost.localdomain Link: http://lkml.kernel.org/r/153063066653.1818.976035462801487910.stgit@localhost.localdomain Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com> Acked-by: Vladimir Davydov <vdavydov.dev@gmail.com> Tested-by: Shakeel Butt <shakeelb@google.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Andrey Ryabinin <aryabinin@virtuozzo.com> Cc: Chris Wilson <chris@chris-wilson.co.uk> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Guenter Roeck <linux@roeck-us.net> Cc: "Huang, Ying" <ying.huang@intel.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Josef Bacik <jbacik@fb.com> Cc: Li RongQing <lirongqing@baidu.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Matthias Kaehlcke <mka@chromium.org> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Michal Hocko <mhocko@kernel.org> Cc: Minchan Kim <minchan@kernel.org> Cc: Philippe Ombredanne <pombredanne@nexb.com> Cc: Roman Gushchin <guro@fb.com> Cc: Sahitya Tummala <stummala@codeaurora.org> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Waiman Long <longman@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-17mm/list_lru.c: set bit in memcg shrinker bitmap on first list_lru item ↵Kirill Tkhai3-2/+37
appearance Introduce set_shrinker_bit() function to set shrinker-related bit in memcg shrinker bitmap, and set the bit after the first item is added and in case of reparenting destroyed memcg's items. This will allow next patch to make shrinkers be called only, in case of they have charged objects at the moment, and to improve shrink_slab() performance. [ktkhai@virtuozzo.com: v9] Link: http://lkml.kernel.org/r/153112557572.4097.17315791419810749985.stgit@localhost.localdomain Link: http://lkml.kernel.org/r/153063065671.1818.15914674956134687268.stgit@localhost.localdomain Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com> Acked-by: Vladimir Davydov <vdavydov.dev@gmail.com> Tested-by: Shakeel Butt <shakeelb@google.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Andrey Ryabinin <aryabinin@virtuozzo.com> Cc: Chris Wilson <chris@chris-wilson.co.uk> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Guenter Roeck <linux@roeck-us.net> Cc: "Huang, Ying" <ying.huang@intel.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Josef Bacik <jbacik@fb.com> Cc: Li RongQing <lirongqing@baidu.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Matthias Kaehlcke <mka@chromium.org> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Michal Hocko <mhocko@kernel.org> Cc: Minchan Kim <minchan@kernel.org> Cc: Philippe Ombredanne <pombredanne@nexb.com> Cc: Roman Gushchin <guro@fb.com> Cc: Sahitya Tummala <stummala@codeaurora.org> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Waiman Long <longman@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-17mm/memcontrol.c: export mem_cgroup_is_root()Kirill Tkhai2-5/+10
This will be used in next patch. Link: http://lkml.kernel.org/r/153063064347.1818.1987011484100392706.stgit@localhost.localdomain Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com> Acked-by: Vladimir Davydov <vdavydov.dev@gmail.com> Tested-by: Shakeel Butt <shakeelb@google.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Andrey Ryabinin <aryabinin@virtuozzo.com> Cc: Chris Wilson <chris@chris-wilson.co.uk> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Guenter Roeck <linux@roeck-us.net> Cc: "Huang, Ying" <ying.huang@intel.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Josef Bacik <jbacik@fb.com> Cc: Li RongQing <lirongqing@baidu.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Matthias Kaehlcke <mka@chromium.org> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Michal Hocko <mhocko@kernel.org> Cc: Minchan Kim <minchan@kernel.org> Cc: Philippe Ombredanne <pombredanne@nexb.com> Cc: Roman Gushchin <guro@fb.com> Cc: Sahitya Tummala <stummala@codeaurora.org> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Waiman Long <longman@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-17mm/list_lru.c: pass lru argument to memcg_drain_list_lru_node()Kirill Tkhai1-2/+3
This is just refactoring to allow next patches to have lru pointer in memcg_drain_list_lru_node(). Link: http://lkml.kernel.org/r/153063063164.1818.55009531386089350.stgit@localhost.localdomain Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com> Acked-by: Vladimir Davydov <vdavydov.dev@gmail.com> Tested-by: Shakeel Butt <shakeelb@google.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Andrey Ryabinin <aryabinin@virtuozzo.com> Cc: Chris Wilson <chris@chris-wilson.co.uk> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Guenter Roeck <linux@roeck-us.net> Cc: "Huang, Ying" <ying.huang@intel.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Josef Bacik <jbacik@fb.com> Cc: Li RongQing <lirongqing@baidu.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Matthias Kaehlcke <mka@chromium.org> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Michal Hocko <mhocko@kernel.org> Cc: Minchan Kim <minchan@kernel.org> Cc: Philippe Ombredanne <pombredanne@nexb.com> Cc: Roman Gushchin <guro@fb.com> Cc: Sahitya Tummala <stummala@codeaurora.org> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Waiman Long <longman@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-17mm/list_lru: pass dst_memcg argument to memcg_drain_list_lru_node()Kirill Tkhai3-7/+8
This is just refactoring to allow the next patches to have dst_memcg pointer in memcg_drain_list_lru_node(). Link: http://lkml.kernel.org/r/153063062118.1818.2761273817739499749.stgit@localhost.localdomain Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com> Acked-by: Vladimir Davydov <vdavydov.dev@gmail.com> Tested-by: Shakeel Butt <shakeelb@google.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Andrey Ryabinin <aryabinin@virtuozzo.com> Cc: Chris Wilson <chris@chris-wilson.co.uk> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Guenter Roeck <linux@roeck-us.net> Cc: "Huang, Ying" <ying.huang@intel.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Josef Bacik <jbacik@fb.com> Cc: Li RongQing <lirongqing@baidu.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Matthias Kaehlcke <mka@chromium.org> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Michal Hocko <mhocko@kernel.org> Cc: Minchan Kim <minchan@kernel.org> Cc: Philippe Ombredanne <pombredanne@nexb.com> Cc: Roman Gushchin <guro@fb.com> Cc: Sahitya Tummala <stummala@codeaurora.org> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Waiman Long <longman@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-17mm/list_lru.c: add memcg argument to list_lru_from_kmem()Kirill Tkhai1-8/+17
This is just refactoring to allow the next patches to have memcg pointer in list_lru_from_kmem(). Link: http://lkml.kernel.org/r/153063060664.1818.9541345386733498582.stgit@localhost.localdomain Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com> Acked-by: Vladimir Davydov <vdavydov.dev@gmail.com> Tested-by: Shakeel Butt <shakeelb@google.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Andrey Ryabinin <aryabinin@virtuozzo.com> Cc: Chris Wilson <chris@chris-wilson.co.uk> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Guenter Roeck <linux@roeck-us.net> Cc: "Huang, Ying" <ying.huang@intel.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Josef Bacik <jbacik@fb.com> Cc: Li RongQing <lirongqing@baidu.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Matthias Kaehlcke <mka@chromium.org> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Michal Hocko <mhocko@kernel.org> Cc: Minchan Kim <minchan@kernel.org> Cc: Philippe Ombredanne <pombredanne@nexb.com> Cc: Roman Gushchin <guro@fb.com> Cc: Sahitya Tummala <stummala@codeaurora.org> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Waiman Long <longman@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-17fs: propagate shrinker::id to list_lruKirill Tkhai4-9/+23
Add list_lru::shrinker_id field and populate it by registered shrinker id. This will be used to set correct bit in memcg shrinkers map by lru code in next patches, after there appeared the first related to memcg element in list_lru. Link: http://lkml.kernel.org/r/153063059758.1818.14866596416857717800.stgit@localhost.localdomain Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com> Acked-by: Vladimir Davydov <vdavydov.dev@gmail.com> Tested-by: Shakeel Butt <shakeelb@google.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Andrey Ryabinin <aryabinin@virtuozzo.com> Cc: Chris Wilson <chris@chris-wilson.co.uk> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Guenter Roeck <linux@roeck-us.net> Cc: "Huang, Ying" <ying.huang@intel.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Josef Bacik <jbacik@fb.com> Cc: Li RongQing <lirongqing@baidu.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Matthias Kaehlcke <mka@chromium.org> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Michal Hocko <mhocko@kernel.org> Cc: Minchan Kim <minchan@kernel.org> Cc: Philippe Ombredanne <pombredanne@nexb.com> Cc: Roman Gushchin <guro@fb.com> Cc: Sahitya Tummala <stummala@codeaurora.org> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Waiman Long <longman@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-17fs/super.c: refactor alloc_super()Kirill Tkhai1-4/+4
Do two list_lru_init_memcg() calls after prealloc_super(). destroy_unused_super() in fail path is OK with this. Next patch needs such the order. Link: http://lkml.kernel.org/r/153063058712.1818.3382490999719078571.stgit@localhost.localdomain Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com> Acked-by: Vladimir Davydov <vdavydov.dev@gmail.com> Tested-by: Shakeel Butt <shakeelb@google.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Andrey Ryabinin <aryabinin@virtuozzo.com> Cc: Chris Wilson <chris@chris-wilson.co.uk> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Guenter Roeck <linux@roeck-us.net> Cc: "Huang, Ying" <ying.huang@intel.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Josef Bacik <jbacik@fb.com> Cc: Li RongQing <lirongqing@baidu.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Matthias Kaehlcke <mka@chromium.org> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Michal Hocko <mhocko@kernel.org> Cc: Minchan Kim <minchan@kernel.org> Cc: Philippe Ombredanne <pombredanne@nexb.com> Cc: Roman Gushchin <guro@fb.com> Cc: Sahitya Tummala <stummala@codeaurora.org> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Waiman Long <longman@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-17mm/workingset.c: refactor workingset_init()Kirill Tkhai1-3/+4
Use prealloc_shrinker()/register_shrinker_prepared() instead of register_shrinker(). This will be used in next patch. [ktkhai@virtuozzo.com: v9] Link: http://lkml.kernel.org/r/153112550112.4097.16606173020912323761.stgit@localhost.localdomain Link: http://lkml.kernel.org/r/153063057666.1818.17625951186610808734.stgit@localhost.localdomain Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com> Acked-by: Vladimir Davydov <vdavydov.dev@gmail.com> Tested-by: Shakeel Butt <shakeelb@google.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Andrey Ryabinin <aryabinin@virtuozzo.com> Cc: Chris Wilson <chris@chris-wilson.co.uk> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Guenter Roeck <linux@roeck-us.net> Cc: "Huang, Ying" <ying.huang@intel.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Josef Bacik <jbacik@fb.com> Cc: Li RongQing <lirongqing@baidu.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Matthias Kaehlcke <mka@chromium.org> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Michal Hocko <mhocko@kernel.org> Cc: Minchan Kim <minchan@kernel.org> Cc: Philippe Ombredanne <pombredanne@nexb.com> Cc: Roman Gushchin <guro@fb.com> Cc: Sahitya Tummala <stummala@codeaurora.org> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Waiman Long <longman@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-17mm, memcg: assign memcg-aware shrinkers bitmap to memcgKirill Tkhai3-1/+145
Imagine a big node with many cpus, memory cgroups and containers. Let we have 200 containers, every container has 10 mounts, and 10 cgroups. All container tasks don't touch foreign containers mounts. If there is intensive pages write, and global reclaim happens, a writing task has to iterate over all memcgs to shrink slab, before it's able to go to shrink_page_list(). Iteration over all the memcg slabs is very expensive: the task has to visit 200 * 10 = 2000 shrinkers for every memcg, and since there are 2000 memcgs, the total calls are 2000 * 2000 = 4000000. So, the shrinker makes 4 million do_shrink_slab() calls just to try to isolate SWAP_CLUSTER_MAX pages in one of the actively writing memcg via shrink_page_list(). I've observed a node spending almost 100% in kernel, making useless iteration over already shrinked slab. This patch adds bitmap of memcg-aware shrinkers to memcg. The size of the bitmap depends on bitmap_nr_ids, and during memcg life it's maintained to be enough to fit bitmap_nr_ids shrinkers. Every bit in the map is related to corresponding shrinker id. Next patches will maintain set bit only for really charged memcg. This will allow shrink_slab() to increase its performance in significant way. See the last patch for the numbers. [ktkhai@virtuozzo.com: v9] Link: http://lkml.kernel.org/r/153112549031.4097.3576147070498769979.stgit@localhost.localdomain [ktkhai@virtuozzo.com: add comment to mem_cgroup_css_online()] Link: http://lkml.kernel.org/r/521f9e5f-c436-b388-fe83-4dc870bfb489@virtuozzo.com Link: http://lkml.kernel.org/r/153063056619.1818.12550500883688681076.stgit@localhost.localdomain Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com> Acked-by: Vladimir Davydov <vdavydov.dev@gmail.com> Tested-by: Shakeel Butt <shakeelb@google.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Andrey Ryabinin <aryabinin@virtuozzo.com> Cc: Chris Wilson <chris@chris-wilson.co.uk> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Guenter Roeck <linux@roeck-us.net> Cc: "Huang, Ying" <ying.huang@intel.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Josef Bacik <jbacik@fb.com> Cc: Li RongQing <lirongqing@baidu.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Matthias Kaehlcke <mka@chromium.org> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Michal Hocko <mhocko@kernel.org> Cc: Minchan Kim <minchan@kernel.org> Cc: Philippe Ombredanne <pombredanne@nexb.com> Cc: Roman Gushchin <guro@fb.com> Cc: Sahitya Tummala <stummala@codeaurora.org> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Waiman Long <longman@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-17mm/memcontrol.c: move up for_each_mem_cgroup{, _tree} definesKirill Tkhai1-15/+15
Next patch requires these defines are above their current position, so here they are moved to declarations. Link: http://lkml.kernel.org/r/153063055665.1818.5200425793649695598.stgit@localhost.localdomain Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com> Acked-by: Vladimir Davydov <vdavydov.dev@gmail.com> Tested-by: Shakeel Butt <shakeelb@google.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Andrey Ryabinin <aryabinin@virtuozzo.com> Cc: Chris Wilson <chris@chris-wilson.co.uk> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Guenter Roeck <linux@roeck-us.net> Cc: "Huang, Ying" <ying.huang@intel.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Josef Bacik <jbacik@fb.com> Cc: Li RongQing <lirongqing@baidu.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Matthias Kaehlcke <mka@chromium.org> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Michal Hocko <mhocko@kernel.org> Cc: Minchan Kim <minchan@kernel.org> Cc: Philippe Ombredanne <pombredanne@nexb.com> Cc: Roman Gushchin <guro@fb.com> Cc: Sahitya Tummala <stummala@codeaurora.org> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Waiman Long <longman@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-17mm: assign id to every memcg-aware shrinkerKirill Tkhai2-0/+67
Introduce shrinker::id number, which is used to enumerate memcg-aware shrinkers. The number start from 0, and the code tries to maintain it as small as possible. This will be used to represent a memcg-aware shrinkers in memcg shrinkers map. Since all memcg-aware shrinkers are based on list_lru, which is per-memcg in case of !CONFIG_MEMCG_KMEM only, the new functionality will be under this config option. [ktkhai@virtuozzo.com: v9] Link: http://lkml.kernel.org/r/153112546435.4097.10607140323811756557.stgit@localhost.localdomain Link: http://lkml.kernel.org/r/153063054586.1818.6041047871606697364.stgit@localhost.localdomain Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com> Acked-by: Vladimir Davydov <vdavydov.dev@gmail.com> Tested-by: Shakeel Butt <shakeelb@google.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Andrey Ryabinin <aryabinin@virtuozzo.com> Cc: Chris Wilson <chris@chris-wilson.co.uk> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Guenter Roeck <linux@roeck-us.net> Cc: "Huang, Ying" <ying.huang@intel.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Josef Bacik <jbacik@fb.com> Cc: Li RongQing <lirongqing@baidu.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Matthias Kaehlcke <mka@chromium.org> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Michal Hocko <mhocko@kernel.org> Cc: Minchan Kim <minchan@kernel.org> Cc: Philippe Ombredanne <pombredanne@nexb.com> Cc: Roman Gushchin <guro@fb.com> Cc: Sahitya Tummala <stummala@codeaurora.org> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Waiman Long <longman@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-17mm: introduce CONFIG_MEMCG_KMEM as combination of CONFIG_MEMCG && !CONFIG_SLOBKirill Tkhai9-26/+31
Introduce new config option, which is used to replace repeating CONFIG_MEMCG && !CONFIG_SLOB pattern. Next patches add a little more memcg+kmem related code, so let's keep the defines more clearly. Link: http://lkml.kernel.org/r/153063053670.1818.15013136946600481138.stgit@localhost.localdomain Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com> Acked-by: Vladimir Davydov <vdavydov.dev@gmail.com> Tested-by: Shakeel Butt <shakeelb@google.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Andrey Ryabinin <aryabinin@virtuozzo.com> Cc: Chris Wilson <chris@chris-wilson.co.uk> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Guenter Roeck <linux@roeck-us.net> Cc: "Huang, Ying" <ying.huang@intel.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Josef Bacik <jbacik@fb.com> Cc: Li RongQing <lirongqing@baidu.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Matthias Kaehlcke <mka@chromium.org> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Michal Hocko <mhocko@kernel.org> Cc: Minchan Kim <minchan@kernel.org> Cc: Philippe Ombredanne <pombredanne@nexb.com> Cc: Roman Gushchin <guro@fb.com> Cc: Sahitya Tummala <stummala@codeaurora.org> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Waiman Long <longman@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-17mm/list_lru.c: combine code under the same defineKirill Tkhai1-10/+8
Patch series "Improve shrink_slab() scalability (old complexity was O(n^2), new is O(n))", v8. This patcheset solves the problem with slow shrink_slab() occuring on the machines having many shrinkers and memory cgroups (i.e., with many containers). The problem is complexity of shrink_slab() is O(n^2) and it grows too fast with the growth of containers numbers. Let us have 200 containers, and every container has 10 mounts and 10 cgroups. All container tasks are isolated, and they don't touch foreign containers mounts. In case of global reclaim, a task has to iterate all over the memcgs and to call all the memcg-aware shrinkers for all of them. This means, the task has to visit 200 * 10 = 2000 shrinkers for every memcg, and since there are 2000 memcgs, the total calls of do_shrink_slab() are 2000 * 2000 = 4000000. 4 million calls are not a number operations, which can takes 1 cpu cycle. E.g., super_cache_count() accesses at least two lists, and makes arifmetical calculations. Even, if there are no charged objects, we do these calculations, and replaces cpu caches by read memory. I observed nodes spending almost 100% time in kernel, in case of intensive writing and global reclaim. The writer consumes pages fast, but it's need to shrink_slab() before the reclaimer reached shrink pages function (and frees SWAP_CLUSTER_MAX pages). Even if there is no writing, the iterations just waste the time, and slows reclaim down. Let's see the small test below: $echo 1 > /sys/fs/cgroup/memory/memory.use_hierarchy $mkdir /sys/fs/cgroup/memory/ct $echo 4000M > /sys/fs/cgroup/memory/ct/memory.kmem.limit_in_bytes $for i in `seq 0 4000`; do mkdir /sys/fs/cgroup/memory/ct/$i; echo $$ > /sys/fs/cgroup/memory/ct/$i/cgroup.procs; mkdir -p s/$i; mount -t tmpfs $i s/$i; touch s/$i/file; done Then, let's see drop caches time (5 sequential calls): $time echo 3 > /proc/sys/vm/drop_caches 0.00user 13.78system 0:13.78elapsed 99%CPU 0.00user 5.59system 0:05.60elapsed 99%CPU 0.00user 5.48system 0:05.48elapsed 99%CPU 0.00user 8.35system 0:08.35elapsed 99%CPU 0.00user 8.34system 0:08.35elapsed 99%CPU The last four calls don't actually shrink anything. So, the iterations over slab shrinkers take 5.48 seconds. Not so good for scalability. The patchset solves the problem by making shrink_slab() of O(n) complexity. There are following functional actions: 1) Assign id to every registered memcg-aware shrinker. 2) Maintain per-memcgroup bitmap of memcg-aware shrinkers, and set a shrinker-related bit after the first element is added to lru list (also, when removed child memcg elements are reparanted). 3) Split memcg-aware shrinkers and !memcg-aware shrinkers, and call a shrinker if its bit is set in memcg's shrinker bitmap. (Also, there is a functionality to clear the bit, after last element is shrinked). This gives significant performance increase. The result after patchset is applied: $time echo 3 > /proc/sys/vm/drop_caches 0.00user 1.10system 0:01.10elapsed 99%CPU 0.00user 0.00system 0:00.01elapsed 64%CPU 0.00user 0.01system 0:00.01elapsed 82%CPU 0.00user 0.00system 0:00.01elapsed 64%CPU 0.00user 0.01system 0:00.01elapsed 82%CPU The results show the performance increases at least in 548 times. So, the patchset makes shrink_slab() of less complexity and improves the performance in such types of load I pointed. This will give a profit in case of !global reclaim case, since there also will be less do_shrink_slab() calls. This patch (of 17): These two pairs of blocks of code are under the same #ifdef #else #endif. Link: http://lkml.kernel.org/r/153063052519.1818.9393587113056959488.stgit@localhost.localdomain Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com> Acked-by: Vladimir Davydov <vdavydov.dev@gmail.com> Tested-by: Shakeel Butt <shakeelb@google.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Philippe Ombredanne <pombredanne@nexb.com> Cc: Sahitya Tummala <stummala@codeaurora.org> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Cc: Roman Gushchin <guro@fb.com> Cc: Matthias Kaehlcke <mka@chromium.org> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: Chris Wilson <chris@chris-wilson.co.uk> Cc: Waiman Long <longman@redhat.com> Cc: Minchan Kim <minchan@kernel.org> Cc: "Huang, Ying" <ying.huang@intel.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Josef Bacik <jbacik@fb.com> Cc: Guenter Roeck <linux@roeck-us.net> Cc: Matthew Wilcox <willy@infradead.org> Cc: Li RongQing <lirongqing@baidu.com> Cc: Andrey Ryabinin <aryabinin@virtuozzo.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-17mm/memblock.c: replace u64 with phys_addr_t where appropriateMike Rapoport1-23/+23
Most functions in memblock already use phys_addr_t to represent a physical address with __memblock_free_late() being an exception. This patch replaces u64 with phys_addr_t in __memblock_free_late() and switches several format strings from %llx to %pa to avoid casting from phys_addr_t to u64. Link: http://lkml.kernel.org/r/1530637506-1256-1-git-send-email-rppt@linux.vnet.ibm.com Signed-off-by: Mike Rapoport <rppt@linux.vnet.ibm.com> Reviewed-by: Pavel Tatashin <pasha.tatashin@oracle.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Pasha Tatashin <Pavel.Tatashin@microsoft.com> Cc: Matthew Wilcox <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-17mm/sparse.c: make sparse_init_one_section void and remove checkOscar Salvador1-9/+4
sparse_init_one_section() is being called from two sites: sparse_init() and sparse_add_one_section(). The former calls it from a for_each_present_section_nr() loop, and the latter marks the section as present before calling it. This means that when sparse_init_one_section() gets called, we already know that the section is present. So there is no point to double check that in the function. This removes the check and makes the function void. [ross.zwisler@linux.intel.com: fix error path in sparse_add_one_section] Link: http://lkml.kernel.org/r/20180706190658.6873-1-ross.zwisler@linux.intel.com [ross.zwisler@linux.intel.com: simplification suggested by Oscar] Link: http://lkml.kernel.org/r/20180706223358.742-1-ross.zwisler@linux.intel.com Link: http://lkml.kernel.org/r/20180702154325.12196-1-osalvador@techadventures.net Signed-off-by: Oscar Salvador <osalvador@suse.de> Acked-by: Michal Hocko <mhocko@suse.com> Reviewed-by: Pavel Tatashin <pasha.tatashin@oracle.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: Pasha Tatashin <Pavel.Tatashin@microsoft.com> Cc: Oscar Salvador <osalvador@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-17memcg, oom: move out_of_memory back to the charge pathMichal Hocko4-26/+71
Commit 3812c8c8f395 ("mm: memcg: do not trap chargers with full callstack on OOM") has changed the ENOMEM semantic of memcg charges. Rather than invoking the oom killer from the charging context it delays the oom killer to the page fault path (pagefault_out_of_memory). This in turn means that many users (e.g. slab or g-u-p) will get ENOMEM when the corresponding memcg hits the hard limit and the memcg is is OOM. This is behavior is inconsistent with !memcg case where the oom killer is invoked from the allocation context and the allocator keeps retrying until it succeeds. The difference in the behavior is user visible. mmap(MAP_POPULATE) might result in not fully populated ranges while the mmap return code doesn't tell that to the userspace. Random syscalls might fail with ENOMEM etc. The primary motivation of the different memcg oom semantic was the deadlock avoidance. Things have changed since then, though. We have an async oom teardown by the oom reaper now and so we do not have to rely on the victim to tear down its memory anymore. Therefore we can return to the original semantic as long as the memcg oom killer is not handed over to the users space. There is still one thing to be careful about here though. If the oom killer is not able to make any forward progress - e.g. because there is no eligible task to kill - then we have to bail out of the charge path to prevent from same class of deadlocks. We have basically two options here. Either we fail the charge with ENOMEM or force the charge and allow overcharge. The first option has been considered more harmful than useful because rare inconsistencies in the ENOMEM behavior is hard to test for and error prone. Basically the same reason why the page allocator doesn't fail allocations under such conditions. The later might allow runaways but those should be really unlikely unless somebody misconfigures the system. E.g. allowing to migrate tasks away from the memcg to a different unlimited memcg with move_charge_at_immigrate disabled. Link: http://lkml.kernel.org/r/20180628151101.25307-1-mhocko@kernel.org Signed-off-by: Michal Hocko <mhocko@suse.com> Acked-by: Greg Thelen <gthelen@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Shakeel Butt <shakeelb@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-17mm: make DEFERRED_STRUCT_PAGE_INIT explicitly depend on SPARSEMEMMike Rapoport1-1/+1
The deferred memory initialization relies on section definitions, e.g PAGES_PER_SECTION, that are only available when CONFIG_SPARSEMEM=y on most architectures. Initially DEFERRED_STRUCT_PAGE_INIT depended on explicit ARCH_SUPPORTS_DEFERRED_STRUCT_PAGE_INIT configuration option, but since the commit 2e3ca40f03bb13709df4 ("mm: relax deferred struct page requirements") this requirement was relaxed and now it is possible to enable DEFERRED_STRUCT_PAGE_INIT on architectures that support DISCONTINGMEM and NO_BOOTMEM which causes build failures. For instance, setting SMP=y and DEFERRED_STRUCT_PAGE_INIT=y on arc causes the following build failure: CC mm/page_alloc.o mm/page_alloc.c: In function 'update_defer_init': mm/page_alloc.c:321:14: error: 'PAGES_PER_SECTION' undeclared (first use in this function); did you mean 'USEC_PER_SEC'? (pfn & (PAGES_PER_SECTION - 1)) == 0) { ^~~~~~~~~~~~~~~~~ USEC_PER_SEC mm/page_alloc.c:321:14: note: each undeclared identifier is reported only once for each function it appears in In file included from include/linux/cache.h:5:0, from include/linux/printk.h:9, from include/linux/kernel.h:14, from include/asm-generic/bug.h:18, from arch/arc/include/asm/bug.h:32, from include/linux/bug.h:5, from include/linux/mmdebug.h:5, from include/linux/mm.h:9, from mm/page_alloc.c:18: mm/page_alloc.c: In function 'deferred_grow_zone': mm/page_alloc.c:1624:52: error: 'PAGES_PER_SECTION' undeclared (first use in this function); did you mean 'USEC_PER_SEC'? unsigned long nr_pages_needed = ALIGN(1 << order, PAGES_PER_SECTION); ^ include/uapi/linux/kernel.h:11:47: note: in definition of macro '__ALIGN_KERNEL_MASK' #define __ALIGN_KERNEL_MASK(x, mask) (((x) + (mask)) & ~(mask)) ^~~~ include/linux/kernel.h:58:22: note: in expansion of macro '__ALIGN_KERNEL' #define ALIGN(x, a) __ALIGN_KERNEL((x), (a)) ^~~~~~~~~~~~~~ mm/page_alloc.c:1624:34: note: in expansion of macro 'ALIGN' unsigned long nr_pages_needed = ALIGN(1 << order, PAGES_PER_SECTION); ^~~~~ In file included from include/asm-generic/bug.h:18:0, from arch/arc/include/asm/bug.h:32, from include/linux/bug.h:5, from include/linux/mmdebug.h:5, from include/linux/mm.h:9, from mm/page_alloc.c:18: mm/page_alloc.c: In function 'free_area_init_node': mm/page_alloc.c:6379:50: error: 'PAGES_PER_SECTION' undeclared (first use in this function); did you mean 'USEC_PER_SEC'? pgdat->static_init_pgcnt = min_t(unsigned long, PAGES_PER_SECTION, ^ include/linux/kernel.h:812:22: note: in definition of macro '__typecheck' (!!(sizeof((typeof(x) *)1 == (typeof(y) *)1))) ^ include/linux/kernel.h:836:24: note: in expansion of macro '__safe_cmp' __builtin_choose_expr(__safe_cmp(x, y), \ ^~~~~~~~~~ include/linux/kernel.h:904:27: note: in expansion of macro '__careful_cmp' #define min_t(type, x, y) __careful_cmp((type)(x), (type)(y), <) ^~~~~~~~~~~~~ mm/page_alloc.c:6379:29: note: in expansion of macro 'min_t' pgdat->static_init_pgcnt = min_t(unsigned long, PAGES_PER_SECTION, ^~~~~ include/linux/kernel.h:836:2: error: first argument to '__builtin_choose_expr' not a constant __builtin_choose_expr(__safe_cmp(x, y), \ ^ include/linux/kernel.h:904:27: note: in expansion of macro '__careful_cmp' #define min_t(type, x, y) __careful_cmp((type)(x), (type)(y), <) ^~~~~~~~~~~~~ mm/page_alloc.c:6379:29: note: in expansion of macro 'min_t' pgdat->static_init_pgcnt = min_t(unsigned long, PAGES_PER_SECTION, ^~~~~ scripts/Makefile.build:317: recipe for target 'mm/page_alloc.o' failed Let's make the DEFERRED_STRUCT_PAGE_INIT explicitly depend on SPARSEMEM as the systems that support DISCONTIGMEM do not seem to have that huge amounts of memory that would make DEFERRED_STRUCT_PAGE_INIT relevant. Link: http://lkml.kernel.org/r/1530279308-24988-1-git-send-email-rppt@linux.vnet.ibm.com Signed-off-by: Mike Rapoport <rppt@linux.vnet.ibm.com> Acked-by: Michal Hocko <mhocko@suse.com> Reviewed-by: Pavel Tatashin <pasha.tatashin@oracle.com> Tested-by: Randy Dunlap <rdunlap@infradead.org> Cc: Pasha Tatashin <Pavel.Tatashin@microsoft.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-17kernel/memremap, kasan: make ZONE_DEVICE with work with KASANAndrey Ryabinin3-14/+325
KASAN learns about hotadded memory via the memory hotplug notifier. devm_memremap_pages() intentionally skips calling memory hotplug notifiers. So KASAN doesn't know anything about new memory added by devm_memremap_pages(). This causes a crash when KASAN tries to access non-existent shadow memory: BUG: unable to handle kernel paging request at ffffed0078000000 RIP: 0010:check_memory_region+0x82/0x1e0 Call Trace: memcpy+0x1f/0x50 pmem_do_bvec+0x163/0x720 pmem_make_request+0x305/0xac0 generic_make_request+0x54f/0xcf0 submit_bio+0x9c/0x370 submit_bh_wbc+0x4c7/0x700 block_read_full_page+0x5ef/0x870 do_read_cache_page+0x2b8/0xb30 read_dev_sector+0xbd/0x3f0 read_lba.isra.0+0x277/0x670 efi_partition+0x41a/0x18f0 check_partition+0x30d/0x5e9 rescan_partitions+0x18c/0x840 __blkdev_get+0x859/0x1060 blkdev_get+0x23f/0x810 __device_add_disk+0x9c8/0xde0 pmem_attach_disk+0x9a8/0xf50 nvdimm_bus_probe+0xf3/0x3c0 driver_probe_device+0x493/0xbd0 bus_for_each_drv+0x118/0x1b0 __device_attach+0x1cd/0x2b0 bus_probe_device+0x1ac/0x260 device_add+0x90d/0x1380 nd_async_device_register+0xe/0x50 async_run_entry_fn+0xc3/0x5d0 process_one_work+0xa0a/0x1810 worker_thread+0x87/0xe80 kthread+0x2d7/0x390 ret_from_fork+0x3a/0x50 Add kasan_add_zero_shadow()/kasan_remove_zero_shadow() - post mm_init() interface to map/unmap kasan_zero_page at requested virtual addresses. And use it to add/remove the shadow memory for hotplugged/unplugged device memory. Link: http://lkml.kernel.org/r/20180629164932.740-1-aryabinin@virtuozzo.com Fixes: 41e94a851304 ("add devm_memremap_pages") Signed-off-by: Andrey Ryabinin <aryabinin@virtuozzo.com> Reported-by: Dave Chinner <david@fromorbit.com> Reviewed-by: Dan Williams <dan.j.williams@intel.com> Tested-by: Dan Williams <dan.j.williams@intel.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Alexander Potapenko <glider@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-17mm: thp: pass correct vm_flags to hugepage_vma_check()Song Liu1-7/+8
khugepaged_enter_vma_merge() passes a stale vma->vm_flags to hugepage_vma_check(). The argument vm_flags contains the latest value. Therefore, it is necessary to pass this vm_flags into hugepage_vma_check(). With this bug, madvise(MADV_HUGEPAGE) for mmap files in shmem fails to put memory in huge pages. Here is an example of failed madvise(): /* mount /dev/shm with huge=advise: * mount -o remount,huge=advise /dev/shm */ /* create file /dev/shm/huge */ #define HUGE_FILE "/dev/shm/huge" fd = open(HUGE_FILE, O_RDONLY); ptr = mmap(NULL, FILE_SIZE, PROT_READ, MAP_PRIVATE, fd, 0); ret = madvise(ptr, FILE_SIZE, MADV_HUGEPAGE); madvise() will return 0, but this memory region is never put in huge page (check from /proc/meminfo: ShmemHugePages). Link: http://lkml.kernel.org/r/20180629181752.792831-1-songliubraving@fb.com Fixes: 02b75dc8160d ("mm: thp: register mm for khugepaged when merging vma for shmem") Signed-off-by: Song Liu <songliubraving@fb.com> Reviewed-by: Rik van Riel <riel@surriel.com> Reviewed-by: Yang Shi <yang.shi@linux.alibaba.com> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Hugh Dickins <hughd@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-17mm/fadvise.c: fix signed overflow UBSAN complaintAndrey Ryabinin1-2/+6
Signed integer overflow is undefined according to the C standard. The overflow in ksys_fadvise64_64() is deliberate, but since it is signed overflow, UBSAN complains: UBSAN: Undefined behaviour in mm/fadvise.c:76:10 signed integer overflow: 4 + 9223372036854775805 cannot be represented in type 'long long int' Use unsigned types to do math. Unsigned overflow is defined so UBSAN will not complain about it. This patch doesn't change generated code. [akpm@linux-foundation.org: add comment explaining the casts] Link: http://lkml.kernel.org/r/20180629184453.7614-1-aryabinin@virtuozzo.com Signed-off-by: Andrey Ryabinin <aryabinin@virtuozzo.com> Reported-by: <icytxw@gmail.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: Alexander Potapenko <glider@google.com> Cc: Dmitry Vyukov <dvyukov@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-17mm/swap_slots.c: make swap_slots_cache_mutex and ↵Colin Ian King1-2/+2
swap_slots_cache_enable_mutex static The mutexes swap_slots_cache_mutex and swap_slots_cache_enable_mutex are local to the source and do not need to be in global scope, so make them static. Cleans up sparse warnings: symbol 'swap_slots_cache_mutex' was not declared. Should it be static? symbol 'swap_slots_cache_enable_mutex' was not declared. Should it be static? Link: http://lkml.kernel.org/r/20180624182536.4937-1-colin.king@canonical.com Signed-off-by: Colin Ian King <colin.king@canonical.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-17mm/zsmalloc.c: make several functions and a struct staticColin Ian King1-18/+18
The functions zs_page_isolate, zs_page_migrate, zs_page_putback, lock_zspage, trylock_zspage and structure zsmalloc_aops are local to source and do not need to be in global scope, so make them static. Cleans up sparse warnings: symbol 'zs_page_isolate' was not declared. Should it be static? symbol 'zs_page_migrate' was not declared. Should it be static? symbol 'zs_page_putback' was not declared. Should it be static? symbol 'zsmalloc_aops' was not declared. Should it be static? symbol 'lock_zspage' was not declared. Should it be static? symbol 'trylock_zspage' was not declared. Should it be static? [arnd@arndb.de: hide unused lock_zspage] Link: http://lkml.kernel.org/r/20180706130924.3891230-1-arnd@arndb.de Link: http://lkml.kernel.org/r/20180624213322.13776-1-colin.king@canonical.com Signed-off-by: Colin Ian King <colin.king@canonical.com> Reviewed-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: Minchan Kim <minchan@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-17mm/page-writeback.c: update stale account_page_redirty() commentGreg Thelen1-2/+2
Commit 93f78d882865 ("writeback: move backing_dev_info->bdi_stat[] into bdi_writeback") replaced BDI_DIRTIED with WB_DIRTIED in account_page_redirty(). Update comment to track that change. BDI_DIRTIED => WB_DIRTIED BDI_WRITTEN => WB_WRITTEN Link: http://lkml.kernel.org/r/20180625171526.173483-1-gthelen@google.com Signed-off-by: Greg Thelen <gthelen@google.com> Reviewed-by: Jan Kara <jack@suse.cz> Acked-by: Tejun Heo <tj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-17fs, mm: account buffer_head to kmemcgShakeel Butt3-2/+39
The buffer_head can consume a significant amount of system memory and is directly related to the amount of page cache. In our production environment we have observed that a lot of machines are spending a significant amount of memory as buffer_head and can not be left as system memory overhead. Charging buffer_head is not as simple as adding __GFP_ACCOUNT to the allocation. The buffer_heads can be allocated in a memcg different from the memcg of the page for which buffer_heads are being allocated. One concrete example is memory reclaim. The reclaim can trigger I/O of pages of any memcg on the system. So, the right way to charge buffer_head is to extract the memcg from the page for which buffer_heads are being allocated and then use targeted memcg charging API. [shakeelb@google.com: use __GFP_ACCOUNT for directed memcg charging] Link: http://lkml.kernel.org/r/20180702220208.213380-1-shakeelb@google.com Link: http://lkml.kernel.org/r/20180627191250.209150-3-shakeelb@google.com Signed-off-by: Shakeel Butt <shakeelb@google.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Jan Kara <jack@suse.cz> Cc: Amir Goldstein <amir73il@gmail.com> Cc: Greg Thelen <gthelen@google.com> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Cc: Roman Gushchin <guro@fb.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-17fs: fsnotify: account fsnotify metadata to kmemcgShakeel Butt12-18/+123
Patch series "Directed kmem charging", v8. The Linux kernel's memory cgroup allows limiting the memory usage of the jobs running on the system to provide isolation between the jobs. All the kernel memory allocated in the context of the job and marked with __GFP_ACCOUNT will also be included in the memory usage and be limited by the job's limit. The kernel memory can only be charged to the memcg of the process in whose context kernel memory was allocated. However there are cases where the allocated kernel memory should be charged to the memcg different from the current processes's memcg. This patch series contains two such concrete use-cases i.e. fsnotify and buffer_head. The fsnotify event objects can consume a lot of system memory for large or unlimited queues if there is either no or slow listener. The events are allocated in the context of the event producer. However they should be charged to the event consumer. Similarly the buffer_head objects can be allocated in a memcg different from the memcg of the page for which buffer_head objects are being allocated. To solve this issue, this patch series introduces mechanism to charge kernel memory to a given memcg. In case of fsnotify events, the memcg of the consumer can be used for charging and for buffer_head, the memcg of the page can be charged. For directed charging, the caller can use the scope API memalloc_[un]use_memcg() to specify the memcg to charge for all the __GFP_ACCOUNT allocations within the scope. This patch (of 2): A lot of memory can be consumed by the events generated for the huge or unlimited queues if there is either no or slow listener. This can cause system level memory pressure or OOMs. So, it's better to account the fsnotify kmem caches to the memcg of the listener. However the listener can be in a different memcg than the memcg of the producer and these allocations happen in the context of the event producer. This patch introduces remote memcg charging API which the producer can use to charge the allocations to the memcg of the listener. There are seven fsnotify kmem caches and among them allocations from dnotify_struct_cache, dnotify_mark_cache, fanotify_mark_cache and inotify_inode_mark_cachep happens in the context of syscall from the listener. So, SLAB_ACCOUNT is enough for these caches. The objects from fsnotify_mark_connector_cachep are not accounted as they are small compared to the notification mark or events and it is unclear whom to account connector to since it is shared by all events attached to the inode. The allocations from the event caches happen in the context of the event producer. For such caches we will need to remote charge the allocations to the listener's memcg. Thus we save the memcg reference in the fsnotify_group structure of the listener. This patch has also moved the members of fsnotify_group to keep the size same, at least for 64 bit build, even with additional member by filling the holes. [shakeelb@google.com: use GFP_KERNEL_ACCOUNT rather than open-coding it] Link: http://lkml.kernel.org/r/20180702215439.211597-1-shakeelb@google.com Link: http://lkml.kernel.org/r/20180627191250.209150-2-shakeelb@google.com Signed-off-by: Shakeel Butt <shakeelb@google.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Jan Kara <jack@suse.cz> Cc: Amir Goldstein <amir73il@gmail.com> Cc: Greg Thelen <gthelen@google.com> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Cc: Roman Gushchin <guro@fb.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-17mm: introduce mem_cgroup_put() helperRoman Gushchin1-0/+9
Introduce the mem_cgroup_put() helper, which helps to eliminate guarding memcg css release with "#ifdef CONFIG_MEMCG" in multiple places. Link: http://lkml.kernel.org/r/20180623000600.5818-2-guro@fb.com Signed-off-by: Roman Gushchin <guro@fb.com> Reviewed-by: Shakeel Butt <shakeelb@google.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Michal Hocko <mhocko@kernel.org> Acked-by: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-17mm: provide a fallback for PAGE_KERNEL_EXEC for architecturesLuis R. Rodriguez3-8/+4
Some architectures just don't have PAGE_KERNEL_EXEC. The mm/nommu.c and mm/vmalloc.c code have been using PAGE_KERNEL as a fallback for years. Move this fallback to asm-generic. Link: http://lkml.kernel.org/r/20180510185507.2439-3-mcgrof@kernel.org Signed-off-by: Luis R. Rodriguez <mcgrof@kernel.org> Suggested-by: Matthew Wilcox <willy@infradead.org> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-17mm: provide a fallback for PAGE_KERNEL_RO for architecturesLuis R. Rodriguez2-5/+14
Some architectures do not define certain PAGE_KERNEL_* flags, this is either because: a) The way to implement some of these flags is *not yet ported*, or b) The architecture *has no way* to describe them Over time we have accumulated a few PAGE_KERNEL_* fallback workarounds for architectures in the kernel which do not define them using *relatively safe* equivalents. Move these scattered fallback hacks into asm-generic. We start off with PAGE_KERNEL_RO using PAGE_KERNEL as a fallback. This has been in place on the firmware loader for years. Move the fallback into the respective asm-generic header. Link: http://lkml.kernel.org/r/20180510185507.2439-2-mcgrof@kernel.org Signed-off-by: Luis R. Rodriguez <mcgrof@kernel.org> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Matthew Wilcox <willy@infradead.org> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-17mm/memory_hotplug.c: drop unnecessary checks from register_mem_sect_under_node()Oscar Salvador1-5/+0
Callers of register_mem_sect_under_node() are always passing a valid memory_block (not NULL), so we can safely drop the check for NULL. In the same way, register_mem_sect_under_node() is only called in case the node is online, so we can safely remove that check as well. Link: http://lkml.kernel.org/r/20180622111839.10071-5-osalvador@techadventures.net Signed-off-by: Oscar Salvador <osalvador@suse.de> Reviewed-by: Pavel Tatashin <pasha.tatashin@oracle.com> Tested-by: Reza Arbab <arbab@linux.vnet.ibm.com> Tested-by: Jonathan Cameron <Jonathan.Cameron@huawei.com> Cc: Pasha Tatashin <Pavel.Tatashin@microsoft.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-17mm/memory_hotplug.c: make register_mem_sect_under_node() a callback of ↵Oscar Salvador3-47/+14
walk_memory_range() link_mem_sections() and walk_memory_range() share most of the code, so we can use convert link_mem_sections() into a dummy function that calls walk_memory_range() with a callback to register_mem_sect_under_node(). This patch converts register_mem_sect_under_node() in order to match a walk_memory_range's callback, getting rid of the check_nid argument and checking instead if the system is still boothing, since we only have to check for the nid if the system is in such state. Link: http://lkml.kernel.org/r/20180622111839.10071-4-osalvador@techadventures.net Signed-off-by: Oscar Salvador <osalvador@suse.de> Suggested-by: Pavel Tatashin <pasha.tatashin@oracle.com> Tested-by: Reza Arbab <arbab@linux.vnet.ibm.com> Tested-by: Jonathan Cameron <Jonathan.Cameron@huawei.com> Reviewed-by: Pavel Tatashin <pavel.tatashin@microsoft.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-17mm/memory_hotplug.c: call register_mem_sect_under_node()Oscar Salvador2-23/+11
When hotplugging memory, it is possible that two calls are being made to register_mem_sect_under_node(). One comes from __add_section()->hotplug_memory_register() and the other from add_memory_resource()->link_mem_sections() if we had to register a new node. In case we had to register a new node, hotplug_memory_register() will only handle/allocate the memory_block's since register_mem_sect_under_node() will return right away because the node it is not online yet. I think it is better if we leave hotplug_memory_register() to handle/allocate only memory_block's and make link_mem_sections() to call register_mem_sect_under_node(). So this patch removes the call to register_mem_sect_under_node() from hotplug_memory_register(), and moves the call to link_mem_sections() out of the condition, so it will always be called. In this way we only have one place where the memory sections are registered. Link: http://lkml.kernel.org/r/20180622111839.10071-3-osalvador@techadventures.net Signed-off-by: Oscar Salvador <osalvador@suse.de> Reviewed-by: Pavel Tatashin <pasha.tatashin@oracle.com> Tested-by: Reza Arbab <arbab@linux.vnet.ibm.com> Tested-by: Jonathan Cameron <Jonathan.Cameron@huawei.com> Cc: Pasha Tatashin <Pavel.Tatashin@microsoft.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-17mm/memory_hotplug.c: make add_memory_resource use __try_online_nodeOscar Salvador1-28/+39
This is a small cleanup for the memhotplug code. A lot more could be done, but it is better to start somewhere. I tried to unify/remove duplicated code. The following is what this patchset does: 1) add_memory_resource() has code to allocate a node in case it was offline. Since try_online_node has some code for that as well, I just made add_memory_resource() to use that so we can remove duplicated code.. This is better explained in patch 1/4. 2) register_mem_sect_under_node() will be called only from link_mem_sections() 3) Make register_mem_sect_under_node() a callback of walk_memory_range() 4) Drop unnecessary checks from register_mem_sect_under_node() I have done some tests and I could not see anything broken because of this patchset. add_memory_resource() contains code to allocate a new node in case it is necessary. Since try_online_node() also has some code for this purpose, let us make use of that and remove duplicate code. This introduces __try_online_node(), which is called by add_memory_resource() and try_online_node(). __try_online_node() has two new parameters, start_addr of the node, and if the node should be onlined and registered right away. This is always wanted if we are calling from do_cpu_up(), but not when we are calling from memhotplug code. Nothing changes from the point of view of the users of try_online_node(), since try_online_node passes start_addr=0 and online_node=true to __try_online_node(). Link: http://lkml.kernel.org/r/20180622111839.10071-2-osalvador@techadventures.net Signed-off-by: Oscar Salvador <osalvador@suse.de> Reviewed-by: Pavel Tatashin <pasha.tatashin@oracle.com> Tested-by: Reza Arbab <arbab@linux.vnet.ibm.com> Tested-by: Jonathan Cameron <Jonathan.Cameron@huawei.com> Cc: Pasha Tatashin <Pavel.Tatashin@microsoft.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Pavel Tatashin <pasha.tatashin@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-17mm/list_lru.c: fold __list_lru_count_one() into its callerAndrew Morton1-9/+3
__list_lru_count_one() has a single callsite. Acked-by: Vladimir Davydov <vdavydov.dev@gmail.com> Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Cc: Kirill Tkhai <ktkhai@virtuozzo.com> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-17mm: workingset: make shadow_lru_isolate() use locking suffixSebastian Andrzej Siewior1-5/+3
shadow_lru_isolate() disables interrupts and acquires a lock. It could use spin_lock_irq() instead. It also uses local_irq_enable() while it could use spin_unlock_irq()/xa_unlock_irq(). Use proper suffix for lock/unlock in order to enable/disable interrupts during release/acquire of a lock. Link: http://lkml.kernel.org/r/20180622151221.28167-3-bigeasy@linutronix.de Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Cc: Kirill Tkhai <ktkhai@virtuozzo.com> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-17mm: workingset: remove local_irq_disable() from count_shadow_nodes()Sebastian Andrzej Siewior1-3/+0
Patch series "mm: use irq locking suffix instead local_irq_disable()". A small series which avoids using local_irq_disable()/local_irq_enable() but instead does spin_lock_irq()/spin_unlock_irq() so it is within the context of the lock which it belongs to. Patch #1 is a cleanup where local_irq_.*() remained after the lock was removed. This patch (of 2): In 0c7c1bed7e13 ("mm: make counting of list_lru_one::nr_items lockless") the spin_lock(&nlru->lock); statement was replaced with rcu_read_lock(); in __list_lru_count_one(). The comment in count_shadow_nodes() says that the local_irq_disable() is required because the lock must be acquired with disabled interrupts and (spin_lock()) does not do so. Since the lock is replaced with rcu_read_lock() the local_irq_disable() is no longer needed. The code path is list_lru_shrink_count() -> list_lru_count_one() -> __list_lru_count_one() -> rcu_read_lock() -> list_lru_from_memcg_idx() -> rcu_read_unlock() Remove the local_irq_disable() statement. Link: http://lkml.kernel.org/r/20180622151221.28167-2-bigeasy@linutronix.de Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Kirill Tkhai <ktkhai@virtuozzo.com> Acked-by: Vladimir Davydov <vdavydov.dev@gmail.com> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-17mm: drop VM_BUG_ON from __get_free_pagesMichal Hocko1-8/+4
There is no real reason to blow up just because the caller doesn't know that __get_free_pages cannot return highmem pages. Simply fix that up silently. Even if we have some confused users such a fixup will not be harmful. [akpm@linux-foundation.org: mask off __GFP_HIGHMEM] Link: http://lkml.kernel.org/r/20180622162841.25114-1-mhocko@kernel.org Signed-off-by: Michal Hocko <mhocko@suse.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: Jiankang Chen <chenjiankang1@huawei.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Yisheng Xie <xieyisheng1@huawei.com> Cc: Hanjun Guo <guohanjun@huawei.com> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-17mm, hugetlbfs: pass fault address to cow handlerHuang Ying1-4/+5
This is to take better advantage of the general huge page copying optimization. Where, the target subpage will be copied last to avoid the cache lines of target subpage to be evicted when copying other subpages. This works better if the address of the target subpage is available when copying huge page. So hugetlbfs page fault handlers are changed to pass that information to hugetlb_cow(). This will benefit workloads which don't access the begin of the hugetlbfs huge page after the page fault under heavy cache contention. Link: http://lkml.kernel.org/r/20180524005851.4079-5-ying.huang@intel.com Signed-off-by: "Huang, Ying" <ying.huang@intel.com> Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com> Cc: Michal Hocko <mhocko@suse.com> Cc: David Rientjes <rientjes@google.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Andi Kleen <andi.kleen@intel.com> Cc: Jan Kara <jack@suse.cz> Cc: Matthew Wilcox <willy@infradead.org> Cc: Hugh Dickins <hughd@google.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Shaohua Li <shli@fb.com> Cc: Christopher Lameter <cl@linux.com> Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com> Cc: Punit Agrawal <punit.agrawal@arm.com> Cc: Anshuman Khandual <khandual@linux.vnet.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-17mm, hugetlbfs: rename address to haddr in hugetlb_cow()Huang Ying1-14/+12
To take better advantage of general huge page copying optimization, the target subpage address will be passed to hugetlb_cow(), then copy_user_huge_page(). So we will use both target subpage address and huge page size aligned address in hugetlb_cow(). To distinguish between them, "haddr" is used for huge page size aligned address to be consistent with Transparent Huge Page naming convention. Now, only huge page size aligned address is used in hugetlb_cow(), so the "address" is renamed to "haddr" in hugetlb_cow() in this patch. Next patch will use target subpage address in hugetlb_cow() too. The patch is just code cleanup without any functionality changes. Link: http://lkml.kernel.org/r/20180524005851.4079-4-ying.huang@intel.com Signed-off-by: "Huang, Ying" <ying.huang@intel.com> Suggested-by: Mike Kravetz <mike.kravetz@oracle.com> Suggested-by: Michal Hocko <mhocko@suse.com> Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com> Cc: David Rientjes <rientjes@google.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Andi Kleen <andi.kleen@intel.com> Cc: Jan Kara <jack@suse.cz> Cc: Matthew Wilcox <willy@infradead.org> Cc: Hugh Dickins <hughd@google.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Shaohua Li <shli@fb.com> Cc: Christopher Lameter <cl@linux.com> Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com> Cc: Punit Agrawal <punit.agrawal@arm.com> Cc: Anshuman Khandual <khandual@linux.vnet.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-17mm, huge page: copy target sub-page last when copy huge pageHuang Ying3-9/+27
Huge page helps to reduce TLB miss rate, but it has higher cache footprint, sometimes this may cause some issue. For example, when copying huge page on x86_64 platform, the cache footprint is 4M. But on a Xeon E5 v3 2699 CPU, there are 18 cores, 36 threads, and only 45M LLC (last level cache). That is, in average, there are 2.5M LLC for each core and 1.25M LLC for each thread. If the cache contention is heavy when copying the huge page, and we copy the huge page from the begin to the end, it is possible that the begin of huge page is evicted from the cache after we finishing copying the end of the huge page. And it is possible for the application to access the begin of the huge page after copying the huge page. In c79b57e462b5d ("mm: hugetlb: clear target sub-page last when clearing huge page"), to keep the cache lines of the target subpage hot, the order to clear the subpages in the huge page in clear_huge_page() is changed to clearing the subpage which is furthest from the target subpage firstly, and the target subpage last. The similar order changing helps huge page copying too. That is implemented in this patch. Because we have put the order algorithm into a separate function, the implementation is quite simple. The patch is a generic optimization which should benefit quite some workloads, not for a specific use case. To demonstrate the performance benefit of the patch, we tested it with vm-scalability run on transparent huge page. With this patch, the throughput increases ~16.6% in vm-scalability anon-cow-seq test case with 36 processes on a 2 socket Xeon E5 v3 2699 system (36 cores, 72 threads). The test case set /sys/kernel/mm/transparent_hugepage/enabled to be always, mmap() a big anonymous memory area and populate it, then forked 36 child processes, each writes to the anonymous memory area from the begin to the end, so cause copy on write. For each child process, other child processes could be seen as other workloads which generate heavy cache pressure. At the same time, the IPC (instruction per cycle) increased from 0.63 to 0.78, and the time spent in user space is reduced ~7.2%. Link: http://lkml.kernel.org/r/20180524005851.4079-3-ying.huang@intel.com Signed-off-by: "Huang, Ying" <ying.huang@intel.com> Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com> Cc: Andi Kleen <andi.kleen@intel.com> Cc: Jan Kara <jack@suse.cz> Cc: Michal Hocko <mhocko@suse.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Hugh Dickins <hughd@google.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Shaohua Li <shli@fb.com> Cc: Christopher Lameter <cl@linux.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-17mm, clear_huge_page: move order algorithm into a separate functionHuang Ying1-34/+56
Patch series "mm, huge page: Copy target sub-page last when copy huge page", v2. Huge page helps to reduce TLB miss rate, but it has higher cache footprint, sometimes this may cause some issue. For example, when copying huge page on x86_64 platform, the cache footprint is 4M. But on a Xeon E5 v3 2699 CPU, there are 18 cores, 36 threads, and only 45M LLC (last level cache). That is, in average, there are 2.5M LLC for each core and 1.25M LLC for each thread. If the cache contention is heavy when copying the huge page, and we copy the huge page from the begin to the end, it is possible that the begin of huge page is evicted from the cache after we finishing copying the end of the huge page. And it is possible for the application to access the begin of the huge page after copying the huge page. In c79b57e462b5d ("mm: hugetlb: clear target sub-page last when clearing huge page"), to keep the cache lines of the target subpage hot, the order to clear the subpages in the huge page in clear_huge_page() is changed to clearing the subpage which is furthest from the target subpage firstly, and the target subpage last. The similar order changing helps huge page copying too. That is implemented in this patchset. The patchset is a generic optimization which should benefit quite some workloads, not for a specific use case. To demonstrate the performance benefit of the patchset, we have tested it with vm-scalability run on transparent huge page. With this patchset, the throughput increases ~16.6% in vm-scalability anon-cow-seq test case with 36 processes on a 2 socket Xeon E5 v3 2699 system (36 cores, 72 threads). The test case set /sys/kernel/mm/transparent_hugepage/enabled to be always, mmap() a big anonymous memory area and populate it, then forked 36 child processes, each writes to the anonymous memory area from the begin to the end, so cause copy on write. For each child process, other child processes could be seen as other workloads which generate heavy cache pressure. At the same time, the IPC (instruction per cycle) increased from 0.63 to 0.78, and the time spent in user space is reduced ~7.2%. This patch (of 4): In c79b57e462b5d ("mm: hugetlb: clear target sub-page last when clearing huge page"), to keep the cache lines of the target subpage hot, the order to clear the subpages in the huge page in clear_huge_page() is changed to clearing the subpage which is furthest from the target subpage firstly, and the target subpage last. This optimization could be applied to copying huge page too with the same order algorithm. To avoid code duplication and reduce maintenance overhead, in this patch, the order algorithm is moved out of clear_huge_page() into a separate function: process_huge_page(). So that we can use it for copying huge page too. This will change the direct calls to clear_user_highpage() into the indirect calls. But with the proper inline support of the compilers, the indirect call will be optimized to be the direct call. Our tests show no performance change with the patch. This patch is a code cleanup without functionality change. Link: http://lkml.kernel.org/r/20180524005851.4079-2-ying.huang@intel.com Signed-off-by: "Huang, Ying" <ying.huang@intel.com> Suggested-by: Mike Kravetz <mike.kravetz@oracle.com> Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com> Cc: Andi Kleen <andi.kleen@intel.com> Cc: Jan Kara <jack@suse.cz> Cc: Michal Hocko <mhocko@suse.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Hugh Dickins <hughd@google.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Shaohua Li <shli@fb.com> Cc: Christopher Lameter <cl@linux.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-17ext4: readpages() should submit IO as read-aheadJens Axboe3-5/+7
a_ops->readpages() is only ever used for read-ahead. Ensure that we pass this information down to the block layer. Link: http://lkml.kernel.org/r/20180621010725.17813-5-axboe@kernel.dk Signed-off-by: Jens Axboe <axboe@kernel.dk> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Chris Mason <clm@fb.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-17btrfs: readpages() should submit IO as read-aheadJens Axboe1-1/+1
a_ops->readpages() is only ever used for read-ahead. Ensure that we pass this information down to the block layer. Link: http://lkml.kernel.org/r/20180621010725.17813-4-axboe@kernel.dk Signed-off-by: Jens Axboe <axboe@kernel.dk> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Chris Mason <clm@fb.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-17mpage: mpage_readpages() should submit IO as read-aheadJens Axboe3-10/+28
a_ops->readpages() is only ever used for read-ahead, yet we don't flag the IO being submitted as such. Fix that up. Any file system that uses mpage_readpages() as its ->readpages() implementation will now get this right. Since we're passing in whether the IO is read-ahead or not, we don't need to pass in the 'gfp' separately, as it is dependent on the IO being read-ahead. Kill off that member. Add some documentation notes on ->readpages() being purely for read-ahead. Link: http://lkml.kernel.org/r/20180621010725.17813-3-axboe@kernel.dk Signed-off-by: Jens Axboe <axboe@kernel.dk> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Chris Mason <clm@fb.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-17mpage: add argument structure for do_mpage_readpage()Jens Axboe1-52/+57
Patch series "Submit ->readpages() IO as read-ahead", v4. The only caller of ->readpages() is from read-ahead, yet we don't submit IO flagged with REQ_RAHEAD. This means we don't see it in blktrace, for instance, which is a shame. Additionally, it's preventing further functional changes in the block layer for deadling with read-ahead more intelligently. We already make assumptions about ->readpages() just being for read-ahead in the mpage implementation, using readahead_gfp_mask(mapping) as out GFP mask of choice. This small series fixes up mpage_readpages() to submit with REQ_RAHEAD, which takes care of file systems using mpage_readpages(). The first patch is a prep patch, that makes do_mpage_readpage() take an argument structure. This patch (of 4): We're currently passing 8 arguments to this function, clean it up a bit by packing the arguments in an args structure we pass to it. No intentional functional changes in this patch. [akpm@linux-foundation.org: coding-style fixes] Link: http://lkml.kernel.org/r/20180621010725.17813-2-axboe@kernel.dk Signed-off-by: Jens Axboe <axboe@kernel.dk> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Christoph Hellwig <hch@lst.de> Cc: Theodore Ts'o <tytso@mit.edu> Cc: Chris Mason <clm@fb.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-17mm: thp: inc counter for collapsed shmem THPYang Shi1-0/+2
/sys/kernel/mm/transparent_hugepage/khugepaged/pages_collapsed is used to record the counter of collapsed THP, but it just gets inc'ed in anonymous THP collapse path, do this for shmem THP collapse too. Link: http://lkml.kernel.org/r/1529622949-75504-2-git-send-email-yang.shi@linux.alibaba.com Signed-off-by: Yang Shi <yang.shi@linux.alibaba.com> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Hugh Dickins <hughd@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-17mm: thp: register mm for khugepaged when merging vma for shmemYang Shi1-27/+26
When merging anonymous page vma, if the size of the vma can fit in at least one hugepage, the mm will be registered for khugepaged for collapsing THP in the future. But it skips shmem vmas. Do so for shmem also, but not for file-private mappings when merging a vma in order to increase the odds of collapsing a hugepage via khugepaged. hugepage_vma_check() sounds like a good fit to do the check. And move the definition of it before khugepaged_enter_vma_merge() to avoid a build error. Link: http://lkml.kernel.org/r/1529697791-6950-1-git-send-email-yang.shi@linux.alibaba.com Signed-off-by: Yang Shi <yang.shi@linux.alibaba.com> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Hugh Dickins <hughd@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-17mm/mempool.c: remove unused argument in kasan_unpoison_element() and ↵Jia-Ju Bai1-6/+6
remove_element() The argument "gfp_t flags" is not used in kasan_unpoison_element() and remove_element(), so remove it. Link: http://lkml.kernel.org/r/20180621070332.16633-1-baijiaju1990@gmail.com Signed-off-by: Jia-Ju Bai <baijiaju1990@gmail.com> Reviewed-by: Matthew Wilcox <willy@infradead.org> Acked-by: Michal Hocko <mhocko@suse.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-17mm/vmscan.c: condense scan_controlGreg Thelen1-12/+20
Use smaller scan_control fields for order, priority, and reclaim_idx. Convert fields from int => s8. All easily fit within a byte: - allocation order range: 0..MAX_ORDER(64?) - priority range: 0..12(DEF_PRIORITY) - reclaim_idx range: 0..6(__MAX_NR_ZONES) Since 6538b8ea886e ("x86_64: expand kernel stack to 16K") x86_64 stack overflows are not an issue. But it's inefficient to use ints. Use s8 (signed byte) rather than u8 to allow for loops like: do { ... } while (--sc.priority >= 0); Add BUILD_BUG_ON to verify that s8 is capable of storing max values. This reduces sizeof(struct scan_control): - 96 => 80 bytes (x86_64) - 68 => 56 bytes (i386) scan_control structure field order is changed to utilize padding. After this patch there is 1 bit of scan_control padding. akpm: makes my vmscan.o's .text 572 bytes smaller as well. Link: http://lkml.kernel.org/r/20180530061212.84915-1-gthelen@google.com Signed-off-by: Greg Thelen <gthelen@google.com> Suggested-by: Matthew Wilcox <willy@infradead.org> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-17mm/page_ext.c: constify lookup_page_ext() argumentKirill A. Shutemov2-4/+4
lookup_page_ext() finds 'struct page_ext' for a given page. It requires only read access to the given struct page. Current implemnentation takes 'struct page *' as an argument. It makes compiler complain when 'const struct page *' passed. Change the argument to 'const struct page *'. Link: http://lkml.kernel.org/r/20180531135457.20167-3-kirill.shutemov@linux.intel.com Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: Vinayak Menon <vinmenon@codeaurora.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-17include/linux/page_ext.h: drop definition of unused PAGE_EXT_DEBUG_POISONKirill A. Shutemov1-11/+0
After commit bd33ef368135 ("mm: enable page poisoning early at boot") PAGE_EXT_DEBUG_POISON is not longer used. Remove it. Link: http://lkml.kernel.org/r/20180531135457.20167-2-kirill.shutemov@linux.intel.com Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Reviewed-by: Vinayak Menon <vinmenon@codeaurora.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-17shmem: use monotonic time for i_generationArnd Bergmann1-1/+2
get_seconds() is deprecated because it will lead to a 32-bit overflow in 2038 or 2106. We don't need the i_generation to be strictly monotonic anyway, and other file systems like ext4 and xfs just use prandom_u32(), so let's use the same one here. If this is considered too slow, we could also use ktime_get_seconds() or ktime_get_real_seconds() to keep the previous behavior. Both of these return a time64_t and are not deprecated, but only return a unique value once per second, and are predictable. Link: http://lkml.kernel.org/r/20180620082556.581543-1-arnd@arndb.de Signed-off-by: Arnd Bergmann <arnd@arndb.de> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: Hugh Dickins <hughd@google.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Michal Hocko <mhocko@suse.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-17mm, page_alloc: actually ignore mempolicies for high priority allocationsVlastimil Babka1-3/+4
__alloc_pages_slowpath() has for a long time contained code to ignore node restrictions from memory policies for high priority allocations. The current code that resets the zonelist iterator however does effectively nothing after commit 7810e6781e0f ("mm, page_alloc: do not break __GFP_THISNODE by zonelist reset") removed a buggy zonelist reset. Even before that commit, mempolicy restrictions were still not ignored, as they are passed in ac->nodemask which is untouched by the code. We can either remove the code, or make it work as intended. Since ac->nodemask can be set from task's mempolicy via alloc_pages_current() and thus also alloc_pages(), it may indeed affect kernel allocations, and it makes sense to ignore it to allow progress for high priority allocations. Thus, this patch resets ac->nodemask to NULL in such cases. This assumes all callers can handle it (i.e. there are no guarantees as in the case of __GFP_THISNODE) which seems to be the case. The same assumption is already present in check_retry_cpuset() for some time. The expected effect is that high priority kernel allocations in the context of userspace tasks (e.g. OOM victims) restricted by mempolicies will have higher chance to succeed if they are restricted to nodes with depleted memory, while there are other nodes with free memory left. It's not a new intention, but for the first time the code will match the intention, AFAICS. It was intended by commit 183f6371aac2 ("mm: ignore mempolicies when using ALLOC_NO_WATERMARK") in v3.6 but I think it never really worked, as mempolicy restriction was already encoded in nodemask, not zonelist, at that time. So originally that was for ALLOC_NO_WATERMARK only. Then it was adjusted by e46e7b77c909 ("mm, page_alloc: recalculate the preferred zoneref if the context can ignore memory policies") and cd04ae1e2dc8 ("mm, oom: do not rely on TIF_MEMDIE for memory reserves access") to the current state. So even GFP_ATOMIC would now ignore mempolicies after the initial attempts fail - if the code worked as people thought it does. Link: http://lkml.kernel.org/r/20180612122624.8045-1-vbabka@suse.cz Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Mel Gorman <mgorman@techsingularity.net> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-17tools/vm/page-types.c: add support for idle page trackingChristian Hansen2-1/+51
Add a flag which causes page-types to use the kernels's idle page tracking to mark pages idle. As the tool already prints the idle flag if set, subsequent runs will show which pages have been accessed since last run. [akpm@linux-foundation.org: simplify mark_page_idle()] [chansen3@cisco.com: reorganize mark_page_idle() logic, add docs] Link: http://lkml.kernel.org/r/20180706172237.21691-1-chansen3@cisco.com Link: http://lkml.kernel.org/r/20180612153223.13174-1-chansen3@cisco.com Signed-off-by: Christian Hansen <chansen3@cisco.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-17tools/vm/page-types.c: include shared map countsChristian Hansen2-14/+62
Add a new flag that will read kpagecount for each PFN and print out the number of times the page is mapped along with the flags in the listing view. This information is useful in understanding and optimizing memory usage. Identifying pages which are not shared allows us to focus on adjusting the memory layout or access patterns for the sole owning process. Knowing the number of processes that share a page tells us how many other times we must make the same adjustments or how many processes to potentially disable. Truncated sample output: voffset map-cnt offset len flags 561a3591e 1 15fe8 1 ___U_lA____Ma_b___________________________ 561a3591f 1 2b103 1 ___U_lA____Ma_b___________________________ 561a36ca4 1 2cc78 1 ___U_lA____Ma_b___________________________ 7f588bb4e 14 2273c 1 __RU_lA____M______________________________ [akpm@linux-foundation.org: coding-style fixes] [chansen3@cisco.com: add documentation, tweak whitespace] Link: http://lkml.kernel.org/r/20180705181204.5529-1-chansen3@cisco.com Link: http://lkml.kernel.org/r/20180612153205.12879-1-chansen3@cisco.com Signed-off-by: Christian Hansen <chansen3@cisco.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-17thp: use mm_file_counter to determine update which rss counterYang Shi2-3/+3
Since commit eca56ff906bd ("mm, shmem: add internal shmem resident memory accounting"), MM_SHMEMPAGES is added to separate the shmem accounting from regular files. So, all shmem pages should be accounted to MM_SHMEMPAGES instead of MM_FILEPAGES. And, normal 4K shmem pages have been accounted to MM_SHMEMPAGES, so shmem thp pages should be not treated differently. Account them to MM_SHMEMPAGES via mm_counter_file() since shmem pages are swap backed to keep consistent with normal 4K shmem pages. This will not change the rss counter of processes since shmem pages are still a part of it. The /proc/pid/status and /proc/pid/statm counters will however be more accurate wrt shmem usage, as originally intended. And as eca56ff906bd ("mm, shmem: add internal shmem resident memory accounting") mentioned, oom also could report more accurate "shmem-rss". Link: http://lkml.kernel.org/r/1529442518-17398-1-git-send-email-yang.shi@linux.alibaba.com Signed-off-by: Yang Shi <yang.shi@linux.alibaba.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Hugh Dickins <hughd@google.com> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-17mm: skip invalid pages block at a time in zero_resv_unresv()Pavel Tatashin1-1/+4
The role of zero_resv_unavail() is to make sure that every struct page that is allocated but is not backed by memory that is accessible by kernel is zeroed and not in some uninitialized state. Since struct pages are allocated in blocks (2M pages in x86 case), we can skip pageblock_nr_pages at a time, when the first one is found to be invalid. This optimization may help since now on x86 every hole in e820 maps is marked as reserved in memblock, and thus will go through this function. This function is called before sched_clock() is initialized, so I used my x86 early boot clock patches to measure the performance improvement. With 1T hole on i7-8700 currently we would take 0.606918s of boot time, but with this optimization 0.001103s. Link: http://lkml.kernel.org/r/20180615155733.1175-1-pasha.tatashin@oracle.com Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com> Reviewed-by: Oscar Salvador <osalvador@suse.de> Reviewed-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Pasha Tatashin <Pavel.Tatashin@microsoft.com> Cc: Steven Sistare <steven.sistare@oracle.com> Cc: Daniel Jordan <daniel.m.jordan@oracle.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Ingo Molnar <mingo@kernel.org> Cc: Dan Williams <dan.j.williams@intel.com> Cc: "Huang, Ying" <ying.huang@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-17mm: convert return type of handle_mm_fault() caller to vm_fault_tSouptick Joarder32-51/+69
Use new return type vm_fault_t for fault handler. For now, this is just documenting that the function returns a VM_FAULT value rather than an errno. Once all instances are converted, vm_fault_t will become a distinct type. Ref-> commit 1c8f422059ae ("mm: change return type to vm_fault_t") In this patch all the caller of handle_mm_fault() are changed to return vm_fault_t type. Link: http://lkml.kernel.org/r/20180617084810.GA6730@jordon-HP-15-Notebook-PC Signed-off-by: Souptick Joarder <jrdr.linux@gmail.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Richard Henderson <rth@twiddle.net> Cc: Tony Luck <tony.luck@intel.com> Cc: Matt Turner <mattst88@gmail.com> Cc: Vineet Gupta <vgupta@synopsys.com> Cc: Russell King <linux@armlinux.org.uk> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Will Deacon <will.deacon@arm.com> Cc: Richard Kuo <rkuo@codeaurora.org> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Michal Simek <monstr@monstr.eu> Cc: James Hogan <jhogan@kernel.org> Cc: Ley Foon Tan <lftan@altera.com> Cc: Jonas Bonn <jonas@southpole.se> Cc: James E.J. Bottomley <jejb@parisc-linux.org> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Palmer Dabbelt <palmer@sifive.com> Cc: Yoshinori Sato <ysato@users.sourceforge.jp> Cc: David S. Miller <davem@davemloft.net> Cc: Richard Weinberger <richard@nod.at> Cc: Guan Xuetao <gxt@pku.edu.cn> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: "Levin, Alexander (Sasha Levin)" <alexander.levin@verizon.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-17mm, slub: restore the original intention of prefetch_freepointer()Vlastimil Babka1-2/+1
In SLUB, prefetch_freepointer() is used when allocating an object from cache's freelist, to make sure the next object in the list is cache-hot, since it's probable it will be allocated soon. Commit 2482ddec670f ("mm: add SLUB free list pointer obfuscation") has unintentionally changed the prefetch in a way where the prefetch is turned to a real fetch, and only the next->next pointer is prefetched. In case there is not a stream of allocations that would benefit from prefetching, the extra real fetch might add a useless cache miss to the allocation. Restore the previous behavior. Link: http://lkml.kernel.org/r/20180809085245.22448-1-vbabka@suse.cz Fixes: 2482ddec670f ("mm: add SLUB free list pointer obfuscation") Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Kees Cook <keescook@chromium.org> Cc: Daniel Micay <danielmicay@gmail.com> Cc: Eric Dumazet <edumazet@google.com> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-17fs/seq_file.c: simplify seq_file iteration code and interfaceNeilBrown2-54/+63
The documentation for seq_file suggests that it is necessary to be able to move the iterator to a given offset, however that is not the case. If the iterator is stored in the private data and is stable from one read() syscall to the next, it is only necessary to support first/next interactions. Implementing this in a client is a little clumsy. - if ->start() is given a pos of zero, it should go to start of sequence. - if ->start() is given the name pos that was given to the most recent next() or start(), it should restore the iterator to state just before that last call - if ->start is given another number, it should set the iterator one beyond the start just before the last ->start or ->next call. Also, the documentation says that the implementation can interpret the pos however it likes (other than zero meaning start), but seq_file increments the pos sometimes which does impose on the implementation. This patch simplifies the interface for first/next iteration and simplifies the code, while maintaining complete backward compatability. Now: - if ->start() is given a pos of zero, it should return an iterator placed at the start of the sequence - if ->start() is given a non-zero pos, it should return the iterator in the same state it was after the last ->start or ->next. This is particularly useful for interators which walk the multiple chains in a hash table, e.g. using rhashtable_walk*. See fs/gfs2/glock.c and drivers/staging/lustre/lustre/llite/vvp_dev.c A large part of achieving this is to *always* call ->next after ->show has successfully stored all of an entry in the buffer. Never just increment the index instead. Also: - always pass &m->index to ->start() and ->next(), never a temp variable - don't clear ->from when ->count is zero, as ->from is dead when ->count is zero. Some ->next functions do not increment *pos when they return NULL. To maintain compatability with this, we still need to increment m->index in one place, if ->next didn't increment it. Note that such ->next functions are buggy and should be fixed. A simple demonstration is dd if=/proc/swaps bs=1000 skip=1 Choose any block size larger than the size of /proc/swaps. This will always show the whole last line of /proc/swaps. This patch doesn't work around buggy next() functions for this case. [neilb@suse.com: ensure ->from is valid] Link: http://lkml.kernel.org/r/87601ryb8a.fsf@notabene.neil.brown.name Signed-off-by: NeilBrown <neilb@suse.com> Acked-by: Jonathan Corbet <corbet@lwn.net> [docs] Tested-by: Jann Horn <jannh@google.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Kees Cook <keescook@chromium.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-17vfs: discard ATTR_ATTR_FLAGNeilBrown2-2/+1
This flag was introduce in 2.1.37pre1 and the only place it was tested was removed in 2.1.43pre1. The flag was never set. Let's discard it properly. Link: http://lkml.kernel.org/r/877en0hewz.fsf@notabene.neil.brown.name Signed-off-by: NeilBrown <neilb@suse.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-17fs/dcache.c: fix kmemcheck splat at take_dentry_name_snapshot()Tetsuo Handa1-1/+2
Since only dentry->d_name.len + 1 bytes out of DNAME_INLINE_LEN bytes are initialized at __d_alloc(), we can't copy the whole size unconditionally. WARNING: kmemcheck: Caught 32-bit read from uninitialized memory (ffff8fa27465ac50) 636f6e66696766732e746d70000000000010000000000000020000000188ffff i i i i i i i i i i i i i u u u u u u u u u u i i i i i u u u u ^ RIP: 0010:take_dentry_name_snapshot+0x28/0x50 RSP: 0018:ffffa83000f5bdf8 EFLAGS: 00010246 RAX: 0000000000000020 RBX: ffff8fa274b20550 RCX: 0000000000000002 RDX: ffffa83000f5be40 RSI: ffff8fa27465ac50 RDI: ffffa83000f5be60 RBP: ffffa83000f5bdf8 R08: ffffa83000f5be48 R09: 0000000000000001 R10: ffff8fa27465ac00 R11: ffff8fa27465acc0 R12: ffff8fa27465ac00 R13: ffff8fa27465acc0 R14: 0000000000000000 R15: 0000000000000000 FS: 00007f79737ac8c0(0000) GS:ffffffff8fc30000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: ffff8fa274c0b000 CR3: 0000000134aa7002 CR4: 00000000000606f0 take_dentry_name_snapshot+0x28/0x50 vfs_rename+0x128/0x870 SyS_rename+0x3b2/0x3d0 entry_SYSCALL_64_fastpath+0x1a/0xa4 0xffffffffffffffff Link: http://lkml.kernel.org/r/201709131912.GBG39012.QMJLOVFSFFOOtH@I-love.SAKURA.ne.jp Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: Vegard Nossum <vegard.nossum@gmail.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-17ocfs2: make several functions and variables static (and some const)Colin Ian King3-10/+10
There are a variety of functions and variables that are local to the source and do not need to be in global scope, so make them static. Also make a couple of char arrays static const. Cleans up sparse warnings: symbol 'o2hb_heartbeat_mode_desc' was not declared. Should it be static? symbol 'o2hb_heartbeat_mode' was not declared. Should it be static? symbol 'o2hb_dependent_users' was not declared. Should it be static? symbol 'o2hb_region_dec_user' was not declared. Should it be static? symbol 'o2nm_fence_method_desc' was not declared. Should it be static? symbol 'lockdep_keys' was not declared. Should it be static? Link: http://lkml.kernel.org/r/20180628131659.12133-1-colin.king@canonical.com Signed-off-by: Colin Ian King <colin.king@canonical.com> Cc: Mark Fasheh <mfasheh@versity.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Junxiao Bi <junxiao.bi@oracle.com> Cc: Joseph Qi <joseph.qi@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-17ocfs2: clean up some unnecessary codewangyan3-19/+5
Several functions have some unnecessary code, clean up these code. Link: http://lkml.kernel.org/r/5B14DF72.5020800@huawei.com Signed-off-by: Yan Wang <wangyan122@huawei.com> Reviewed-by: Jun Piao <piaojun@huawei.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: Mark Fasheh <mark@fasheh.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Junxiao Bi <junxiao.bi@oracle.com> Cc: Joseph Qi <jiangqi903@gmail.com> Cc: Changwei Ge <ge.changwei@h3c.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-17ocfs2: return -EROFS when filesystem becomes read-onlyJun Piao3-37/+30
We should return -EROFS rather than other errno if filesystem becomes read-only. [akpm@linux-foundation.org: coding-style fixes] Link: http://lkml.kernel.org/r/5B191B26.9010501@huawei.com Signed-off-by: Jun Piao <piaojun@huawei.com> Reviewed-by: Yiwen Jiang <jiangyiwen@huawei.com> Acked-by: Joseph Qi <jiangqi903@gmail.com> Cc: Mark Fasheh <mark@fasheh.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Junxiao Bi <junxiao.bi@oracle.com> Cc: Changwei Ge <ge.changwei@h3c.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-17sh: prefer _THIS_IP_ to current_text_addrNick Desaulniers2-2/+3
As part of the effort to reduce the code duplication between _THIS_IP_ and current_text_addr(), let's consolidate callers of current_text_addr() to use _THIS_IP_. Link: http://lkml.kernel.org/r/20180801185331.39535-1-ndesaulniers@google.com Signed-off-by: Nick Desaulniers <ndesaulniers@google.com> Cc: Yoshinori Sato <ysato@users.sourceforge.jp> Cc: Rich Felker <dalias@libc.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-17sh: make use of for_each_node_by_type()Dmitry Torokhov1-3/+4
Instead of open-coding the loop, let's use canned macro. Also make sure we are not leaking "cpus" node reference. Link: http://lkml.kernel.org/r/20180624224252.GA220395@dtor-ws Signed-off-by: Dmitry Torokhov <dmitry.torokhov@gmail.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: Yoshinori Sato <ysato@users.sourceforge.jp> Cc: Rich Felker <dalias@libc.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-17ntfs: mft: remove VLA usageKees Cook1-2/+10
In the quest to remove all stack VLA usage from the kernel[1], this allocates the maximum size stack buffer. Existing checks already require that blocksize >= NTFS_BLOCK_SIZE and mft_record_size <= PAGE_SIZE, so max_bhs can be at most PAGE_SIZE / NTFS_BLOCK_SIZE. Sanity checks are added for robustness. [1] https://lkml.kernel.org/r/CA+55aFzCG-zNmZwX4A2FQpadafLfEzK6CC=qPXydAacU1RqZWA@mail.gmail.com Link: http://lkml.kernel.org/r/20180626172909.41453-4-keescook@chromium.org Signed-off-by: Kees Cook <keescook@chromium.org> Cc: Anton Altaparmakov <anton@tuxera.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-17ntfs: decompress: remove VLA usageKees Cook1-12/+16
In the quest to remove all stack VLA usage from the kernel[1], this moves the stack buffer used during decompression to be allocated externally. The existing "dest_max_index" used in the VLA is bounded by cb_max_page. cb_max_page is bounded by max_page, and max_page is bounded by nr_pages. Since nr_pages is used for the "pages" allocation, it can similarly be used for the "completed_pages" allocation and passed into the decompression function. The error paths are updated to free the new allocation. [1] https://lkml.kernel.org/r/CA+55aFzCG-zNmZwX4A2FQpadafLfEzK6CC=qPXydAacU1RqZWA@mail.gmail.com Link: http://lkml.kernel.org/r/20180626172909.41453-3-keescook@chromium.org Signed-off-by: Kees Cook <keescook@chromium.org> Cc: Anton Altaparmakov <anton@tuxera.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-17ntfs: aops: remove VLA usageKees Cook1-1/+4
In the quest to remove all stack VLA usage from the kernel[1], this uses the maximum size needed on the stack and adds a sanity check for robustness: index.block_size cannot be larger than PAGE_SIZE nor less than NTFS_BLOCK_SIZE. [1] https://lkml.kernel.org/r/CA+55aFzCG-zNmZwX4A2FQpadafLfEzK6CC=qPXydAacU1RqZWA@mail.gmail.com Link: http://lkml.kernel.org/r/20180626172909.41453-2-keescook@chromium.org Signed-off-by: Kees Cook <keescook@chromium.org> Cc: Anton Altaparmakov <anton@tuxera.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-17fs/ntfs/aops.c: don't disable interrupts during kmap_atomic()Sebastian Andrzej Siewior1-4/+0
ntfs_end_buffer_async_read() disables interrupts around kmap_atomic(). This is a leftover from the old kmap_atomic() implementation which relied on fixed mapping slots, so the caller had to make sure that the same slot could not be reused from an interrupting context. kmap_atomic() was changed to dynamic slots long ago and commit 1ec9c5ddc17a ("include/linux/highmem.h: remove the second argument of k[un]map_atomic()") removed the slot assignements, but the callers were not checked for now redundant interrupt disabling. Remove the conditional interrupt disable. Link: http://lkml.kernel.org/r/20180611144913.gln5mklhqcrfsoom@linutronix.de Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Cc: Anton Altaparmakov <anton@tuxera.com> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-17scripts: add Python 3 compatibility to spdxcheck.pyJeremy Cline1-2/+5
"dict.has_key(key)" on dictionaries has been replaced with "key in dict". Additionally, when run under Python 3 some files don't decode with the default encoding (tested with UTF-8). To handle that, don't open the file in text mode and decode text line-by-line, ignoring encoding errors. This remains compatible with Python 2 and should have no functional change. Link: http://lkml.kernel.org/r/20180717190635.29467-1-jcline@redhat.com Signed-off-by: Jeremy Cline <jcline@redhat.com> Acked-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-17scripts/spdxcheck.py: work with current HEAD LICENSES/ directoryJoe Perches1-3/+1
Depending on how old your -next tree is, it may not have a master that has the LICENSES directory. Change the lookup to HEAD and find whatever LICENSE directory files are used in that branch. Miscellanea: - Remove the checkpatch test as it will have its own SPDX license identifier. Link: http://lkml.kernel.org/r/7eeefc862194930c773e662cb2152e178441d3b8.camel@perches.com Signed-off-by: Joe Perches <joe@perches.com> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-17fs/hpfs: extend gmt_to_local() conversion to 64-bit timesArnd Bergmann2-9/+16
The VFS timestamps are all 64-bit now, the only missing piece for hpfs is the internal conversion function. One interesting bit about hpfs is that it can already deal with moving the 136 year window of its timestamps to support a much wider range than other file systems with 32-bit timestamps. It also treats the timestamps as 'unsigned' on 64-bit architectures (but signed on 32-bit, because time_t always around to negative numbers in 2038). Changing the conversion to use time64_t makes 32-bit architectures behave the same way as 64-bit. For completeness, this also adds a clamp_t call for each conversion, so we don't wrap the timestamps but instead stay within the [0..U32_MAX] range of the on-disk timestamps. Link: http://lkml.kernel.org/r/20180718115017.742609-3-arnd@arndb.de Signed-off-by: Arnd Bergmann <arnd@arndb.de> Cc: Mikulas Patocka <mikulas@artax.karlin.mff.cuni.cz> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-17fs/ntfs: use timespec64 directly for timestamp conversionArnd Bergmann2-18/+21
Now that the VFS has been converted from timespec to timespec64 timestamps, only the conversion to/from ntfs timestamps uses 32-bit seconds. This changes that last missing piece to get the ntfs implementation y2038 safe on 32-bit architectures. Link: http://lkml.kernel.org/r/20180718115017.742609-2-arnd@arndb.de Signed-off-by: Arnd Bergmann <arnd@arndb.de> Cc: Anton Altaparmakov <anton@tuxera.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-17fs/ufs: use ktime_get_real_seconds for sb and cg timestampsArnd Bergmann4-5/+19
get_seconds() is deprecated because of the 32-bit overflow and will be removed. All callers in ufs also truncate to a 32-bit number, so nothing changes during the conversion, but this should be harmless as the superblock and cylinder group timestamps are not visible to user space, except for checking the fs-dirty state, wich works fine across the overflow. This moves the call to get_seconds() into a new inline function, with a comment explaining the constraints, while converting it to ktime_get_real_seconds(). Link: http://lkml.kernel.org/r/20180718115017.742609-1-arnd@arndb.de Signed-off-by: Arnd Bergmann <arnd@arndb.de> Acked-by: Thomas Gleixner <tglx@linutronix.de> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-17firewire: use 64-bit time_t based interfacesArnd Bergmann1-4/+4
32-bit CLOCK_REALTIME timestamps overflow in year 2038, so all such interfaces are deprecated now. For the FW_CDEV_IOC_GET_CYCLE_TIMER2 ioctl, we already support 64-bit timestamps, but the implementation still uses timespec. This changes the code to use timespec64 instead with the appropriate accessor functions. Link: http://lkml.kernel.org/r/20180711124456.1023039-1-arnd@arndb.de Signed-off-by: Arnd Bergmann <arnd@arndb.de> Cc: Stefan Richter <stefanr@s5r6.in-berlin.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-17dax: remove VM_MIXEDMAP for fsdax and device daxDave Jiang11-14/+27
This patch is reworked from an earlier patch that Dan has posted: https://patchwork.kernel.org/patch/10131727/ VM_MIXEDMAP is used by dax to direct mm paths like vm_normal_page() that the memory page it is dealing with is not typical memory from the linear map. The get_user_pages_fast() path, since it does not resolve the vma, is already using {pte,pmd}_devmap() as a stand-in for VM_MIXEDMAP, so we use that as a VM_MIXEDMAP replacement in some locations. In the cases where there is no pte to consult we fallback to using vma_is_dax() to detect the VM_MIXEDMAP special case. Now that we have explicit driver pfn_t-flag opt-in/opt-out for get_user_pages() support for DAX we can stop setting VM_MIXEDMAP. This also means we no longer need to worry about safely manipulating vm_flags in a future where we support dynamically changing the dax mode of a file. DAX should also now be supported with madvise_behavior(), vma_merge(), and copy_page_range(). This patch has been tested against ndctl unit test. It has also been tested against xfstests commit: 625515d using fake pmem created by memmap and no additional issues have been observed. Link: http://lkml.kernel.org/r/152847720311.55924.16999195879201817653.stgit@djiang5-desk3.ch.intel.com Signed-off-by: Dave Jiang <dave.jiang@intel.com> Acked-by: Dan Williams <dan.j.williams@intel.com> Cc: Jan Kara <jack@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-17bitfield: avoid gcc-8 -Wint-in-bool-context warningArnd Bergmann1-1/+1
Passing an enum into FIELD_GET() produces a long but harmless warning on newer compilers: from include/linux/linkage.h:7, from include/linux/kernel.h:7, from include/linux/skbuff.h:17, from include/linux/if_ether.h:23, from include/linux/etherdevice.h:25, from drivers/net/wireless/intel/iwlwifi/mvm/rxmq.c:63: drivers/net/wireless/intel/iwlwifi/mvm/rxmq.c: In function 'iwl_mvm_rx_mpdu_mq': include/linux/bitfield.h:56:20: error: enum constant in boolean context [-Werror=int-in-bool-context] BUILD_BUG_ON_MSG(!(_mask), _pfx "mask is zero"); \ ^ ... include/linux/bitfield.h:103:3: note: in expansion of macro '__BF_FIELD_CHECK' __BF_FIELD_CHECK(_mask, _reg, 0U, "FIELD_GET: "); \ ^~~~~~~~~~~~~~~~ drivers/net/wireless/intel/iwlwifi/mvm/rxmq.c:1025:21: note: in expansion of macro 'FIELD_GET' le16_encode_bits(FIELD_GET(IWL_RX_HE_PHY_SIBG_SYM_OR_USER_NUM_MASK, The problem here is that the caller has no idea how the macro gets expanding, leading to a false-positive. It can be trivially avoided by doing a comparison against zero. This only recently started appearing as the iwlwifi driver was patched to use FIELD_GET. Link: http://lkml.kernel.org/r/20180813220950.194841-1-arnd@arndb.de Fixes: 514c30696fbc ("iwlwifi: add support for IEEE802.11ax") Signed-off-by: Arnd Bergmann <arnd@arndb.de> Cc: Masahiro Yamada <yamada.masahiro@socionext.com> Cc: Johannes Berg <johannes@sipsolutions.net> Cc: Jakub Kicinski <jakub.kicinski@netronome.com> Cc: Andy Shevchenko <andy.shevchenko@gmail.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: David Laight <David.Laight@ACULAB.COM> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-179p: add Dominique Martinet to MAINTAINERSDominique Martinet1-0/+2
Link: http://lkml.kernel.org/r/1533869305-29325-1-git-send-email-asmadeus@codewreck.org Signed-off-by: Dominique Martinet <dominique.martinet@cea.fr> Acked-by: Andrew Morton <akpm@linux-foundation.org> Cc: Eric Van Hensbergen <ericvh@gmail.com> Cc: Latchesar Ionkov <lucho@ionkov.net> Cc: Ron Minnich <rminnich@sandia.gov> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-179p: remove Ron Minnich from MAINTAINERSDominique Martinet2-1/+5
Ron Minnich has left Sandia in 2011, and has not been involved in any 9p commit in recent years. Also add a CREDITS entry to record his contributions. Link: http://lkml.kernel.org/r/1534486244-1055-1-git-send-email-asmadeus@codewreck.org Signed-off-by: Dominique Martinet <dominique.martinet@cea.fr> Cc: Eric Van Hensbergen <ericvh@gmail.com> Cc: Ron Minnich <rminnich@sandia.gov> Cc: Ronald G. Minnich <rminnich@gmail.com> Cc: Latchesar Ionkov <lucho@ionkov.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-17bpf: fix redirect to map under tail callsDaniel Borkmann7-63/+38
Commits 109980b894e9 ("bpf: don't select potentially stale ri->map from buggy xdp progs") and 7c3001313396 ("bpf: fix ri->map_owner pointer on bpf_prog_realloc") tried to mitigate that buggy programs using bpf_redirect_map() helper call do not leave stale maps behind. Idea was to add a map_owner cookie into the per CPU struct redirect_info which was set to prog->aux by the prog making the helper call as a proof that the map is not stale since the prog is implicitly holding a reference to it. This owner cookie could later on get compared with the program calling into BPF whether they match and therefore the redirect could proceed with processing the map safely. In (obvious) hindsight, this approach breaks down when tail calls are involved since the original caller's prog->aux pointer does not have to match the one from one of the progs out of the tail call chain, and therefore the xdp buffer will be dropped instead of redirected. A way around that would be to fix the issue differently (which also allows to remove related work in fast path at the same time): once the life-time of a redirect map has come to its end we use it's map free callback where we need to wait on synchronize_rcu() for current outstanding xdp buffers and remove such a map pointer from the redirect info if found to be present. At that time no program is using this map anymore so we simply invalidate the map pointers to NULL iff they previously pointed to that instance while making sure that the redirect path only reads out the map once. Fixes: 97f91a7cf04f ("bpf: add bpf_redirect_map helper routine") Fixes: 109980b894e9 ("bpf: don't select potentially stale ri->map from buggy xdp progs") Reported-by: Sebastiano Miano <sebastiano.miano@polito.it> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: John Fastabend <john.fastabend@gmail.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2018-08-17Merge tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdmaLinus Torvalds233-7059/+12380
Pull rdma updates from Jason Gunthorpe: "This has been a large cycle for RDMA, with several major patch series reworking parts of the core code. - Rework the so-called 'gid cache' and internal APIs to use a kref'd pointer to a struct instead of copying, push this upwards into the callers and add more stuff to the struct. The new design avoids some ugly races the old one suffered with. This is part of the namespace enablement work as the new struct is learning to be namespace aware. - Various uapi cleanups, moving more stuff to include/uapi and fixing some long standing bugs that have recently been discovered. - Driver updates for mlx5, mlx4 i40iw, rxe, cxgb4, hfi1, usnic, pvrdma, and hns - Provide max_send_sge and max_recv_sge attributes to better support HW where these values are asymmetric. - mlx5 user API 'devx' allows sending commands directly to the device FW, instead of trying to cram every wild and niche feature into the common API. Sort of like what GPU does. - Major write() and ioctl() API rework to cleanly support PCI device hot unplug and advance the ioctl conversion work - Sparse and compile warning cleanups - Add 'const' to the ib_poll_cq() signature, and permit a NULL 'bad_wr', which is the common use case - Various patches to avoid high order allocations across the stack - SRQ support for cxgb4, hns and qedr - Changes to IPoIB to better follow the netdev model for working with struct net_device liftime" * tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma: (312 commits) Revert "net/smc: Replace ib_query_gid with rdma_get_gid_attr" RDMA/hns: Fix usage of bitmap allocation functions return values IB/core: Change filter function return type from int to bool IB/core: Update GID entries for netdevice whose mac address changes IB/core: Add default GIDs of the bond master netdev IB/core: Consider adding default GIDs of bond device IB/core: Delete lower netdevice default GID entries in bonding scenario IB/core: Avoid confusing del_netdev_default_ips IB/core: Add comment for change upper netevent handling qedr: Add user space support for SRQ qedr: Add support for kernel mode SRQ's qedr: Add wrapping generic structure for qpidr and adjust idr routines. IB/mlx5: Fix leaking stack memory to userspace Update the e-mail address of Bart Van Assche IB/ucm: Fix compiling ucm.c IB/uverbs: Do not check for device disassociation during ioctl IB/uverbs: Remove struct uverbs_root_spec and all supporting code IB/uverbs: Use uverbs_api to unmarshal ioctl commands IB/uverbs: Use uverbs_alloc for allocations IB/uverbs: Add a simple allocator to uverbs_attr_bundle ...
2018-08-17r8169: add missing Kconfig dependencyHeiner Kallweit1-0/+1
Now that we switched the r8169 driver to use phylib, there's a dependency on the Realtek PHY drivers. This dependency was missing in Kconfig. Reported-by: Jouni Mettälä <jtmettala@gmail.com> Fixes: f1e911d5d0df ("r8169: add basic phylib support") Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com> Acked-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-08-17tools/bpf: fix bpf selftest test_cgroup_storage failureYonghong Song1-0/+1
The bpf selftest test_cgroup_storage failed in one of our production test servers. # sudo ./test_cgroup_storage Failed to create map: Operation not permitted It turns out this is due to insufficient locked memory with system default 16KB. Similar to other self tests, let us arm the process with unlimited locked memory. With this change, the test passed. # sudo ./test_cgroup_storage test_cgroup_storage:PASS Fixes: 68cfa3ac6b8d ("selftests/bpf: add a cgroup storage test") Cc: Roman Gushchin <guro@fb.com> Signed-off-by: Yonghong Song <yhs@fb.com> Acked-by: Roman Gushchin <guro@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2018-08-17Merge tag 'drm-next-2018-08-17-1' of git://anongit.freedesktop.org/drm/drmLinus Torvalds31-303/+9133
Pull drm msm support for adreno a6xx from Dave Airlie: "This is the support for new Qualcomm Snapdragon SoCs with the A6xx core. Userspace support is in mesa now" * tag 'drm-next-2018-08-17-1' of git://anongit.freedesktop.org/drm/drm: drm/msm: a6xx: fix spelling mistake: "initalization" -> "initialization" drm/msm: Add A6XX device support drm/msm: update generated headers drm/msm/adreno: Load the firmware before bringing up the hardware drm/msm: Add a helper function to parse clock names
2018-08-17Merge tag 'drm-next-2018-08-17' of git://anongit.freedesktop.org/drm/drmLinus Torvalds44-156/+346
Pull drm fixes from Dave Airlie: "First round of fixes for -rc1. I'll follow this up with the msm new hw support pull request. This just has three sets of fixes, some for msm before the new hw, a bunch of AMD fixes (includiing some required firmware changes for new hw), and a set of i915 (+gvt) fixes" * tag 'drm-next-2018-08-17' of git://anongit.freedesktop.org/drm/drm: (30 commits) drm/amdgpu: Use kvmalloc for allocating UVD/VCE/VCN BO backup memory drm/i915: set DP Main Stream Attribute for color range on DDI platforms drm/i915/selftests: Hold rpm for unparking drm/i915: Restore user forcewake domains across suspend drm/i915: Unmask user interrupts writes into HWSP on snb/ivb/vlv/hsw drm/i915/gvt: fix memory leak in intel_vgpu_ioctl() drm/i915/gvt: Off by one in intel_vgpu_write_fence() drm/i915/kvmgt: Fix potential Spectre v1 drm/i915/gvt: return error on cmd access drm/i915/gvt: initialize dmabuf mutex in vgpu_create drm/i915/gvt: fix cleanup sequence in intel_gvt_clean_device drm/amd/display: Guard against null crtc in CRC IRQ drm/amd/display: Pass connector id when executing VBIOS CT drm/amd/display: Check if clock source in use before disabling drm/amd/display: Allow clock sharing b/w HDMI and DVI drm/amd/display: Fix warning observed in mode change on Vega drm/amd/display: fix single link DVI has no display drm/amdgpu/vce: VCE entity initialization relies on ring initializtion drm/amdgpu/uvd: UVD entity initialization relys on ring initialization drm/amdgpu:add VCN booting with firmware loaded by PSP ...
2018-08-17Merge tag 'arm64-fixes' of ↵Linus Torvalds2-2/+6
git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux Pull arm64 fixes from Will Deacon: "A couple of arm64 fixes - Fix boot on Hikey-960 by avoiding an IPI with interrupts disabled - Fix address truncation in pfn_valid() implementation" * tag 'arm64-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux: arm64: mm: check for upper PAGE_SHIFT bits in pfn_valid() arm64: Avoid calling stop_machine() when patching jump labels
2018-08-17Merge tag 'powerpc-4.19-1' of ↵Linus Torvalds363-5250/+5246
git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux Pull powerpc updates from Michael Ellerman: "Notable changes: - A fix for a bug in our page table fragment allocator, where a page table page could be freed and reallocated for something else while still in use, leading to memory corruption etc. The fix reuses pt_mm in struct page (x86 only) for a powerpc only refcount. - Fixes to our pkey support. Several are user-visible changes, but bring us in to line with x86 behaviour and/or fix outright bugs. Thanks to Florian Weimer for reporting many of these. - A series to improve the hvc driver & related OPAL console code, which have been seen to cause hardlockups at times. The hvc driver changes in particular have been in linux-next for ~month. - Increase our MAX_PHYSMEM_BITS to 128TB when SPARSEMEM_VMEMMAP=y. - Remove Power8 DD1 and Power9 DD1 support, neither chip should be in use anywhere other than as a paper weight. - An optimised memcmp implementation using Power7-or-later VMX instructions - Support for barrier_nospec on some NXP CPUs. - Support for flushing the count cache on context switch on some IBM CPUs (controlled by firmware), as a Spectre v2 mitigation. - A series to enhance the information we print on unhandled signals to bring it into line with other arches, including showing the offending VMA and dumping the instructions around the fault. Thanks to: Aaro Koskinen, Akshay Adiga, Alastair D'Silva, Alexey Kardashevskiy, Alexey Spirkov, Alistair Popple, Andrew Donnellan, Aneesh Kumar K.V, Anju T Sudhakar, Arnd Bergmann, Bartosz Golaszewski, Benjamin Herrenschmidt, Bharat Bhushan, Bjoern Noetel, Boqun Feng, Breno Leitao, Bryant G. Ly, Camelia Groza, Christophe Leroy, Christoph Hellwig, Cyril Bur, Dan Carpenter, Daniel Klamt, Darren Stevens, Dave Young, David Gibson, Diana Craciun, Finn Thain, Florian Weimer, Frederic Barrat, Gautham R. Shenoy, Geert Uytterhoeven, Geoff Levand, Guenter Roeck, Gustavo Romero, Haren Myneni, Hari Bathini, Joel Stanley, Jonathan Neuschäfer, Kees Cook, Madhavan Srinivasan, Mahesh Salgaonkar, Markus Elfring, Mathieu Malaterre, Mauro S. M. Rodrigues, Michael Hanselmann, Michael Neuling, Michael Schmitz, Mukesh Ojha, Murilo Opsfelder Araujo, Nicholas Piggin, Parth Y Shah, Paul Mackerras, Paul Menzel, Ram Pai, Randy Dunlap, Rashmica Gupta, Reza Arbab, Rodrigo R. Galvao, Russell Currey, Sam Bobroff, Scott Wood, Shilpasri G Bhat, Simon Guo, Souptick Joarder, Stan Johnson, Thiago Jung Bauermann, Tyrel Datwyler, Vaibhav Jain, Vasant Hegde, Venkat Rao, zhong jiang" * tag 'powerpc-4.19-1' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux: (234 commits) powerpc/mm/book3s/radix: Add mapping statistics powerpc/uaccess: Enable get_user(u64, *p) on 32-bit powerpc/mm/hash: Remove unnecessary do { } while(0) loop powerpc/64s: move machine check SLB flushing to mm/slb.c powerpc/powernv/idle: Fix build error powerpc/mm/tlbflush: update the mmu_gather page size while iterating address range powerpc/mm: remove warning about ‘type’ being set powerpc/32: Include setup.h header file to fix warnings powerpc: Move `path` variable inside DEBUG_PROM powerpc/powermac: Make some functions static powerpc/powermac: Remove variable x that's never read cxl: remove a dead branch powerpc/powermac: Add missing include of header pmac.h powerpc/kexec: Use common error handling code in setup_new_fdt() powerpc/xmon: Add address lookup for percpu symbols powerpc/mm: remove huge_pte_offset_and_shift() prototype powerpc/lib: Use patch_site to patch copy_32 functions once cache is enabled powerpc/pseries: Fix endianness while restoring of r3 in MCE handler. powerpc/fadump: merge adjacent memory ranges to reduce PT_LOAD segements powerpc/fadump: handle crash memory ranges array index overflow ...
2018-08-17Merge tag 'modules-for-v4.19' of ↵Linus Torvalds5-85/+100
git://git.kernel.org/pub/scm/linux/kernel/git/jeyu/linux Pull modules updates from Jessica Yu: "Summary of modules changes for the 4.19 merge window: - Fix modules kallsyms for livepatch. Livepatch modules can have SHN_UNDEF symbols in their module symbol tables for later symbol resolution, but kallsyms shouldn't be returning these symbols - Some code cleanups and minor reshuffling in load_module() were done to log the module name when module signature verification fails" * tag 'modules-for-v4.19' of git://git.kernel.org/pub/scm/linux/kernel/git/jeyu/linux: kernel/module: Use kmemdup to replace kmalloc+memcpy ARM: module: fix modsign build error modsign: log module name in the event of an error module: replace VMLINUX_SYMBOL_STR() with __stringify() or string literal module: print sensible error code module: setup load info before module_sig_check() module: make it clear when we're handling the module copy in info->hdr module: exclude SHN_UNDEF symbols from kallsyms api
2018-08-17Merge tag 'vla-leftovers-v4.19-rc1' of ↵Linus Torvalds2-2/+10
git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux Pull VLA removal leftovers from Kees Cook: - bus/imx-weim: Use maximum register count to avoid VLA - drm/i2c/tda9950: Use maximum CEC message size to avoid VLA * tag 'vla-leftovers-v4.19-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux: bus: imx-weim: Remove VLA usage drm/i2c: tda9950: Remove VLA usage
2018-08-17x86/speculation/l1tf: Exempt zeroed PTEs from inversionSean Christopherson1-1/+10
It turns out that we should *not* invert all not-present mappings, because the all zeroes case is obviously special. clear_page() does not undergo the XOR logic to invert the address bits, i.e. PTE, PMD and PUD entries that have not been individually written will have val=0 and so will trigger __pte_needs_invert(). As a result, {pte,pmd,pud}_pfn() will return the wrong PFN value, i.e. all ones (adjusted by the max PFN mask) instead of zero. A zeroed entry is ok because the page at physical address 0 is reserved early in boot specifically to mitigate L1TF, so explicitly exempt them from the inversion when reading the PFN. Manifested as an unexpected mprotect(..., PROT_NONE) failure when called on a VMA that has VM_PFNMAP and was mmap'd to as something other than PROT_NONE but never used. mprotect() sends the PROT_NONE request down prot_none_walk(), which walks the PTEs to check the PFNs. prot_none_pte_entry() gets the bogus PFN from pte_pfn() and returns -EACCES because it thinks mprotect() is trying to adjust a high MMIO address. [ This is a very modified version of Sean's original patch, but all credit goes to Sean for doing this and also pointing out that sometimes the __pte_needs_invert() function only gets the protection bits, not the full eventual pte. But zero remains special even in just protection bits, so that's ok. - Linus ] Fixes: f22cc87f6c1f ("x86/speculation/l1tf: Invert all not present mappings") Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Acked-by: Andi Kleen <ak@linux.intel.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Josh Poimboeuf <jpoimboe@redhat.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-17Merge tag 'for-4.19/dm-changes' of ↵Linus Torvalds15-334/+690
git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm Pull device mapper updates from Mike Snitzer: - A couple stable fixes for the DM writecache target. - A stable fix for the DM cache target that fixes the potential for data corruption after an unclean shutdown of a cache device using writeback mode. - Update DM integrity target to allow the metadata to be stored on a separate device from data. - Fix DM kcopyd and the snapshot target to cond_resched() where appropriate and be more efficient with processing completed work. - A few fixes and improvements for DM crypt. - Add DM delay target feature to configure delay of flushes independent of writes. - Update DM thin-provisioning target to include metadata_low_watermark threshold in pool status. - Fix stale DM thin-provisioning Documentation. * tag 'for-4.19/dm-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm: (26 commits) dm writecache: fix a crash due to reading past end of dirty_bitmap dm crypt: don't decrease device limits dm cache metadata: set dirty on all cache blocks after a crash dm snapshot: remove stale FIXME in snapshot_map() dm snapshot: improve performance by switching out_of_order_list to rbtree dm kcopyd: avoid softlockup in run_complete_job dm cache metadata: save in-core policy_hint_size to on-disk superblock dm thin: stop no_space_timeout worker when switching to write-mode dm kcopyd: return void from dm_kcopyd_copy() dm thin: include metadata_low_watermark threshold in pool status dm writecache: report start_sector in status line dm crypt: convert essiv from ahash to shash dm crypt: use wake_up_process() instead of a wait queue dm integrity: recalculate checksums on creation dm integrity: flush journal on suspend when using separate metadata device dm integrity: use version 2 for separate metadata dm integrity: allow separate metadata device dm integrity: add ic->start in get_data_sector() dm integrity: report provided data sectors in the status dm integrity: implement fair range locks ...
2018-08-17Merge tag 'fsnotify_for_v4.19-rc1' of ↵Linus Torvalds9-138/+157
git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs Pull fsnotify updates from Jan Kara: "fsnotify cleanups from Amir and a small inotify improvement" * tag 'fsnotify_for_v4.19-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs: inotify: Add flag IN_MASK_CREATE for inotify_add_watch() fanotify: factor out helpers to add/remove mark fsnotify: add helper to get mask from connector fsnotify: let connector point to an abstract object fsnotify: pass connp and object type to fsnotify_add_mark() fsnotify: use typedef fsnotify_connp_t for brevity
2018-08-17Merge tag 'for_v4.19-rc1' of ↵Linus Torvalds9-45/+33
git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs Pull UDF and ext2 update from Jan Kara. * tag 'for_v4.19-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs: ext2: use ktime_get_real_seconds for timestamps udf: convert inode stamps to timespec64
2018-08-17Merge branch 'topic/pl330' into for-linusVinod Koul1-6/+6
2018-08-17Merge branch 'topic/imx' into for-linusVinod Koul2-199/+380
2018-08-17Merge branch 'topic/xilinx' into for-linusVinod Koul3-0/+26
2018-08-17Merge branch 'topic/ste' into for-linusVinod Koul3-8/+19
2018-08-17Merge branch 'topic/renesas' into for-linusVinod Koul2-66/+47
2018-08-17Merge branch 'topic/owl' into for-linusVinod Koul4-0/+1027
2018-08-17Merge branch 'topic/nbpfaxi' into for-linusVinod Koul1-0/+1
2018-08-17Merge branch 'topic/mv_xor' into for-linusVinod Koul1-6/+10
2018-08-17Merge branch 'topic/ioat' into for-linusVinod Koul1-0/+6
2018-08-17Merge branch 'topic/intel' into for-linusVinod Koul2-0/+16
2018-08-17Merge branch 'topic/async_tx' into for-linusVinod Koul2-5/+9
2018-08-17arm64: mm: check for upper PAGE_SHIFT bits in pfn_valid()Greg Hackmann1-1/+5
ARM64's pfn_valid() shifts away the upper PAGE_SHIFT bits of the input before seeing if the PFN is valid. This leads to false positives when some of the upper bits are set, but the lower bits match a valid PFN. For example, the following userspace code looks up a bogus entry in /proc/kpageflags: int pagemap = open("/proc/self/pagemap", O_RDONLY); int pageflags = open("/proc/kpageflags", O_RDONLY); uint64_t pfn, val; lseek64(pagemap, [...], SEEK_SET); read(pagemap, &pfn, sizeof(pfn)); if (pfn & (1UL << 63)) { /* valid PFN */ pfn &= ((1UL << 55) - 1); /* clear flag bits */ pfn |= (1UL << 55); lseek64(pageflags, pfn * sizeof(uint64_t), SEEK_SET); read(pageflags, &val, sizeof(val)); } On ARM64 this causes the userspace process to crash with SIGSEGV rather than reading (1 << KPF_NOPAGE). kpageflags_read() treats the offset as valid, and stable_page_flags() will try to access an address between the user and kernel address ranges. Fixes: c1cc1552616d ("arm64: MMU initialisation") Cc: stable@vger.kernel.org Signed-off-by: Greg Hackmann <ghackmann@google.com> Signed-off-by: Will Deacon <will.deacon@arm.com>
2018-08-17arm64: Avoid calling stop_machine() when patching jump labelsWill Deacon1-1/+1
Patching a jump label involves patching a single instruction at a time, swizzling between a branch and a NOP. The architecture treats these instructions specially, so a concurrently executing CPU is guaranteed to see either the NOP or the branch, rather than an amalgamation of the two instruction encodings. However, in order to guarantee that the new instruction is visible, it is necessary to send an IPI to the concurrently executing CPU so that it discards any previously fetched instructions from its pipeline. This operation therefore cannot be completed from a context with IRQs disabled, but this is exactly what happens on the jump label path where the hotplug lock is held and irqs are subsequently disabled by stop_machine_cpuslocked(). This results in a deadlock during boot on Hikey-960. Due to the architectural guarantees around patching NOPs and branches, we don't actually need to stop_machine() at all on the jump label path, so we can avoid the deadlock by using the "nosync" variant of our instruction patching routine. Fixes: 693350a79980 ("arm64: insn: Don't fallback on nosync path for general insn patching") Reported-by: Tuomas Tynkkynen <tuomas.tynkkynen@iki.fi> Reported-by: John Stultz <john.stultz@linaro.org> Tested-by: Valentin Schneider <valentin.schneider@arm.com> Tested-by: Tuomas Tynkkynen <tuomas@tuxera.com> Tested-by: John Stultz <john.stultz@linaro.org> Signed-off-by: Will Deacon <will.deacon@arm.com>
2018-08-17Merge tag 'drm-msm-next-2018-08-10' of ↵Dave Airlie31-303/+9133
git://people.freedesktop.org/~robclark/linux into drm-next An optional follow-on PR for 4.19, on top of previous -fixes PR, which brings in a6xx support. These patches have been on list since earlier in the year (mostly waiting for userspace). They have been in linux-next since earlier in the week, now that we have freedreno userspace working on a6xx[1][2]. So far glmark2, Chromium/ChromiumOS, gnome-shell, glamor, xonotic, etc, are working. And a healthy chuck of deqp works, and I've been busy fixing things. The needed libdrm changes (no new uapi changes needed) are already on master, and the 2nd branch is rebased on that. Signed-off-by: Dave Airlie <airlied@redhat.com> From: Rob Clark <robdclark@gmail.com> Link: https://patchwork.freedesktop.org/patch/msgid/CAF6AEGuCKekZ2Dho80qxODT1BEUGg4hbq33ACUy5VXs3dHbDLA@mail.gmail.com
2018-08-16remoteproc/davinci: use the reset frameworkBartosz Golaszewski1-5/+29
Switch to using the reset framework instead of handcoded reset routines we used so far. Reviewed-by: Sekhar Nori <nsekhar@ti.com> Reviewed-by: Philipp Zabel <p.zabel@pengutronix.de> Signed-off-by: Bartosz Golaszewski <bgolaszewski@baylibre.com> Signed-off-by: Bjorn Andersson <bjorn.andersson@linaro.org>
2018-08-17Merge tag 'drm-intel-next-fixes-2018-08-16-1' of ↵Dave Airlie15-53/+117
git://anongit.freedesktop.org/drm/drm-intel into drm-next Fixes for: - DP full color range. - selftest for gem_object - forcewake on suspend - GPU reset This also include accumulated fixes from GVT: - Fix an error code in gvt_dma_map_page() (Dan) - Fix off by one error in intel_vgpu_write_fence() (Dan) - Fix potential Spectre v1 (Gustavo) - Fix workload free in vgpu release (Henry) - Fix cleanup sequence in intel_gvt_clean_device (Henry) - dmabuf mutex init place fix (Henry) - possible memory leak in intel_vgpu_ioctl() err path (Yi) - return error on cmd access check failure (Yan) Signed-off-by: Dave Airlie <airlied@redhat.com> From: Rodrigo Vivi <rodrigo.vivi@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20180816190335.GA7765@intel.com
2018-08-17Merge branch 'drm-next-4.19' of git://people.freedesktop.org/~agd5f/linux ↵Dave Airlie23-65/+189
into drm-next Fixes for 4.19: - Add VCN PSP FW loading for RV (this is required on upcoming parts) - Fix scheduler setup ordering for VCE and UVD - Few misc display fixes Signed-off-by: Dave Airlie <airlied@redhat.com> From: Alex Deucher <alexdeucher@gmail.com> Link: https://patchwork.freedesktop.org/patch/msgid/20180816181840.2786-1-alexander.deucher@amd.com
2018-08-17Merge tag 'drm-msm-fixes-2018-08-10' of ↵Dave Airlie6-38/+40
git://people.freedesktop.org/~robclark/linux into drm-next Some small msm fixes. Signed-off-by: Dave Airlie <airlied@redhat.com> From: Rob Clark <robdclark@gmail.com> Link: https://patchwork.freedesktop.org/patch/msgid/CAF6AEGuZE0VEpatrtxGZtUB6FaQYr6Gf07UVpMsD15ook+5_WQ@mail.gmail.com
2018-08-16Merge branch 'sockmap-ulp-fixes'Alexei Starovoitov4-53/+76
Daniel Borkmann says: ==================== Batch of various fixes related to BPF sockmap and ULP, including adding module alias to restrict module requests, races and memory leaks in sockmap code. For details please refer to the individual patches. Thanks! ==================== Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2018-08-16bpf, sockmap: fix sock_map_ctx_update_elem race with exist/noexistDaniel Borkmann1-49/+57
The current code in sock_map_ctx_update_elem() allows for BPF_EXIST and BPF_NOEXIST map update flags. While on array-like maps this approach is rather uncommon, e.g. bpf_fd_array_map_update_elem() and others enforce map update flags to be BPF_ANY such that xchg() can be used directly, the current implementation in sock map does not guarantee that such operation with BPF_EXIST / BPF_NOEXIST is atomic. The initial test does a READ_ONCE(stab->sock_map[i]) to fetch the socket from the slot which is then tested for NULL / non-NULL. However later after __sock_map_ctx_update_elem(), the actual update is done through osock = xchg(&stab->sock_map[i], sock). Problem is that in the meantime a different CPU could have updated / deleted a socket on that specific slot and thus flag contraints won't hold anymore. I've been thinking whether best would be to just break UAPI and do an enforcement of BPF_ANY to check if someone actually complains, however trouble is that already in BPF kselftest we use BPF_NOEXIST for the map update, and therefore it might have been copied into applications already. The fix to keep the current behavior intact would be to add a map lock similar to the sock hash bucket lock only for covering the whole map. Fixes: 174a79ff9515 ("bpf: sockmap with sk redirect support") Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: John Fastabend <john.fastabend@gmail.com> Acked-by: Song Liu <songliubraving@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2018-08-16bpf, sockmap: fix map elem deletion race with smap_stop_sockDaniel Borkmann1-1/+4
The smap_start_sock() and smap_stop_sock() are each protected under the sock->sk_callback_lock from their call-sites except in the case of sock_map_delete_elem() where we drop the old socket from the map slot. This is racy because the same sock could be part of multiple sock maps, so we run smap_stop_sock() in parallel, and given at that point psock->strp_enabled might be true on both CPUs, we might for example wrongly restore the sk->sk_data_ready / sk->sk_write_space. Therefore, hold the sock->sk_callback_lock as well on delete. Looks like 2f857d04601a ("bpf: sockmap, remove STRPARSER map_flags and add multi-map support") had this right, but later on e9db4ef6bf4c ("bpf: sockhash fix omitted bucket lock in sock_close") removed it again from delete leaving this smap_stop_sock() instance unprotected. Fixes: e9db4ef6bf4c ("bpf: sockhash fix omitted bucket lock in sock_close") Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: John Fastabend <john.fastabend@gmail.com> Acked-by: Song Liu <songliubraving@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2018-08-16bpf, sockmap: fix leakage of smap_psock_map_entryDaniel Borkmann1-2/+7
While working on sockmap I noticed that we do not always kfree the struct smap_psock_map_entry list elements which track psocks attached to maps. In the case of sock_hash_ctx_update_elem(), these map entries are allocated outside of __sock_map_ctx_update_elem() with their linkage to the socket hash table filled. In the case of sock array, the map entries are allocated inside of __sock_map_ctx_update_elem() and added with their linkage to the psock->maps. Both additions are under psock->maps_lock each. Now, we drop these elements from their psock->maps list in a few occasions: i) in sock array via smap_list_map_remove() when an entry is either deleted from the map from user space, or updated via user space or BPF program where we drop the old socket at that map slot, or the sock array is freed via sock_map_free() and drops all its elements; ii) for sock hash via smap_list_hash_remove() in exactly the same occasions as just described for sock array; iii) in the bpf_tcp_close() where we remove the elements from the list via psock_map_pop() and iterate over them dropping themselves from either sock array or sock hash; and last but not least iv) once again in smap_gc_work() which is a callback for deferring the work once the psock refcount hit zero and thus the socket is being destroyed. Problem is that the only case where we kfree() the list entry is in case iv), which at that point should have an empty list in normal cases. So in cases from i) to iii) we unlink the elements without freeing where they go out of reach from us. Hence fix is to properly kfree() them as well to stop the leakage. Given these are all handled under psock->maps_lock there is no need for deferred RCU freeing. I later also ran with kmemleak detector and it confirmed the finding as well where in the state before the fix the object goes unreferenced while after the patch no kmemleak report related to BPF showed up. [...] unreferenced object 0xffff880378eadae0 (size 64): comm "test_sockmap", pid 2225, jiffies 4294720701 (age 43.504s) hex dump (first 32 bytes): 00 01 00 00 00 00 ad de 00 02 00 00 00 00 ad de ................ 50 4d 75 5d 03 88 ff ff 00 00 00 00 00 00 00 00 PMu]............ backtrace: [<000000005225ac3c>] sock_map_ctx_update_elem.isra.21+0xd8/0x210 [<0000000045dd6d3c>] bpf_sock_map_update+0x29/0x60 [<00000000877723aa>] ___bpf_prog_run+0x1e1f/0x4960 [<000000002ef89e83>] 0xffffffffffffffff unreferenced object 0xffff880378ead240 (size 64): comm "test_sockmap", pid 2225, jiffies 4294720701 (age 43.504s) hex dump (first 32 bytes): 00 01 00 00 00 00 ad de 00 02 00 00 00 00 ad de ................ 00 44 75 5d 03 88 ff ff 00 00 00 00 00 00 00 00 .Du]............ backtrace: [<000000005225ac3c>] sock_map_ctx_update_elem.isra.21+0xd8/0x210 [<0000000030e37a3a>] sock_map_update_elem+0x125/0x240 [<000000002e5ce36e>] map_update_elem+0x4eb/0x7b0 [<00000000db453cc9>] __x64_sys_bpf+0x1f9/0x360 [<0000000000763660>] do_syscall_64+0x9a/0x300 [<00000000422a2bb2>] entry_SYSCALL_64_after_hwframe+0x44/0xa9 [<000000002ef89e83>] 0xffffffffffffffff [...] Fixes: e9db4ef6bf4c ("bpf: sockhash fix omitted bucket lock in sock_close") Fixes: 54fedb42c653 ("bpf: sockmap, fix smap_list_map_remove when psock is in many maps") Fixes: 2f857d04601a ("bpf: sockmap, remove STRPARSER map_flags and add multi-map support") Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: John Fastabend <john.fastabend@gmail.com> Acked-by: Song Liu <songliubraving@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2018-08-16tcp, ulp: fix leftover icsk_ulp_ops preventing sock from reattachDaniel Borkmann1-0/+2
I found that in BPF sockmap programs once we either delete a socket from the map or we updated a map slot and the old socket was purged from the map that these socket can never get reattached into a map even though their related psock has been dropped entirely at that point. Reason is that tcp_cleanup_ulp() leaves the old icsk->icsk_ulp_ops intact, so that on the next tcp_set_ulp_id() the kernel returns an -EEXIST thinking there is still some active ULP attached. BPF sockmap is the only one that has this issue as the other user, kTLS, only calls tcp_cleanup_ulp() from tcp_v4_destroy_sock() whereas sockmap semantics allow dropping the socket from the map with all related psock state being cleaned up. Fixes: 1aa12bdf1bfb ("bpf: sockmap, add sock close() hook to remove socks") Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: John Fastabend <john.fastabend@gmail.com> Acked-by: Song Liu <songliubraving@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2018-08-16tcp, ulp: add alias for all ulp modulesDaniel Borkmann3-1/+6
Lets not turn the TCP ULP lookup into an arbitrary module loader as we only intend to load ULP modules through this mechanism, not other unrelated kernel modules: [root@bar]# cat foo.c #include <sys/types.h> #include <sys/socket.h> #include <linux/tcp.h> #include <linux/in.h> int main(void) { int sock = socket(PF_INET, SOCK_STREAM, 0); setsockopt(sock, IPPROTO_TCP, TCP_ULP, "sctp", sizeof("sctp")); return 0; } [root@bar]# gcc foo.c -O2 -Wall [root@bar]# lsmod | grep sctp [root@bar]# ./a.out [root@bar]# lsmod | grep sctp sctp 1077248 4 libcrc32c 16384 3 nf_conntrack,nf_nat,sctp [root@bar]# Fix it by adding module alias to TCP ULP modules, so probing module via request_module() will be limited to tcp-ulp-[name]. The existing modules like kTLS will load fine given tcp-ulp-tls alias, but others will fail to load: [root@bar]# lsmod | grep sctp [root@bar]# ./a.out [root@bar]# lsmod | grep sctp [root@bar]# Sockmap is not affected from this since it's either built-in or not. Fixes: 734942cc4ea6 ("tcp: ULP infrastructure") Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: John Fastabend <john.fastabend@gmail.com> Acked-by: Song Liu <songliubraving@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2018-08-16Merge branch 'linus/master' into rdma.git for-nextJason Gunthorpe7051-128656/+339123
rdma.git merge resolution for the 4.19 merge window Conflicts: drivers/infiniband/core/rdma_core.c - Use the rdma code and revise with the new spelling for atomic_fetch_add_unless drivers/nvme/host/rdma.c - Replace max_sge with max_send_sge in new blk code drivers/nvme/target/rdma.c - Use the blk code and revise to use NULL for ib_post_recv when appropriate - Replace max_sge with max_recv_sge in new blk code net/rds/ib_send.c - Use the net code and revise to use NULL for ib_post_recv when appropriate Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-08-16Revert "net/smc: Replace ib_query_gid with rdma_get_gid_attr"Jason Gunthorpe3-23/+21
This reverts commit ddb457c6993babbcdd41fca638b870d2a2fc3941. The include rdma/ib_cache.h is kept, and we have to add a memset to the compat wrapper to avoid compiler warnings in gcc-7 This revert is done to avoid extensive merge conflicts with SMC changes in netdev during the 4.19 merge window. Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-08-16bpf: fix a rcu usage warning in bpf_prog_array_copy_core()Yonghong Song1-1/+1
Commit 394e40a29788 ("bpf: extend bpf_prog_array to store pointers to the cgroup storage") refactored the bpf_prog_array_copy_core() to accommodate new structure bpf_prog_array_item which contains bpf_prog array itself. In the old code, we had perf_event_query_prog_array(): mutex_lock(...) bpf_prog_array_copy_call(): prog = rcu_dereference_check(array, 1)->progs bpf_prog_array_copy_core(prog, ...) mutex_unlock(...) With the above commit, we had perf_event_query_prog_array(): mutex_lock(...) bpf_prog_array_copy_call(): bpf_prog_array_copy_core(array, ...): item = rcu_dereference(array)->items; ... mutex_unlock(...) The new code will trigger a lockdep rcu checking warning. The fix is to change rcu_dereference() to rcu_dereference_check() to prevent such a warning. Reported-by: syzbot+6e72317008eef84a216b@syzkaller.appspotmail.com Fixes: 394e40a29788 ("bpf: extend bpf_prog_array to store pointers to the cgroup storage") Cc: Roman Gushchin <guro@fb.com> Signed-off-by: Yonghong Song <yhs@fb.com> Acked-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Roman Gushchin <guro@fb.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-08-16samples/bpf: all XDP samples should unload xdp/bpf prog on SIGTERMJesper Dangaard Brouer2-2/+4
It is common XDP practice to unload/deattach the XDP bpf program, when the XDP sample program is Ctrl-C interrupted (SIGINT) or killed (SIGTERM). The samples/bpf programs xdp_redirect_cpu and xdp_rxq_info, forgot to trap signal SIGTERM (which is the default signal used by the kill command). This was discovered by Red Hat QA, which automated scripts depend on killing the XDP sample program after a timeout period. Fixes: fad3917e361b ("samples/bpf: add cpumap sample program xdp_redirect_cpu") Fixes: 0fca931a6f21 ("samples/bpf: program demonstrating access to xdp_rxq_info") Reported-by: Jean-Tsung Hsiao <jhsiao@redhat.com> Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com> Acked-by: Yonghong Song <yhs@fb.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-08-16net/xdp: Fix suspicious RCU usage warningTariq Toukan1-11/+3
Fix the warning below by calling rhashtable_lookup_fast. Also, make some code movements for better quality and human readability. [ 342.450870] WARNING: suspicious RCU usage [ 342.455856] 4.18.0-rc2+ #17 Tainted: G O [ 342.462210] ----------------------------- [ 342.467202] ./include/linux/rhashtable.h:481 suspicious rcu_dereference_check() usage! [ 342.476568] [ 342.476568] other info that might help us debug this: [ 342.476568] [ 342.486978] [ 342.486978] rcu_scheduler_active = 2, debug_locks = 1 [ 342.495211] 4 locks held by modprobe/3934: [ 342.500265] #0: 00000000e23116b2 (mlx5_intf_mutex){+.+.}, at: mlx5_unregister_interface+0x18/0x90 [mlx5_core] [ 342.511953] #1: 00000000ca16db96 (rtnl_mutex){+.+.}, at: unregister_netdev+0xe/0x20 [ 342.521109] #2: 00000000a46e2c4b (&priv->state_lock){+.+.}, at: mlx5e_close+0x29/0x60 [mlx5_core] [ 342.531642] #3: 0000000060c5bde3 (mem_id_lock){+.+.}, at: xdp_rxq_info_unreg+0x93/0x6b0 [ 342.541206] [ 342.541206] stack backtrace: [ 342.547075] CPU: 12 PID: 3934 Comm: modprobe Tainted: G O 4.18.0-rc2+ #17 [ 342.556621] Hardware name: Dell Inc. PowerEdge R730/0H21J3, BIOS 1.5.4 10/002/2015 [ 342.565606] Call Trace: [ 342.568861] dump_stack+0x78/0xb3 [ 342.573086] xdp_rxq_info_unreg+0x3f5/0x6b0 [ 342.578285] ? __call_rcu+0x220/0x300 [ 342.582911] mlx5e_free_rq+0x38/0xc0 [mlx5_core] [ 342.588602] mlx5e_close_channel+0x20/0x120 [mlx5_core] [ 342.594976] mlx5e_close_channels+0x26/0x40 [mlx5_core] [ 342.601345] mlx5e_close_locked+0x44/0x50 [mlx5_core] [ 342.607519] mlx5e_close+0x42/0x60 [mlx5_core] [ 342.613005] __dev_close_many+0xb1/0x120 [ 342.617911] dev_close_many+0xa2/0x170 [ 342.622622] rollback_registered_many+0x148/0x460 [ 342.628401] ? __lock_acquire+0x48d/0x11b0 [ 342.633498] ? unregister_netdev+0xe/0x20 [ 342.638495] rollback_registered+0x56/0x90 [ 342.643588] unregister_netdevice_queue+0x7e/0x100 [ 342.649461] unregister_netdev+0x18/0x20 [ 342.654362] mlx5e_remove+0x2a/0x50 [mlx5_core] [ 342.659944] mlx5_remove_device+0xe5/0x110 [mlx5_core] [ 342.666208] mlx5_unregister_interface+0x39/0x90 [mlx5_core] [ 342.673038] cleanup+0x5/0xbfc [mlx5_core] [ 342.678094] __x64_sys_delete_module+0x16b/0x240 [ 342.683725] ? do_syscall_64+0x1c/0x210 [ 342.688476] do_syscall_64+0x5a/0x210 [ 342.693025] entry_SYSCALL_64_after_hwframe+0x49/0xbe Fixes: 8d5d88527587 ("xdp: rhashtable with allocator ID to pointer mapping") Signed-off-by: Tariq Toukan <tariqt@mellanox.com> Suggested-by: Daniel Borkmann <daniel@iogearbox.net> Cc: Jesper Dangaard Brouer <brouer@redhat.com> Acked-by: Jesper Dangaard Brouer <brouer@redhat.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-08-16net/mlx5e: Delete unneeded function argumentYuval Shaia1-2/+2
priv argument is not used by the function, delete it. Fixes: a89842811ea98 ("net/mlx5e: Merge per priority stats groups") Signed-off-by: Yuval Shaia <yuval.shaia@oracle.com> Reviewed-by: Leon Romanovsky <leonro@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-08-16Documentation: networking: ti-cpsw: correct cbs parameters for Eth1 100MbIvan Khoronzhuk1-5/+6
If set cbs parameters calculated for 1000Mb, but use on 100Mb port w/o h/w offload (for cpsw offload it doesn't matter), it works incorrectly. According to the example and testing board, second port is 100Mb interface. Correct them on recalculated for 100Mb interface. It allows to use the same command for CBS software implementation for board in example. Signed-off-by: Ivan Khoronzhuk <ivan.khoronzhuk@linaro.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-08-16isdn: Disable IIOCDBGVARKees Cook1-7/+1
It was possible to directly leak the kernel address where the isdn_dev structure pointer was stored. This is a kernel ASLR bypass for anyone with access to the ioctl. The code had been present since the beginning of git history, though this shouldn't ever be needed for normal operation, therefore remove it. Reported-by: Al Viro <viro@zeniv.linux.org.uk> Cc: Karsten Keil <isdn@linux-pingi.de> Signed-off-by: Kees Cook <keescook@chromium.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-08-16net: dsa: add support for ksz9897 ethernet switchLad, Prabhakar3-1/+13
ksz9477 is superset of ksz9xx series, driver just works out of the box for ksz9897 chip with this patch. Signed-off-by: Lad, Prabhakar <prabhakar.csengg@gmail.com> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-08-16veth: Free queues on link deleteToshiaki Makita1-37/+33
David Ahern reported memory leak in veth. ======================================================================= $ cat /sys/kernel/debug/kmemleak unreferenced object 0xffff8800354d5c00 (size 1024): comm "ip", pid 836, jiffies 4294722952 (age 25.904s) hex dump (first 32 bytes): 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ backtrace: [<(____ptrval____)>] kmemleak_alloc+0x70/0x94 [<(____ptrval____)>] slab_post_alloc_hook+0x42/0x52 [<(____ptrval____)>] __kmalloc+0x101/0x142 [<(____ptrval____)>] kmalloc_array.constprop.20+0x1e/0x26 [veth] [<(____ptrval____)>] veth_newlink+0x147/0x3ac [veth] ... unreferenced object 0xffff88002e009c00 (size 1024): comm "ip", pid 836, jiffies 4294722958 (age 25.898s) hex dump (first 32 bytes): 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ backtrace: [<(____ptrval____)>] kmemleak_alloc+0x70/0x94 [<(____ptrval____)>] slab_post_alloc_hook+0x42/0x52 [<(____ptrval____)>] __kmalloc+0x101/0x142 [<(____ptrval____)>] kmalloc_array.constprop.20+0x1e/0x26 [veth] [<(____ptrval____)>] veth_newlink+0x219/0x3ac [veth] ======================================================================= veth_rq allocated in veth_newlink() was not freed on dellink. We need to free up them after veth_close() so that any packets will not reference the queues afterwards. Thus free them in veth_dev_free() in the same way as freeing stats structure (vstats). Also move queues allocation to veth_dev_init() to be in line with stats allocation. Fixes: 638264dc90227 ("veth: Support per queue XDP ring") Reported-by: David Ahern <dsahern@gmail.com> Signed-off-by: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp> Reviewed-by: David Ahern <dsahern@gmail.com> Tested-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-08-16ila: make lockdep happy againCong Wang2-7/+21
Previously, alloc_ila_locks() and bucket_table_alloc() call spin_lock_init() separately, therefore they have two different lock names and lock class keys. However, after commit b893281715ab ("ila: Call library function alloc_bucket_locks") they both call helper alloc_bucket_spinlocks() which now only has one lock name and lock class key. This causes a few bogus lockdep warnings as reported by syzbot. Fix this by making alloc_bucket_locks() a macro and pass declaration name as lock name and a static lock class key inside the macro. Fixes: b893281715ab ("ila: Call library function alloc_bucket_locks") Reported-by: <syzbot+b66a5a554991a8ed027c@syzkaller.appspotmail.com> Cc: Tom Herbert <tom@quantonium.net> Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-08-16net: sched: act_ife: always release ife action on init errorVlad Buslov1-6/+2
Action init API was changed to always take reference to action, even when overwriting existing action. Substitute conditional action release, which was executed only if action is newly created, with unconditional release in tcf_ife_init() error handling code to prevent double free or memory leak in case of overwrite. Fixes: 4e8ddd7f1758 ("net: sched: don't release reference on action overwrite") Reported-by: Cong Wang <xiyou.wangcong@gmail.com> Signed-off-by: Vlad Buslov <vladbu@mellanox.com> Acked-by: Cong Wang <xiyou.wangcong@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-08-16Merge tag 'v4.18' into rdma.git for-nextJason Gunthorpe1684-9534/+18047
Resolve merge conflicts from the -rc cycle against the rdma.git tree: Conflicts: drivers/infiniband/core/uverbs_cmd.c - New ifs added to ib_uverbs_ex_create_flow in -rc and for-next - Merge removal of file->ucontext in for-next with new code in -rc drivers/infiniband/core/uverbs_main.c - for-next removed code from ib_uverbs_write() that was modified in for-rc Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-08-16dt-bindings: net: ravb: Add support for r8a774a1 SoCFabrizio Castro1-1/+2
Document RZ/G2M (R8A774A1) SoC bindings. Signed-off-by: Fabrizio Castro <fabrizio.castro@bp.renesas.com> Reviewed-by: Biju Das <biju.das@bp.renesas.com> Reviewed-by: Sergei Shtylyov <sergei.shtylyov@cogentembedded.com> Reviewed-by: Rob Herring <robh@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-08-16cls_matchall: fix tcf_unbind_filter missingHangbin Liu1-0/+2
Fix tcf_unbind_filter missing in cls_matchall as this will trigger WARN_ON() in cbq_destroy_class(). Fixes: fd62d9f5c575f ("net/sched: matchall: Fix configuration race") Reported-by: Li Shuang <shuali@redhat.com> Signed-off-by: Hangbin Liu <liuhangbin@gmail.com> Acked-by: Cong Wang <xiyou.wangcong@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-08-16Merge branch 'next' into for-linusDmitry Torokhov488-2894/+6367
Prepare input updates for 4.19 merge window.
2018-08-16drm/amdgpu: Use kvmalloc for allocating UVD/VCE/VCN BO backup memoryMichel Dänzer3-8/+8
The allocated size can be (at least?) as large as megabytes, and there's no need for it to be physically contiguous. May avoid spurious failures to initialize / suspend the corresponding block while there's memory pressure. Bugzilla: https://bugs.freedesktop.org/107432 Reviewed-by: Christian König <christian.koenig@amd.com> Signed-off-by: Michel Dänzer <michel.daenzer@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2018-08-16Merge tag 'for-linus-4.19-ofs1' of ↵Linus Torvalds2-12/+10
git://git.kernel.org/pub/scm/linux/kernel/git/hubcap/linux Pull orangefs updates from Mike Marshall: "Orangefs: one cleanup and Souptick's vm_fault_t patch: - add new return type vm_fault_t (Souptick Joarder) - remove redundant pointer (Colin Ian King)" * tag 'for-linus-4.19-ofs1' of git://git.kernel.org/pub/scm/linux/kernel/git/hubcap/linux: orangefs: remove redundant pointer orangefs_inode orangefs: Adding new return type vm_fault_t
2018-08-16dm writecache: fix a crash due to reading past end of dirty_bitmapMikulas Patocka1-1/+1
wc->dirty_bitmap_size is in bytes so must multiply it by 8, not by BITS_PER_LONG, to get number of bitmap_bits. Fixes crash in find_next_bit() that was reported: https://bugzilla.kernel.org/show_bug.cgi?id=200819 Reported-by: edo.rus@gmail.com Fixes: 48debafe4f2f ("dm: add writecache target") Cc: stable@vger.kernel.org # 4.18 Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
2018-08-16netfilter: nft_dynset: allow dynamic updates of non-anonymous setPablo Neira Ayuso1-2/+0
This check is superfluous since it breaks valid configurations, remove it. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-08-16netfilter: nft_tproxy: Fix missing-braces warningMáté Eckl1-1/+3
This patch fixes a warning reported by the kbuild test robot (from linux-next tree): net/netfilter/nft_tproxy.c: In function 'nft_tproxy_eval_v6': >> net/netfilter/nft_tproxy.c:85:9: warning: missing braces around initializer [-Wmissing-braces] struct in6_addr taddr = {0}; ^ net/netfilter/nft_tproxy.c:85:9: warning: (near initialization for 'taddr.in6_u') [-Wmissing-braces] This warning is actually caused by a gcc bug already resolved in newer versions (kbuild used 4.9) so this kind of initialization is omitted and memset is used instead. Fixes: 4ed8eb6570a4 ("netfilter: nf_tables: Add native tproxy support") Signed-off-by: Máté Eckl <ecklm94@gmail.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-08-16netfilter: uapi: fix linux/netfilter/nf_osf.h userspace compilation errorsDmitry V. Levin2-2/+2
Move inclusion of <linux/ip.h> and <linux/tcp.h> from linux/netfilter/xt_osf.h to linux/netfilter/nf_osf.h to fix the following linux/netfilter/nf_osf.h userspace compilation errors: /usr/include/linux/netfilter/nf_osf.h:59:24: error: 'MAX_IPOPTLEN' undeclared here (not in a function) struct nf_osf_opt opt[MAX_IPOPTLEN]; /usr/include/linux/netfilter/nf_osf.h:64:17: error: field 'ip' has incomplete type struct iphdr ip; /usr/include/linux/netfilter/nf_osf.h:65:18: error: field 'tcp' has incomplete type struct tcphdr tcp; Fixes: bfb15f2a95cb ("netfilter: extract Passive OS fingerprint infrastructure from xt_osf") Signed-off-by: Dmitry V. Levin <ldv@altlinux.org> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-08-16netfilter: nft_ct: make l3 protocol field optional for timeout objectHarsha Sharma1-3/+4
If l3 protocol value is not specified for ct timeout object then use the value from nft_ctx protocol family. Signed-off-by: Harsha Sharma <harshasharmaiitr@gmail.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>