aboutsummaryrefslogtreecommitdiffstats
path: root/ipc
AgeCommit message (Collapse)AuthorFilesLines
2018-06-04Merge branch 'timers-2038-for-linus' of ↵Linus Torvalds6-65/+99
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull time/Y2038 updates from Thomas Gleixner: - Consolidate SySV IPC UAPI headers - Convert SySV IPC to the new COMPAT_32BIT_TIME mechanism - Cleanup the core interfaces and standardize on the ktime_get_* naming convention. - Convert the X86 platform ops to timespec64 - Remove the ugly temporary timespec64 hack * 'timers-2038-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (22 commits) x86: Convert x86_platform_ops to timespec64 timekeeping: Add more coarse clocktai/boottime interfaces timekeeping: Add ktime_get_coarse_with_offset timekeeping: Standardize on ktime_get_*() naming timekeeping: Clean up ktime_get_real_ts64 timekeeping: Remove timespec64 hack y2038: ipc: Redirect ipc(SEMTIMEDOP, ...) to compat_ksys_semtimedop y2038: ipc: Enable COMPAT_32BIT_TIME y2038: ipc: Use __kernel_timespec y2038: ipc: Report long times to user space y2038: ipc: Use ktime_get_real_seconds consistently y2038: xtensa: Extend sysvipc data structures y2038: powerpc: Extend sysvipc data structures y2038: sparc: Extend sysvipc data structures y2038: parisc: Extend sysvipc data structures y2038: mips: Extend sysvipc data structures y2038: arm64: Extend sysvipc compat data structures y2038: s390: Remove unneeded ipc uapi header files y2038: ia64: Remove unneeded ipc uapi header files y2038: alpha: Remove unneeded ipc uapi header files ...
2018-05-25ipc/shm: fix shmat() nil address after round-down when remappingDavidlohr Bueso1-2/+10
shmat()'s SHM_REMAP option forbids passing a nil address for; this is in fact the very first thing we check for. Andrea reported that for SHM_RND|SHM_REMAP cases we can end up bypassing the initial addr check, but we need to check again if the address was rounded down to nil. As of this patch, such cases will return -EINVAL. Link: http://lkml.kernel.org/r/20180503204934.kk63josdu6u53fbd@linux-n805 Signed-off-by: Davidlohr Bueso <dbueso@suse.de> Reported-by: Andrea Arcangeli <aarcange@redhat.com> Cc: Joe Lawrence <joe.lawrence@redhat.com> Cc: Manfred Spraul <manfred@colorfullife.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-05-25Revert "ipc/shm: Fix shmat mmap nil-page protection"Davidlohr Bueso1-7/+2
Patch series "ipc/shm: shmat() fixes around nil-page". These patches fix two issues reported[1] a while back by Joe and Andrea around how shmat(2) behaves with nil-page. The first reverts a commit that it was incorrectly thought that mapping nil-page (address=0) was a no no with MAP_FIXED. This is not the case, with the exception of SHM_REMAP; which is address in the second patch. I chose two patches because it is easier to backport and it explicitly reverts bogus behaviour. Both patches ought to be in -stable and ltp testcases need updated (the added testcase around the cve can be modified to just test for SHM_RND|SHM_REMAP). [1] lkml.kernel.org/r/20180430172152.nfa564pvgpk3ut7p@linux-n805 This patch (of 2): Commit 95e91b831f87 ("ipc/shm: Fix shmat mmap nil-page protection") worked on the idea that we should not be mapping as root addr=0 and MAP_FIXED. However, it was reported that this scenario is in fact valid, thus making the patch both bogus and breaks userspace as well. For example X11's libint10.so relies on shmat(1, SHM_RND) for lowmem initialization[1]. [1] https://cgit.freedesktop.org/xorg/xserver/tree/hw/xfree86/os-support/linux/int10/linux.c#n347 Link: http://lkml.kernel.org/r/20180503203243.15045-2-dave@stgolabs.net Fixes: 95e91b831f87 ("ipc/shm: Fix shmat mmap nil-page protection") Signed-off-by: Davidlohr Bueso <dbueso@suse.de> Reported-by: Joe Lawrence <joe.lawrence@redhat.com> Reported-by: Andrea Arcangeli <aarcange@redhat.com> Cc: Manfred Spraul <manfred@colorfullife.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-04-20y2038: ipc: Redirect ipc(SEMTIMEDOP, ...) to compat_ksys_semtimedopArnd Bergmann1-3/+10
32-bit architectures implementing 64BIT_TIME and COMPAT_32BIT_TIME need to have the traditional semtimedop() behavior with 32-bit timestamps for sys_ipc() by calling compat_ksys_semtimedop(), while those that are not yet converted need to keep using ksys_semtimedop() like 64-bit architectures do. Note that I chose to not implement a new SEMTIMEDOP64 function that corresponds to the new sys_semtimedop() with 64-bit timeouts. The reason here is that sys_ipc() should no longer be used for new system calls, and libc should just call the semtimedop syscall directly. One open question remain to whether we want to completely avoid the sys_ipc() system call for architectures that do not yet have all the individual calls as they get converted to 64-bit time_t. Doing that would require adding several extra system calls on m68k, mips, powerpc, s390, sh, sparc, and x86-32. Signed-off-by: Arnd Bergmann <arnd@arndb.de>
2018-04-20y2038: ipc: Enable COMPAT_32BIT_TIMEArnd Bergmann3-41/+44
Three ipc syscalls (mq_timedsend, mq_timedreceive and and semtimedop) take a timespec argument. After we move 32-bit architectures over to useing 64-bit time_t based syscalls, we need seperate entry points for the old 32-bit based interfaces. This changes the #ifdef guards for the existing 32-bit compat syscalls to check for CONFIG_COMPAT_32BIT_TIME instead, which will then be enabled on all existing 32-bit architectures. Signed-off-by: Arnd Bergmann <arnd@arndb.de>
2018-04-20y2038: ipc: Use __kernel_timespecArnd Bergmann3-6/+6
This is a preparatation for changing over __kernel_timespec to 64-bit times, which involves assigning new system call numbers for mq_timedsend(), mq_timedreceive() and semtimedop() for compatibility with future y2038 proof user space. The existing ABIs will remain available through compat code. Signed-off-by: Arnd Bergmann <arnd@arndb.de>
2018-04-20y2038: ipc: Report long times to user spaceArnd Bergmann3-9/+33
The shmid64_ds/semid64_ds/msqid64_ds data structures have been extended to contain extra fields for storing the upper bits of the time stamps, this patch does the other half of the job and and fills the new fields on 32-bit architectures as well as 32-bit tasks running on a 64-bit kernel in compat mode. There should be no change for native 64-bit tasks. Signed-off-by: Arnd Bergmann <arnd@arndb.de>
2018-04-20y2038: ipc: Use ktime_get_real_seconds consistentlyArnd Bergmann2-6/+6
In some places, we still used get_seconds() instead of ktime_get_real_seconds(), and I'm changing the remaining ones now to all use ktime_get_real_seconds() so we use the full available range for timestamps instead of overflowing the 'unsigned long' return value in year 2106 on 32-bit kernels. Signed-off-by: Arnd Bergmann <arnd@arndb.de>
2018-04-13ipc/shm: fix use-after-free of shm file via remap_file_pages()Eric Biggers1-3/+20
syzbot reported a use-after-free of shm_file_data(file)->file->f_op in shm_get_unmapped_area(), called via sys_remap_file_pages(). Unfortunately it couldn't generate a reproducer, but I found a bug which I think caused it. When remap_file_pages() is passed a full System V shared memory segment, the memory is first unmapped, then a new map is created using the ->vm_file. Between these steps, the shm ID can be removed and reused for a new shm segment. But, shm_mmap() only checks whether the ID is currently valid before calling the underlying file's ->mmap(); it doesn't check whether it was reused. Thus it can use the wrong underlying file, one that was already freed. Fix this by making the "outer" shm file (the one that gets put in ->vm_file) hold a reference to the real shm file, and by making __shm_open() require that the file associated with the shm ID matches the one associated with the "outer" file. Taking the reference to the real shm file is needed to fully solve the problem, since otherwise sfd->file could point to a freed file, which then could be reallocated for the reused shm ID, causing the wrong shm segment to be mapped (and without the required permission checks). Commit 1ac0b6dec656 ("ipc/shm: handle removed segments gracefully in shm_mmap()") almost fixed this bug, but it didn't go far enough because it didn't consider the case where the shm ID is reused. The following program usually reproduces this bug: #include <stdlib.h> #include <sys/shm.h> #include <sys/syscall.h> #include <unistd.h> int main() { int is_parent = (fork() != 0); srand(getpid()); for (;;) { int id = shmget(0xF00F, 4096, IPC_CREAT|0700); if (is_parent) { void *addr = shmat(id, NULL, 0); usleep(rand() % 50); while (!syscall(__NR_remap_file_pages, addr, 4096, 0, 0, 0)); } else { usleep(rand() % 50); shmctl(id, IPC_RMID, NULL); } } } It causes the following NULL pointer dereference due to a 'struct file' being used while it's being freed. (I couldn't actually get a KASAN use-after-free splat like in the syzbot report. But I think it's possible with this bug; it would just take a more extraordinary race...) BUG: unable to handle kernel NULL pointer dereference at 0000000000000058 PGD 0 P4D 0 Oops: 0000 [#1] SMP NOPTI CPU: 9 PID: 258 Comm: syz_ipc Not tainted 4.16.0-05140-gf8cf2f16a7c95 #189 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.11.0-20171110_100015-anatol 04/01/2014 RIP: 0010:d_inode include/linux/dcache.h:519 [inline] RIP: 0010:touch_atime+0x25/0xd0 fs/inode.c:1724 [...] Call Trace: file_accessed include/linux/fs.h:2063 [inline] shmem_mmap+0x25/0x40 mm/shmem.c:2149 call_mmap include/linux/fs.h:1789 [inline] shm_mmap+0x34/0x80 ipc/shm.c:465 call_mmap include/linux/fs.h:1789 [inline] mmap_region+0x309/0x5b0 mm/mmap.c:1712 do_mmap+0x294/0x4a0 mm/mmap.c:1483 do_mmap_pgoff include/linux/mm.h:2235 [inline] SYSC_remap_file_pages mm/mmap.c:2853 [inline] SyS_remap_file_pages+0x232/0x310 mm/mmap.c:2769 do_syscall_64+0x64/0x1a0 arch/x86/entry/common.c:287 entry_SYSCALL_64_after_hwframe+0x42/0xb7 [ebiggers@google.com: add comment] Link: http://lkml.kernel.org/r/20180410192850.235835-1-ebiggers3@gmail.com Link: http://lkml.kernel.org/r/20180409043039.28915-1-ebiggers3@gmail.com Reported-by: syzbot+d11f321e7f1923157eac80aa990b446596f46439@syzkaller.appspotmail.com Fixes: c8d78c1823f4 ("mm: replace remap_file_pages() syscall with emulation") Signed-off-by: Eric Biggers <ebiggers@google.com> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Acked-by: Davidlohr Bueso <dbueso@suse.de> Cc: Manfred Spraul <manfred@colorfullife.com> Cc: "Eric W . Biederman" <ebiederm@xmission.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-04-11ipc/shm.c: shm_split(): remove unneeded test for NULL shm_file_data.vm_opsAndrew Morton1-1/+1
This was added by the recent "ipc/shm.c: add split function to shm_vm_ops", but it is not necessary. Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com> Cc: Laurent Dufour <ldufour@linux.vnet.ibm.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: Manfred Spraul <manfred@colorfullife.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-04-11ipc/msg: introduce msgctl(MSG_STAT_ANY)Davidlohr Bueso1-5/+12
There is a permission discrepancy when consulting msq ipc object metadata between /proc/sysvipc/msg (0444) and the MSG_STAT shmctl command. The later does permission checks for the object vs S_IRUGO. As such there can be cases where EACCESS is returned via syscall but the info is displayed anyways in the procfs files. While this might have security implications via info leaking (albeit no writing to the msq metadata), this behavior goes way back and showing all the objects regardless of the permissions was most likely an overlook - so we are stuck with it. Furthermore, modifying either the syscall or the procfs file can cause userspace programs to break (ie ipcs). Some applications require getting the procfs info (without root privileges) and can be rather slow in comparison with a syscall -- up to 500x in some reported cases for shm. This patch introduces a new MSG_STAT_ANY command such that the msq ipc object permissions are ignored, and only audited instead. In addition, I've left the lsm security hook checks in place, as if some policy can block the call, then the user has no other choice than just parsing the procfs file. Link: http://lkml.kernel.org/r/20180215162458.10059-4-dave@stgolabs.net Signed-off-by: Davidlohr Bueso <dbueso@suse.de> Reported-by: Robert Kettler <robert.kettler@outlook.com> Cc: Eric W. Biederman <ebiederm@xmission.com> Cc: Kees Cook <keescook@chromium.org> Cc: Manfred Spraul <manfred@colorfullife.com> Cc: Michael Kerrisk <mtk.manpages@gmail.com> Cc: Michal Hocko <mhocko@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-04-11ipc/sem: introduce semctl(SEM_STAT_ANY)Davidlohr Bueso1-5/+12
There is a permission discrepancy when consulting shm ipc object metadata between /proc/sysvipc/sem (0444) and the SEM_STAT semctl command. The later does permission checks for the object vs S_IRUGO. As such there can be cases where EACCESS is returned via syscall but the info is displayed anyways in the procfs files. While this might have security implications via info leaking (albeit no writing to the sma metadata), this behavior goes way back and showing all the objects regardless of the permissions was most likely an overlook - so we are stuck with it. Furthermore, modifying either the syscall or the procfs file can cause userspace programs to break (ie ipcs). Some applications require getting the procfs info (without root privileges) and can be rather slow in comparison with a syscall -- up to 500x in some reported cases for shm. This patch introduces a new SEM_STAT_ANY command such that the sem ipc object permissions are ignored, and only audited instead. In addition, I've left the lsm security hook checks in place, as if some policy can block the call, then the user has no other choice than just parsing the procfs file. Link: http://lkml.kernel.org/r/20180215162458.10059-3-dave@stgolabs.net Signed-off-by: Davidlohr Bueso <dbueso@suse.de> Reported-by: Robert Kettler <robert.kettler@outlook.com> Cc: Eric W. Biederman <ebiederm@xmission.com> Cc: Kees Cook <keescook@chromium.org> Cc: Manfred Spraul <manfred@colorfullife.com> Cc: Michael Kerrisk <mtk.manpages@gmail.com> Cc: Michal Hocko <mhocko@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-04-11ipc/shm: introduce shmctl(SHM_STAT_ANY)Davidlohr Bueso1-5/+18
Patch series "sysvipc: introduce STAT_ANY commands", v2. The following patches adds the discussed (see [1]) new command for shm as well as for sems and msq as they are subject to the same discrepancies for ipc object permission checks between the syscall and via procfs. These new commands are justified in that (1) we are stuck with this semantics as changing syscall and procfs can break userland; and (2) some users can benefit from performance (for large amounts of shm segments, for example) from not having to parse the procfs interface. Once merged, I will submit the necesary manpage updates. But I'm thinking something like: : diff --git a/man2/shmctl.2 b/man2/shmctl.2 : index 7bb503999941..bb00bbe21a57 100644 : --- a/man2/shmctl.2 : +++ b/man2/shmctl.2 : @@ -41,6 +41,7 @@ : .\" 2005-04-25, mtk -- noted aberrant Linux behavior w.r.t. new : .\" attaches to a segment that has already been marked for deletion. : .\" 2005-08-02, mtk: Added IPC_INFO, SHM_INFO, SHM_STAT descriptions. : +.\" 2018-02-13, dbueso: Added SHM_STAT_ANY description. : .\" : .TH SHMCTL 2 2017-09-15 "Linux" "Linux Programmer's Manual" : .SH NAME : @@ -242,6 +243,18 @@ However, the : argument is not a segment identifier, but instead an index into : the kernel's internal array that maintains information about : all shared memory segments on the system. : +.TP : +.BR SHM_STAT_ANY " (Linux-specific)" : +Return a : +.I shmid_ds : +structure as for : +.BR SHM_STAT . : +However, the : +.I shm_perm.mode : +is not checked for read access for : +.IR shmid , : +resembing the behaviour of : +/proc/sysvipc/shm. : .PP : The caller can prevent or allow swapping of a shared : memory segment with the following \fIcmd\fP values: : @@ -287,7 +300,7 @@ operation returns the index of the highest used entry in the : kernel's internal array recording information about all : shared memory segments. : (This information can be used with repeated : -.B SHM_STAT : +.B SHM_STAT/SHM_STAT_ANY : operations to obtain information about all shared memory segments : on the system.) : A successful : @@ -328,7 +341,7 @@ isn't accessible. : \fIshmid\fP is not a valid identifier, or \fIcmd\fP : is not a valid command. : Or: for a : -.B SHM_STAT : +.B SHM_STAT/SHM_STAT_ANY : operation, the index value specified in : .I shmid : referred to an array slot that is currently unused. This patch (of 3): There is a permission discrepancy when consulting shm ipc object metadata between /proc/sysvipc/shm (0444) and the SHM_STAT shmctl command. The later does permission checks for the object vs S_IRUGO. As such there can be cases where EACCESS is returned via syscall but the info is displayed anyways in the procfs files. While this might have security implications via info leaking (albeit no writing to the shm metadata), this behavior goes way back and showing all the objects regardless of the permissions was most likely an overlook - so we are stuck with it. Furthermore, modifying either the syscall or the procfs file can cause userspace programs to break (ie ipcs). Some applications require getting the procfs info (without root privileges) and can be rather slow in comparison with a syscall -- up to 500x in some reported cases. This patch introduces a new SHM_STAT_ANY command such that the shm ipc object permissions are ignored, and only audited instead. In addition, I've left the lsm security hook checks in place, as if some policy can block the call, then the user has no other choice than just parsing the procfs file. [1] https://lkml.org/lkml/2017/12/19/220 Link: http://lkml.kernel.org/r/20180215162458.10059-2-dave@stgolabs.net Signed-off-by: Davidlohr Bueso <dbueso@suse.de> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Michael Kerrisk <mtk.manpages@gmail.com> Cc: Manfred Spraul <manfred@colorfullife.com> Cc: Eric W. Biederman <ebiederm@xmission.com> Cc: Kees Cook <keescook@chromium.org> Cc: Robert Kettler <robert.kettler@outlook.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-04-11proc: move /proc/sysvipc creation to where it belongsAlexey Dobriyan1-0/+1
Move the proc_mkdir() call within the sysvipc subsystem such that we avoid polluting proc_root_init() with petty cpp. [dave@stgolabs.net: contributed changelog] Link: http://lkml.kernel.org/r/20180216161732.GA10297@avx2 Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Acked-by: Davidlohr Bueso <dave@stgolabs.net> Cc: Manfred Spraul <manfred@colorfullife.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-04-03Merge branch 'userns-linus' of ↵Linus Torvalds5-80/+152
git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace Pull namespace updates from Eric Biederman: "There was a lot of work this cycle fixing bugs that were discovered after the merge window and getting everything ready where we can reasonably support fully unprivileged fuse. The bug fixes you already have and much of the unprivileged fuse work is coming in via other trees. Still left for fully unprivileged fuse is figuring out how to cleanly handle .set_acl and .get_acl in the legacy case, and properly handling of evm xattrs on unprivileged mounts. Included in the tree is a cleanup from Alexely that replaced a linked list with a statically allocated fix sized array for the pid caches, which simplifies and speeds things up. Then there is are some cleanups and fixes for the ipc namespace. The motivation was that in reviewing other code it was discovered that access ipc objects from different pid namespaces recorded pids in such a way that when asked the wrong pids were returned. In the worst case there has been a measured 30% performance impact for sysvipc semaphores. Other test cases showed no measurable performance impact. Manfred Spraul and Davidlohr Bueso who tend to work on sysvipc performance both gave the nod that this is good enough. Casey Schaufler and James Morris have given their approval to the LSM side of the changes. I simplified the types and the code dealing with sysvipc to pass just kern_ipc_perm for all three types of ipc. Which reduced the header dependencies throughout the kernel and simplified the lsm code. Which let me work on the pid fixes without having to worry about trivial changes causing complete kernel recompiles" * 'userns-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace: ipc/shm: Fix pid freeing. ipc/shm: fix up for struct file no longer being available in shm.h ipc/smack: Tidy up from the change in type of the ipc security hooks ipc: Directly call the security hook in ipc_ops.associate ipc/sem: Fix semctl(..., GETPID, ...) between pid namespaces ipc/msg: Fix msgctl(..., IPC_STAT, ...) between pid namespaces ipc/shm: Fix shmctl(..., IPC_STAT, ...) between pid namespaces. ipc/util: Helpers for making the sysvipc operations pid namespace aware ipc: Move IPCMNI from include/ipc.h into ipc/util.h msg: Move struct msg_queue into ipc/msg.c shm: Move struct shmid_kernel into ipc/shm.c sem: Move struct sem and struct sem_array into ipc/sem.c msg/security: Pass kern_ipc_perm not msg_queue into the msg_queue security hooks shm/security: Pass kern_ipc_perm not shmid_kernel into the shm security hooks sem/security: Pass kern_ipc_perm not sem_array into the sem security hooks pidns: simpler allocation of pid_* caches
2018-04-02Merge branch 'syscalls-next' of ↵Linus Torvalds5-49/+172
git://git.kernel.org/pub/scm/linux/kernel/git/brodo/linux Pull removal of in-kernel calls to syscalls from Dominik Brodowski: "System calls are interaction points between userspace and the kernel. Therefore, system call functions such as sys_xyzzy() or compat_sys_xyzzy() should only be called from userspace via the syscall table, but not from elsewhere in the kernel. At least on 64-bit x86, it will likely be a hard requirement from v4.17 onwards to not call system call functions in the kernel: It is better to use use a different calling convention for system calls there, where struct pt_regs is decoded on-the-fly in a syscall wrapper which then hands processing over to the actual syscall function. This means that only those parameters which are actually needed for a specific syscall are passed on during syscall entry, instead of filling in six CPU registers with random user space content all the time (which may cause serious trouble down the call chain). Those x86-specific patches will be pushed through the x86 tree in the near future. Moreover, rules on how data may be accessed may differ between kernel data and user data. This is another reason why calling sys_xyzzy() is generally a bad idea, and -- at most -- acceptable in arch-specific code. This patchset removes all in-kernel calls to syscall functions in the kernel with the exception of arch/. On top of this, it cleans up the three places where many syscalls are referenced or prototyped, namely kernel/sys_ni.c, include/linux/syscalls.h and include/linux/compat.h" * 'syscalls-next' of git://git.kernel.org/pub/scm/linux/kernel/git/brodo/linux: (109 commits) bpf: whitelist all syscalls for error injection kernel/sys_ni: remove {sys_,sys_compat} from cond_syscall definitions kernel/sys_ni: sort cond_syscall() entries syscalls/x86: auto-create compat_sys_*() prototypes syscalls: sort syscall prototypes in include/linux/compat.h net: remove compat_sys_*() prototypes from net/compat.h syscalls: sort syscall prototypes in include/linux/syscalls.h kexec: move sys_kexec_load() prototype to syscalls.h x86/sigreturn: use SYSCALL_DEFINE0 x86: fix sys_sigreturn() return type to be long, not unsigned long x86/ioport: add ksys_ioperm() helper; remove in-kernel calls to sys_ioperm() mm: add ksys_readahead() helper; remove in-kernel calls to sys_readahead() mm: add ksys_mmap_pgoff() helper; remove in-kernel calls to sys_mmap_pgoff() mm: add ksys_fadvise64_64() helper; remove in-kernel call to sys_fadvise64_64() fs: add ksys_fallocate() wrapper; remove in-kernel calls to sys_fallocate() fs: add ksys_p{read,write}64() helpers; remove in-kernel calls to syscalls fs: add ksys_truncate() wrapper; remove in-kernel calls to sys_truncate() fs: add ksys_sync_file_range helper(); remove in-kernel calls to syscall kernel: add ksys_setsid() helper; remove in-kernel call to sys_setsid() kernel: add ksys_unshare() helper; remove in-kernel calls to sys_unshare() ...
2018-04-02ipc: add msgsnd syscall/compat_syscall wrappersDominik Brodowski3-6/+22
Provide ksys_msgsnd() and compat_ksys_msgsnd() wrappers to avoid in-kernel calls to these syscalls. The ksys_ prefix denotes that these functions are meant as a drop-in replacement for the syscalls. In particular, they use the same calling convention as sys_msgsnd() and compat_sys_msgsnd(). This patch is part of a series which removes in-kernel calls to syscalls. On this basis, the syscall entry path can be streamlined. For details, see http://lkml.kernel.org/r/20180325162527.GA17492@light.dominikbrodowski.net Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Dominik Brodowski <linux@dominikbrodowski.net>
2018-04-02ipc: add msgrcv syscall/compat_syscall wrappersDominik Brodowski3-7/+24
Provide ksys_msgrcv() and compat_ksys_msgrcv() wrappers to avoid in-kernel calls to these syscalls. The ksys_ prefix denotes that these functions are meant as a drop-in replacement for the syscalls. In particular, they use the same calling convention as sys_msgrcv() and compat_sys_msgrcv(). This patch is part of a series which removes in-kernel calls to syscalls. On this basis, the syscall entry path can be streamlined. For details, see http://lkml.kernel.org/r/20180325162527.GA17492@light.dominikbrodowski.net Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Dominik Brodowski <linux@dominikbrodowski.net>
2018-04-02ipc: add msgctl syscall/compat_syscall wrappersDominik Brodowski3-4/+17
Provide ksys_msgctl() and compat_ksys_msgctl() wrappers to avoid in-kernel calls to these syscalls. The ksys_ prefix denotes that these functions are meant as a drop-in replacement for the syscalls. In particular, they use the same calling convention as sys_msgctl() and compat_sys_msgctl(). This patch is part of a series which removes in-kernel calls to syscalls. On this basis, the syscall entry path can be streamlined. For details, see http://lkml.kernel.org/r/20180325162527.GA17492@light.dominikbrodowski.net Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Dominik Brodowski <linux@dominikbrodowski.net>
2018-04-02ipc: add shmctl syscall/compat_syscall wrappersDominik Brodowski3-4/+16
Provide ksys_shmctl() and compat_ksys_shmctl() wrappers to avoid in-kernel calls to these syscalls. The ksys_ prefix denotes that these functions are meant as a drop-in replacement for the syscalls. In particular, they use the same calling convention as sys_shmctl() and compat_sys_shmctl(). This patch is part of a series which removes in-kernel calls to syscalls. On this basis, the syscall entry path can be streamlined. For details, see http://lkml.kernel.org/r/20180325162527.GA17492@light.dominikbrodowski.net Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Dominik Brodowski <linux@dominikbrodowski.net>
2018-04-02ipc: add shmdt syscall wrapperDominik Brodowski3-3/+9
Provide ksys_shmdt() wrapper to avoid in-kernel calls to this syscall. The ksys_ prefix denotes that this function is meant as a drop-in replacement for the syscall. In particular, it uses the same calling convention as sys_shmdt(). This patch is part of a series which removes in-kernel calls to syscalls. On this basis, the syscall entry path can be streamlined. For details, see http://lkml.kernel.org/r/20180325162527.GA17492@light.dominikbrodowski.net Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Dominik Brodowski <linux@dominikbrodowski.net>
2018-04-02ipc: add shmget syscall wrapperDominik Brodowski3-3/+9
Provide ksys_shmget() wrapper to avoid in-kernel calls to this syscall. The ksys_ prefix denotes that this function is meant as a drop-in replacement for the syscall. In particular, it uses the same calling convention as sys_shmget(). This patch is part of a series which removes in-kernel calls to syscalls. On this basis, the syscall entry path can be streamlined. For details, see http://lkml.kernel.org/r/20180325162527.GA17492@light.dominikbrodowski.net Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Dominik Brodowski <linux@dominikbrodowski.net>
2018-04-02ipc: add msgget syscall wrapperDominik Brodowski3-3/+9
Provide ksys_msgget() wrapper to avoid in-kernel calls to this syscall. The ksys_ prefix denotes that this function is meant as a drop-in replacement for the syscall. In particular, it uses the same calling convention as sys_msgget(). This patch is part of a series which removes in-kernel calls to syscalls. On this basis, the syscall entry path can be streamlined. For details, see http://lkml.kernel.org/r/20180325162527.GA17492@light.dominikbrodowski.net Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Dominik Brodowski <linux@dominikbrodowski.net>
2018-04-02ipc: add semctl syscall/compat_syscall wrappersDominik Brodowski3-4/+16
Provide ksys_semctl() and compat_ksys_semctl() wrappers to avoid in-kernel calls to these syscalls. The ksys_ prefix denotes that these functions are meant as a drop-in replacement for the syscalls. In particular, they use the same calling convention as sys_semctl() and compat_sys_semctl(). This patch is part of a series which removes in-kernel calls to syscalls. On this basis, the syscall entry path can be streamlined. For details, see http://lkml.kernel.org/r/20180325162527.GA17492@light.dominikbrodowski.net Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Dominik Brodowski <linux@dominikbrodowski.net>
2018-04-02ipc: add semget syscall wrapperDominik Brodowski3-3/+9
Provide ksys_semget() wrapper to avoid in-kernel calls to this syscall. The ksys_ prefix denotes that this function is meant as a drop-in replacement for the syscall. In particular, it uses the same calling convention as sys_semget(). This patch is part of a series which removes in-kernel calls to syscalls. On this basis, the syscall entry path can be streamlined. For details, see http://lkml.kernel.org/r/20180325162527.GA17492@light.dominikbrodowski.net Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Dominik Brodowski <linux@dominikbrodowski.net>
2018-04-02ipc: add semtimedop syscall/compat_syscall wrappersDominik Brodowski3-12/+41
Provide ksys_semtimedop() and compat_ksys_semtimedop() wrappers to avoid in-kernel calls to these syscalls. The ksys_ prefix denotes that these functions are meant as a drop-in replacement for the syscalls. In particular, they use the same calling convention as sys_semtimedop() and compat_sys_semtimedop(). This patch is part of a series which removes in-kernel calls to syscalls. On this basis, the syscall entry path can be streamlined. For details, see http://lkml.kernel.org/r/20180325162527.GA17492@light.dominikbrodowski.net Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Dominik Brodowski <linux@dominikbrodowski.net>
2018-03-28ipc/shm.c: add split function to shm_vm_opsMike Kravetz1-0/+12
If System V shmget/shmat operations are used to create a hugetlbfs backed mapping, it is possible to munmap part of the mapping and split the underlying vma such that it is not huge page aligned. This will untimately result in the following BUG: kernel BUG at /build/linux-jWa1Fv/linux-4.15.0/mm/hugetlb.c:3310! Oops: Exception in kernel mode, sig: 5 [#1] LE SMP NR_CPUS=2048 NUMA PowerNV Modules linked in: kcm nfc af_alg caif_socket caif phonet fcrypt CPU: 18 PID: 43243 Comm: trinity-subchil Tainted: G C E 4.15.0-10-generic #11-Ubuntu NIP: c00000000036e764 LR: c00000000036ee48 CTR: 0000000000000009 REGS: c000003fbcdcf810 TRAP: 0700 Tainted: G C E (4.15.0-10-generic) MSR: 9000000000029033 <SF,HV,EE,ME,IR,DR,RI,LE> CR: 24002222 XER: 20040000 CFAR: c00000000036ee44 SOFTE: 1 NIP __unmap_hugepage_range+0xa4/0x760 LR __unmap_hugepage_range_final+0x28/0x50 Call Trace: 0x7115e4e00000 (unreliable) __unmap_hugepage_range_final+0x28/0x50 unmap_single_vma+0x11c/0x190 unmap_vmas+0x94/0x140 exit_mmap+0x9c/0x1d0 mmput+0xa8/0x1d0 do_exit+0x360/0xc80 do_group_exit+0x60/0x100 SyS_exit_group+0x24/0x30 system_call+0x58/0x6c ---[ end trace ee88f958a1c62605 ]--- This bug was introduced by commit 31383c6865a5 ("mm, hugetlbfs: introduce ->split() to vm_operations_struct"). A split function was added to vm_operations_struct to determine if a mapping can be split. This was mostly for device-dax and hugetlbfs mappings which have specific alignment constraints. Mappings initiated via shmget/shmat have their original vm_ops overwritten with shm_vm_ops. shm_vm_ops functions will call back to the original vm_ops if needed. Add such a split function to shm_vm_ops. Link: http://lkml.kernel.org/r/20180321161314.7711-1-mike.kravetz@oracle.com Fixes: 31383c6865a5 ("mm, hugetlbfs: introduce ->split() to vm_operations_struct") Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com> Reported-by: Laurent Dufour <ldufour@linux.vnet.ibm.com> Reviewed-by: Laurent Dufour <ldufour@linux.vnet.ibm.com> Tested-by: Laurent Dufour <ldufour@linux.vnet.ibm.com> Reviewed-by: Dan Williams <dan.j.williams@intel.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: Manfred Spraul <manfred@colorfullife.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-03-28ipc/shm: Fix pid freeing.Eric W. Biederman1-2/+2
The 0day kernel test build report reported an oops: > > IP: put_pid+0x22/0x5c > PGD 19efa067 P4D 19efa067 PUD 0 > Oops: 0000 [#1] > CPU: 0 PID: 727 Comm: trinity Not tainted 4.16.0-rc2-00010-g98f929b #1 > RIP: 0010:put_pid+0x22/0x5c > RSP: 0018:ffff986719f73e48 EFLAGS: 00010202 > RAX: 00000006d765f710 RBX: ffff98671a4fa4d0 RCX: ffff986719f73d40 > RDX: 000000006f6e6125 RSI: 0000000000000000 RDI: ffffffffa01e6d21 > RBP: ffffffffa0955fe0 R08: 0000000000000020 R09: 0000000000000000 > R10: 0000000000000078 R11: ffff986719f73e76 R12: 0000000000001000 > R13: 00000000ffffffea R14: 0000000054000fb0 R15: 0000000000000000 > FS: 00000000028c2880(0000) GS:ffffffffa06ad000(0000) knlGS:0000000000000000 > CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > CR2: 0000000677846439 CR3: 0000000019fc1005 CR4: 00000000000606b0 > Call Trace: > ? ipc_update_pid+0x36/0x3e > ? newseg+0x34c/0x3a6 > ? ipcget+0x5d/0x528 > ? entry_SYSCALL_64_after_hwframe+0x52/0xb7 > ? SyS_shmget+0x5a/0x84 > ? do_syscall_64+0x194/0x1b3 > ? entry_SYSCALL_64_after_hwframe+0x42/0xb7 > Code: ff 05 e7 20 9b 03 58 c9 c3 48 ff 05 85 21 9b 03 48 85 ff 74 4f 8b 47 04 8b 17 48 ff 05 7c 21 9b 03 48 83 c0 03 48 c1 e0 04 ff ca <48> 8b 44 07 08 74 1f 48 ff 05 6c 21 9b 03 ff 0f 0f 94 c2 48 ff > RIP: put_pid+0x22/0x5c RSP: ffff986719f73e48 > CR2: 0000000677846439 > ---[ end trace ab8c5cb4389d37c5 ]--- > Kernel panic - not syncing: Fatal exception In newseg when changing shm_cprid and shm_lprid from pid_t to struct pid* I misread the kvmalloc as kvzalloc and thought shp was initialized to 0. As that is not the case it is not safe to for the error handling to address shm_cprid and shm_lprid before they are initialized. Therefore move the cleanup of shm_cprid and shm_lprid from the no_file error cleanup path to the no_id error cleanup path. Ensuring that an early error exit won't cause the oops above. Reported-by: kernel test robot <fengguang.wu@intel.com> Reviewed-by: Nagarathnam Muthusamy <nagarathnam.muthusamy@oracle.com> Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
2018-03-27ipc: Directly call the security hook in ipc_ops.associateEric W. Biederman3-27/+3
After the last round of cleanups the shm, sem, and msg associate operations just became trivial wrappers around the appropriate security method. Simplify things further by just calling the security method directly. Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2018-03-27ipc/sem: Fix semctl(..., GETPID, ...) between pid namespacesEric W. Biederman1-10/+12
Today the last process to update a semaphore is remembered and reported in the pid namespace of that process. If there are processes in any other pid namespace querying that process id with GETPID the result will be unusable nonsense as it does not make any sense in your own pid namespace. Due to ipc_update_pid I don't think you will be able to get System V ipc semaphores into a troublesome cache line ping-pong. Using struct pids from separate process are not a problem because they do not share a cache line. Using struct pid from different threads of the same process are unlikely to be a problem as the reference count update can be avoided. Further linux futexes are a much better tool for the job of mutual exclusion between processes than System V semaphores. So I expect programs that are performance limited by their interprocess mutual exclusion primitive will be using futexes. So while it is possible that enhancing the storage of the last rocess of a System V semaphore from an integer to a struct pid will cause a performance regression because of the effect of frequently updating the pid reference count. I don't expect that to happen in practice. This change updates semctl(..., GETPID, ...) to return the process id of the last process to update a semphore inthe pid namespace of the calling process. Fixes: b488893a390e ("pid namespaces: changes to show virtual ids to user") Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2018-03-27ipc/msg: Fix msgctl(..., IPC_STAT, ...) between pid namespacesEric W. Biederman1-10/+13
Today msg_lspid and msg_lrpid are remembered in the pid namespace of the creator and the processes that last send or received a sysvipc message. If you have processes in multiple pid namespaces that is just wrong. The process ids reported will not make the least bit of sense. This fix is slightly more susceptible to a performance problem than the related fix for System V shared memory. By definition the pids are updated by msgsnd and msgrcv, the fast path of System V message queues. The only concern over the previous implementation is the incrementing and decrementing of the pid reference count. As that is the only difference and multiple updates by of the task_tgid by threads in the same process have been shown in af_unix sockets to create a cache line ping-pong between cpus of the same processor. In this case I don't expect cache lines holding pid reference counts to ping pong between cpus. As senders and receivers update different pids there is a natural separation there. Further if multiple threads of the same process either send or receive messages the pid will be updated to the same value and ipc_update_pid will avoid the reference count update. Which means in the common case I expect msg_lspid and msg_lrpid to remain constant, and reference counts not to be updated when messages are sent. In rare cases it may be possible to trigger the issue which was observed for af_unix sockets, but it will require multiple processes with multiple threads to be either sending or receiving messages. It just does not feel likely that anyone would do that in practice. This change updates msgctl(..., IPC_STAT, ...) to return msg_lspid and msg_lrpid in the pid namespace of the process calling stat. This change also updates cat /proc/sysvipc/msg to return print msg_lspid and msg_lrpid in the pid namespace of the process that opened the proc file. Fixes: b488893a390e ("pid namespaces: changes to show virtual ids to user") Reviewed-by: Nagarathnam Muthusamy <nagarathnam.muthusamy@oracle.com> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2018-03-27ipc/shm: Fix shmctl(..., IPC_STAT, ...) between pid namespaces.Eric W. Biederman1-10/+15
Today shm_cpid and shm_lpid are remembered in the pid namespace of the creator and the processes that last touched a sysvipc shared memory segment. If you have processes in multiple pid namespaces that is just wrong, and I don't know how this has been over-looked for so long. As only creation and shared memory attach and shared memory detach update the pids I do not expect there to be a repeat of the issues when struct pid was attached to each af_unix skb, which in some notable cases cut the performance in half. The problem was threads of the same process updating same struct pid from different cpus causing the cache line to be highly contended and bounce between cpus. As creation, attach, and detach are expected to be rare operations for sysvipc shared memory segments I do not expect that kind of cache line ping pong to cause probems. In addition because the pid is at a fixed location in the structure instead of being dynamic on a skb, the reference count of the pid does not need to be updated on each operation if the pid is the same. This ability to simply skip the pid reference count changes if the pid is unchanging further reduces the likelihood of the a cache line holding a pid reference count ping-ponging between cpus. Fixes: b488893a390e ("pid namespaces: changes to show virtual ids to user") Reviewed-by: Nagarathnam Muthusamy <nagarathnam.muthusamy@oracle.com> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2018-03-24Revert "mqueue: switch to on-demand creation of internal mount"Eric W. Biederman1-55/+19
This reverts commit 36735a6a2b5e042db1af956ce4bcc13f3ff99e21. Aleksa Sarai <asarai@suse.de> writes: > [REGRESSION v4.16-rc6] [PATCH] mqueue: forbid unprivileged user access to internal mount > > Felix reported weird behaviour on 4.16.0-rc6 with regards to mqueue[1], > which was introduced by 36735a6a2b5e ("mqueue: switch to on-demand > creation of internal mount"). > > Basically, the reproducer boils down to being able to mount mqueue if > you create a new user namespace, even if you don't unshare the IPC > namespace. > > Previously this was not possible, and you would get an -EPERM. The mount > is the *host* mqueue mount, which is being cached and just returned from > mqueue_mount(). To be honest, I'm not sure if this is safe or not (or if > it was intentional -- since I'm not familiar with mqueue). > > To me it looks like there is a missing permission check. I've included a > patch below that I've compile-tested, and should block the above case. > Can someone please tell me if I'm missing something? Is this actually > safe? > > [1]: https://github.com/docker/docker/issues/36674 The issue is a lot deeper than a missing permission check. sb->s_user_ns was is improperly set as well. So in addition to the filesystem being mounted when it should not be mounted, so things are not allow that should be. We are practically to the release of 4.16 and there is no agreement between Al Viro and myself on what the code should looks like to fix things properly. So revert the code to what it was before so that we can take our time and discuss this properly. Fixes: 36735a6a2b5e ("mqueue: switch to on-demand creation of internal mount") Reported-by: Felix Abecassis <fabecassis@nvidia.com> Reported-by: Aleksa Sarai <asarai@suse.de> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2018-03-24ipc/util: Helpers for making the sysvipc operations pid namespace awareEric W. Biederman2-0/+20
Capture the pid namespace when /proc/sysvipc/msg /proc/sysvipc/shm and /proc/sysvipc/sem are opened, and make it available through the new helper ipc_seq_pid_ns. This makes it possible to report the pids in these files in the pid namespace of the opener of the files. Implement ipc_update_pid. A simple impline helper that will only update a struct pid pointer if the new value does not equal the old value. This removes the need for wordy code sequences like: old = object->pid; object->pid = new; put_pid(old); and old = object->pid; if (old != new) { object->pid = new; put_pid(old); } Allowing the following to be written instead: ipc_update_pid(&object->pid, new); Which is easier to read and ensures that the pid reference count is not touched the old and the new values are the same. Not touching the reference count in this case is important to help avoid issues like af_unix experienced, where multiple threads of the same process managed to bounce the struct pid between cpu cache lines, but updating the pids reference count. Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2018-03-24ipc: Move IPCMNI from include/ipc.h into ipc/util.hEric W. Biederman1-0/+1
The definition IPCMNI is only used in ipc/util.h and ipc/util.c. So there is no reason to keep it in a header file that the whole kernel can see. Move it into util.h to simplify future maintenance. Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2018-03-24msg: Move struct msg_queue into ipc/msg.cEric W. Biederman1-0/+17
All of the users are now in ipc/msg.c so make the definition local to that file to make code maintenance easier. AKA to prevent rebuilding the entire kernel when struct msg_queue changes. Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2018-03-24shm: Move struct shmid_kernel into ipc/shm.cEric W. Biederman1-0/+22
All of the users are now in ipc/shm.c so make the definition local to that file to make code maintenance easier. AKA to prevent rebuilding the entire kernel when struct shmid_kernel changes. Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2018-03-22sem: Move struct sem and struct sem_array into ipc/sem.cEric W. Biederman1-0/+34
All of the users are now in ipc/sem.c so make the definitions local to that file to make code maintenance easier. AKA to prevent rebuilding the entire kernel when one of these files is changed. Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2018-03-22msg/security: Pass kern_ipc_perm not msg_queue into the msg_queue security hooksEric W. Biederman1-10/+8
All of the implementations of security hooks that take msg_queue only access q_perm the struct kern_ipc_perm member. This means the dependencies of the msg_queue security hooks can be simplified by passing the kern_ipc_perm member of msg_queue. Making this change will allow struct msg_queue to become private to ipc/msg.c. Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2018-03-22shm/security: Pass kern_ipc_perm not shmid_kernel into the shm security hooksEric W. Biederman1-10/+7
All of the implementations of security hooks that take shmid_kernel only access shm_perm the struct kern_ipc_perm member. This means the dependencies of the shm security hooks can be simplified by passing the kern_ipc_perm member of shmid_kernel.. Making this change will allow struct shmid_kernel to become private to ipc/shm.c. Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2018-03-22sem/security: Pass kern_ipc_perm not sem_array into the sem security hooksEric W. Biederman1-11/+8
All of the implementations of security hooks that take sem_array only access sem_perm the struct kern_ipc_perm member. This means the dependencies of the sem security hooks can be simplified by passing the kern_ipc_perm member of sem_array. Making this change will allow struct sem and struct sem_array to become private to ipc/sem.c. Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2018-02-11vfs: do bulk POLL* -> EPOLL* replacementLinus Torvalds1-2/+2
This is the mindless scripted replacement of kernel use of POLL* variables as described by Al, done by this script: for V in IN OUT PRI ERR RDNORM RDBAND WRNORM WRBAND HUP RDHUP NVAL MSG; do L=`git grep -l -w POLL$V | grep -v '^t' | grep -v /um/ | grep -v '^sa' | grep -v '/poll.h$'|grep -v '^D'` for f in $L; do sed -i "-es/^\([^\"]*\)\(\<POLL$V\>\)/\\1E\\2/" $f; done done with de-mangling cleanups yet to come. NOTE! On almost all architectures, the EPOLL* constants have the same values as the POLL* constants do. But they keyword here is "almost". For various bad reasons they aren't the same, and epoll() doesn't actually work quite correctly in some cases due to this on Sparc et al. The next patch from Al will sort out the final differences, and we should be all done. Scripted-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-02-06ipc/mqueue.c: have RT tasks queue in by priority in wq_add()Jonathan Haws1-1/+1
Previous behavior added tasks to the work queue using the static_prio value instead of the dynamic priority value in prio. This caused RT tasks to be added to the work queue in a FIFO manner rather than by priority. Normal tasks were handled by priority. This fix utilizes the dynamic priority of the task to ensure that both RT and normal tasks are added to the work queue in priority order. Utilizing the dynamic priority (prio) rather than the base priority (normal_prio) was chosen to ensure that if a task had a boosted priority when it was added to the work queue, it would be woken sooner to to ensure that it releases any other locks it may be holding in a more timely manner. It is understood that the task could have a lower priority when it wakes than when it was added to the queue in this (unlikely) case. Link: http://lkml.kernel.org/r/1513006652-7014-1-git-send-email-jhaws@sdl.usu.edu Signed-off-by: Jonathan Haws <jhaws@sdl.usu.edu> Reviewed-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Reviewed-by: Davidlohr Bueso <dbueso@suse.de> Cc: Ingo Molnar <mingo@kernel.org> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Deepa Dinamani <deepa.kernel@gmail.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Cc: Manfred Spraul <manfred@colorfullife.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-02-06ipc: fix ipc data structures inconsistencyPhilippe Mikoyan4-12/+43
As described in the title, this patch fixes <ipc>id_ds inconsistency when <ipc>ctl_stat executes concurrently with some ds-changing function, e.g. shmat, msgsnd or whatever. For instance, if shmctl(IPC_STAT) is running concurrently with shmat, following data structure can be returned: {... shm_lpid = 0, shm_nattch = 1, ...} Link: http://lkml.kernel.org/r/20171202153456.6514-1-philippe.mikoyan@skat.systems Signed-off-by: Philippe Mikoyan <philippe.mikoyan@skat.systems> Reviewed-by: Davidlohr Bueso <dbueso@suse.de> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Manfred Spraul <manfred@colorfullife.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-01-30Merge branch 'work.mqueue' of ↵Linus Torvalds1-130/+111
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull mqueue/bpf vfs cleanups from Al Viro: "mqueue and bpf go through rather painful and similar contortions to create objects in their dentry trees. Provide a primitive for doing that without abusing ->mknod(), switch bpf and mqueue to it. Another mqueue-related thing that has ended up in that branch is on-demand creation of internal mount (based upon the work of Giuseppe Scrivano)" * 'work.mqueue' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: mqueue: switch to on-demand creation of internal mount tidy do_mq_open() up a bit mqueue: clean prepare_open() up do_mq_open(): move all work prior to dentry_open() into a helper mqueue: fold mq_attr_ok() into mqueue_get_inode() move dentry_open() calls up into do_mq_open() mqueue: switch to vfs_mkobj(), quit abusing ->d_fsdata bpf_obj_do_pin(): switch to vfs_mkobj(), quit abusing ->mknod() new primitive: vfs_mkobj()
2018-01-30Merge branch 'misc.poll' of ↵Linus Torvalds1-2/+2
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull poll annotations from Al Viro: "This introduces a __bitwise type for POLL### bitmap, and propagates the annotations through the tree. Most of that stuff is as simple as 'make ->poll() instances return __poll_t and do the same to local variables used to hold the future return value'. Some of the obvious brainos found in process are fixed (e.g. POLLIN misspelled as POLL_IN). At that point the amount of sparse warnings is low and most of them are for genuine bugs - e.g. ->poll() instance deciding to return -EINVAL instead of a bitmap. I hadn't touched those in this series - it's large enough as it is. Another problem it has caught was eventpoll() ABI mess; select.c and eventpoll.c assumed that corresponding POLL### and EPOLL### were equal. That's true for some, but not all of them - EPOLL### are arch-independent, but POLL### are not. The last commit in this series separates userland POLL### values from the (now arch-independent) kernel-side ones, converting between them in the few places where they are copied to/from userland. AFAICS, this is the least disruptive fix preserving poll(2) ABI and making epoll() work on all architectures. As it is, it's simply broken on sparc - try to give it EPOLLWRNORM and it will trigger only on what would've triggered EPOLLWRBAND on other architectures. EPOLLWRBAND and EPOLLRDHUP, OTOH, are never triggered at all on sparc. With this patch they should work consistently on all architectures" * 'misc.poll' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (37 commits) make kernel-side POLL... arch-independent eventpoll: no need to mask the result of epi_item_poll() again eventpoll: constify struct epoll_event pointers debugging printk in sg_poll() uses %x to print POLL... bitmap annotate poll(2) guts 9p: untangle ->poll() mess ->si_band gets POLL... bitmap stored into a user-visible long field ring_buffer_poll_wait() return value used as return value of ->poll() the rest of drivers/*: annotate ->poll() instances media: annotate ->poll() instances fs: annotate ->poll() instances ipc, kernel, mm: annotate ->poll() instances net: annotate ->poll() instances apparmor: annotate ->poll() instances tomoyo: annotate ->poll() instances sound: annotate ->poll() instances acpi: annotate ->poll() instances crypto: annotate ->poll() instances block: annotate ->poll() instances x86: annotate ->poll() instances ...
2018-01-12signal: Ensure generic siginfos the kernel sends have all bits initializedEric W. Biederman1-0/+1
Call clear_siginfo to ensure stack allocated siginfos are fully initialized before being passed to the signal sending functions. This ensures that if there is the kind of confusion documented by TRAP_FIXME, FPE_FIXME, or BUS_FIXME the kernel won't send unitialized data to userspace when the kernel generates a signal with SI_USER but the copy to userspace assumes it is a different kind of signal, and different fields are initialized. This also prepares the way for turning copy_siginfo_to_user into a copy_to_user, by removing the need in many cases to perform a field by field copy simply to skip the uninitialized fields. Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2018-01-05mqueue: switch to on-demand creation of internal mountAl Viro1-19/+55
Instead of doing that upon each ipcns creation, we do that the first time mq_open(2) or mqueue mount is done in an ipcns. What's more, doing that allows to get rid of mount_ns() use - we can go with considerably cheaper mount_nodev(), avoiding the loop over all mqueue superblock instances; ipcns->mq_mnt is used to locate preexisting instance in O(1) time instead of O(instances) mount_ns() would've cost us. Based upon the version by Giuseppe Scrivano <gscrivan@redhat.com>; I've added handling of userland mqueue mounts (original had been broken in that area) and added a switch to mount_nodev(). Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2018-01-05tidy do_mq_open() up a bitAl Viro1-6/+3
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2018-01-05mqueue: clean prepare_open() upAl Viro1-18/+11
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2018-01-05do_mq_open(): move all work prior to dentry_open() into a helperAl Viro1-46/+31
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2018-01-05mqueue: fold mq_attr_ok() into mqueue_get_inode()Al Viro1-50/+21
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2018-01-05move dentry_open() calls up into do_mq_open()Al Viro1-24/+20
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2018-01-05mqueue: switch to vfs_mkobj(), quit abusing ->d_fsdataAl Viro1-8/+11
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2017-11-27ipc, kernel, mm: annotate ->poll() instancesAl Viro1-2/+2
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2017-11-27Rename superblock flags (MS_xyz -> SB_xyz)Linus Torvalds1-1/+1
This is a pure automated search-and-replace of the internal kernel superblock flags. The s_flags are now called SB_*, with the names and the values for the moment mirroring the MS_* flags that they're equivalent to. Note how the MS_xyz flags are the ones passed to the mount system call, while the SB_xyz flags are what we then use in sb->s_flags. The script to do this was: # places to look in; re security/*: it generally should *not* be # touched (that stuff parses mount(2) arguments directly), but # there are two places where we really deal with superblock flags. FILES="drivers/mtd drivers/staging/lustre fs ipc mm \ include/linux/fs.h include/uapi/linux/bfs_fs.h \ security/apparmor/apparmorfs.c security/apparmor/include/lib.h" # the list of MS_... constants SYMS="RDONLY NOSUID NODEV NOEXEC SYNCHRONOUS REMOUNT MANDLOCK \ DIRSYNC NOATIME NODIRATIME BIND MOVE REC VERBOSE SILENT \ POSIXACL UNBINDABLE PRIVATE SLAVE SHARED RELATIME KERNMOUNT \ I_VERSION STRICTATIME LAZYTIME SUBMOUNT NOREMOTELOCK NOSEC BORN \ ACTIVE NOUSER" SED_PROG= for i in $SYMS; do SED_PROG="$SED_PROG -e s/MS_$i/SB_$i/g"; done # we want files that contain at least one of MS_..., # with fs/namespace.c and fs/pnode.c excluded. L=$(for i in $SYMS; do git grep -w -l MS_$i $FILES; done| sort|uniq|grep -v '^fs/namespace.c'|grep -v '^fs/pnode.c') for f in $L; do sed -i $f $SED_PROG; done Requested-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-11-17Merge branch 'akpm' (patches from Andrew)Linus Torvalds4-50/+75
Merge more updates from Andrew Morton: - a bit more MM - procfs updates - dynamic-debug fixes - lib/ updates - checkpatch - epoll - nilfs2 - signals - rapidio - PID management cleanup and optimization - kcov updates - sysvipc updates - quite a few misc things all over the place * emailed patches from Andrew Morton <akpm@linux-foundation.org>: (94 commits) EXPERT Kconfig menu: fix broken EXPERT menu include/asm-generic/topology.h: remove unused parent_node() macro arch/tile/include/asm/topology.h: remove unused parent_node() macro arch/sparc/include/asm/topology_64.h: remove unused parent_node() macro arch/sh/include/asm/topology.h: remove unused parent_node() macro arch/ia64/include/asm/topology.h: remove unused parent_node() macro drivers/pcmcia/sa1111_badge4.c: avoid unused function warning mm: add infrastructure for get_user_pages_fast() benchmarking sysvipc: make get_maxid O(1) again sysvipc: properly name ipc_addid() limit parameter sysvipc: duplicate lock comments wrt ipc_addid() sysvipc: unteach ids->next_id for !CHECKPOINT_RESTORE initramfs: use time64_t timestamps drivers/watchdog: make use of devm_register_reboot_notifier() kernel/reboot.c: add devm_register_reboot_notifier() kcov: update documentation Makefile: support flag -fsanitizer-coverage=trace-cmp kcov: support comparison operands collection kcov: remove pointless current != NULL check kernel/panic.c: add TAINT_AUX ...
2017-11-17sysvipc: make get_maxid O(1) againDavidlohr Bueso2-33/+31
For a custom microbenchmark on a 3.30GHz Xeon SandyBridge, which calls IPC_STAT over and over, it was calculated that, on avg the cost of ipc_get_maxid() for increasing amounts of keys was: 10 keys: ~900 cycles 100 keys: ~15000 cycles 1000 keys: ~150000 cycles 10000 keys: ~2100000 cycles This is unsurprising as maxid is currently O(n). By having the max_id available in O(1) we save all those cycles for each semctl(_STAT) command, the idr_find can be expensive -- which some real (customer) workloads actually poll on. Note that this used to be the case, until commit 7ca7e564e04 ("ipc: store ipcs into IDRs"). The cost is the extra idr_find when doing RMIDs, but we simply go backwards, and should not take too many iterations to find the new value. [akpm@linux-foundation.org: coding-style fixes] Link: http://lkml.kernel.org/r/20170831172049.14576-5-dave@stgolabs.net Signed-off-by: Davidlohr Bueso <dbueso@suse.de> Cc: Manfred Spraul <manfred@colorfullife.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-11-17sysvipc: properly name ipc_addid() limit parameterDavidlohr Bueso1-5/+5
This is better understood as a limit, instead of size; exactly like the function comment indicates. Rename it. Link: http://lkml.kernel.org/r/20170831172049.14576-4-dave@stgolabs.net Signed-off-by: Davidlohr Bueso <dbueso@suse.de> Cc: Manfred Spraul <manfred@colorfullife.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-11-17sysvipc: duplicate lock comments wrt ipc_addid()Davidlohr Bueso2-0/+2
The comment in msgqueues when using ipc_addid() is quite useful imo. Duplicate it for shm and semaphores. Link: http://lkml.kernel.org/r/20170831172049.14576-3-dave@stgolabs.net Signed-off-by: Davidlohr Bueso <dbueso@suse.de> Cc: Manfred Spraul <manfred@colorfullife.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-11-17sysvipc: unteach ids->next_id for !CHECKPOINT_RESTOREDavidlohr Bueso2-20/+45
Patch series "sysvipc: ipc-key management improvements". Here are a few improvements I spotted while eyeballing Guillaume's rhashtable implementation for ipc keys. The first and fourth patches are the interesting ones, the middle two are trivial. This patch (of 4): The next_id object-allocation functionality was introduced in commit 03f595668017 ("ipc: add sysctl to specify desired next object id"). Given that these new entries are _only_ exported under the CONFIG_CHECKPOINT_RESTORE option, there is no point for the common case to even know about ->next_id. As such rewrite ipc_buildid() such that it can do away with the field as well as unnecessary branches when adding a new identifier. The end result also better differentiates both cases, so the code ends up being cleaner; albeit the small duplications regarding the default case. [akpm@linux-foundation.org: coding-style fixes] Link: http://lkml.kernel.org/r/20170831172049.14576-2-dave@stgolabs.net Signed-off-by: Davidlohr Bueso <dbueso@suse.de> Cc: Manfred Spraul <manfred@colorfullife.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-11-17Merge branch 'work.misc' of ↵Linus Torvalds4-7/+7
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull misc vfs updates from Al Viro: "Assorted stuff, really no common topic here" * 'work.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: vfs: grab the lock instead of blocking in __fd_install during resizing vfs: stop clearing close on exec when closing a fd include/linux/fs.h: fix comment about struct address_space fs: make fiemap work from compat_ioctl coda: fix 'kernel memory exposure attempt' in fsync pstore: remove unneeded unlikely() vfs: remove unneeded unlikely() stubs for mount_bdev() and kill_block_super() in !CONFIG_BLOCK case make vfs_ustat() static do_handle_open() should be static elf_fdpic: fix unused variable warning fold destroy_super() into __put_super() new helper: destroy_unused_super() fix address space warnings in ipc/ acct.h: get rid of detritus
2017-11-02License cleanup: add SPDX GPL-2.0 license identifier to files with no licenseGreg Kroah-Hartman9-0/+9
Many source files in the tree are missing licensing information, which makes it harder for compliance tools to determine the correct license. By default all files without license information are under the default license of the kernel, which is GPL version 2. Update the files which contain no license information with the 'GPL-2.0' SPDX license identifier. The SPDX identifier is a legally binding shorthand, which can be used instead of the full boiler plate text. This patch is based on work done by Thomas Gleixner and Kate Stewart and Philippe Ombredanne. How this work was done: Patches were generated and checked against linux-4.14-rc6 for a subset of the use cases: - file had no licensing information it it. - file was a */uapi/* one with no licensing information in it, - file was a */uapi/* one with existing licensing information, Further patches will be generated in subsequent months to fix up cases where non-standard license headers were used, and references to license had to be inferred by heuristics based on keywords. The analysis to determine which SPDX License Identifier to be applied to a file was done in a spreadsheet of side by side results from of the output of two independent scanners (ScanCode & Windriver) producing SPDX tag:value files created by Philippe Ombredanne. Philippe prepared the base worksheet, and did an initial spot review of a few 1000 files. The 4.13 kernel was the starting point of the analysis with 60,537 files assessed. Kate Stewart did a file by file comparison of the scanner results in the spreadsheet to determine which SPDX license identifier(s) to be applied to the file. She confirmed any determination that was not immediately clear with lawyers working with the Linux Foundation. Criteria used to select files for SPDX license identifier tagging was: - Files considered eligible had to be source code files. - Make and config files were included as candidates if they contained >5 lines of source - File already had some variant of a license header in it (even if <5 lines). All documentation files were explicitly excluded. The following heuristics were used to determine which SPDX license identifiers to apply. - when both scanners couldn't find any license traces, file was considered to have no license information in it, and the top level COPYING file license applied. For non */uapi/* files that summary was: SPDX license identifier # files ---------------------------------------------------|------- GPL-2.0 11139 and resulted in the first patch in this series. If that file was a */uapi/* path one, it was "GPL-2.0 WITH Linux-syscall-note" otherwise it was "GPL-2.0". Results of that was: SPDX license identifier # files ---------------------------------------------------|------- GPL-2.0 WITH Linux-syscall-note 930 and resulted in the second patch in this series. - if a file had some form of licensing information in it, and was one of the */uapi/* ones, it was denoted with the Linux-syscall-note if any GPL family license was found in the file or had no licensing in it (per prior point). Results summary: SPDX license identifier # files ---------------------------------------------------|------ GPL-2.0 WITH Linux-syscall-note 270 GPL-2.0+ WITH Linux-syscall-note 169 ((GPL-2.0 WITH Linux-syscall-note) OR BSD-2-Clause) 21 ((GPL-2.0 WITH Linux-syscall-note) OR BSD-3-Clause) 17 LGPL-2.1+ WITH Linux-syscall-note 15 GPL-1.0+ WITH Linux-syscall-note 14 ((GPL-2.0+ WITH Linux-syscall-note) OR BSD-3-Clause) 5 LGPL-2.0+ WITH Linux-syscall-note 4 LGPL-2.1 WITH Linux-syscall-note 3 ((GPL-2.0 WITH Linux-syscall-note) OR MIT) 3 ((GPL-2.0 WITH Linux-syscall-note) AND MIT) 1 and that resulted in the third patch in this series. - when the two scanners agreed on the detected license(s), that became the concluded license(s). - when there was disagreement between the two scanners (one detected a license but the other didn't, or they both detected different licenses) a manual inspection of the file occurred. - In most cases a manual inspection of the information in the file resulted in a clear resolution of the license that should apply (and which scanner probably needed to revisit its heuristics). - When it was not immediately clear, the license identifier was confirmed with lawyers working with the Linux Foundation. - If there was any question as to the appropriate license identifier, the file was flagged for further research and to be revisited later in time. In total, over 70 hours of logged manual review was done on the spreadsheet to determine the SPDX license identifiers to apply to the source files by Kate, Philippe, Thomas and, in some cases, confirmation by lawyers working with the Linux Foundation. Kate also obtained a third independent scan of the 4.13 code base from FOSSology, and compared selected files where the other two scanners disagreed against that SPDX file, to see if there was new insights. The Windriver scanner is based on an older version of FOSSology in part, so they are related. Thomas did random spot checks in about 500 files from the spreadsheets for the uapi headers and agreed with SPDX license identifier in the files he inspected. For the non-uapi files Thomas did random spot checks in about 15000 files. In initial set of patches against 4.14-rc6, 3 files were found to have copy/paste license identifier errors, and have been fixed to reflect the correct identifier. Additionally Philippe spent 10 hours this week doing a detailed manual inspection and review of the 12,461 patched files from the initial patch version early this week with: - a full scancode scan run, collecting the matched texts, detected license ids and scores - reviewing anything where there was a license detected (about 500+ files) to ensure that the applied SPDX license was correct - reviewing anything where there was no detection but the patch license was not GPL-2.0 WITH Linux-syscall-note to ensure that the applied SPDX license was correct This produced a worksheet with 20 files needing minor correction. This worksheet was then exported into 3 different .csv files for the different types of files to be modified. These .csv files were then reviewed by Greg. Thomas wrote a script to parse the csv files and add the proper SPDX tag to the file, in the format that the file expected. This script was further refined by Greg based on the output to detect more types of files automatically and to distinguish between header and source .c files (which need different comment types.) Finally Greg ran the script using the .csv files to generate the patches. Reviewed-by: Kate Stewart <kstewart@linuxfoundation.org> Reviewed-by: Philippe Ombredanne <pombredanne@nexb.com> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-10-11fix address space warnings in ipc/Linus Torvalds4-7/+7
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2017-09-25fix a typo in put_compat_shm_info()Al Viro1-1/+1
"uip" misspelled as "up"; unfortunately, the latter happens to be a function and gcc is happy to convert it to void *... Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2017-09-20ipc/shm: Fix order of parameters when calling copy_compat_shmid_to_userWill Deacon1-1/+1
Commit 553f770ef71b ("ipc: move compat shmctl to native") moved the compat IPC syscall handling into ipc/shm.c and refactored the struct accessors in the process. Unfortunately, the call to copy_compat_shmid_to_user when handling a compat {IPC,SHM}_STAT command gets the arguments the wrong way round, passing a kernel stack address as the user buffer (destination) and the user buffer as the kernel stack address (source). This patch fixes the parameter ordering so the buffers are accessed correctly. Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Will Deacon <will.deacon@arm.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2017-09-14Merge branch 'work.ipc' of ↵Linus Torvalds7-1120/+1011
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull ipc compat cleanup and 64-bit time_t from Al Viro: "IPC copyin/copyout sanitizing, including 64bit time_t work from Deepa Dinamani" * 'work.ipc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: utimes: Make utimes y2038 safe ipc: shm: Make shmid_kernel timestamps y2038 safe ipc: sem: Make sem_array timestamps y2038 safe ipc: msg: Make msg_queue timestamps y2038 safe ipc: mqueue: Replace timespec with timespec64 ipc: Make sys_semtimedop() y2038 safe get rid of SYSVIPC_COMPAT on ia64 semtimedop(): move compat to native shmat(2): move compat to native msgrcv(2), msgsnd(2): move compat to native ipc(2): move compat to native ipc: make use of compat ipc_perm helpers semctl(): move compat to native semctl(): separate all layout-dependent copyin/copyout msgctl(): move compat to native msgctl(): split the actual work from copyin/copyout ipc: move compat shmctl to native shmctl: split the work from copyin/copyout
2017-09-08ipc: optimize semget/shmget/msgget for lots of keysGuillaume Knispel6-50/+124
ipc_findkey() used to scan all objects to look for the wanted key. This is slow when using a high number of keys. This change adds an rhashtable of kern_ipc_perm objects in ipc_ids, so that one lookup cease to be O(n). This change gives a 865% improvement of benchmark reaim.jobs_per_min on a 56 threads Intel(R) Xeon(R) CPU E5-2695 v3 @ 2.30GHz with 256G memory [1] Other (more micro) benchmark results, by the author: On an i5 laptop, the following loop executed right after a reboot took, without and with this change: for (int i = 0, k=0x424242; i < KEYS; ++i) semget(k++, 1, IPC_CREAT | 0600); total total max single max single KEYS without with call without call with 1 3.5 4.9 µs 3.5 4.9 10 7.6 8.6 µs 3.7 4.7 32 16.2 15.9 µs 4.3 5.3 100 72.9 41.8 µs 3.7 4.7 1000 5,630.0 502.0 µs * * 10000 1,340,000.0 7,240.0 µs * * 31900 17,600,000.0 22,200.0 µs * * *: unreliable measure: high variance The duration for a lookup-only usage was obtained by the same loop once the keys are present: total total max single max single KEYS without with call without call with 1 2.1 2.5 µs 2.1 2.5 10 4.5 4.8 µs 2.2 2.3 32 13.0 10.8 µs 2.3 2.8 100 82.9 25.1 µs * 2.3 1000 5,780.0 217.0 µs * * 10000 1,470,000.0 2,520.0 µs * * 31900 17,400,000.0 7,810.0 µs * * Finally, executing each semget() in a new process gave, when still summing only the durations of these syscalls: creation: total total KEYS without with 1 3.7 5.0 µs 10 32.9 36.7 µs 32 125.0 109.0 µs 100 523.0 353.0 µs 1000 20,300.0 3,280.0 µs 10000 2,470,000.0 46,700.0 µs 31900 27,800,000.0 219,000.0 µs lookup-only: total total KEYS without with 1 2.5 2.7 µs 10 25.4 24.4 µs 32 106.0 72.6 µs 100 591.0 352.0 µs 1000 22,400.0 2,250.0 µs 10000 2,510,000.0 25,700.0 µs 31900 28,200,000.0 115,000.0 µs [1] http://lkml.kernel.org/r/20170814060507.GE23258@yexl-desktop Link: http://lkml.kernel.org/r/20170815194954.ck32ta2z35yuzpwp@debix Signed-off-by: Guillaume Knispel <guillaume.knispel@supersonicimagine.com> Reviewed-by: Marc Pardo <marc.pardo@supersonicimagine.com> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: Kees Cook <keescook@chromium.org> Cc: Manfred Spraul <manfred@colorfullife.com> Cc: Alexey Dobriyan <adobriyan@gmail.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: "Peter Zijlstra (Intel)" <peterz@infradead.org> Cc: Ingo Molnar <mingo@kernel.org> Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Cc: Serge Hallyn <serge@hallyn.com> Cc: Andrey Vagin <avagin@openvz.org> Cc: Guillaume Knispel <guillaume.knispel@supersonicimagine.com> Cc: Marc Pardo <marc.pardo@supersonicimagine.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-09-08ipc/sem: play nicer with large nsops allocationsDavidlohr Bueso1-2/+2
Replacing semop()'s kmalloc for kvmalloc was originally proposed by Manfred on the premise that it can be called for large (than order-1) sizes. For example, while Oracle recommends setting SEMOPM to a _minimum_ of 100, some distros[1] encourage the setting to be a factor of the amount of db tasks (PROCESSES), which can get fishy for large systems (easily going beyond 1000). [1] An Example of Semaphore Settings https://access.redhat.com/documentation/en-US/Red_Hat_Enterprise_Linux/5/html/Tuning_and_Optimizing_Red_Hat_Enterprise_Linux_for_Oracle_9i_and_10g_Databases/sect-Oracle_9i_and_10g_Tuning_Guide-Setting_Semaphores-An_Example_of_Semaphore_Settings.html So let's just convert this to kvmalloc, just like the rest of the allocations we do in ipc. While the fallback vmalloc obviously involves more overhead, this by far the uncommon path, and it's better for the user than just erroring out with kmalloc. Link: http://lkml.kernel.org/r/20170803184136.13855-2-dave@stgolabs.net Signed-off-by: Davidlohr Bueso <dbueso@suse.de> Cc: Manfred Spraul <manfred@colorfullife.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-09-08ipc/sem: drop sem_checkid helperDavidlohr Bueso1-2/+0
... 'tis not used. Link: http://lkml.kernel.org/r/20170803184136.13855-1-dave@stgolabs.net Signed-off-by: Davidlohr Bueso <dbueso@suse.de> Cc: Manfred Spraul <manfred@colorfullife.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-09-08ipc: convert kern_ipc_perm.refcount from atomic_t to refcount_tElena Reshetova1-3/+3
refcount_t type and corresponding API should be used instead of atomic_t when the variable is used as a reference counter. This allows to avoid accidental refcounter overflows that might lead to use-after-free situations. Link: http://lkml.kernel.org/r/1499417992-3238-4-git-send-email-elena.reshetova@intel.com Signed-off-by: Elena Reshetova <elena.reshetova@intel.com> Signed-off-by: Hans Liljestrand <ishkamiel@gmail.com> Signed-off-by: Kees Cook <keescook@chromium.org> Signed-off-by: David Windsor <dwindsor@gmail.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Alexey Dobriyan <adobriyan@gmail.com> Cc: Serge Hallyn <serge@hallyn.com> Cc: <arozansk@redhat.com> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: Manfred Spraul <manfred@colorfullife.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-09-08ipc: convert sem_undo_list.refcnt from atomic_t to refcount_tElena Reshetova1-4/+4
refcount_t type and corresponding API should be used instead of atomic_t when the variable is used as a reference counter. This allows to avoid accidental refcounter overflows that might lead to use-after-free situations. Link: http://lkml.kernel.org/r/1499417992-3238-3-git-send-email-elena.reshetova@intel.com Signed-off-by: Elena Reshetova <elena.reshetova@intel.com> Signed-off-by: Hans Liljestrand <ishkamiel@gmail.com> Signed-off-by: Kees Cook <keescook@chromium.org> Signed-off-by: David Windsor <dwindsor@gmail.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Alexey Dobriyan <adobriyan@gmail.com> Cc: Serge Hallyn <serge@hallyn.com> Cc: <arozansk@redhat.com> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: Manfred Spraul <manfred@colorfullife.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-09-08ipc: convert ipc_namespace.count from atomic_t to refcount_tElena Reshetova2-3/+3
refcount_t type and corresponding API should be used instead of atomic_t when the variable is used as a reference counter. This allows to avoid accidental refcounter overflows that might lead to use-after-free situations. Link: http://lkml.kernel.org/r/1499417992-3238-2-git-send-email-elena.reshetova@intel.com Signed-off-by: Elena Reshetova <elena.reshetova@intel.com> Signed-off-by: Hans Liljestrand <ishkamiel@gmail.com> Signed-off-by: Kees Cook <keescook@chromium.org> Signed-off-by: David Windsor <dwindsor@gmail.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Alexey Dobriyan <adobriyan@gmail.com> Cc: Serge Hallyn <serge@hallyn.com> Cc: <arozansk@redhat.com> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: Manfred Spraul <manfred@colorfullife.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-09-03ipc: shm: Make shmid_kernel timestamps y2038 safeDeepa Dinamani1-5/+5
time_t is not y2038 safe. Replace all uses of time_t by y2038 safe time64_t. Similarly, replace the calls to get_seconds() with y2038 safe ktime_get_real_seconds(). Note that this preserves fast access on 64 bit systems, but 32 bit systems need sequence counters. The syscall interfaces themselves are not changed as part of the patch. They will be part of a different series. Signed-off-by: Deepa Dinamani <deepa.kernel@gmail.com> Reviewed-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2017-09-03ipc: sem: Make sem_array timestamps y2038 safeDeepa Dinamani1-9/+9
time_t is not y2038 safe. Replace all uses of time_t by y2038 safe time64_t. Similarly, replace the calls to get_seconds() with y2038 safe ktime_get_real_seconds(). Note that this preserves fast access on 64 bit systems, but 32 bit systems need sequence counters. The syscall interface themselves are not changed as part of the patch. They will be part of a different series. Signed-off-by: Deepa Dinamani <deepa.kernel@gmail.com> Reviewed-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2017-09-03ipc: msg: Make msg_queue timestamps y2038 safeDeepa Dinamani1-3/+3
time_t is not y2038 safe. Replace all uses of time_t by y2038 safe time64_t. Similarly, replace the calls to get_seconds() with y2038 safe ktime_get_real_seconds(). Note that this preserves fast access on 64 bit systems, but 32 bit systems need sequence counters. The syscall interfaces themselves are not changed as part of the patch. They will be part of a different series. Signed-off-by: Deepa Dinamani <deepa.kernel@gmail.com> Reviewed-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2017-09-03ipc: mqueue: Replace timespec with timespec64Deepa Dinamani1-14/+14
struct timespec is not y2038 safe. Replace all uses of timespec by y2038 safe struct timespec64. Even though timespec is used here to represent timeouts, replace these with timespec64 so that it facilitates in verification by creating a y2038 safe kernel image that is free of timespec. The syscall interfaces themselves are not changed as part of the patch. They will be part of a different series. Signed-off-by: Deepa Dinamani <deepa.kernel@gmail.com> Cc: Paul Moore <paul@paul-moore.com> Cc: Richard Guy Briggs <rgb@redhat.com> Reviewed-by: Richard Guy Briggs <rgb@redhat.com> Reviewed-by: Arnd Bergmann <arnd@arndb.de> Acked-by: Paul Moore <paul@paul-moore.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2017-09-03ipc: Make sys_semtimedop() y2038 safeDeepa Dinamani1-6/+6
struct timespec is not y2038 safe on 32 bit machines. Replace timespec with y2038 safe struct timespec64. Note that the patch only changes the internals without modifying the syscall interface. This will be part of a separate series. Signed-off-by: Deepa Dinamani <deepa.kernel@gmail.com> Reviewed-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2017-08-21Merge branch 'for-mingo' of ↵Ingo Molnar1-1/+2
git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu into core/rcu Pull RCU updates from Paul E. McKenney: - Removal of spin_unlock_wait() - SRCU updates - Torture-test updates - Documentation updates - Miscellaneous fixes - CPU-hotplug fixes - Miscellaneous non-RCU fixes Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-08-17ipc: Replace spin_unlock_wait() with lock/unlock pairPaul E. McKenney1-1/+2
There is no agreed-upon definition of spin_unlock_wait()'s semantics, and it appears that all callers could do just as well with a lock/unlock pair. This commit therefore replaces the spin_unlock_wait() call in exit_sem() with spin_lock() followed immediately by spin_unlock(). This should be safe from a performance perspective because exit_sem() is rarely invoked in production. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: Will Deacon <will.deacon@arm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Alan Stern <stern@rowland.harvard.edu> Cc: Andrea Parri <parri.andrea@gmail.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Acked-by: Manfred Spraul <manfred@colorfullife.com>
2017-08-02ipc: add missing container_of()s for randstructKees Cook3-3/+7
When building with the randstruct gcc plugin, the layout of the IPC structs will be randomized, which requires any sub-structure accesses to use container_of(). The proc display handlers were missing the needed container_of()s since the iterator is passing in the top-level struct kern_ipc_perm. This would lead to crashes when running the "lsipc" program after the system had IPC registered (e.g. after starting up Gnome): general protection fault: 0000 [#1] PREEMPT SMP ... RIP: 0010:shm_add_rss_swap.isra.1+0x13/0xa0 ... Call Trace: sysvipc_shm_proc_show+0x5e/0x150 sysvipc_proc_show+0x1a/0x30 seq_read+0x2e9/0x3f0 ... Link: http://lkml.kernel.org/r/20170730205950.GA55841@beast Fixes: 3859a271a003 ("randstruct: Mark various structs for randomization") Signed-off-by: Kees Cook <keescook@chromium.org> Reported-by: Dominik Brodowski <linux@dominikbrodowski.net> Acked-by: Davidlohr Bueso <dave@stgolabs.net> Acked-by: Manfred Spraul <manfred@colorfullife.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-07-15semtimedop(): move compat to nativeAl Viro2-21/+33
... and finally kill the sodding compat_convert_timespec() Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2017-07-15shmat(2): move compat to nativeAl Viro2-16/+19
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2017-07-15msgrcv(2), msgsnd(2): move compat to nativeAl Viro2-39/+43
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2017-07-15ipc(2): move compat to nativeAl Viro2-95/+93
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2017-07-15ipc: make use of compat ipc_perm helpersAl Viro2-50/+8
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2017-07-15semctl(): move compat to nativeAl Viro3-197/+133
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2017-07-15semctl(): separate all layout-dependent copyin/copyoutAl Viro1-98/+94
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2017-07-15msgctl(): move compat to nativeAl Viro2-132/+133
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2017-07-15msgctl(): split the actual work from copyin/copyoutAl Viro1-106/+96
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2017-07-15ipc: move compat shmctl to nativeAl Viro3-235/+231
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2017-07-15shmctl: split the work from copyin/copyoutAl Viro1-175/+172
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2017-07-12ipc/util.h: update documentation for ipc_getref() and ipc_putref()Manfred Spraul1-0/+3
Now that ipc_rcu_alloc() and ipc_rcu_free() are removed, document when it is valid to use ipc_getref() and ipc_putref(). Link: http://lkml.kernel.org/r/20170525185107.12869-21-manfred@colorfullife.com Signed-off-by: Manfred Spraul <manfred@colorfullife.com> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: Kees Cook <keescook@chromium.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-07-12ipc/sem: drop __sem_free()Kees Cook1-7/+2
The remaining users of __sem_free() can simply call kvfree() instead for better readability. [manfred@colorfullife.com: Rediff to keep rcu protection for security_sem_alloc()] Link: http://lkml.kernel.org/r/20170525185107.12869-20-manfred@colorfullife.com Signed-off-by: Kees Cook <keescook@chromium.org> Signed-off-by: Manfred Spraul <manfred@colorfullife.com> Cc: Davidlohr Bueso <dave@stgolabs.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-07-12ipc/msg: remove special msg_alloc/freeKees Cook1-20/+4
There is nothing special about the msg_alloc/free routines any more, so remove them to make code more readable. [manfred@colorfullife.com: Rediff to keep rcu protection for security_msg_queue_alloc()] Link: http://lkml.kernel.org/r/20170525185107.12869-19-manfred@colorfullife.com Signed-off-by: Kees Cook <keescook@chromium.org> Signed-off-by: Manfred Spraul <manfred@colorfullife.com> Cc: Davidlohr Bueso <dave@stgolabs.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-07-12ipc/shm: remove special shm_alloc/freeKees Cook1-20/+4
There is nothing special about the shm_alloc/free routines any more, so remove them to make code more readable. [manfred@colorfullife.com: Rediff, to continue to keep rcu for free calls after a successful security_shm_alloc()] Link: http://lkml.kernel.org/r/20170525185107.12869-18-manfred@colorfullife.com Signed-off-by: Kees Cook <keescook@chromium.org> Signed-off-by: Manfred Spraul <manfred@colorfullife.com> Cc: Davidlohr Bueso <dave@stgolabs.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-07-12ipc: move atomic_set() to where it is neededKees Cook4-5/+1
Only after ipc_addid() has succeeded will refcounting be used, so move initialization into ipc_addid() and remove from open-coded *_alloc() routines. Link: http://lkml.kernel.org/r/20170525185107.12869-17-manfred@colorfullife.com Signed-off-by: Kees Cook <keescook@chromium.org> Signed-off-by: Manfred Spraul <manfred@colorfullife.com> Cc: Davidlohr Bueso <dave@stgolabs.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-07-12ipc/msg.c: avoid ipc_rcu_putref for failed ipc_addid()Manfred Spraul1-5/+5
Loosely based on a patch from Kees Cook <keescook@chromium.org>: - id and retval can be merged - if ipc_addid() fails, then use call_rcu() directly. The difference is that call_rcu is used for failed ipc_addid() calls, to continue to guaranteed an rcu delay for security_msg_queue_free(). Link: http://lkml.kernel.org/r/20170525185107.12869-16-manfred@colorfullife.com Signed-off-by: Manfred Spraul <manfred@colorfullife.com> Cc: Kees Cook <keescook@chromium.org> Cc: Davidlohr Bueso <dave@stgolabs.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-07-12ipc/shm.c: avoid ipc_rcu_putref for failed ipc_addid()Manfred Spraul1-6/+3
Loosely based on a patch from Kees Cook <keescook@chromium.org>: - id and error can be merged - if operations before ipc_addid() fail, then use call_rcu() directly. The difference is that call_rcu is used for failures after security_shm_alloc(), to continue to guaranteed an rcu delay for security_sem_free(). Link: http://lkml.kernel.org/r/20170525185107.12869-15-manfred@colorfullife.com Signed-off-by: Manfred Spraul <manfred@colorfullife.com> Cc: Kees Cook <keescook@chromium.org> Cc: Davidlohr Bueso <dave@stgolabs.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-07-12ipc/sem.c: avoid ipc_rcu_putref for failed ipc_addid()Manfred Spraul1-5/+4
Loosely based on a patch from Kees Cook <keescook@chromium.org>: - id and retval can be merged - if ipc_addid() fails, then use call_rcu() directly. The difference is that call_rcu is used for failed ipc_addid() calls, to continue to guaranteed an rcu delay for security_sem_free(). Link: http://lkml.kernel.org/r/20170525185107.12869-14-manfred@colorfullife.com Signed-off-by: Manfred Spraul <manfred@colorfullife.com> Cc: Kees Cook <keescook@chromium.org> Cc: Davidlohr Bueso <dave@stgolabs.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-07-12ipc/util: drop ipc_rcu_alloc()Kees Cook2-24/+0
No callers remain for ipc_rcu_alloc(). Drop the function. [manfred@colorfullife.com: Rediff because the memset was temporarily inside ipc_rcu_free()] Link: http://lkml.kernel.org/r/20170525185107.12869-13-manfred@colorfullife.com Signed-off-by: Kees Cook <keescook@chromium.org> Signed-off-by: Manfred Spraul <manfred@colorfullife.com> Cc: Kees Cook <keescook@chromium.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-07-12ipc/msg: avoid ipc_rcu_alloc()Kees Cook1-4/+14
Instead of using ipc_rcu_alloc() which only performs the refcount bump, open code it. This also allows for msg_queue structure layout to be randomized in the future. Link: http://lkml.kernel.org/r/20170525185107.12869-12-manfred@colorfullife.com Signed-off-by: Kees Cook <keescook@chromium.org> Signed-off-by: Manfred Spraul <manfred@colorfullife.com> Cc: Davidlohr Bueso <dave@stgolabs.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-07-12ipc/shm: avoid ipc_rcu_alloc()Kees Cook1-4/+14
Instead of using ipc_rcu_alloc() which only performs the refcount bump, open code it. This also allows for shmid_kernel structure layout to be randomized in the future. Link: http://lkml.kernel.org/r/20170525185107.12869-11-manfred@colorfullife.com Signed-off-by: Kees Cook <keescook@chromium.org> Signed-off-by: Manfred Spraul <manfred@colorfullife.com> Cc: Davidlohr Bueso <dave@stgolabs.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-07-12ipc/sem: avoid ipc_rcu_alloc()Kees Cook1-5/+20
Instead of using ipc_rcu_alloc() which only performs the refcount bump, open code it to perform better sem-specific checks. This also allows for sem_array structure layout to be randomized in the future. [manfred@colorfullife.com: Rediff, because the memset was temporarily inside ipc_rcu_alloc()] Link: http://lkml.kernel.org/r/20170525185107.12869-10-manfred@colorfullife.com Signed-off-by: Kees Cook <keescook@chromium.org> Signed-off-by: Manfred Spraul <manfred@colorfullife.com> Cc: Davidlohr Bueso <dave@stgolabs.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-07-12ipc/util: drop ipc_rcu_free()Kees Cook2-8/+0
There are no more callers of ipc_rcu_free(), so remove it. Link: http://lkml.kernel.org/r/20170525185107.12869-9-manfred@colorfullife.com Signed-off-by: Kees Cook <keescook@chromium.org> Signed-off-by: Manfred Spraul <manfred@colorfullife.com> Cc: Davidlohr Bueso <dave@stgolabs.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-07-12ipc/msg: do not use ipc_rcu_free()Kees Cook1-2/+7
Avoid using ipc_rcu_free, since it just re-finds the original structure pointer. For the pre-list-init failure path, there is no RCU needed, since it was just allocated. It can be directly freed. Link: http://lkml.kernel.org/r/20170525185107.12869-8-manfred@colorfullife.com Signed-off-by: Kees Cook <keescook@chromium.org> Signed-off-by: Manfred Spraul <manfred@colorfullife.com> Cc: Davidlohr Bueso <dave@stgolabs.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-07-12ipc/shm: do not use ipc_rcu_free()Kees Cook1-2/+7
Avoid using ipc_rcu_free, since it just re-finds the original structure pointer. For the pre-list-init failure path, there is no RCU needed, since it was just allocated. It can be directly freed. Link: http://lkml.kernel.org/r/20170525185107.12869-7-manfred@colorfullife.com Signed-off-by: Kees Cook <keescook@chromium.org> Signed-off-by: Manfred Spraul <manfred@colorfullife.com> Cc: Davidlohr Bueso <dave@stgolabs.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-07-12ipc/sem: do not use ipc_rcu_free()Kees Cook1-2/+7
Avoid using ipc_rcu_free, since it just re-finds the original structure pointer. For the pre-list-init failure path, there is no RCU needed, since it was just allocated. It can be directly freed. Link: http://lkml.kernel.org/r/20170525185107.12869-6-manfred@colorfullife.com Signed-off-by: Kees Cook <keescook@chromium.org> Signed-off-by: Manfred Spraul <manfred@colorfullife.com> Cc: Davidlohr Bueso <dave@stgolabs.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-07-12ipc: drop non-RCU allocationKees Cook3-33/+6
The only users of ipc_alloc() were ipc_rcu_alloc() and the on-heap sem_io fall-back memory. Better to just open-code these to make things easier to read. [manfred@colorfullife.com: Rediff due to inclusion of memset() into ipc_rcu_alloc()] Link: http://lkml.kernel.org/r/20170525185107.12869-5-manfred@colorfullife.com Signed-off-by: Kees Cook <keescook@chromium.org> Signed-off-by: Manfred Spraul <manfred@colorfullife.com> Cc: Davidlohr Bueso <dave@stgolabs.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-07-12ipc: merge ipc_rcu and kern_ipc_permManfred Spraul5-61/+63
ipc has two management structures that exist for every id: - struct kern_ipc_perm, it contains e.g. the permissions. - struct ipc_rcu, it contains the rcu head for rcu handling and the refcount. The patch merges both structures. As a bonus, we may save one cacheline, because both structures are cacheline aligned. In addition, it reduces the number of casts, instead most codepaths can use container_of. To simplify code, the ipc_rcu_alloc initializes the allocation to 0. [manfred@colorfullife.com: really include the memset() into ipc_alloc_rcu()] Link: http://lkml.kernel.org/r/564f8612-0601-b267-514f-a9f650ec9b32@colorfullife.com Link: http://lkml.kernel.org/r/20170525185107.12869-3-manfred@colorfullife.com Signed-off-by: Manfred Spraul <manfred@colorfullife.com> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: Kees Cook <keescook@chromium.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-07-12ipc/sem.c: remove sem_base, embed struct semManfred Spraul1-54/+34
sma->sem_base is initialized with sma->sem_base = (struct sem *) &sma[1]; The current code has four problems: - There is an unnecessary pointer dereference - sem_base is not needed. - Alignment for struct sem only works by chance. - The current code causes false positive for static code analysis. - This is a cast between different non-void types, which the future randstruct GCC plugin warns on. And, as bonus, the code size gets smaller: Before: 0 .text 00003770 After: 0 .text 0000374e [manfred@colorfullife.com: s/[0]/[]/, per hch] Link: http://lkml.kernel.org/r/20170525185107.12869-2-manfred@colorfullife.com Link: http://lkml.kernel.org/r/20170515171912.6298-2-manfred@colorfullife.com Signed-off-by: Manfred Spraul <manfred@colorfullife.com> Acked-by: Kees Cook <keescook@chromium.org> Cc: Kees Cook <keescook@chromium.org> Cc: <1vier1@web.de> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: Ingo Molnar <mingo@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Fabian Frederick <fabf@skynet.be> Cc: Christoph Hellwig <hch@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-07-09mqueue: fix a use-after-free in sys_mq_notify()Cong Wang1-1/+3
The retry logic for netlink_attachskb() inside sys_mq_notify() is nasty and vulnerable: 1) The sock refcnt is already released when retry is needed 2) The fd is controllable by user-space because we already release the file refcnt so we when retry but the fd has been just closed by user-space during this small window, we end up calling netlink_detachskb() on the error path which releases the sock again, later when the user-space closes this socket a use-after-free could be triggered. Setting 'sock' to NULL here should be sufficient to fix it. Reported-by: GeneBlue <geneblue.mail@gmail.com> Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Manfred Spraul <manfred@colorfullife.com> Cc: stable@kernel.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-07-07Merge tag 'for-linus-v4.13-1' of ↵Linus Torvalds1-1/+1
git://git.kernel.org/pub/scm/linux/kernel/git/jlayton/linux Pull Writeback error handling fixes from Jeff Layton: "The main rationale for all of these changes is to tighten up writeback error reporting to userland. There are many ways now that writeback errors can be lost, such that fsync/fdatasync/msync return 0 when writeback actually failed. This pile contains a small set of cleanups and writeback error handling fixes that I was able to break off from the main pile (#2). Two of the patches in this pile are trivial. The exceptions are the patch to fix up error handling in write_one_page, and the patch to make JFS pay attention to write_one_page errors" * tag 'for-linus-v4.13-1' of git://git.kernel.org/pub/scm/linux/kernel/git/jlayton/linux: fs: remove call_fsync helper function mm: clean up error handling in write_one_page JFS: do not ignore return code from write_one_page() mm: drop "wait" parameter from write_one_page()
2017-07-05fs: remove call_fsync helper functionJeff Layton1-1/+1
Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com> Signed-off-by: Jeff Layton <jlayton@redhat.com>
2017-07-04mqueue: move compat syscalls to native onesAl Viro3-228/+262
... and stop messing with compat_alloc_user_space() and friends [braino fix from Colin King folded in] Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2017-05-08mm: introduce kv[mz]alloc helpersMichal Hocko1-6/+1
Patch series "kvmalloc", v5. There are many open coded kmalloc with vmalloc fallback instances in the tree. Most of them are not careful enough or simply do not care about the underlying semantic of the kmalloc/page allocator which means that a) some vmalloc fallbacks are basically unreachable because the kmalloc part will keep retrying until it succeeds b) the page allocator can invoke a really disruptive steps like the OOM killer to move forward which doesn't sound appropriate when we consider that the vmalloc fallback is available. As it can be seen implementing kvmalloc requires quite an intimate knowledge if the page allocator and the memory reclaim internals which strongly suggests that a helper should be implemented in the memory subsystem proper. Most callers, I could find, have been converted to use the helper instead. This is patch 6. There are some more relying on __GFP_REPEAT in the networking stack which I have converted as well and Eric Dumazet was not opposed [2] to convert them as well. [1] http://lkml.kernel.org/r/20170130094940.13546-1-mhocko@kernel.org [2] http://lkml.kernel.org/r/1485273626.16328.301.camel@edumazet-glaptop3.roam.corp.google.com This patch (of 9): Using kmalloc with the vmalloc fallback for larger allocations is a common pattern in the kernel code. Yet we do not have any common helper for that and so users have invented their own helpers. Some of them are really creative when doing so. Let's just add kv[mz]alloc and make sure it is implemented properly. This implementation makes sure to not make a large memory pressure for > PAGE_SZE requests (__GFP_NORETRY) and also to not warn about allocation failures. This also rules out the OOM killer as the vmalloc is a more approapriate fallback than a disruptive user visible action. This patch also changes some existing users and removes helpers which are specific for them. In some cases this is not possible (e.g. ext4_kvmalloc, libcfs_kvzalloc) because those seems to be broken and require GFP_NO{FS,IO} context which is not vmalloc compatible in general (note that the page table allocation is GFP_KERNEL). Those need to be fixed separately. While we are at it, document that __vmalloc{_node} about unsupported gfp mask because there seems to be a lot of confusion out there. kvmalloc_node will warn about GFP_KERNEL incompatible (which are not superset) flags to catch new abusers. Existing ones would have to die slowly. [sfr@canb.auug.org.au: f2fs fixup] Link: http://lkml.kernel.org/r/20170320163735.332e64b7@canb.auug.org.au Link: http://lkml.kernel.org/r/20170306103032.2540-2-mhocko@kernel.org Signed-off-by: Michal Hocko <mhocko@suse.com> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au> Reviewed-by: Andreas Dilger <adilger@dilger.ca> [ext4 part] Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: John Hubbard <jhubbard@nvidia.com> Cc: David Miller <davem@davemloft.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-05-08ipc/shm: some shmat cleanupsDavidlohr Bueso1-9/+7
Clean up early flag and address some minutia. Link: http://lkml.kernel.org/r/1486673582-6979-3-git-send-email-dave@stgolabs.net Signed-off-by: Davidlohr Bueso <dbueso@suse.de> Cc: Manfred Spraul <manfred@colorfullife.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-05-05Merge branch 'for-linus' of ↵Linus Torvalds1-2/+0
git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace Pull namespace updates from Eric Biederman: "This is a set of small fixes that were mostly stumbled over during more significant development. This proc fix and the fix to posix-timers are the most significant of the lot. There is a lot of good development going on but unfortunately it didn't quite make the merge window" * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace: proc: Fix unbalanced hard link numbers signal: Make kill_proc_info static rlimit: Properly call security_task_setrlimit signal: Remove unused definition of sig_user_definied ia64: Remove unused IA64_TASK_SIGHAND_OFFSET and IA64_SIGHAND_SIGLOCK_OFFSET ipc: Remove unused declaration of recompute_msgmni posix-timers: Correct sanity check in posix_cpu_nsleep sysctl: Remove dead register_sysctl_root
2017-04-17ipc: Remove unused declaration of recompute_msgmniEric W. Biederman1-2/+0
The function recompute_msgmni was removed a while ago but it is still declared in a header file remove it. Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2017-04-02kernel-api.rst: fix a series of errors when parsing C filesmchehab@s-opensource.com1-5/+7
./lib/string.c:134: WARNING: Inline emphasis start-string without end-string. ./mm/filemap.c:522: WARNING: Inline interpreted text or phrase reference start-string without end-string. ./mm/filemap.c:1283: ERROR: Unexpected indentation. ./mm/filemap.c:3003: WARNING: Inline interpreted text or phrase reference start-string without end-string. ./mm/vmalloc.c:1544: WARNING: Inline emphasis start-string without end-string. ./mm/page_alloc.c:4245: ERROR: Unexpected indentation. ./ipc/util.c:676: ERROR: Unexpected indentation. ./drivers/pci/irq.c:35: WARNING: Block quote ends without a blank line; unexpected unindent. ./security/security.c:109: ERROR: Unexpected indentation. ./security/security.c:110: WARNING: Definition list ends without a blank line; unexpected unindent. ./block/genhd.c:275: WARNING: Inline strong start-string without end-string. ./block/genhd.c:283: WARNING: Inline strong start-string without end-string. ./include/linux/clk.h:134: WARNING: Inline emphasis start-string without end-string. ./include/linux/clk.h:134: WARNING: Inline emphasis start-string without end-string. ./ipc/util.c:477: ERROR: Unknown target name: "s". Signed-off-by: Mauro Carvalho Chehab <mchehab@s-opensource.com> Acked-by: Bjorn Helgaas <bhelgaas@google.com> Signed-off-by: Jonathan Corbet <corbet@lwn.net>
2017-03-03Merge branch 'WIP.sched-core-for-linus' of ↵Linus Torvalds4-1/+7
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull sched.h split-up from Ingo Molnar: "The point of these changes is to significantly reduce the <linux/sched.h> header footprint, to speed up the kernel build and to have a cleaner header structure. After these changes the new <linux/sched.h>'s typical preprocessed size goes down from a previous ~0.68 MB (~22K lines) to ~0.45 MB (~15K lines), which is around 40% faster to build on typical configs. Not much changed from the last version (-v2) posted three weeks ago: I eliminated quirks, backmerged fixes plus I rebased it to an upstream SHA1 from yesterday that includes most changes queued up in -next plus all sched.h changes that were pending from Andrew. I've re-tested the series both on x86 and on cross-arch defconfigs, and did a bisectability test at a number of random points. I tried to test as many build configurations as possible, but some build breakage is probably still left - but it should be mostly limited to architectures that have no cross-compiler binaries available on kernel.org, and non-default configurations" * 'WIP.sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (146 commits) sched/headers: Clean up <linux/sched.h> sched/headers: Remove #ifdefs from <linux/sched.h> sched/headers: Remove the <linux/topology.h> include from <linux/sched.h> sched/headers, hrtimer: Remove the <linux/wait.h> include from <linux/hrtimer.h> sched/headers, x86/apic: Remove the <linux/pm.h> header inclusion from <asm/apic.h> sched/headers, timers: Remove the <linux/sysctl.h> include from <linux/timer.h> sched/headers: Remove <linux/magic.h> from <linux/sched/task_stack.h> sched/headers: Remove <linux/sched.h> from <linux/sched/init.h> sched/core: Remove unused prefetch_stack() sched/headers: Remove <linux/rculist.h> from <linux/sched.h> sched/headers: Remove the 'init_pid_ns' prototype from <linux/sched.h> sched/headers: Remove <linux/signal.h> from <linux/sched.h> sched/headers: Remove <linux/rwsem.h> from <linux/sched.h> sched/headers: Remove the runqueue_is_locked() prototype sched/headers: Remove <linux/sched.h> from <linux/sched/hotplug.h> sched/headers: Remove <linux/sched.h> from <linux/sched/debug.h> sched/headers: Remove <linux/sched.h> from <linux/sched/nohz.h> sched/headers: Remove <linux/sched.h> from <linux/sched/stat.h> sched/headers: Remove the <linux/gfp.h> include from <linux/sched.h> sched/headers: Remove <linux/rtmutex.h> from <linux/sched.h> ...
2017-03-02Merge branch 'for-linus' of ↵Linus Torvalds1-2/+2
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull vfs pile two from Al Viro: - orangefs fix - series of fs/namei.c cleanups from me - VFS stuff coming from overlayfs tree * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: orangefs: Use RCU for destroy_inode vfs: use helper for calling f_op->fsync() mm: use helper for calling f_op->mmap() vfs: use helpers for calling f_op->{read,write}_iter() vfs: pass type instead of fn to do_{loop,iter}_readv_writev() vfs: extract common parts of {compat_,}do_readv_writev() vfs: wrap write f_ops with file_{start,end}_write() vfs: deny copy_file_range() for non regular files vfs: deny fallocate() on directory vfs: create vfs helper vfs_tmpfile() namei.c: split unlazy_walk() namei.c: fold the check for DCACHE_OP_REVALIDATE into d_revalidate() lookup_fast(): clean up the logics around the fallback to non-rcu mode namei: fold unlazy_link() into its sole caller
2017-03-02Merge remote-tracking branch 'ovl/for-viro' into for-linusAl Viro1-2/+2
Overlayfs-related series from Miklos and Amir
2017-03-02sched/headers: Move the wake-queue types and interfaces from sched.h into ↵Ingo Molnar1-1/+0
<linux/sched/wake_q.h> Acked-by: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-kernel@vger.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-03-02sched/headers: Prepare to move the task_lock()/unlock() APIs to ↵Ingo Molnar1-0/+1
<linux/sched/task.h> But first update the code that uses these facilities with the new header. Acked-by: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-kernel@vger.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-03-02sched/headers: Prepare to remove <linux/cred.h> inclusion from <linux/sched.h>Ingo Molnar1-0/+1
Add #include <linux/cred.h> dependencies to all .c files rely on sched.h doing that for them. Note that even if the count where we need to add extra headers seems high, it's still a net win, because <linux/sched.h> is included in over 2,200 files ... Acked-by: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-kernel@vger.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-03-02sched/headers: Prepare for new header dependencies before moving code to ↵Ingo Molnar1-0/+1
<linux/sched/user.h> We are going to split <linux/sched/user.h> out of <linux/sched.h>, which will have to be picked up from other headers and a couple of .c files. Create a trivial placeholder <linux/sched/user.h> file that just maps to <linux/sched.h> to make this patch obviously correct and bisectable. Include the new header in the files that are going to need it. Acked-by: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-kernel@vger.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-03-02sched/headers: Prepare for new header dependencies before moving code to ↵Ingo Molnar1-0/+1
<linux/sched/signal.h> We are going to split <linux/sched/signal.h> out of <linux/sched.h>, which will have to be picked up from other headers and a couple of .c files. Create a trivial placeholder <linux/sched/signal.h> file that just maps to <linux/sched.h> to make this patch obviously correct and bisectable. Include the new header in the files that are going to need it. Acked-by: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-kernel@vger.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-03-02sched/headers: Prepare for new header dependencies before moving code to ↵Ingo Molnar3-0/+3
<linux/sched/wake_q.h> We are going to split <linux/sched/wake_q.h> out of <linux/sched.h>, which will have to be picked up from other headers and a couple of .c files. Create a trivial placeholder <linux/sched/wake_q.h> file that just maps to <linux/sched.h> to make this patch obviously correct and bisectable. Include the new header in the files that are going to need it. Acked-by: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-02-27ipc/shm: Fix shmat mmap nil-page protectionDavidlohr Bueso1-4/+9
The issue is described here, with a nice testcase: https://bugzilla.kernel.org/show_bug.cgi?id=192931 The problem is that shmat() calls do_mmap_pgoff() with MAP_FIXED, and the address rounded down to 0. For the regular mmap case, the protection mentioned above is that the kernel gets to generate the address -- arch_get_unmapped_area() will always check for MAP_FIXED and return that address. So by the time we do security_mmap_addr(0) things get funky for shmat(). The testcase itself shows that while a regular user crashes, root will not have a problem attaching a nil-page. There are two possible fixes to this. The first, and which this patch does, is to simply allow root to crash as well -- this is also regular mmap behavior, ie when hacking up the testcase and adding mmap(... |MAP_FIXED). While this approach is the safer option, the second alternative is to ignore SHM_RND if the rounded address is 0, thus only having MAP_SHARED flags. This makes the behavior of shmat() identical to the mmap() case. The downside of this is obviously user visible, but does make sense in that it maintains semantics after the round-down wrt 0 address and mmap. Passes shm related ltp tests. Link: http://lkml.kernel.org/r/1486050195-18629-1-git-send-email-dave@stgolabs.net Signed-off-by: Davidlohr Bueso <dbueso@suse.de> Reported-by: Gareth Evans <gareth.evans@contextis.co.uk> Cc: Manfred Spraul <manfred@colorfullife.com> Cc: Michael Kerrisk <mtk.manpages@googlemail.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-02-27ipc/mqueue: add missing sparse annotationLuc Van Oostenryck1-0/+1
Link: http://lkml.kernel.org/r/20170128235704.45302-1-luc.vanoostenryck@gmail.com Signed-off-by: Luc Van Oostenryck <luc.vanoostenryck@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-02-27ipc/sem: add hysteresisManfred Spraul1-25/+61
sysv sem has two lock modes: One with per-semaphore locks, one lock mode with a single global lock for the whole array. When switching from the per-semaphore locks to the global lock, all per-semaphore locks must be scanned for ongoing operations. The patch adds a hysteresis for switching from the global lock to the per semaphore locks. This reduces how often the per-semaphore locks must be scanned. Compared to the initial patch, this is a simplified solution: Setting USE_GLOBAL_LOCK_HYSTERESIS to 1 restores the current behavior. In theory, a workload with exactly 10 simple sops and then one complex op now scales a bit worse, but this is pure theory: If there is concurrency, the it won't be exactly 10:1:10:1:10:1:... If there is no concurrency, then there is no need for scalability. Link: http://lkml.kernel.org/r/1476851896-3590-3-git-send-email-manfred@colorfullife.com Signed-off-by: Manfred Spraul <manfred@colorfullife.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@elte.hu> Cc: H. Peter Anvin <hpa@zytor.com> Cc: <1vier1@web.de> Cc: kernel test robot <xiaolong.ye@intel.com> Cc: <felixh@informatik.uni-bremen.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-02-27ipc/sem.c: avoid using spin_unlock_wait()Manfred Spraul1-22/+3
a) The ACQUIRE in spin_lock() applies to the read, not to the store, at least for powerpc. This forces to add a smp_mb() into the fast path. b) The memory barrier provided by spin_unlock_wait() is right now arch dependent. Therefore: Use spin_lock()/spin_unlock() instead of spin_unlock_wait(). Advantage: faster single op semop calls(), observed +8.9% on x86. (the other solution would be arch dependencies in ipc/sem). Disadvantage: slower complex op semop calls, if (and only if) there are no sleeping operations. The next patch adds hysteresis, this further reduces the probability that the slow path is used. Link: http://lkml.kernel.org/r/1476851896-3590-2-git-send-email-manfred@colorfullife.com Signed-off-by: Manfred Spraul <manfred@colorfullife.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@elte.hu> Cc: H. Peter Anvin <hpa@zytor.com> Cc: <1vier1@web.de> Cc: kernel test robot <xiaolong.ye@intel.com> Cc: <felixh@informatik.uni-bremen.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-02-24userfaultfd: non-cooperative: add event for memory unmapsMike Rapoport1-4/+4
When a non-cooperative userfaultfd monitor copies pages in the background, it may encounter regions that were already unmapped. Addition of UFFD_EVENT_UNMAP allows the uffd monitor to track precisely changes in the virtual memory layout. Since there might be different uffd contexts for the affected VMAs, we first should create a temporary representation for the unmap event for each uffd context and then notify them one by one to the appropriate userfault file descriptors. The event notification occurs after the mmap_sem has been released. [arnd@arndb.de: fix nommu build] Link: http://lkml.kernel.org/r/20170203165141.3665284-1-arnd@arndb.de [mhocko@suse.com: fix nommu build] Link: http://lkml.kernel.org/r/20170202091503.GA22823@dhcp22.suse.cz Link: http://lkml.kernel.org/r/1485542673-24387-3-git-send-email-rppt@linux.vnet.ibm.com Signed-off-by: Mike Rapoport <rppt@linux.vnet.ibm.com> Signed-off-by: Michal Hocko <mhocko@suse.com> Signed-off-by: Arnd Bergmann <arnd@arndb.de> Acked-by: Hillf Danton <hillf.zj@alibaba-inc.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Pavel Emelyanov <xemul@virtuozzo.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-02-24mm, fs: reduce fault, page_mkwrite, and pfn_mkwrite to take only vmfDave Jiang1-3/+3
->fault(), ->page_mkwrite(), and ->pfn_mkwrite() calls do not need to take a vma and vmf parameter when the vma already resides in vmf. Remove the vma parameter to simplify things. [arnd@arndb.de: fix ARM build] Link: http://lkml.kernel.org/r/20170125223558.1451224-1-arnd@arndb.de Link: http://lkml.kernel.org/r/148521301778.19116.10840599906674778980.stgit@djiang5-desk3.ch.intel.com Signed-off-by: Dave Jiang <dave.jiang@intel.com> Signed-off-by: Arnd Bergmann <arnd@arndb.de> Reviewed-by: Ross Zwisler <ross.zwisler@linux.intel.com> Cc: Theodore Ts'o <tytso@mit.edu> Cc: Darrick J. Wong <darrick.wong@oracle.com> Cc: Matthew Wilcox <mawilcox@microsoft.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Jan Kara <jack@suse.com> Cc: Dan Williams <dan.j.williams@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-02-20vfs: use helper for calling f_op->fsync()Miklos Szeredi1-1/+1
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2017-02-20mm: use helper for calling f_op->mmap()Miklos Szeredi1-1/+1
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2017-01-10ipc/sem.c: fix incorrect sem_lock pairingManfred Spraul1-1/+1
Based on the syzcaller test case from dvyukov: https://gist.githubusercontent.com/dvyukov/d0e5efefe4d7d6daed829f5c3ca26a40/raw/08d0a261fe3c987bed04fbf267e08ba04bd533ea/gistfile1.txt The slow (i.e.: failure to acquire) syscall exit from semtimedop() incorrectly assumed that the the same lock is acquired as it was at the initial syscall entry. This is wrong: - thread A: single semop semop(), sleeps - thread B: multi semop semop(), sleeps - thread A: woken up by signal/timeout With this sequence, the initial sem_lock() call locks the per-semaphore spinlock, and it is unlocked with sem_unlock(). The call at the syscall return locks the global spinlock. Because locknum is not updated, the following sem_unlock() call unlocks the per-semaphore spinlock, which is actually not locked. The fix is trivial: Use the return value from sem_lock. Fixes: 370b262c896e ("ipc/sem: avoid idr tree lookup for interrupted semop") Link: http://lkml.kernel.org/r/1482215645-22328-1-git-send-email-manfred@colorfullife.com Signed-off-by: Manfred Spraul <manfred@colorfullife.com> Reported-by: Dmitry Vyukov <dvyukov@google.com> Reported-by: Johanna Abrahamsson <johanna@mjao.org> Tested-by: Johanna Abrahamsson <johanna@mjao.org> Acked-by: Davidlohr Bueso <dave@stgolabs.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-12-14ipc/sem: avoid idr tree lookup for interrupted semopDavidlohr Bueso1-32/+5
We can avoid the idr tree lookup (albeit possibly avoiding idr_find_fast()) when being awoken in EINTR, as the semid will not change in this context while blocked. Use the sma pointer directly and take the sem_lock, then re-check for RMID races. We continue to re-check the queue.status with the lock held such that we can detect situations where we where are dealing with a spurious wakeup but another task that holds the sem_lock updated the queue.status while we were spinning for it. Once we take the lock it obviously won't change again. Being the only caller, get rid of sem_obtain_lock() altogether. Link: http://lkml.kernel.org/r/1478708774-28826-3-git-send-email-dave@stgolabs.net Signed-off-by: Davidlohr Bueso <dbueso@suse.de> Cc: Manfred Spraul <manfred@colorfullife.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-12-14ipc/sem: simplify wait-wake loopDavidlohr Bueso1-56/+52
Instead of using the reverse goto, we can simplify the flow and make it more language natural by just doing do-while instead. One would hope this is the standard way (or obviously just with a while bucle) that we do wait/wakeup handling in the kernel. The exact same logic is kept, just more indented. [akpm@linux-foundation.org: coding-style fixes] Link: http://lkml.kernel.org/r/1478708774-28826-2-git-send-email-dave@stgolabs.net Signed-off-by: Davidlohr Bueso <dbueso@suse.de> Cc: Manfred Spraul <manfred@colorfullife.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-12-14ipc/sem: use proper list api for pending_list wakeupsDavidlohr Bueso1-25/+13
... saves some LoC and looks cleaner than re-implementing the calls. Link: http://lkml.kernel.org/r/1474225896-10066-6-git-send-email-dave@stgolabs.net Signed-off-by: Davidlohr Bueso <dbueso@suse.de> Acked-by: Manfred Spraul <manfred@colorfullife.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-12-14ipc/sem: explicitly inline check_restartDavidlohr Bueso1-1/+1
The compiler already does this, but make it explicit. This helper is really small and also used in update_queue's main loop, which is O(N^2) scanning. Inline and avoid the function overhead. Link: http://lkml.kernel.org/r/1474225896-10066-5-git-send-email-dave@stgolabs.net Signed-off-by: Davidlohr Bueso <dbueso@suse.de> Cc: Manfred Spraul <manfred@colorfullife.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-12-14ipc/sem: optimize perform_atomic_semop()Davidlohr Bueso1-10/+102
This is the main workhorse that deals with semop user calls such that the waitforzero or semval update operations, on the set, can complete on not as the sma currently stands. Currently, the set is iterated twice (setting semval, then backwards for the sempid value). Slowpaths, and particularly SEM_UNDO calls, must undo any altered sem when it is detected that the caller must block or has errored-out. With larger sets, there can occur situations where this involves a lot of cycles and can obviously be a suboptimal use of cached resources in shared memory. Ie, discarding CPU caches that are also calling semop and have the sembuf cached (and can complete), while the current lock holder doing the semop will block, error, or does a waitforzero operation. This patch proposes still iterating the set twice, but the first scan is read-only, and we perform the actual updates afterward, once we know that the call will succeed. In order to not suffer from the overhead of dealing with sops that act on the same sem_num, such (rare) cases use perform_atomic_semop_slow(), which is exactly what we have now. Duplicates are detected before grabbing sem_lock, and uses simple a 32/64-bit hash array variable to based on the sem_num we are working on. In addition add some comments to when we expect to the caller to block. [akpm@linux-foundation.org: coding-style fixes] [colin.king@canonical.com: ensure we left shift a ULL rather than a 32 bit integer] Link: http://lkml.kernel.org/r/20161028181129.7311-1-colin.king@canonical.com Link: http://lkml.kernel.org/r/20160921194603.GB21438@linux-80c1.suse Signed-off-by: Davidlohr Bueso <dbueso@suse.de> Cc: Manfred Spraul <manfred@colorfullife.com> Signed-off-by: Colin Ian King <colin.king@canonical.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-12-14ipc/sem: rework task wakeupsDavidlohr Bueso1-180/+86
Our sysv sems have been using the notion of lockless wakeups for a while, ever since commit 0a2b9d4c7967 ("ipc/sem.c: move wake_up_process out of the spinlock section"), in order to reduce the sem_lock hold times. This in-house pending queue can be replaced by wake_q (just like all the rest of ipc now), in that it provides the following advantages: o Simplifies and gets rid of unnecessary code. o We get rid of the IN_WAKEUP complexities. Given that wake_q_add() grabs reference to the task, if awoken due to an unrelated event, between the wake_q_add() and wake_up_q() window, we cannot race with sys_exit and the imminent call to wake_up_process(). o By not spinning IN_WAKEUP, we no longer need to disable preemption. In consequence, the wakeup paths (after schedule(), that is) must acknowledge an external signal/event, as well spurious wakeup occurring during the pending wakeup window. Obviously no changes in semantics that could be visible to the user. The fastpath is _only_ for when we know for sure that we were awoken due to a the waker's successful semop call (queue.status is not -EINTR). On a 48-core Haswell, running the ipcscale 'waitforzero' test, the following is seen with increasing thread counts: v4.8-rc5 v4.8-rc5 semopv2 Hmean sembench-sem-2 574733.00 ( 0.00%) 578322.00 ( 0.62%) Hmean sembench-sem-8 811708.00 ( 0.00%) 824689.00 ( 1.59%) Hmean sembench-sem-12 842448.00 ( 0.00%) 845409.00 ( 0.35%) Hmean sembench-sem-21 933003.00 ( 0.00%) 977748.00 ( 4.80%) Hmean sembench-sem-48 935910.00 ( 0.00%) 1004759.00 ( 7.36%) Hmean sembench-sem-79 937186.00 ( 0.00%) 983976.00 ( 4.99%) Hmean sembench-sem-234 974256.00 ( 0.00%) 1060294.00 ( 8.83%) Hmean sembench-sem-265 975468.00 ( 0.00%) 1016243.00 ( 4.18%) Hmean sembench-sem-296 991280.00 ( 0.00%) 1042659.00 ( 5.18%) Hmean sembench-sem-327 975415.00 ( 0.00%) 1029977.00 ( 5.59%) Hmean sembench-sem-358 1014286.00 ( 0.00%) 1049624.00 ( 3.48%) Hmean sembench-sem-389 972939.00 ( 0.00%) 1043127.00 ( 7.21%) Hmean sembench-sem-420 981909.00 ( 0.00%) 1056747.00 ( 7.62%) Hmean sembench-sem-451 990139.00 ( 0.00%) 1051609.00 ( 6.21%) Hmean sembench-sem-482 965735.00 ( 0.00%) 1040313.00 ( 7.72%) [akpm@linux-foundation.org: coding-style fixes] [sfr@canb.auug.org.au: merge fix for WAKE_Q to DEFINE_WAKE_Q rename] Link: http://lkml.kernel.org/r/20161122210410.5eca9fc2@canb.auug.org.au Link: http://lkml.kernel.org/r/1474225896-10066-3-git-send-email-dave@stgolabs.net Signed-off-by: Davidlohr Bueso <dbueso@suse.de> Acked-by: Manfred Spraul <manfred@colorfullife.com> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-12-14ipc/sem: do not call wake_sem_queue_do() prematurely ... as this call should ↵Davidlohr Bueso1-7/+12
obviously be paired with its _prepare() counterpart. At least whenever possible, as there is no harm in calling it bogusly as we do now in a few places. Immediate error semop(2) paths that are far from ever having the task block can be simplified and avoid a few unnecessary loads on their way out of the call as it is not deeply nested. Link: http://lkml.kernel.org/r/1474225896-10066-2-git-send-email-dave@stgolabs.net Signed-off-by: Davidlohr Bueso <dbueso@suse.de> Cc: Manfred Spraul <manfred@colorfullife.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-12-14ipc/shm.c: coding style fixesShailesh Pandey1-3/+10
This patch fixes below warnings: WARNING: Missing a blank line after declarations WARNING: Block comments use a trailing */ on a separate line ERROR: spaces required around that '=' (ctx:WxV) Above warnings were reported by checkpatch.pl Link: http://lkml.kernel.org/r/1478604980-18062-1-git-send-email-p.shailesh@samsung.com Signed-off-by: Shailesh Pandey <p.shailesh@samsung.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-12-14ipc: msg, make msgrcv work with LONG_MINJiri Slaby1-1/+4
When LONG_MIN is passed to msgrcv, one would expect to recieve any message. But convert_mode does *msgtyp = -*msgtyp and -LONG_MIN is undefined. In particular, with my gcc -LONG_MIN produces -LONG_MIN again. So handle this case properly by assigning LONG_MAX to *msgtyp if LONG_MIN was specified as msgtyp to msgrcv. This code: long msg[] = { 100, 200 }; int m = msgget(IPC_PRIVATE, IPC_CREAT | 0644); msgsnd(m, &msg, sizeof(msg), 0); msgrcv(m, &msg, sizeof(msg), LONG_MIN, 0); produces currently nothing: msgget(IPC_PRIVATE, IPC_CREAT|0644) = 65538 msgsnd(65538, {100, "\310\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"}, 16, 0) = 0 msgrcv(65538, ... Except a UBSAN warning: UBSAN: Undefined behaviour in ipc/msg.c:745:13 negation of -9223372036854775808 cannot be represented in type 'long int': With the patch, I see what I expect: msgget(IPC_PRIVATE, IPC_CREAT|0644) = 0 msgsnd(0, {100, "\310\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"}, 16, 0) = 0 msgrcv(0, {100, "\310\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"}, 16, -9223372036854775808, 0) = 16 Link: http://lkml.kernel.org/r/20161024082633.10148-1-jslaby@suse.cz Signed-off-by: Jiri Slaby <jslaby@suse.cz> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: Manfred Spraul <manfred@colorfullife.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-11-21sched/wake_q: Rename WAKE_Q to DEFINE_WAKE_QWaiman Long2-6/+6
Currently the wake_q data structure is defined by the WAKE_Q() macro. This macro, however, looks like a function doing something as "wake" is a verb. Even checkpatch.pl was confused as it reported warnings like WARNING: Missing a blank line after declarations #548: FILE: kernel/futex.c:3665: + int ret; + WAKE_Q(wake_q); This patch renames the WAKE_Q() macro to DEFINE_WAKE_Q() which clarifies what the macro is doing and eliminates the checkpatch.pl warnings. Signed-off-by: Waiman Long <longman@redhat.com> Acked-by: Davidlohr Bueso <dave@stgolabs.net> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/1479401198-1765-1-git-send-email-longman@redhat.com [ Resolved conflict and added missing rename. ] Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-10-27ipc: account for kmem usage on mqueue and msgAristeu Rozanski1-2/+2
When kmem accounting switched from account by default to only account if flagged by __GFP_ACCOUNT, IPC mqueue and messages was left out. The production use case at hand is that mqueues should be customizable via sysctls in Docker containers in a Kubernetes cluster. This can only be safely allowed to the users of the cluster (without the risk that they can cause resource shortage on a node, influencing other users' containers) if all resources they control are bounded, i.e. accounted for. Link: http://lkml.kernel.org/r/1476806075-1210-1-git-send-email-arozansk@redhat.com Signed-off-by: Aristeu Rozanski <arozansk@redhat.com> Reported-by: Stefan Schimanski <sttts@redhat.com> Acked-by: Davidlohr Bueso <dave@stgolabs.net> Cc: Alexey Dobriyan <adobriyan@gmail.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Cc: Stefan Schimanski <sttts@redhat.com> Cc: Michal Hocko <mhocko@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-10-11ipc/sem.c: add cond_resched in exit_smeNikolay Borisov1-0/+2
In CONFIG_PREEMPT=n kernel a softlockup was observed while the for loop in exit_sem. Apparently it's possible for the loop to take quite a long time and it doesn't have a scheduling point in it. Since the codes is executing under an rcu read section this may also cause rcu stalls, which in turn block synchronize_rcu operations, which more or less de-stabilises the whole system. Fix this by introducing a cond_resched() at the beginning of the loop. So this patch fixes the following: NMI watchdog: BUG: soft lockup - CPU#10 stuck for 23s! [httpd:18119] CPU: 10 PID: 18119 Comm: httpd Tainted: G O 4.4.20-clouder2 #6 Hardware name: Supermicro X10DRi/X10DRi, BIOS 1.1 04/14/2015 task: ffff88348d695280 ti: ffff881c95550000 task.ti: ffff881c95550000 RIP: 0010:[<ffffffff81614bc7>] [<ffffffff81614bc7>] _raw_spin_lock+0x17/0x30 RSP: 0018:ffff881c95553e40 EFLAGS: 00000246 RAX: 0000000000000000 RBX: ffff883161b1eea8 RCX: 000000000000000d RDX: 0000000000000001 RSI: 000000000000000e RDI: ffff883161b1eea4 RBP: ffff881c95553ea0 R08: ffff881c95553e68 R09: ffff883fef376f88 R10: ffff881fffb58c20 R11: ffffea0072556600 R12: ffff883161b1eea0 R13: ffff88348d695280 R14: ffff883dec427000 R15: ffff8831621672a0 FS: 0000000000000000(0000) GS:ffff881fffb40000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007f3b3723e020 CR3: 0000000001c0a000 CR4: 00000000001406e0 Call Trace: ? exit_sem+0x7c/0x280 do_exit+0x338/0xb40 do_group_exit+0x43/0xd0 SyS_exit_group+0x14/0x20 entry_SYSCALL_64_fastpath+0x16/0x6e Link: http://lkml.kernel.org/r/1475154992-6363-1-git-send-email-kernel@kyup.com Signed-off-by: Nikolay Borisov <kernel@kyup.com> Cc: Herton R. Krzesinski <herton@redhat.com> Cc: Fabian Frederick <fabf@skynet.be> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: Manfred Spraul <manfred@colorfullife.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-10-11ipc/msg: avoid waking sender upon full queueDavidlohr Bueso1-10/+43
Blocked tasks queued in q_senders waiting for their message to fit in the queue are blindly awoken every time we think there's a remote chance this might happen. This could cause numerous (and expensive -- thundering herd-ish) bogus wakeups if the queue is still really full. Adding to the scheduling cost/overhead, there's also the fact that we need to take the ipc object lock and requeue ourselves in the q_senders list. By keeping track of the blocked sender's message size, we can know previously if the wakeup ought to occur or not. Otherwise, to maintain the current wakeup order we just move it to the tail. This is exactly what occurs right now if the sender needs to go back to sleep. The case of EIDRM is left completely untouched, as we need to wakeup all the tasks, and shouldn't be playing games in the first place. This patch was seen to save on the 'msgctl10' ltp testcase ~15% in context switches (avg out of ten runs). Although these tests are really about functionality (as opposed to performance), is does show the direct benefits of the optimization. [akpm@linux-foundation.org: coding-style fixes] Link: http://lkml.kernel.org/r/1469748819-19484-6-git-send-email-dave@stgolabs.net Signed-off-by: Davidlohr Bueso <dbueso@suse.de> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Manfred Spraul <manfred@colorfullife.com> Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-10-11ipc/msg: make ss_wakeup() kill arg booleanDavidlohr Bueso1-4/+4
... 'tis annoying. Link: http://lkml.kernel.org/r/1469748819-19484-4-git-send-email-dave@stgolabs.net Signed-off-by: Davidlohr Bueso <dave@stgolabs.net> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Manfred Spraul <manfred@colorfullife.com> Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-10-11ipc/msg: batch queue sender wakeupsDavidlohr Bueso1-10/+20
Currently the use of wake_qs in sysv msg queues are only for the receiver tasks that are blocked on the queue. But blocked sender tasks (due to queue size constraints) still are awoken with the ipc object lock held, which can be a problem particularly for small sized queues and far from gracious for -rt (just like it was for the receiver side). The paths that actually wakeup a sender are obviously related to when we are either getting rid of the queue or after (some) space is freed-up after a receiver takes the msg (msgrcv). Furthermore, with the exception of msgrcv, we can always piggy-back on expunge_all that has its own tasks lined-up for waking. Finally, upon unlinking the message, it should be no problem delaying the wakeups a bit until after we've released the lock. Link: http://lkml.kernel.org/r/1469748819-19484-3-git-send-email-dave@stgolabs.net Signed-off-by: Davidlohr Bueso <dbueso@suse.de> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Manfred Spraul <manfred@colorfullife.com> Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-10-11ipc/msg: implement lockless pipelined wakeupsSebastian Andrzej Siewior1-93/+40
This patch moves the wakeup_process() invocation so it is not done under the ipc global lock by making use of a lockless wake_q. With this change, the waiter is woken up once the message has been assigned and it does not need to loop on SMP if the message points to NULL. In the signal case we still need to check the pointer under the lock to verify the state. This change should also avoid the introduction of preempt_disable() in -RT which avoids a busy-loop which pools for the NULL -> !NULL change if the waiter has a higher priority compared to the waker. By making use of wake_qs, the logic of sysv msg queues is greatly simplified (and very well suited as we can batch lockless wakeups), particularly around the lockless receive algorithm. This has been tested with Manred's pmsg-shared tool on a "AMD A10-7800 Radeon R7, 12 Compute Cores 4C+8G": test | before | after | diff -----------------|------------|------------|---------- pmsg-shared 8 60 | 19,347,422 | 30,442,191 | + ~57.34 % pmsg-shared 4 60 | 21,367,197 | 35,743,458 | + ~67.28 % pmsg-shared 2 60 | 22,884,224 | 24,278,200 | + ~6.09 % Link: http://lkml.kernel.org/r/1469748819-19484-2-git-send-email-dave@stgolabs.net Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Davidlohr Bueso <dbueso@suse.de> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Manfred Spraul <manfred@colorfullife.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-10-11ipc/sem.c: fix complex_count vs. simple op raceManfred Spraul1-55/+83
Commit 6d07b68ce16a ("ipc/sem.c: optimize sem_lock()") introduced a race: sem_lock has a fast path that allows parallel simple operations. There are two reasons why a simple operation cannot run in parallel: - a non-simple operations is ongoing (sma->sem_perm.lock held) - a complex operation is sleeping (sma->complex_count != 0) As both facts are stored independently, a thread can bypass the current checks by sleeping in the right positions. See below for more details (or kernel bugzilla 105651). The patch fixes that by creating one variable (complex_mode) that tracks both reasons why parallel operations are not possible. The patch also updates stale documentation regarding the locking. With regards to stable kernels: The patch is required for all kernels that include the commit 6d07b68ce16a ("ipc/sem.c: optimize sem_lock()") (3.10?) The alternative is to revert the patch that introduced the race. The patch is safe for backporting, i.e. it makes no assumptions about memory barriers in spin_unlock_wait(). Background: Here is the race of the current implementation: Thread A: (simple op) - does the first "sma->complex_count == 0" test Thread B: (complex op) - does sem_lock(): This includes an array scan. But the scan can't find Thread A, because Thread A does not own sem->lock yet. - the thread does the operation, increases complex_count, drops sem_lock, sleeps Thread A: - spin_lock(&sem->lock), spin_is_locked(sma->sem_perm.lock) - sleeps before the complex_count test Thread C: (complex op) - does sem_lock (no array scan, complex_count==1) - wakes up Thread B. - decrements complex_count Thread A: - does the complex_count test Bug: Now both thread A and thread C operate on the same array, without any synchronization. Fixes: 6d07b68ce16a ("ipc/sem.c: optimize sem_lock()") Link: http://lkml.kernel.org/r/1469123695-5661-1-git-send-email-manfred@colorfullife.com Reported-by: <felixh@informatik.uni-bremen.de> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@elte.hu> Cc: <1vier1@web.de> Cc: <stable@vger.kernel.org> [3.10+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-10-10Merge branch 'for-linus' of ↵Linus Torvalds1-9/+9
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull more vfs updates from Al Viro: ">rename2() work from Miklos + current_time() from Deepa" * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: fs: Replace current_fs_time() with current_time() fs: Replace CURRENT_TIME_SEC with current_time() for inode timestamps fs: Replace CURRENT_TIME with current_time() for inode timestamps fs: proc: Delete inode time initializations in proc_alloc_inode() vfs: Add current_time() api vfs: add note about i_op->rename changes to porting fs: rename "rename2" i_op to "rename" vfs: remove unused i_op->rename fs: make remaining filesystems use .rename2 libfs: support RENAME_NOREPLACE in simple_rename() fs: support RENAME_NOREPLACE for local filesystems ncpfs: fix unused variable warning
2016-09-27fs: Replace CURRENT_TIME with current_time() for inode timestampsDeepa Dinamani1-9/+9
CURRENT_TIME macro is not appropriate for filesystems as it doesn't use the right granularity for filesystem timestamps. Use current_time() instead. CURRENT_TIME is also not y2038 safe. This is also in preparation for the patch that transitions vfs timestamps to use 64 bit time and hence make them y2038 safe. As part of the effort current_time() will be extended to do range checks. Hence, it is necessary for all file system timestamps to use current_time(). Also, current_time() will be transitioned along with vfs to be y2038 safe. Note that whenever a single call to current_time() is used to change timestamps in different inodes, it is because they share the same time granularity. Signed-off-by: Deepa Dinamani <deepa.kernel@gmail.com> Reviewed-by: Arnd Bergmann <arnd@arndb.de> Acked-by: Felipe Balbi <balbi@kernel.org> Acked-by: Steven Whitehouse <swhiteho@redhat.com> Acked-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp> Acked-by: David Sterba <dsterba@suse.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-09-22Merge branch 'nsfs-ioctls' into HEADEric W. Biederman1-0/+6
From: Andrey Vagin <avagin@openvz.org> Each namespace has an owning user namespace and now there is not way to discover these relationships. Pid and user namepaces are hierarchical. There is no way to discover parent-child relationships too. Why we may want to know relationships between namespaces? One use would be visualization, in order to understand the running system. Another would be to answer the question: what capability does process X have to perform operations on a resource governed by namespace Y? One more use-case (which usually called abnormal) is checkpoint/restart. In CRIU we are going to dump and restore nested namespaces. There [1] was a discussion about which interface to choose to determing relationships between namespaces. Eric suggested to add two ioctl-s [2]: > Grumble, Grumble. I think this may actually a case for creating ioctls > for these two cases. Now that random nsfs file descriptors are bind > mountable the original reason for using proc files is not as pressing. > > One ioctl for the user namespace that owns a file descriptor. > One ioctl for the parent namespace of a namespace file descriptor. Here is an implementaions of these ioctl-s. $ man man7/namespaces.7 ... Since Linux 4.X, the following ioctl(2) calls are supported for namespace file descriptors. The correct syntax is: fd = ioctl(ns_fd, ioctl_type); where ioctl_type is one of the following: NS_GET_USERNS Returns a file descriptor that refers to an owning user names‐ pace. NS_GET_PARENT Returns a file descriptor that refers to a parent namespace. This ioctl(2) can be used for pid and user namespaces. For user namespaces, NS_GET_PARENT and NS_GET_USERNS have the same meaning. In addition to generic ioctl(2) errors, the following specific ones can occur: EINVAL NS_GET_PARENT was called for a nonhierarchical namespace. EPERM The requested namespace is outside of the current namespace scope. [1] https://lkml.org/lkml/2016/7/6/158 [2] https://lkml.org/lkml/2016/7/9/101 Changes for v2: * don't return ENOENT for init_user_ns and init_pid_ns. There is nothing outside of the init namespace, so we can return EPERM in this case too. > The fewer special cases the easier the code is to get > correct, and the easier it is to read. // Eric Changes for v3: * rename ns->get_owner() to ns->owner(). get_* usually means that it grabs a reference. Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: James Bottomley <James.Bottomley@HansenPartnership.com> Cc: "Michael Kerrisk (man-pages)" <mtk.manpages@gmail.com> Cc: "W. Trevor King" <wking@tremily.us> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Serge Hallyn <serge.hallyn@canonical.com>
2016-09-22kernel: add a helper to get an owning user namespace for a namespaceAndrey Vagin1-0/+6
Return -EPERM if an owning user namespace is outside of a process current user namespace. v2: In a first version ns_get_owner returned ENOENT for init_user_ns. This special cases was removed from this version. There is nothing outside of init_user_ns, so we can return EPERM. v3: rename ns->get_owner() to ns->owner(). get_* usually means that it grabs a reference. Acked-by: Serge Hallyn <serge@hallyn.com> Signed-off-by: Andrei Vagin <avagin@openvz.org> Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
2016-09-22userns: When the per user per user namespace limit is reached return ENOSPCEric W. Biederman1-1/+1
The current error codes returned when a the per user per user namespace limit are hit (EINVAL, EUSERS, and ENFILE) are wrong. I asked for advice on linux-api and it we made clear that those were the wrong error code, but a correct effor code was not suggested. The best general error code I have found for hitting a resource limit is ENOSPC. It is not perfect but as it is unambiguous it will serve until someone comes up with a better error code. Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2016-08-08ipcns: Add a limit on the number of ipc namespacesEric W. Biederman1-11/+34
Acked-by: Kees Cook <keescook@chromium.org> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2016-08-02ipc: delete "nr_ipc_ns"Alexey Dobriyan2-4/+0
Write-only variable. Link: http://lkml.kernel.org/r/20160708214356.GA6785@p183.telecom.by Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-08-02sysv, ipc: fix security-layer leakingFabian Frederick2-7/+7
Commit 53dad6d3a8e5 ("ipc: fix race with LSMs") updated ipc_rcu_putref() to receive rcu freeing function but used generic ipc_rcu_free() instead of msg_rcu_free() which does security cleaning. Running LTP msgsnd06 with kmemleak gives the following: cat /sys/kernel/debug/kmemleak unreferenced object 0xffff88003c0a11f8 (size 8): comm "msgsnd06", pid 1645, jiffies 4294672526 (age 6.549s) hex dump (first 8 bytes): 1b 00 00 00 01 00 00 00 ........ backtrace: kmemleak_alloc+0x23/0x40 kmem_cache_alloc_trace+0xe1/0x180 selinux_msg_queue_alloc_security+0x3f/0xd0 security_msg_queue_alloc+0x2e/0x40 newque+0x4e/0x150 ipcget+0x159/0x1b0 SyS_msgget+0x39/0x40 entry_SYSCALL_64_fastpath+0x13/0x8f Manfred Spraul suggested to fix sem.c as well and Davidlohr Bueso to only use ipc_rcu_free in case of security allocation failure in newary() Fixes: 53dad6d3a8e ("ipc: fix race with LSMs") Link: http://lkml.kernel.org/r/1470083552-22966-1-git-send-email-fabf@skynet.be Signed-off-by: Fabian Frederick <fabf@skynet.be> Cc: Davidlohr Bueso <dbueso@suse.de> Cc: Manfred Spraul <manfred@colorfullife.com> Cc: <stable@vger.kernel.org> [3.12+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-07-29Merge branch 'for-linus' of ↵Linus Torvalds2-13/+12
git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace Pull userns vfs updates from Eric Biederman: "This tree contains some very long awaited work on generalizing the user namespace support for mounting filesystems to include filesystems with a backing store. The real world target is fuse but the goal is to update the vfs to allow any filesystem to be supported. This patchset is based on a lot of code review and testing to approach that goal. While looking at what is needed to support the fuse filesystem it became clear that there were things like xattrs for security modules that needed special treatment. That the resolution of those concerns would not be fuse specific. That sorting out these general issues made most sense at the generic level, where the right people could be drawn into the conversation, and the issues could be solved for everyone. At a high level what this patchset does a couple of simple things: - Add a user namespace owner (s_user_ns) to struct super_block. - Teach the vfs to handle filesystem uids and gids not mapping into to kuids and kgids and being reported as INVALID_UID and INVALID_GID in vfs data structures. By assigning a user namespace owner filesystems that are mounted with only user namespace privilege can be detected. This allows security modules and the like to know which mounts may not be trusted. This also allows the set of uids and gids that are communicated to the filesystem to be capped at the set of kuids and kgids that are in the owning user namespace of the filesystem. One of the crazier corner casees this handles is the case of inodes whose i_uid or i_gid are not mapped into the vfs. Most of the code simply doesn't care but it is easy to confuse the inode writeback path so no operation that could cause an inode write-back is permitted for such inodes (aka only reads are allowed). This set of changes starts out by cleaning up the code paths involved in user namespace permirted mounts. Then when things are clean enough adds code that cleanly sets s_user_ns. Then additional restrictions are added that are possible now that the filesystem superblock contains owner information. These changes should not affect anyone in practice, but there are some parts of these restrictions that are changes in behavior. - Andy's restriction on suid executables that does not honor the suid bit when the path is from another mount namespace (think /proc/[pid]/fd/) or when the filesystem was mounted by a less privileged user. - The replacement of the user namespace implicit setting of MNT_NODEV with implicitly setting SB_I_NODEV on the filesystem superblock instead. Using SB_I_NODEV is a stronger form that happens to make this state user invisible. The user visibility can be managed but it caused problems when it was introduced from applications reasonably expecting mount flags to be what they were set to. There is a little bit of work remaining before it is safe to support mounting filesystems with backing store in user namespaces, beyond what is in this set of changes. - Verifying the mounter has permission to read/write the block device during mount. - Teaching the integrity modules IMA and EVM to handle filesystems mounted with only user namespace root and to reduce trust in their security xattrs accordingly. - Capturing the mounters credentials and using that for permission checks in d_automount and the like. (Given that overlayfs already does this, and we need the work in d_automount it make sense to generalize this case). Furthermore there are a few changes that are on the wishlist: - Get all filesystems supporting posix acls using the generic posix acls so that posix_acl_fix_xattr_from_user and posix_acl_fix_xattr_to_user may be removed. [Maintainability] - Reducing the permission checks in places such as remount to allow the superblock owner to perform them. - Allowing the superblock owner to chown files with unmapped uids and gids to something that is mapped so the files may be treated normally. I am not considering even obvious relaxations of permission checks until it is clear there are no more corner cases that need to be locked down and handled generically. Many thanks to Seth Forshee who kept this code alive, and putting up with me rewriting substantial portions of what he did to handle more corner cases, and for his diligent testing and reviewing of my changes" * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace: (30 commits) fs: Call d_automount with the filesystems creds fs: Update i_[ug]id_(read|write) to translate relative to s_user_ns evm: Translate user/group ids relative to s_user_ns when computing HMAC dquot: For now explicitly don't support filesystems outside of init_user_ns quota: Handle quota data stored in s_user_ns in quota_setxquota quota: Ensure qids map to the filesystem vfs: Don't create inodes with a uid or gid unknown to the vfs vfs: Don't modify inodes with a uid or gid unknown to the vfs cred: Reject inodes with invalid ids in set_create_file_as() fs: Check for invalid i_uid in may_follow_link() vfs: Verify acls are valid within superblock's s_user_ns. userns: Handle -1 in k[ug]id_has_mapping when !CONFIG_USER_NS fs: Refuse uid/gid changes which don't map into s_user_ns selinux: Add support for unprivileged mounts from user namespaces Smack: Handle labels consistently in untrusted mounts Smack: Add support for unprivileged mounts from user namespaces fs: Treat foreign mounts as nosuid fs: Limit file caps to the user namespace of the super block userns: Remove the now unnecessary FS_USERNS_DEV_MOUNT flag userns: Remove implicit MNT_NODEV fragility. ...
2016-07-26shmem: make shmem_inode_info::lock irq-safeKirill A. Shutemov1-2/+2
We are going to need to call shmem_charge() under tree_lock to get accoutning right on collapse of small tmpfs pages into a huge one. The problem is that tree_lock is irq-safe and lockdep is not happy, that we take irq-unsafe lock under irq-safe[1]. Let's convert the lock to irq-safe. [1] https://gist.github.com/kiryl/80c0149e03ed35dfaf26628b8e03cdbc Link: http://lkml.kernel.org/r/1466021202-61880-34-git-send-email-kirill.shutemov@linux.intel.com Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-07-26shmem: get_unmapped_area align huge pageHugh Dickins1-2/+4
Provide a shmem_get_unmapped_area method in file_operations, called at mmap time to decide the mapping address. It could be conditional on CONFIG_TRANSPARENT_HUGEPAGE, but save #ifdefs in other places by making it unconditional. shmem_get_unmapped_area() first calls the usual mm->get_unmapped_area (which we treat as a black box, highly dependent on architecture and config and executable layout). Lots of conditions, and in most cases it just goes with the address that chose; but when our huge stars are rightly aligned, yet that did not provide a suitable address, go back to ask for a larger arena, within which to align the mapping suitably. There have to be some direct calls to shmem_get_unmapped_area(), not via the file_operations: because of the way shmem_zero_setup() is called to create a shmem object late in the mmap sequence, when MAP_SHARED is requested with MAP_ANONYMOUS or /dev/zero. Though this only matters when /proc/sys/vm/shmem_huge has been set. Link: http://lkml.kernel.org/r/1466021202-61880-29-git-send-email-kirill.shutemov@linux.intel.com Signed-off-by: Hugh Dickins <hughd@google.com> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-06-23vfs: Generalize filesystem nodev handling.Eric W. Biederman1-1/+1
Introduce a function may_open_dev that tests MNT_NODEV and a new superblock flab SB_I_NODEV. Use this new function in all of the places where MNT_NODEV was previously tested. Add the new SB_I_NODEV s_iflag to proc, sysfs, and mqueuefs as those filesystems should never support device nodes, and a simple superblock flags makes that very hard to get wrong. With SB_I_NODEV set if any device nodes somehow manage to show up on on a filesystem those device nodes will be unopenable. Acked-by: Seth Forshee <seth.forshee@canonical.com> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2016-06-23ipc/mqueue: The mqueue filesystem should never contain executablesEric W. Biederman1-0/+1
Set SB_I_NOEXEC on mqueuefs to ensure small implementation mistakes do not result in executable on mqueuefs by accident. Acked-by: Seth Forshee <seth.forshee@canonical.com> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2016-06-23vfs: Pass data, ns, and ns->userns to mount_nsEric W. Biederman1-11/+8
Today what is normally called data (the mount options) is not passed to fill_super through mount_ns. Pass the mount options and the namespace separately to mount_ns so that filesystems such as proc that have mount options, can use mount_ns. Pass the user namespace to mount_ns so that the standard permission check that verifies the mounter has permissions over the namespace can be performed in mount_ns instead of in each filesystems .mount method. Thus removing the duplication between mqueuefs and proc in terms of permission checks. The extra permission check does not currently affect the rpc_pipefs filesystem and the nfsd filesystem as those filesystems do not currently allow unprivileged mounts. Without unpvileged mounts it is guaranteed that the caller has already passed capable(CAP_SYS_ADMIN) which guarantees extra permission check will pass. Update rpc_pipefs and the nfsd filesystem to ensure that the network namespace reference is always taken in fill_super and always put in kill_sb so that the logic is simpler and so that errors originating inside of fill_super do not cause a network namespace leak. Acked-by: Seth Forshee <seth.forshee@canonical.com> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2016-06-23ipc: Initialize ipc_namespace->user_ns early.Eric W. Biederman1-2/+3
Allow the ipc namespace initialization code to depend on ns->user_ns being set during initialization. In particular this allows mq_init_ns to use ns->user_ns for permission checks and initializating s_user_ns while the the mq filesystem is being mounted. Acked-by: Seth Forshee <seth.forshee@canonical.com> Suggested-by: Seth Forshee <seth.forshee@canonical.com> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2016-06-14locking/spinlock: Update spin_unlock_wait() usersPeter Zijlstra1-1/+0
With the modified semantics of spin_unlock_wait() a number of explicit barriers can be removed. Also update the comment for the do_exit() usecase, as that was somewhat stale/obscure. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-kernel@vger.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-06-14locking/barriers: Introduce smp_acquire__after_ctrl_dep()Peter Zijlstra1-12/+2
Introduce smp_acquire__after_ctrl_dep(), this construct is not uncommon, but the lack of this barrier is. Use it to better express smp_rmb() uses in WRITE_ONCE(), the IPC semaphore code and the qspinlock code. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-kernel@vger.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-05-23ipc, shm: make shmem attach/detach wait for mmap_sem killableMichal Hocko1-2/+7
shmat and shmdt rely on mmap_sem for write. If the waiting task gets killed by the oom killer it would block oom_reaper from asynchronous address space reclaim and reduce the chances of timely OOM resolving. Wait for the lock in the killable mode and return with EINTR if the task got killed while waiting. Signed-off-by: Michal Hocko <mhocko@suse.com> Acked-by: Davidlohr Bueso <dave@stgolabs.net> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-04-04mm, fs: get rid of PAGE_CACHE_* and page_cache_{get,release} macrosKirill A. Shutemov1-2/+2
PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} macros were introduced *long* time ago with promise that one day it will be possible to implement page cache with bigger chunks than PAGE_SIZE. This promise never materialized. And unlikely will. We have many places where PAGE_CACHE_SIZE assumed to be equal to PAGE_SIZE. And it's constant source of confusion on whether PAGE_CACHE_* or PAGE_* constant should be used in a particular case, especially on the border between fs and mm. Global switching to PAGE_CACHE_SIZE != PAGE_SIZE would cause to much breakage to be doable. Let's stop pretending that pages in page cache are special. They are not. The changes are pretty straight-forward: - <foo> << (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>; - <foo> >> (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>; - PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} -> PAGE_{SIZE,SHIFT,MASK,ALIGN}; - page_cache_get() -> get_page(); - page_cache_release() -> put_page(); This patch contains automated changes generated with coccinelle using script below. For some reason, coccinelle doesn't patch header files. I've called spatch for them manually. The only adjustment after coccinelle is revert of changes to PAGE_CAHCE_ALIGN definition: we are going to drop it later. There are few places in the code where coccinelle didn't reach. I'll fix them manually in a separate patch. Comments and documentation also will be addressed with the separate patch. virtual patch @@ expression E; @@ - E << (PAGE_CACHE_SHIFT - PAGE_SHIFT) + E @@ expression E; @@ - E >> (PAGE_CACHE_SHIFT - PAGE_SHIFT) + E @@ @@ - PAGE_CACHE_SHIFT + PAGE_SHIFT @@ @@ - PAGE_CACHE_SIZE + PAGE_SIZE @@ @@ - PAGE_CACHE_MASK + PAGE_MASK @@ expression E; @@ - PAGE_CACHE_ALIGN(E) + PAGE_ALIGN(E) @@ expression E; @@ - page_cache_get(E) + get_page(E) @@ expression E; @@ - page_cache_release(E) + put_page(E) Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Acked-by: Michal Hocko <mhocko@suse.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-03-22ipc/sem: make semctl setting sempid consistentDavidlohr Bueso1-2/+11
As indicated by bug#112271, Linux sets the sempid value upon semctl, and not only for semop calls. However, within semctl we only do this for SETVAL, leaving SETALL without updating the field, and therefore rather inconsistent behavior when compared to other Unices. There is really no documentation regarding this and therefore users should not make assumptions. With this patch, along with updating semctl.2 manpages, this scenario should become less ambiguous As such, set sempid on SETALL cmd. Also update some in-code documentation, specifying where the sempid is set. Passes ltp and custom testcase where a child (fork) does SETALL to the set. Signed-off-by: Davidlohr Bueso <dbueso@suse.de> Reported-by: Philip Semanchuk <linux_kernel.20.ick@spamgourmet.com> Cc: Michael Kerrisk <mtk.manpages@gmail.com> Cc: PrasannaKumar Muralidharan <prasannatsmkumar@gmail.com> Cc: Manfred Spraul <manfred@colorfullife.com> Cc: Herton R. Krzesinski <herton@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-02-18ipc/shm: handle removed segments gracefully in shm_mmap()Kirill A. Shutemov1-10/+43
remap_file_pages(2) emulation can reach file which represents removed IPC ID as long as a memory segment is mapped. It breaks expectations of IPC subsystem. Test case (rewritten to be more human readable, originally autogenerated by syzkaller[1]): #define _GNU_SOURCE #include <stdlib.h> #include <sys/ipc.h> #include <sys/mman.h> #include <sys/shm.h> #define PAGE_SIZE 4096 int main() { int id; void *p; id = shmget(IPC_PRIVATE, 3 * PAGE_SIZE, 0); p = shmat(id, NULL, 0); shmctl(id, IPC_RMID, NULL); remap_file_pages(p, 3 * PAGE_SIZE, 0, 7, 0); return 0; } The patch changes shm_mmap() and code around shm_lock() to propagate locking error back to caller of shm_mmap(). [1] http://github.com/google/syzkaller Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Reported-by: Dmitry Vyukov <dvyukov@google.com> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: Manfred Spraul <manfred@colorfullife.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-01-23Merge branch 'for-linus' of ↵Linus Torvalds1-4/+4
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull final vfs updates from Al Viro: - The ->i_mutex wrappers (with small prereq in lustre) - a fix for too early freeing of symlink bodies on shmem (they need to be RCU-delayed) (-stable fodder) - followup to dedupe stuff merged this cycle * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: vfs: abort dedupe loop if fatal signals are pending make sure that freeing shmem fast symlinks is RCU-delayed wrappers for ->i_mutex access lustre: remove unused declaration
2016-01-22tree wide: use kvfree() than conditional kfree()/vfree()Tetsuo Handa3-10/+5
There are many locations that do if (memory_was_allocated_by_vmalloc) vfree(ptr); else kfree(ptr); but kvfree() can handle both kmalloc()ed memory and vmalloc()ed memory using is_vmalloc_addr(). Unless callers have special reasons, we can replace this branch with kvfree(). Please check and reply if you found problems. Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Jan Kara <jack@suse.com> Acked-by: Russell King <rmk+kernel@arm.linux.org.uk> Reviewed-by: Andreas Dilger <andreas.dilger@intel.com> Acked-by: "Rafael J. Wysocki" <rjw@rjwysocki.net> Acked-by: David Rientjes <rientjes@google.com> Cc: "Luck, Tony" <tony.luck@intel.com> Cc: Oleg Drokin <oleg.drokin@intel.com> Cc: Boris Petkov <bp@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-01-22wrappers for ->i_mutex accessAl Viro1-4/+4
parallel to mutex_{lock,unlock,trylock,is_locked,lock_nested}, inode_foo(inode) being mutex_foo(&inode->i_mutex). Please, use those for access to ->i_mutex; over the coming cycle ->i_mutex will become rwsem, with ->lookup() done with it held only shared. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-01-20ipc/shm.c: is_file_shm_hugepages() can be booleanYaowei Bai1-1/+1
Make is_file_shm_hugepages() return bool to improve readability due to this particular function only using either one or zero as its return value. No functional change. Signed-off-by: Yaowei Bai <baiyaowei@cmss.chinamobile.com> Acked-by: Michal Hocko <mhocko@suse.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-01-14kmemcg: account certain kmem allocations to memcgVladimir Davydov1-1/+1
Mark those kmem allocations that are known to be easily triggered from userspace as __GFP_ACCOUNT/SLAB_ACCOUNT, which makes them accounted to memcg. For the list, see below: - threadinfo - task_struct - task_delay_info - pid - cred - mm_struct - vm_area_struct and vm_region (nommu) - anon_vma and anon_vma_chain - signal_struct - sighand_struct - fs_struct - files_struct - fdtable and fdtable->full_fds_bits - dentry and external_name - inode for all filesystems. This is the most tedious part, because most filesystems overwrite the alloc_inode method. The list is far from complete, so feel free to add more objects. Nevertheless, it should be close to "account everything" approach and keep most workloads within bounds. Malevolent users will be able to breach the limit, but this was possible even with the former "account everything" approach (simply because it did not account everything in fact). [akpm@linux-foundation.org: coding-style fixes] Signed-off-by: Vladimir Davydov <vdavydov@virtuozzo.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Tejun Heo <tj@kernel.org> Cc: Greg Thelen <gthelen@google.com> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-11-06ipc,msg: drop dst nil validation in copy_msgDavidlohr Bueso1-1/+0
d0edd8528362 ("ipc: convert invalid scenarios to use WARN_ON") relaxed the nil dst parameter check, originally being a full BUG_ON. However, this check seems quite unnecessary when the only purpose is for ceckpoint/restore (MSG_COPY flag): o The copy variable is set initially to nil, apparently as a way of ensuring that prepare_copy is previously called. Which is in fact done, unconditionally at the beginning of do_msgrcv. o There is no concurrency with 'copy' (stack allocated in do_msgrcv). Furthermore, any errors in 'copy' (and thus prepare_copy/copy_msg) should always handled by IS_ERR() family. Therefore remove this check altogether as it can never occur with the current users. Signed-off-by: Davidlohr Bueso <dbueso@suse.de> Cc: Stanislav Kinsbursky <skinsbursky@parallels.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-09-30Initialize msg/shm IPC objects before doing ipc_addid()Linus Torvalds3-17/+18
As reported by Dmitry Vyukov, we really shouldn't do ipc_addid() before having initialized the IPC object state. Yes, we initialize the IPC object in a locked state, but with all the lockless RCU lookup work, that IPC object lock no longer means that the state cannot be seen. We already did this for the IPC semaphore code (see commit e8577d1f0329: "ipc/sem.c: fully initialize sem_array before making it visible") but we clearly forgot about msg and shm. Reported-by: Dmitry Vyukov <dvyukov@google.com> Cc: Manfred Spraul <manfred@colorfullife.com> Cc: Davidlohr Bueso <dbueso@suse.de> Cc: stable@vger.kernel.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-09-10ipc: convert invalid scenarios to use WARN_ONDavidlohr Bueso2-3/+3
Considering Linus' past rants about the (ab)use of BUG in the kernel, I took a look at how we deal with such calls in ipc. Given that any errors or corruption in ipc code are most likely contained within the set of processes participating in the broken mechanisms, there aren't really many strong fatal system failure scenarios that would require a BUG call. Also, if something is seriously wrong, ipc might not be the place for such a BUG either. 1. For example, recently, a customer hit one of these BUG_ONs in shm after failing shm_lock(). A busted ID imho does not merit a BUG_ON, and WARN would have been better. 2. MSG_COPY functionality of posix msgrcv(2) for checkpoint/restore. I don't see how we can hit this anyway -- at least it should be IS_ERR. The 'copy' arg from do_msgrcv is always set by calling prepare_copy() first and foremost. We could also probably drop this check altogether. Either way, it does not merit a BUG_ON. 3. No ->fault() callback for the fs getting the corresponding page -- seems selfish to make the system unusable. Signed-off-by: Davidlohr Bueso <dbueso@suse.de> Cc: Manfred Spraul <manfred@colorfullife.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-08-14ipc/sem.c: update/correct memory barriersManfred Spraul1-4/+14
sem_lock() did not properly pair memory barriers: !spin_is_locked() and spin_unlock_wait() are both only control barriers. The code needs an acquire barrier, otherwise the cpu might perform read operations before the lock test. As no primitive exists inside <include/spinlock.h> and since it seems noone wants another primitive, the code creates a local primitive within ipc/sem.c. With regards to -stable: The change of sem_wait_array() is a bugfix, the change to sem_lock() is a nop (just a preprocessor redefinition to improve the readability). The bugfix is necessary for all kernels that use sem_wait_array() (i.e.: starting from 3.10). Signed-off-by: Manfred Spraul <manfred@colorfullife.com> Reported-by: Oleg Nesterov <oleg@redhat.com> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> Cc: Kirill Tkhai <ktkhai@parallels.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Josh Poimboeuf <jpoimboe@redhat.com> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: <stable@vger.kernel.org> [3.10+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-08-14ipc,sem: remove uneeded sem_undo_list lock usage in exit_sem()Herton R. Krzesinski1-2/+4
After we acquire the sma->sem_perm lock in exit_sem(), we are protected against a racing IPC_RMID operation. Also at that point, we are the last user of sem_undo_list. Therefore it isn't required that we acquire or use ulp->lock. Signed-off-by: Herton R. Krzesinski <herton@redhat.com> Acked-by: Manfred Spraul <manfred@colorfullife.com> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: Rafael Aquini <aquini@redhat.com> CC: Aristeu Rozanski <aris@redhat.com> Cc: David Jeffery <djeffery@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-08-14ipc,sem: fix use after free on IPC_RMID after a task using same semaphore ↵Herton R. Krzesinski1-6/+17
set exits The current semaphore code allows a potential use after free: in exit_sem we may free the task's sem_undo_list while there is still another task looping through the same semaphore set and cleaning the sem_undo list at freeary function (the task called IPC_RMID for the same semaphore set). For example, with a test program [1] running which keeps forking a lot of processes (which then do a semop call with SEM_UNDO flag), and with the parent right after removing the semaphore set with IPC_RMID, and a kernel built with CONFIG_SLAB, CONFIG_SLAB_DEBUG and CONFIG_DEBUG_SPINLOCK, you can easily see something like the following in the kernel log: Slab corruption (Not tainted): kmalloc-64 start=ffff88003b45c1c0, len=64 000: 6b 6b 6b 6b 6b 6b 6b 6b 00 6b 6b 6b 6b 6b 6b 6b kkkkkkkk.kkkkkkk 010: ff ff ff ff 6b 6b 6b 6b ff ff ff ff ff ff ff ff ....kkkk........ Prev obj: start=ffff88003b45c180, len=64 000: 00 00 00 00 ad 4e ad de ff ff ff ff 5a 5a 5a 5a .....N......ZZZZ 010: ff ff ff ff ff ff ff ff c0 fb 01 37 00 88 ff ff ...........7.... Next obj: start=ffff88003b45c200, len=64 000: 00 00 00 00 ad 4e ad de ff ff ff ff 5a 5a 5a 5a .....N......ZZZZ 010: ff ff ff ff ff ff ff ff 68 29 a7 3c 00 88 ff ff ........h).<.... BUG: spinlock wrong CPU on CPU#2, test/18028 general protection fault: 0000 [#1] SMP Modules linked in: 8021q mrp garp stp llc nf_conntrack_ipv4 nf_defrag_ipv4 ip6t_REJECT nf_reject_ipv6 nf_conntrack_ipv6 nf_defrag_ipv6 xt_state nf_conntrack ip6table_filter ip6_tables binfmt_misc ppdev input_leds joydev parport_pc parport floppy serio_raw virtio_balloon virtio_rng virtio_console virtio_net iosf_mbi crct10dif_pclmul crc32_pclmul ghash_clmulni_intel pcspkr qxl ttm drm_kms_helper drm snd_hda_codec_generic i2c_piix4 snd_hda_intel snd_hda_codec snd_hda_core snd_hwdep snd_seq snd_seq_device snd_pcm snd_timer snd soundcore crc32c_intel virtio_pci virtio_ring virtio pata_acpi ata_generic [last unloaded: speedstep_lib] CPU: 2 PID: 18028 Comm: test Not tainted 4.2.0-rc5+ #1 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.8.1-20150318_183358- 04/01/2014 RIP: spin_dump+0x53/0xc0 Call Trace: spin_bug+0x30/0x40 do_raw_spin_unlock+0x71/0xa0 _raw_spin_unlock+0xe/0x10 freeary+0x82/0x2a0 ? _raw_spin_lock+0xe/0x10 semctl_down.clone.0+0xce/0x160 ? __do_page_fault+0x19a/0x430 ? __audit_syscall_entry+0xa8/0x100 SyS_semctl+0x236/0x2c0 ? syscall_trace_leave+0xde/0x130 entry_SYSCALL_64_fastpath+0x12/0x71 Code: 8b 80 88 03 00 00 48 8d 88 60 05 00 00 48 c7 c7 a0 2c a4 81 31 c0 65 8b 15 eb 40 f3 7e e8 08 31 68 00 4d 85 e4 44 8b 4b 08 74 5e <45> 8b 84 24 88 03 00 00 49 8d 8c 24 60 05 00 00 8b 53 04 48 89 RIP [<ffffffff810d6053>] spin_dump+0x53/0xc0 RSP <ffff88003750fd68> ---[ end trace 783ebb76612867a0 ]--- NMI watchdog: BUG: soft lockup - CPU#3 stuck for 22s! [test:18053] Modules linked in: 8021q mrp garp stp llc nf_conntrack_ipv4 nf_defrag_ipv4 ip6t_REJECT nf_reject_ipv6 nf_conntrack_ipv6 nf_defrag_ipv6 xt_state nf_conntrack ip6table_filter ip6_tables binfmt_misc ppdev input_leds joydev parport_pc parport floppy serio_raw virtio_balloon virtio_rng virtio_console virtio_net iosf_mbi crct10dif_pclmul crc32_pclmul ghash_clmulni_intel pcspkr qxl ttm drm_kms_helper drm snd_hda_codec_generic i2c_piix4 snd_hda_intel snd_hda_codec snd_hda_core snd_hwdep snd_seq snd_seq_device snd_pcm snd_timer snd soundcore crc32c_intel virtio_pci virtio_ring virtio pata_acpi ata_generic [last unloaded: speedstep_lib] CPU: 3 PID: 18053 Comm: test Tainted: G D 4.2.0-rc5+ #1 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.8.1-20150318_183358- 04/01/2014 RIP: native_read_tsc+0x0/0x20 Call Trace: ? delay_tsc+0x40/0x70 __delay+0xf/0x20 do_raw_spin_lock+0x96/0x140 _raw_spin_lock+0xe/0x10 sem_lock_and_putref+0x11/0x70 SYSC_semtimedop+0x7bf/0x960 ? handle_mm_fault+0xbf6/0x1880 ? dequeue_task_fair+0x79/0x4a0 ? __do_page_fault+0x19a/0x430 ? kfree_debugcheck+0x16/0x40 ? __do_page_fault+0x19a/0x430 ? __audit_syscall_entry+0xa8/0x100 ? do_audit_syscall_entry+0x66/0x70 ? syscall_trace_enter_phase1+0x139/0x160 SyS_semtimedop+0xe/0x10 SyS_semop+0x10/0x20 entry_SYSCALL_64_fastpath+0x12/0x71 Code: 47 10 83 e8 01 85 c0 89 47 10 75 08 65 48 89 3d 1f 74 ff 7e c9 c3 0f 1f 44 00 00 55 48 89 e5 e8 87 17 04 00 66 90 c9 c3 0f 1f 00 <55> 48 89 e5 0f 31 89 c1 48 89 d0 48 c1 e0 20 89 c9 48 09 c8 c9 Kernel panic - not syncing: softlockup: hung tasks I wasn't able to trigger any badness on a recent kernel without the proper config debugs enabled, however I have softlockup reports on some kernel versions, in the semaphore code, which are similar as above (the scenario is seen on some servers running IBM DB2 which uses semaphore syscalls). The patch here fixes the race against freeary, by acquiring or waiting on the sem_undo_list lock as necessary (exit_sem can race with freeary, while freeary sets un->semid to -1 and removes the same sem_undo from list_proc or when it removes the last sem_undo). After the patch I'm unable to reproduce the problem using the test case [1]. [1] Test case used below: #include <stdio.h> #include <sys/types.h> #include <sys/ipc.h> #include <sys/sem.h> #include <sys/wait.h> #include <stdlib.h> #include <time.h> #include <unistd.h> #include <errno.h> #define NSEM 1 #define NSET 5 int sid[NSET]; void thread() { struct sembuf op; int s; uid_t pid = getuid(); s = rand() % NSET; op.sem_num = pid % NSEM; op.sem_op = 1; op.sem_flg = SEM_UNDO; semop(sid[s], &op, 1); exit(EXIT_SUCCESS); } void create_set() { int i, j; pid_t p; union { int val; struct semid_ds *buf; unsigned short int *array; struct seminfo *__buf; } un; /* Create and initialize semaphore set */ for (i = 0; i < NSET; i++) { sid[i] = semget(IPC_PRIVATE , NSEM, 0644 | IPC_CREAT); if (sid[i] < 0) { perror("semget"); exit(EXIT_FAILURE); } } un.val = 0; for (i = 0; i < NSET; i++) { for (j = 0; j < NSEM; j++) { if (semctl(sid[i], j, SETVAL, un) < 0) perror("semctl"); } } /* Launch threads that operate on semaphore set */ for (i = 0; i < NSEM * NSET * NSET; i++) { p = fork(); if (p < 0) perror("fork"); if (p == 0) thread(); } /* Free semaphore set */ for (i = 0; i < NSET; i++) { if (semctl(sid[i], NSEM, IPC_RMID)) perror("IPC_RMID"); } /* Wait for forked processes to exit */ while (wait(NULL)) { if (errno == ECHILD) break; }; } int main(int argc, char **argv) { pid_t p; srand(time(NULL)); while (1) { p = fork(); if (p < 0) { perror("fork"); exit(EXIT_FAILURE); } if (p == 0) { create_set(); goto end; } /* Wait for forked processes to exit */ while (wait(NULL)) { if (errno == ECHILD) break; }; } end: return 0; } [akpm@linux-foundation.org: use normal comment layout] Signed-off-by: Herton R. Krzesinski <herton@redhat.com> Acked-by: Manfred Spraul <manfred@colorfullife.com> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: Rafael Aquini <aquini@redhat.com> CC: Aristeu Rozanski <aris@redhat.com> Cc: David Jeffery <djeffery@redhat.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-08-07ipc: use private shmem or hugetlbfs inodes for shm segments.Stephen Smalley1-1/+1
The shm implementation internally uses shmem or hugetlbfs inodes for shm segments. As these inodes are never directly exposed to userspace and only accessed through the shm operations which are already hooked by security modules, mark the inodes with the S_PRIVATE flag so that inode security initialization and permission checking is skipped. This was motivated by the following lockdep warning: ====================================================== [ INFO: possible circular locking dependency detected ] 4.2.0-0.rc3.git0.1.fc24.x86_64+debug #1 Tainted: G W ------------------------------------------------------- httpd/1597 is trying to acquire lock: (&ids->rwsem){+++++.}, at: shm_close+0x34/0x130 but task is already holding lock: (&mm->mmap_sem){++++++}, at: SyS_shmdt+0x4b/0x180 which lock already depends on the new lock. the existing dependency chain (in reverse order) is: -> #3 (&mm->mmap_sem){++++++}: lock_acquire+0xc7/0x270 __might_fault+0x7a/0xa0 filldir+0x9e/0x130 xfs_dir2_block_getdents.isra.12+0x198/0x1c0 [xfs] xfs_readdir+0x1b4/0x330 [xfs] xfs_file_readdir+0x2b/0x30 [xfs] iterate_dir+0x97/0x130 SyS_getdents+0x91/0x120 entry_SYSCALL_64_fastpath+0x12/0x76 -> #2 (&xfs_dir_ilock_class){++++.+}: lock_acquire+0xc7/0x270 down_read_nested+0x57/0xa0 xfs_ilock+0x167/0x350 [xfs] xfs_ilock_attr_map_shared+0x38/0x50 [xfs] xfs_attr_get+0xbd/0x190 [xfs] xfs_xattr_get+0x3d/0x70 [xfs] generic_getxattr+0x4f/0x70 inode_doinit_with_dentry+0x162/0x670 sb_finish_set_opts+0xd9/0x230 selinux_set_mnt_opts+0x35c/0x660 superblock_doinit+0x77/0xf0 delayed_superblock_init+0x10/0x20 iterate_supers+0xb3/0x110 selinux_complete_init+0x2f/0x40 security_load_policy+0x103/0x600 sel_write_load+0xc1/0x750 __vfs_write+0x37/0x100 vfs_write+0xa9/0x1a0 SyS_write+0x58/0xd0 entry_SYSCALL_64_fastpath+0x12/0x76 ... Signed-off-by: Stephen Smalley <sds@tycho.nsa.gov> Reported-by: Morten Stevens <mstevens@fedoraproject.org> Acked-by: Hugh Dickins <hughd@google.com> Acked-by: Paul Moore <paul@paul-moore.com> Cc: Manfred Spraul <manfred@colorfullife.com> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: Prarit Bhargava <prarit@redhat.com> Cc: Eric Paris <eparis@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-08-07ipc: modify message queue accounting to not take kernel data structures into ↵Marcus Gelderie1-5/+0
account A while back, the message queue implementation in the kernel was improved to use btrees to speed up retrieval of messages, in commit d6629859b36d ("ipc/mqueue: improve performance of send/recv"). That patch introducing the improved kernel handling of message queues (using btrees) has, as a by-product, changed the meaning of the QSIZE field in the pseudo-file created for the queue. Before, this field reflected the size of the user-data in the queue. Since, it also takes kernel data structures into account. For example, if 13 bytes of user data are in the queue, on my machine the file reports a size of 61 bytes. There was some discussion on this topic before (for example https://lkml.org/lkml/2014/10/1/115). Commenting on a th lkml, Michael Kerrisk gave the following background (https://lkml.org/lkml/2015/6/16/74): The pseudofiles in the mqueue filesystem (usually mounted at /dev/mqueue) expose fields with metadata describing a message queue. One of these fields, QSIZE, as originally implemented, showed the total number of bytes of user data in all messages in the message queue, and this feature was documented from the beginning in the mq_overview(7) page. In 3.5, some other (useful) work happened to break the user-space API in a couple of places, including the value exposed via QSIZE, which now includes a measure of kernel overhead bytes for the queue, a figure that renders QSIZE useless for its original purpose, since there's no way to deduce the number of overhead bytes consumed by the implementation. (The other user-space breakage was subsequently fixed.) This patch removes the accounting of kernel data structures in the queue. Reporting the size of these data-structures in the QSIZE field was a breaking change (see Michael's comment above). Without the QSIZE field reporting the total size of user-data in the queue, there is no way to deduce this number. It should be noted that the resource limit RLIMIT_MSGQUEUE is counted against the worst-case size of the queue (in both the old and the new implementation). Therefore, the kernel overhead accounting in QSIZE is not necessary to help the user understand the limitations RLIMIT imposes on the processes. Signed-off-by: Marcus Gelderie <redmnic@gmail.com> Acked-by: Doug Ledford <dledford@redhat.com> Acked-by: Michael Kerrisk <mtk.manpages@gmail.com> Acked-by: Davidlohr Bueso <dbueso@suse.de> Cc: David Howells <dhowells@redhat.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: John Duffy <jb_duffy@btinternet.com> Cc: Arto Bendiken <arto@bendiken.net> Cc: Manfred Spraul <manfred@colorfullife.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-06-30ipc,sysv: return -EINVAL upon incorrect id/seqnumDavidlohr Bueso1-1/+1
In ipc_obtain_object_check we return -EIDRM when a bogus sequence number is detected via ipc_checkid, while the ipc manpages state the following return codes for such errors: EIDRM <ID> points to a removed identifier. EINVAL Invalid <ID> value, or unaligned, etc. EIDRM should only be returned upon a RMID call (->deleted check), and thus return EINVAL for wrong seq. This difference in semantics has also caused real bugs, ie: https://bugzilla.redhat.com/show_bug.cgi?id=246509 Signed-off-by: Davidlohr Bueso <dbueso@suse.de> Cc: Manfred Spraul <manfred@colorfullife.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-06-30ipc,sysv: make return -EIDRM when racing with RMID consistentDavidlohr Bueso1-5/+8
The ipc_lock helper is used by all forms of sysv ipc to acquire the ipc object's spinlock. Upon error (bogus identifier), we always return -EINVAL, whether the problem be in the idr path or because we raced with a task performing RMID. For the later, however, all ipc related manpages, state the that for: EIDRM <ID> points to a removed identifier. And return: EINVAL Invalid <ID> value, or unaligned, etc. Which (EINVAL) should only return once the ipc resource is deleted. For all types of ipc this is done immediately upon a RMID command. However, shared memory behaves slightly different as it can merely mark a segment for deletion, and delay the actual freeing until there are no more active consumers. Per shmctl(IPC_RMID) manpage: "" Mark the segment to be destroyed. The segment will only actually be destroyed after the last process detaches it (i.e., when the shm_nattch member of the associated structure shmid_ds is zero). "" Unlike ipc_lock, paths that behave "correctly", at least per the manpage, involve controlling the ipc resource via *ctl(), doing the exact same validity check as ipc_lock after right acquiring the spinlock: if (!ipc_valid_object()) { err = -EIDRM; goto out_unlock; } Thus make ipc_lock consistent with the rest of ipc code and return -EIDRM in ipc_lock when !ipc_valid_object(). Signed-off-by: Davidlohr Bueso <dbueso@suse.de> Cc: Manfred Spraul <manfred@colorfullife.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-06-30ipc: rename ipc_obtain_objectDavidlohr Bueso5-9/+9
... to ipc_obtain_object_idr, which is more meaningful and makes the code slightly easier to follow. Signed-off-by: Davidlohr Bueso <dbueso@suse.de> Cc: Manfred Spraul <manfred@colorfullife.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-06-30ipc,msg: provide barrier pairings for lockless receiveDavidlohr Bueso1-10/+38
We currently use a full barrier on the sender side to to avoid receiver tasks disappearing on us while still performing on the sender side wakeup. We lack however, the proper CPU-CPU interactions pairing on the receiver side which busy-waits for the message. Similarly, we do not need a full smp_mb, and can relax the semantics for the writer and reader sides of the message. This is safe as we are only ordering loads and stores to r_msg. And in both smp_wmb and smp_rmb, there are no stores after the calls _anyway_. This obviously applies for pipelined_send and expunge_all, for EIRDM when destroying a queue. Signed-off-by: Davidlohr Bueso <dbueso@suse.de> Cc: Manfred Spraul <manfred@colorfullife.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-06-30ipc,shm: move BUG_ON check into shm_lockDavidlohr Bueso1-5/+5
Upon every shm_lock call, we BUG_ON if an error was returned, indicating racing either in idr or in shm_destroy. Move this logic into the locking. [akpm@linux-foundation.org: simplify code] Signed-off-by: Davidlohr Bueso <dbueso@suse.de> Cc: Manfred Spraul <manfred@colorfullife.com> Cc: Davidlohr Bueso <dave@stgolabs.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-06-30ipc/util.c: use kvfree() in ipc_rcu_free()Pekka Enberg1-4/+1
Use kvfree() instead of open-coding it. Signed-off-by: Pekka Enberg <penberg@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-05-08ipc/mqueue: Implement lockless pipelined wakeupsDavidlohr Bueso1-21/+33
This patch moves the wakeup_process() invocation so it is not done under the info->lock by making use of a lockless wake_q. With this change, the waiter is woken up once it is STATE_READY and it does not need to loop on SMP if it is still in STATE_PENDING. In the timeout case we still need to grab the info->lock to verify the state. This change should also avoid the introduction of preempt_disable() in -rt which avoids a busy-loop which pools for the STATE_PENDING -> STATE_READY change if the waiter has a higher priority compared to the waker. Additionally, this patch micro-optimizes wq_sleep by using the cheaper cousin of set_current_state(TASK_INTERRUPTABLE) as we will block no matter what, thus get rid of the implied barrier. Signed-off-by: Davidlohr Bueso <dbueso@suse.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: George Spelvin <linux@horizon.com> Acked-by: Thomas Gleixner <tglx@linutronix.de> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Chris Mason <clm@fb.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Manfred Spraul <manfred@colorfullife.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: dave@stgolabs.net Link: http://lkml.kernel.org/r/1430748166.1940.17.camel@stgolabs.net Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-04-26Merge branch 'for-linus' of ↵Linus Torvalds2-12/+12
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull fourth vfs update from Al Viro: "d_inode() annotations from David Howells (sat in for-next since before the beginning of merge window) + four assorted fixes" * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: RCU pathwalk breakage when running into a symlink overmounting something fix I_DIO_WAKEUP definition direct-io: only inc/dec inode->i_dio_count for file systems fs/9p: fix readdir() VFS: assorted d_backing_inode() annotations VFS: fs/inode.c helpers: d_inode() annotations VFS: fs/cachefiles: d_backing_inode() annotations VFS: fs library helpers: d_inode() annotations VFS: assorted weird filesystems: d_inode() annotations VFS: normal filesystems (and lustre): d_inode() annotations VFS: security/: d_inode() annotations VFS: security/: d_backing_inode() annotations VFS: net/: d_inode() annotations VFS: net/unix: d_backing_inode() annotations VFS: kernel/: d_inode() annotations VFS: audit: d_backing_inode() annotations VFS: Fix up some ->d_inode accesses in the chelsio driver VFS: Cachefiles should perform fs modifications on the top layer only VFS: AF_UNIX sockets should call mknod on the top layer only
2015-04-15ipc: remove use of seq_printf return valueJoe Perches4-50/+58
The seq_printf return value, because it's frequently misused, will eventually be converted to void. See: commit 1f33c41c03da ("seq_file: Rename seq_overflow() to seq_has_overflowed() and make public") Signed-off-by: Joe Perches <joe@perches.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-04-15VFS: assorted weird filesystems: d_inode() annotationsDavid Howells2-12/+12
Signed-off-by: David Howells <dhowells@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2015-02-17ipc,sem: use current->state helpersDavidlohr Bueso1-1/+1
Call __set_current_state() instead of assigning the new state directly. These interfaces also aid CONFIG_DEBUG_ATOMIC_SLEEP environments, keeping track of who changed the state. Signed-off-by: Davidlohr Bueso <dbueso@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>