aboutsummaryrefslogtreecommitdiffstats
path: root/io_uring
AgeCommit message (Collapse)AuthorFilesLines
2024-04-08io_uring/net: restore msg_control on sendzc retryPavel Begunkov1-0/+1
cac9e4418f4cb ("io_uring/net: save msghdr->msg_control for retries") reinstatiates msg_control before every __sys_sendmsg_sock(), since the function can overwrite the value in msghdr. We need to do same for zerocopy sendmsg. Cc: stable@vger.kernel.org Fixes: 493108d95f146 ("io_uring/net: zerocopy sendmsg") Link: https://github.com/axboe/liburing/issues/1067 Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/cc1d5d9df0576fa66ddad4420d240a98a020b267.1712596179.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-04-05io_uring: Fix io_cqring_wait() not restoring sigmask on get_timespec64() failureAlexey Izbyshev1-13/+13
This bug was introduced in commit 950e79dd7313 ("io_uring: minor io_cqring_wait() optimization"), which was made in preparation for adc8682ec690 ("io_uring: Add support for napi_busy_poll"). The latter got reverted in cb3182167325 ("Revert "io_uring: Add support for napi_busy_poll""), so simply undo the former as well. Cc: stable@vger.kernel.org Fixes: 950e79dd7313 ("io_uring: minor io_cqring_wait() optimization") Signed-off-by: Alexey Izbyshev <izbyshev@ispras.ru> Link: https://lore.kernel.org/r/20240405125551.237142-1-izbyshev@ispras.ru Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-04-02io_uring/kbuf: hold io_buffer_list reference over mmapJens Axboe3-14/+36
If we look up the kbuf, ensure that it doesn't get unregistered until after we're done with it. Since we're inside mmap, we cannot safely use the io_uring lock. Rely on the fact that we can lookup the buffer list under RCU now and grab a reference to it, preventing it from being unregistered until we're done with it. The lookup returns the io_buffer_list directly with it referenced. Cc: stable@vger.kernel.org # v6.4+ Fixes: 5cf4f52e6d8a ("io_uring: free io_buffer_list entries via RCU") Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-04-02io_uring/kbuf: protect io_buffer_list teardown with a referenceJens Axboe2-4/+13
No functional changes in this patch, just in preparation for being able to keep the buffer list alive outside of the ctx->uring_lock. Cc: stable@vger.kernel.org # v6.4+ Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-04-02io_uring/kbuf: get rid of bl->is_readyJens Axboe2-10/+0
Now that xarray is being exclusively used for the buffer_list lookup, this check is no longer needed. Get rid of it and the is_ready member. Cc: stable@vger.kernel.org # v6.4+ Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-04-02io_uring/kbuf: get rid of lower BGID listsJens Axboe2-64/+8
Just rely on the xarray for any kind of bgid. This simplifies things, and it really doesn't bring us much, if anything. Cc: stable@vger.kernel.org # v6.4+ Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-04-02io_uring: use private workqueue for exit workJens Axboe1-1/+4
Rather than use the system unbound event workqueue, use an io_uring specific one. This avoids dependencies with the tty, which also uses the system_unbound_wq, and issues flushes of said workqueue from inside its poll handling. Cc: stable@vger.kernel.org Reported-by: Rasmus Karlsson <rasmus.karlsson@pajlada.com> Tested-by: Rasmus Karlsson <rasmus.karlsson@pajlada.com> Tested-by: Iskren Chernev <me@iskren.info> Link: https://github.com/axboe/liburing/issues/1113 Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-04-01io_uring: disable io-wq execution of multishot NOWAIT requestsJens Axboe1-4/+9
Do the same check for direct io-wq execution for multishot requests that commit 2a975d426c82 did for the inline execution, and disable multishot mode (and revert to single shot) if the file type doesn't support NOWAIT, and isn't opened in O_NONBLOCK mode. For multishot to work properly, it's a requirement that nonblocking read attempts can be done. Cc: stable@vger.kernel.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-04-01io_uring/rw: don't allow multishot reads without NOWAIT supportJens Axboe1-1/+8
Supporting multishot reads requires support for NOWAIT, as the alternative would be always having io-wq execute the work item whenever the poll readiness triggered. Any fast file type will have NOWAIT support (eg it understands both O_NONBLOCK and IOCB_NOWAIT). If the given file type does not, then simply resort to single shot execution. Cc: stable@vger.kernel.org Fixes: fc68fcda04910 ("io_uring/rw: add support for IORING_OP_READ_MULTISHOT") Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-03-18io_uring/sqpoll: early exit thread if task_context wasn't allocatedJens Axboe1-1/+5
Ideally we'd want to simply kill the task rather than wake it, but for now let's just add a startup check that causes the thread to exit. This can only happen if io_uring_alloc_task_context() fails, which generally requires fault injection. Reported-by: Ubisectech Sirius <bugreport@ubisectech.com> Fixes: af5d68f8892f ("io_uring/sqpoll: manage task_work privately") Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-03-16io_uring: clear opcode specific data for an early failureJens Axboe1-9/+16
If failure happens before the opcode prep handler is called, ensure that we clear the opcode specific area of the request, which holds data specific to that request type. This prevents errors where opcode handlers either don't get to clear per-request private data since prep isn't even called. Reported-and-tested-by: syzbot+f8e9a371388aa62ecab4@syzkaller.appspotmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-03-16io_uring/net: ensure async prep handlers always initialize ->done_ioJens Axboe1-1/+8
If we get a request with IOSQE_ASYNC set, then we first run the prep async handlers. But if we then fail setting it up and want to post a CQE with -EINVAL, we use ->done_io. This was previously guarded with REQ_F_PARTIAL_IO, and the normal setup handlers do set it up before any potential errors, but we need to cover the async setup too. Fixes: 9817ad85899f ("io_uring/net: remove dependency on REQ_F_PARTIAL_IO for sr->done_io") Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-03-15io_uring/waitid: always remove waitid entry for cancel allJens Axboe1-6/+1
We know the request is either being removed, or already in the process of being removed through task_work, so we can delete it from our waitid list upfront. This is important for remove all conditions, as we otherwise will find it multiple times and prevent cancelation progress. Remove the dead check in cancelation as well for the hash_node being empty or not. We already have a waitid reference check for ownership, so we don't need to check the list too. Cc: stable@vger.kernel.org Fixes: f31ecf671ddc ("io_uring: add IORING_OP_WAITID support") Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-03-15io_uring/futex: always remove futex entry for cancel allJens Axboe1-0/+1
We know the request is either being removed, or already in the process of being removed through task_work, so we can delete it from our futex list upfront. This is important for remove all conditions, as we otherwise will find it multiple times and prevent cancelation progress. Cc: stable@vger.kernel.org Fixes: 194bb58c6090 ("io_uring: add support for futex wake and wait") Fixes: 8f350194d5cf ("io_uring: add support for vectored futex waits") Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-03-15io_uring: fix poll_remove stalled req completionPavel Begunkov1-2/+2
Taking the ctx lock is not enough to use the deferred request completion infrastructure, it'll get queued into the list but no one would expect it there, so it will sit there until next io_submit_flush_completions(). It's hard to care about the cancellation path, so complete it via tw. Fixes: ef7dfac51d8ed ("io_uring/poll: serialize poll linked timer start with poll removal") Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/c446740bc16858f8a2a8dcdce899812f21d15f23.1710514702.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-03-13io_uring: Fix release of pinned pages when __io_uaddr_map failsGabriel Krisman Bertazi1-9/+13
Looking at the error path of __io_uaddr_map, if we fail after pinning the pages for any reasons, ret will be set to -EINVAL and the error handler won't properly release the pinned pages. I didn't manage to trigger it without forcing a failure, but it can happen in real life when memory is heavily fragmented. Signed-off-by: Gabriel Krisman Bertazi <krisman@suse.de> Fixes: 223ef4743164 ("io_uring: don't allow IORING_SETUP_NO_MMAP rings on highmem pages") Link: https://lore.kernel.org/r/20240313213912.1920-1-krisman@suse.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-03-13io_uring/kbuf: rename is_mappedPavel Begunkov2-11/+11
In buffer lists we have ->is_mapped as well as ->is_mmap, it's pretty hard to stay sane double checking which one means what, and in the long run there is a high chance of an eventual bug. Rename ->is_mapped into ->is_buf_ring. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/c4838f4d8ad506ad6373f1c305aee2d2c1a89786.1710343154.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-03-13io_uring: simplify io_pages_freePavel Begunkov1-5/+1
We never pass a null (top-level) pointer, remove the check. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/0e1a46f9a5cd38e6876905e8030bdff9b0845e96.1710343154.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-03-12io_uring: clean rings on NO_MMAP alloc failPavel Begunkov1-2/+3
We make a few cancellation judgements based on ctx->rings, so let's zero it afer deallocation for IORING_SETUP_NO_MMAP just like it's done with the mmap case. Likely, it's not a real problem, but zeroing is safer and better tested. Cc: stable@vger.kernel.org Fixes: 03d89a2de25bbc ("io_uring: support for user allocated memory for rings/sqes") Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/9ff6cdf91429b8a51699c210e1f6af6ea3f8bdcf.1710255382.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-03-12io_uring/rw: return IOU_ISSUE_SKIP_COMPLETE for multishot retryJens Axboe1-0/+2
If read multishot is being invoked from the poll retry handler, then we should return IOU_ISSUE_SKIP_COMPLETE rather than -EAGAIN. If not, then a CQE will be posted with -EAGAIN rather than triggering the retry when the file is flagged as readable again. Cc: stable@vger.kernel.org Reported-by: Sargun Dhillon <sargun@meta.com> Fixes: fc68fcda04910 ("io_uring/rw: add support for IORING_OP_READ_MULTISHOT") Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-03-11io_uring: don't save/restore iowait stateJens Axboe1-3/+2
This kind of state is per-syscall, and since we're doing the waiting off entering the io_uring_enter(2) syscall, there's no way that iowait can already be set for this case. Simplify it by setting it if we need to, and always clearing it to 0 when done. Fixes: 7b72d661f1f2 ("io_uring: gate iowait schedule on having pending requests") Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-03-11Merge tag 'for-6.9/io_uring-20240310' of git://git.kernel.dk/linuxLinus Torvalds23-395/+1067
Pull io_uring updates from Jens Axboe: - Make running of task_work internal loops more fair, and unify how the different methods deal with them (me) - Support for per-ring NAPI. The two minor networking patches are in a shared branch with netdev (Stefan) - Add support for truncate (Tony) - Export SQPOLL utilization stats (Xiaobing) - Multishot fixes (Pavel) - Fix for a race in manipulating the request flags via poll (Pavel) - Cleanup the multishot checking by making it generic, moving it out of opcode handlers (Pavel) - Various tweaks and cleanups (me, Kunwu, Alexander) * tag 'for-6.9/io_uring-20240310' of git://git.kernel.dk/linux: (53 commits) io_uring: Fix sqpoll utilization check racing with dying sqpoll io_uring/net: dedup io_recv_finish req completion io_uring: refactor DEFER_TASKRUN multishot checks io_uring: fix mshot io-wq checks io_uring/net: add io_req_msg_cleanup() helper io_uring/net: simplify msghd->msg_inq checking io_uring/kbuf: rename REQ_F_PARTIAL_IO to REQ_F_BL_NO_RECYCLE io_uring/net: remove dependency on REQ_F_PARTIAL_IO for sr->done_io io_uring/net: correctly handle multishot recvmsg retry setup io_uring/net: clear REQ_F_BL_EMPTY in the multishot retry handler io_uring: fix io_queue_proc modifying req->flags io_uring: fix mshot read defer taskrun cqe posting io_uring/net: fix overflow check in io_recvmsg_mshot_prep() io_uring/net: correct the type of variable io_uring/sqpoll: statistics of the true utilization of sq threads io_uring/net: move recv/recvmsg flags out of retry loop io_uring/kbuf: flag request if buffer pool is empty after buffer pick io_uring/net: improve the usercopy for sendmsg/recvmsg io_uring/net: move receive multishot out of the generic msghdr path io_uring/net: unify how recvmsg and sendmsg copy in the msghdr ...
2024-03-09io_uring: Fix sqpoll utilization check racing with dying sqpollGabriel Krisman Bertazi1-5/+12
Commit 3fcb9d17206e ("io_uring/sqpoll: statistics of the true utilization of sq threads"), currently in Jens for-next branch, peeks at io_sq_data->thread to report utilization statistics. But, If io_uring_show_fdinfo races with sqpoll terminating, even though we hold the ctx lock, sqd->thread might be NULL and we hit the Oops below. Note that we could technically just protect the getrusage() call and the sq total/work time calculations. But showing some sq information (pid/cpu) and not other information (utilization) is more confusing than not reporting anything, IMO. So let's hide it all if we happen to race with a dying sqpoll. This can be triggered consistently in my vm setup running sqpoll-cancel-hang.t in a loop. BUG: kernel NULL pointer dereference, address: 00000000000007b0 PGD 0 P4D 0 Oops: 0000 [#1] PREEMPT SMP NOPTI CPU: 0 PID: 16587 Comm: systemd-coredum Not tainted 6.8.0-rc3-g3fcb9d17206e-dirty #69 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS unknown 2/2/2022 RIP: 0010:getrusage+0x21/0x3e0 Code: 90 90 90 90 90 90 90 90 90 0f 1f 44 00 00 55 48 89 d1 48 89 e5 41 57 41 56 41 55 41 54 49 89 fe 41 52 53 48 89 d3 48 83 ec 30 <4c> 8b a7 b0 07 00 00 48 8d 7a 08 65 48 8b 04 25 28 00 00 00 48 89 RSP: 0018:ffffa166c671bb80 EFLAGS: 00010282 RAX: 00000000000040ca RBX: ffffa166c671bc60 RCX: ffffa166c671bc60 RDX: ffffa166c671bc60 RSI: 0000000000000000 RDI: 0000000000000000 RBP: ffffa166c671bbe0 R08: ffff9448cc3930c0 R09: 0000000000000000 R10: ffffa166c671bd50 R11: ffffffff9ee89260 R12: 0000000000000000 R13: ffff9448ce099480 R14: 0000000000000000 R15: ffff9448cff5b000 FS: 00007f786e225900(0000) GS:ffff94493bc00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00000000000007b0 CR3: 000000010d39c000 CR4: 0000000000750ef0 PKRU: 55555554 Call Trace: <TASK> ? __die_body+0x1a/0x60 ? page_fault_oops+0x154/0x440 ? srso_alias_return_thunk+0x5/0xfbef5 ? do_user_addr_fault+0x174/0x7c0 ? srso_alias_return_thunk+0x5/0xfbef5 ? exc_page_fault+0x63/0x140 ? asm_exc_page_fault+0x22/0x30 ? getrusage+0x21/0x3e0 ? seq_printf+0x4e/0x70 io_uring_show_fdinfo+0x9db/0xa10 ? srso_alias_return_thunk+0x5/0xfbef5 ? vsnprintf+0x101/0x4d0 ? srso_alias_return_thunk+0x5/0xfbef5 ? seq_vprintf+0x34/0x50 ? srso_alias_return_thunk+0x5/0xfbef5 ? seq_printf+0x4e/0x70 ? seq_show+0x16b/0x1d0 ? __pfx_io_uring_show_fdinfo+0x10/0x10 seq_show+0x16b/0x1d0 seq_read_iter+0xd7/0x440 seq_read+0x102/0x140 vfs_read+0xae/0x320 ? srso_alias_return_thunk+0x5/0xfbef5 ? __do_sys_newfstat+0x35/0x60 ksys_read+0xa5/0xe0 do_syscall_64+0x50/0x110 entry_SYSCALL_64_after_hwframe+0x6e/0x76 RIP: 0033:0x7f786ec1db4d Code: e8 46 e3 01 00 0f 1f 84 00 00 00 00 00 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 80 3d d9 ce 0e 00 00 74 17 31 c0 0f 05 <48> 3d 00 f0 ff ff 77 5b c3 66 2e 0f 1f 84 00 00 00 00 00 48 83 ec RSP: 002b:00007ffcb361a4b8 EFLAGS: 00000246 ORIG_RAX: 0000000000000000 RAX: ffffffffffffffda RBX: 000055a4c8fe42f0 RCX: 00007f786ec1db4d RDX: 0000000000000400 RSI: 000055a4c8fe48a0 RDI: 0000000000000006 RBP: 00007f786ecfb0b0 R08: 00007f786ecfb2a8 R09: 0000000000000001 R10: 0000000000000000 R11: 0000000000000246 R12: 00007f786ecfaf60 R13: 000055a4c8fe42f0 R14: 0000000000000000 R15: 00007ffcb361a628 </TASK> Modules linked in: CR2: 00000000000007b0 ---[ end trace 0000000000000000 ]--- RIP: 0010:getrusage+0x21/0x3e0 Code: 90 90 90 90 90 90 90 90 90 0f 1f 44 00 00 55 48 89 d1 48 89 e5 41 57 41 56 41 55 41 54 49 89 fe 41 52 53 48 89 d3 48 83 ec 30 <4c> 8b a7 b0 07 00 00 48 8d 7a 08 65 48 8b 04 25 28 00 00 00 48 89 RSP: 0018:ffffa166c671bb80 EFLAGS: 00010282 RAX: 00000000000040ca RBX: ffffa166c671bc60 RCX: ffffa166c671bc60 RDX: ffffa166c671bc60 RSI: 0000000000000000 RDI: 0000000000000000 RBP: ffffa166c671bbe0 R08: ffff9448cc3930c0 R09: 0000000000000000 R10: ffffa166c671bd50 R11: ffffffff9ee89260 R12: 0000000000000000 R13: ffff9448ce099480 R14: 0000000000000000 R15: ffff9448cff5b000 FS: 00007f786e225900(0000) GS:ffff94493bc00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00000000000007b0 CR3: 000000010d39c000 CR4: 0000000000750ef0 PKRU: 55555554 Kernel panic - not syncing: Fatal exception Kernel Offset: 0x1ce00000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff) Fixes: 3fcb9d17206e ("io_uring/sqpoll: statistics of the true utilization of sq threads") Signed-off-by: Gabriel Krisman Bertazi <krisman@suse.de> Link: https://lore.kernel.org/r/20240309003256.358-1-krisman@suse.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-03-08io_uring/net: dedup io_recv_finish req completionPavel Begunkov1-12/+4
There are two block in io_recv_finish() completing the request, which we can combine and remove jumping. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/0e338dcb33c88de83809fda021cba9e7c9681620.1709905727.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-03-08io_uring: refactor DEFER_TASKRUN multishot checksPavel Begunkov3-23/+20
We disallow DEFER_TASKRUN multishots from running by io-wq, which is checked by individual opcodes in the issue path. We can consolidate all it in io_wq_submit_work() at the same time moving the checks out of the hot path. Suggested-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/e492f0f11588bb5aa11d7d24e6f53b7c7628afdb.1709905727.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-03-08io_uring: fix mshot io-wq checksPavel Begunkov1-1/+1
When checking for concurrent CQE posting, we're not only interested in requests running from the poll handler but also strayed requests ended up in normal io-wq execution. We're disallowing multishots in general from io-wq, not only when they came in a certain way. Cc: stable@vger.kernel.org Fixes: 17add5cea2bba ("io_uring: force multishot CQEs into task context") Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/d8c5b36a39258036f93301cd60d3cd295e40653d.1709905727.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-03-08io_uring/net: add io_req_msg_cleanup() helperJens Axboe1-12/+15
For the fast inline path, we manually recycle the io_async_msghdr and free the iovec, and then clear the REQ_F_NEED_CLEANUP flag to avoid that needing doing in the slower path. We already do that in 2 spots, and in preparation for adding more, add a helper and use it. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-03-08io_uring/net: simplify msghd->msg_inq checkingJens Axboe1-2/+2
Just check for larger than zero rather than check for non-zero and not -1. This is easier to read, and also protects against any errants < 0 values that aren't -1. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-03-08io_uring/kbuf: rename REQ_F_PARTIAL_IO to REQ_F_BL_NO_RECYCLEJens Axboe4-32/+13
We only use the flag for this purpose, so rename it accordingly. This further prevents various other use cases of it, keeping it clean and consistent. Then we can also check it in one spot, when it's being attempted recycled, and remove some dead code in io_kbuf_recycle_ring(). Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-03-08io_uring/net: remove dependency on REQ_F_PARTIAL_IO for sr->done_ioJens Axboe1-5/+7
Ensure that prep handlers always initialize sr->done_io before any potential failure conditions, and with that, we now it's always been set even for the failure case. With that, we don't need to use the REQ_F_PARTIAL_IO flag to gate on that. Additionally, we should not overwrite req->cqe.res unless sr->done_io is actually positive. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-03-07io_uring/net: correctly handle multishot recvmsg retry setupJens Axboe1-1/+2
If we loop for multishot receive on the initial attempt, and then abort later on to wait for more, we miss a case where we should be copying the io_async_msghdr from the stack to stable storage. This leads to the next retry potentially failing, if the application had the msghdr on the stack. Cc: stable@vger.kernel.org Fixes: 9bb66906f23e ("io_uring: support multishot in recvmsg") Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-03-07io_uring/net: clear REQ_F_BL_EMPTY in the multishot retry handlerJens Axboe1-0/+1
This flag should not be persistent across retries, so ensure we clear it before potentially attemting a retry. Fixes: c3f9109dbc9e ("io_uring/kbuf: flag request if buffer pool is empty after buffer pick") Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-03-07io_uring: fix io_queue_proc modifying req->flagsPavel Begunkov1-8/+11
With multiple poll entries __io_queue_proc() might be running in parallel with poll handlers and possibly task_work, we should not be carelessly modifying req->flags there. io_poll_double_prepare() handles a similar case with locking but it's much easier to move it into __io_arm_poll_handler(). Cc: stable@vger.kernel.org Fixes: 595e52284d24a ("io_uring/poll: don't enable lazy wake for POLLEXCLUSIVE") Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/455cc49e38cf32026fa1b49670be8c162c2cb583.1709834755.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-03-07io_uring: fix mshot read defer taskrun cqe postingPavel Begunkov1-0/+2
We can't post CQEs from io-wq with DEFER_TASKRUN set, normal completions are handled but aux should be explicitly disallowed by opcode handlers. Cc: stable@vger.kernel.org Fixes: fc68fcda04910 ("io_uring/rw: add support for IORING_OP_READ_MULTISHOT") Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/6fb7cba6f5366da25f4d3eb95273f062309d97fa.1709740837.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-03-04io_uring/net: fix overflow check in io_recvmsg_mshot_prep()Dan Carpenter1-2/+2
The "controllen" variable is type size_t (unsigned long). Casting it to int could lead to an integer underflow. The check_add_overflow() function considers the type of the destination which is type int. If we add two positive values and the result cannot fit in an integer then that's counted as an overflow. However, if we cast "controllen" to an int and it turns negative, then negative values *can* fit into an int type so there is no overflow. Good: 100 + (unsigned long)-4 = 96 <-- overflow Bad: 100 + (int)-4 = 96 <-- no overflow I deleted the cast of the sizeof() as well. That's not a bug but the cast is unnecessary. Fixes: 9b0fc3c054ff ("io_uring: fix types in io_recvmsg_multishot_overflow") Signed-off-by: Dan Carpenter <dan.carpenter@linaro.org> Link: https://lore.kernel.org/r/138bd2e2-ede8-4bcc-aa7b-f3d9de167a37@moroto.mountain Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-03-04io_uring/net: correct the type of variableMuhammad Usama Anjum1-1/+1
The namelen is of type int. It shouldn't be made size_t which is unsigned. The signed number is needed for error checking before use. Fixes: c55978024d12 ("io_uring/net: move receive multishot out of the generic msghdr path") Signed-off-by: Muhammad Usama Anjum <usama.anjum@collabora.com> Link: https://lore.kernel.org/r/20240301144349.2807544-1-usama.anjum@collabora.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-03-01io_uring/sqpoll: statistics of the true utilization of sq threadsXiaobing Li3-1/+24
Count the running time and actual IO processing time of the sqpoll thread, and output the statistical data to fdinfo. Variable description: "work_time" in the code represents the sum of the jiffies of the sq thread actually processing IO, that is, how many milliseconds it actually takes to process IO. "total_time" represents the total time that the sq thread has elapsed from the beginning of the loop to the current time point, that is, how many milliseconds it has spent in total. The test tool is fio, and its parameters are as follows: [global] ioengine=io_uring direct=1 group_reporting bs=128k norandommap=1 randrepeat=0 refill_buffers ramp_time=30s time_based runtime=1m clocksource=clock_gettime overwrite=1 log_avg_msec=1000 numjobs=1 [disk0] filename=/dev/nvme0n1 rw=read iodepth=16 hipri sqthread_poll=1 The test results are as follows: Every 2.0s: cat /proc/9230/fdinfo/6 | grep -E Sq SqMask: 0x3 SqHead: 3197153 SqTail: 3197153 CachedSqHead: 3197153 SqThread: 9231 SqThreadCpu: 11 SqTotalTime: 18099614 SqWorkTime: 16748316 The test results corresponding to different iodepths are as follows: |-----------|-------|-------|-------|------|-------| | iodepth | 1 | 4 | 8 | 16 | 64 | |-----------|-------|-------|-------|------|-------| |utilization| 2.9% | 8.8% | 10.9% | 92.9%| 84.4% | |-----------|-------|-------|-------|------|-------| | idle | 97.1% | 91.2% | 89.1% | 7.1% | 15.6% | |-----------|-------|-------|-------|------|-------| Signed-off-by: Xiaobing Li <xiaobing.li@samsung.com> Link: https://lore.kernel.org/r/20240228091251.543383-1-xiaobing.li@samsung.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-03-01io_uring/net: move recv/recvmsg flags out of retry loopJens Axboe1-7/+8
The flags don't change, just intialize them once rather than every loop for multishot. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-02-27io_uring/kbuf: flag request if buffer pool is empty after buffer pickJens Axboe1-2/+8
Normally we do an extra roundtrip for retries even if the buffer pool has depleted, as we don't check that upfront. Rather than add this check, have the buffer selection methods mark the request with REQ_F_BL_EMPTY if the used buffer group is out of buffers after this selection. This is very cheap to do once we're all the way inside there anyway, and it gives the caller a chance to make better decisions on how to proceed. For example, recv/recvmsg multishot could check this flag when it decides whether to keep receiving or not. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-02-27io_uring/net: improve the usercopy for sendmsg/recvmsgJens Axboe1-7/+22
We're spending a considerable amount of the sendmsg/recvmsg time just copying in the message header. And for provided buffers, the known single entry iovec. Be a bit smarter about it and enable/disable user access around our copying. In a test case that does both sendmsg and recvmsg, the runtime before this change (averaged over multiple runs, very stable times however): Kernel Time Diff ==================================== -git 4720 usec -git+commit 4311 usec -8.7% and looking at a profile diff, we see the following: 0.25% +9.33% [kernel.kallsyms] [k] _copy_from_user 4.47% -3.32% [kernel.kallsyms] [k] __io_msg_copy_hdr.constprop.0 where we drop more than 9% of _copy_from_user() time, and consequently add time to __io_msg_copy_hdr() where the copies are now attributed to, but with a net win of 6%. In comparison, the same test case with send/recv runs in 3745 usec, which is (expectedly) still quite a bit faster. But at least sendmsg/recvmsg is now only ~13% slower, where it was ~21% slower before. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-02-27io_uring/net: move receive multishot out of the generic msghdr pathJens Axboe1-70/+91
Move the actual user_msghdr / compat_msghdr into the send and receive sides, respectively, so we can move the uaddr receive handling into its own handler, and ditto the multishot with buffer selection logic. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-02-27io_uring/net: unify how recvmsg and sendmsg copy in the msghdrJens Axboe1-129/+142
For recvmsg, we roll our own since we support buffer selections. This isn't the case for sendmsg right now, but in preparation for doing so, make the recvmsg copy helpers generic so we can call them from the sendmsg side as well. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-02-15io_uring/napi: enable even with a timeout of 0Jens Axboe1-4/+5
1 usec is not as short as it used to be, and it makes sense to allow 0 for a busy poll timeout - this means just do one loop to check if we have anything available. Add a separate ->napi_enabled to check if napi has been enabled or not. While at it, move the writing of the ctx napi values after we've copied the old values back to userspace. This ensures that if the call fails, we'll be in the same state as we were before, rather than some indeterminate state. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-02-15io_uring: kill stale comment for io_cqring_overflow_kill()Jens Axboe1-1/+0
This function now deals only with discarding overflow entries on ring free and exit, and it no longer returns whether we successfully flushed all entries as there's no CQE posting involved anymore. Kill the outdated comment. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-02-14io_uring/net: fix multishot accept overflow handlingJens Axboe1-2/+3
If we hit CQ ring overflow when attempting to post a multishot accept completion, we don't properly save the result or return code. This results in losing the accepted fd value. Instead, we return the result from the poll operation that triggered the accept retry. This is generally POLLIN|POLLPRI|POLLRDNORM|POLLRDBAND which is 0xc3, or 195, which looks like a valid file descriptor, but it really has no connection to that. Handle this like we do for other multishot completions - assign the result, and return IOU_STOP_MULTISHOT to cancel any further completions from this request when overflow is hit. This preserves the result, as we should, and tells the application that the request needs to be re-armed. Cc: stable@vger.kernel.org Fixes: 515e26961295 ("io_uring: revert "io_uring fix multishot accept ordering"") Link: https://github.com/axboe/liburing/issues/1062 Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-02-14io_uring/sqpoll: use the correct check for pending task_workJens Axboe1-1/+8
A previous commit moved to using just the private task_work list for SQPOLL, but it neglected to update the check for whether we have pending task_work. Normally this is fine as we'll attempt to run it unconditionally, but if we race with going to sleep AND task_work being added, then we certainly need the right check here. Fixes: af5d68f8892f ("io_uring/sqpoll: manage task_work privately") Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-02-14io_uring: wake SQPOLL task when task_work is added to an empty queueJens Axboe1-1/+6
If there's no current work on the list, we still need to potentially wake the SQPOLL task if it is sleeping. This is ordered with the wait queue addition in sqpoll, which adds to the wait queue before checking for pending work items. Fixes: af5d68f8892f ("io_uring/sqpoll: manage task_work privately") Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-02-14io_uring/napi: ensure napi polling is aborted when work is availableJens Axboe3-12/+12
While testing io_uring NAPI with DEFER_TASKRUN, I ran into slowdowns and stalls in packet delivery. Turns out that while io_napi_busy_loop_should_end() aborts appropriately on regular task_work, it does not abort if we have local task_work pending. Move io_has_work() into the private io_uring.h header, and gate whether we should continue polling on that as well. This makes NAPI polling on send/receive work as designed with IORING_SETUP_DEFER_TASKRUN as well. Fixes: 8d0c12a80cde ("io-uring: add napi busy poll support") Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-02-12io_uring: Don't include af_unix.h.Kuniyuki Iwashima4-3/+2
Changes to AF_UNIX trigger rebuild of io_uring, but io_uring does not use AF_UNIX anymore. Let's not include af_unix.h and instead include necessary headers. Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Link: https://lore.kernel.org/r/20240212234236.63714-1-kuniyu@amazon.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-02-09io_uring: add register/unregister napi functionStefan Roesch3-0/+76
This adds an api to register and unregister the napi for io-uring. If the arg value is specified when unregistering, the current napi setting for the busy poll timeout is copied into the user structure. If this is not required, NULL can be passed as the arg value. Signed-off-by: Stefan Roesch <shr@devkernel.io> Acked-by: Jakub Kicinski <kuba@kernel.org> Link: https://lore.kernel.org/r/20230608163839.2891748-7-shr@devkernel.io Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-02-09io-uring: add sqpoll support for napi busy pollStefan Roesch3-1/+33
This adds the sqpoll support to the io-uring napi. Signed-off-by: Stefan Roesch <shr@devkernel.io> Suggested-by: Olivier Langlois <olivier@trillion01.com> Link: https://lore.kernel.org/r/20230608163839.2891748-6-shr@devkernel.io Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-02-09io-uring: add napi busy poll supportStefan Roesch6-0/+359
This adds the napi busy polling support in io_uring.c. It adds a new napi_list to the io_ring_ctx structure. This list contains the list of napi_id's that are currently enabled for busy polling. The list is synchronized by the new napi_lock spin lock. The current default napi busy polling time is stored in napi_busy_poll_to. If napi busy polling is not enabled, the value is 0. In addition there is also a hash table. The hash table store the napi id and the pointer to the above list nodes. The hash table is used to speed up the lookup to the list elements. The hash table is synchronized with rcu. The NAPI_TIMEOUT is stored as a timeout to make sure that the time a napi entry is stored in the napi list is limited. The busy poll timeout is also stored as part of the io_wait_queue. This is necessary as for sq polling the poll interval needs to be adjusted and the napi callback allows only to pass in one value. This has been tested with two simple programs from the liburing library repository: the napi client and the napi server program. The client sends a request, which has a timestamp in its payload and the server replies with the same payload. The client calculates the roundtrip time and stores it to calculate the results. The client is running on host1 and the server is running on host 2 (in the same rack). The measured times below are roundtrip times. They are average times over 5 runs each. Each run measures 1 million roundtrips. no rx coal rx coal: frames=88,usecs=33 Default 57us 56us client_poll=100us 47us 46us server_poll=100us 51us 46us client_poll=100us+ 40us 40us server_poll=100us client_poll=100us+ 41us 39us server_poll=100us+ prefer napi busy poll on client client_poll=100us+ 41us 39us server_poll=100us+ prefer napi busy poll on server client_poll=100us+ 41us 39us server_poll=100us+ prefer napi busy poll on client + server Signed-off-by: Stefan Roesch <shr@devkernel.io> Suggested-by: Olivier Langlois <olivier@trillion01.com> Acked-by: Jakub Kicinski <kuba@kernel.org> Link: https://lore.kernel.org/r/20230608163839.2891748-5-shr@devkernel.io Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-02-09io-uring: move io_wait_queue definition to header fileStefan Roesch2-21/+22
This moves the definition of the io_wait_queue structure to the header file so it can be also used from other files. Signed-off-by: Stefan Roesch <shr@devkernel.io> Link: https://lore.kernel.org/r/20230608163839.2891748-4-shr@devkernel.io Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-02-09io_uring: add support for ftruncateTony Solomonik4-1/+63
Adds support for doing truncate through io_uring, eliminating the need for applications to roll their own thread pool or offload mechanism to be able to do non-blocking truncates. Signed-off-by: Tony Solomonik <tony.solomonik@gmail.com> Link: https://lore.kernel.org/r/20240202121724.17461-3-tony.solomonik@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-02-08io_uring: Simplify the allocation of slab cachesKunwu Chan1-3/+2
commit 0a31bd5f2bbb ("KMEM_CACHE(): simplify slab cache creation") introduces a new macro. Use the new KMEM_CACHE() macro instead of direct kmem_cache_create to simplify the creation of SLAB caches. Signed-off-by: Kunwu Chan <chentao@kylinos.cn> Link: https://lore.kernel.org/r/20240130100247.81460-1-chentao@kylinos.cn Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-02-08io_uring/sqpoll: manage task_work privatelyJens Axboe3-17/+82
Decouple from task_work running, and cap the number of entries we process at the time. If we exceed that number, push remaining entries to a retry list that we'll process first next time. We cap the number of entries to process at 8, which is fairly random. We just want to get enough per-ctx batching here, while not processing endlessly. Since we manually run PF_IO_WORKER related task_work anyway as the task never exits to userspace, with this we no longer need to add an actual task_work item to the per-process list. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-02-08io_uring: pass in counter to handle_tw_list() rather than return itJens Axboe1-5/+3
No functional changes in this patch, just in preparation for returning something other than count from this helper. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-02-08io_uring: cleanup handle_tw_list() calling conventionJens Axboe1-16/+13
Now that we don't loop around task_work anymore, there's no point in maintaining the ring and locked state outside of handle_tw_list(). Get rid of passing in those pointers (and pointers to pointers) and just do the management internally in handle_tw_list(). Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-02-08io_uring/poll: improve readability of poll reference decrementingJens Axboe1-2/+2
This overly long line is hard to read. Break it up by AND'ing the ref mask first, then perform the atomic_sub_return() with the value itself. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-02-08io_uring: remove unconditional looping in local task_work handlingJens Axboe1-15/+29
If we have a ton of notifications coming in, we can be looping in here for a long time. This can be problematic for various reasons, mostly because we can starve userspace. If the application is waiting on N events, then only re-run if we need more events. Fixes: c0e0d6ba25f1 ("io_uring: add IORING_SETUP_DEFER_TASKRUN") Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-02-08io_uring: remove next io_kiocb fetch in task_work runningJens Axboe1-3/+0
We just reversed the task_work list and that will have touched requests as well, just get rid of this optimization as it should not make a difference anymore. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-02-08io_uring: handle traditional task_work in FIFO orderJens Axboe1-1/+1
For local task_work, which is used if IORING_SETUP_DEFER_TASKRUN is set, we reverse the order of the lockless list before processing the work. This is done to process items in the order in which they were queued, as the llist always adds to the head. Do the same for traditional task_work, so we have the same behavior for both types. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-02-08io_uring: remove 'loops' argument from trace_io_uring_task_work_run()Jens Axboe1-1/+1
We no longer loop in task_work handling, hence delete the argument from the tracepoint as it's always 1 and hence not very informative. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-02-08io_uring: remove looping around handling traditional task_workJens Axboe1-38/+7
A previous commit added looping around handling traditional task_work as an optimization, and while that may seem like a good idea, it's also possible to run into application starvation doing so. If the task_work generation is bursty, we can get very deep task_work queues, and we can end up looping in here for a very long time. One immediately observable problem with that is handling network traffic using provided buffers, where flooding incoming traffic and looping task_work handling will very quickly lead to buffer starvation as we keep running task_work rather than returning to the application so it can handle the associated CQEs and also provide buffers back. Fixes: 3a0c037b0e16 ("io_uring: batch task_work") Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-02-08io_uring/kbuf: cleanup passing back cflagsJens Axboe2-24/+31
We have various functions calculating the CQE cflags we need to pass back, but it's all the same everywhere. Make a number of the putting functions void, and just have the two main helps for this, io_put_kbuf() and io_put_kbuf_comp() calculate the actual mask and pass it back. While at it, cleanup how we put REQ_F_BUFFER_RING buffers. Before this change, we would call into __io_put_kbuf() only to go right back in to the header defined functions. As clearing this type of buffer is just re-assigning the buf_index and incrementing the head, this is very wasteful. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-02-08io_uring/rw: remove dead file == NULL checkJens Axboe1-1/+1
Any read/write opcode has needs_file == true, which means that we would've failed the request long before reaching the issue stage if we didn't successfully assign a file. This check has been dead forever, and is really a leftover from generic code. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-02-08io_uring: cleanup io_req_complete_post()Jens Axboe1-4/+4
Move the ctx declaration and assignment up to be generally available in the function, as we use req->ctx at the top anyway. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-02-08io_uring: mark the need to lock/unlock the ring as unlikelyJens Axboe1-2/+2
Any of the fast paths will already have this locked, this helper only exists to deal with io-wq invoking request issue where we do not have the ctx->uring_lock held already. This means that any common or fast path will already have this locked, mark it as such. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-02-08io_uring: add io_file_can_poll() helperJens Axboe5-6/+18
This adds a flag to avoid dipping dereferencing file and then f_op to figure out if the file has a poll handler defined or not. We generally call this at least twice for networked workloads, and if using ring provided buffers, we do it on every buffer selection. Particularly the latter is troublesome, as it's otherwise a very fast operation. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-02-08io_uring/cancel: don't default to setting req->work.cancel_seqJens Axboe4-8/+12
Just leave it unset by default, avoiding dipping into the last cacheline (which is otherwise untouched) for the fast path of using poll to drive networked traffic. Add a flag that tells us if the sequence is valid or not, and then we can defer actually assigning the flag and sequence until someone runs cancelations. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-02-08io_uring: expand main struct io_kiocb flags to 64-bitsJens Axboe2-5/+6
We're out of space here, and none of the flags are easily reclaimable. Bump it to 64-bits and re-arrange the struct a bit to avoid gaps. Add a specific bitwise type for the request flags, io_request_flags_t. This will help catch violations of casting this value to a smaller type on 32-bit archs, like unsigned int. This creates a hole in the io_kiocb, so move nr_tw up and rsrc_node down to retain needing only cacheline 0 and 1 for non-polled opcodes. No functional changes intended in this patch. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-02-06io_uring: use file_mnt_idmap helperAlexander Mikhalitsyn1-1/+1
Let's use file_mnt_idmap() as we do that across the tree. No functional impact. Cc: Christian Brauner <brauner@kernel.org> Cc: Jens Axboe <axboe@kernel.dk> Cc: Pavel Begunkov <asml.silence@gmail.com> Cc: io-uring@vger.kernel.org Cc: linux-fsdevel@vger.kernel.org Cc: linux-kernel@vger.kernel.org Signed-off-by: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@canonical.com> Link: https://lore.kernel.org/r/20240129180024.219766-2-aleksandr.mikhalitsyn@canonical.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-02-01io_uring/net: fix sr->len for IORING_OP_RECV with MSG_WAITALL and buffersJens Axboe1-0/+1
If we use IORING_OP_RECV with provided buffers and pass in '0' as the length of the request, the length is retrieved from the selected buffer. If MSG_WAITALL is also set and we get a short receive, then we may hit the retry path which decrements sr->len and increments the buffer for a retry. However, the length is still zero at this point, which means that sr->len now becomes huge and import_ubuf() will cap it to MAX_RW_COUNT and subsequently return -EFAULT for the range as a whole. Fix this by always assigning sr->len once the buffer has been selected. Cc: stable@vger.kernel.org Fixes: 7ba89d2af17a ("io_uring: ensure recv and recvmsg handle MSG_WAITALL correctly") Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-01-29io_uring/net: limit inline multishot retriesJens Axboe1-3/+20
If we have multiple clients and some/all are flooding the receives to such an extent that we can retry a LOT handling multishot receives, then we can be starving some clients and hence serving traffic in an imbalanced fashion. Limit multishot retry attempts to some arbitrary value, whose only purpose serves to ensure that we don't keep serving a single connection for way too long. We default to 32 retries, which should be more than enough to provide fairness, yet not so small that we'll spend too much time requeuing rather than handling traffic. Cc: stable@vger.kernel.org Depends-on: 704ea888d646 ("io_uring/poll: add requeue return code from poll multishot handling") Depends-on: 1e5d765a82f ("io_uring/net: un-indent mshot retry path in io_recv_finish()") Depends-on: e84b01a880f6 ("io_uring/poll: move poll execution helpers higher up") Fixes: b3fdea6ecb55 ("io_uring: multishot recv") Fixes: 9bb66906f23e ("io_uring: support multishot in recvmsg") Link: https://github.com/axboe/liburing/issues/1043 Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-01-29io_uring/poll: add requeue return code from poll multishot handlingJens Axboe2-2/+15
Since our poll handling is edge triggered, multishot handlers retry internally until they know that no more data is available. In preparation for limiting these retries, add an internal return code, IOU_REQUEUE, which can be used to inform the poll backend about the handler wanting to retry, but that this should happen through a normal task_work requeue rather than keep hammering on the issue side for this one request. No functional changes in this patch, nobody is using this return code just yet. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-01-29io_uring/net: un-indent mshot retry path in io_recv_finish()Jens Axboe1-16/+20
In preparation for putting some retry logic in there, have the done path just skip straight to the end rather than have too much nesting in here. No functional changes in this patch. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-01-29io_uring/poll: move poll execution helpers higher upJens Axboe1-20/+20
In preparation for calling __io_poll_execute() higher up, move the functions to avoid forward declarations. No functional changes in this patch. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-01-28io_uring/rw: ensure poll based multishot read retries appropriatelyJens Axboe2-1/+18
io_read_mshot() always relies on poll triggering retries, and this works fine as long as we do a retry per size of the buffer being read. The buffer size is given by the size of the buffer(s) in the given buffer group ID. But if we're reading less than what is available, then we don't always get to read everything that is available. For example, if the buffers available are 32 bytes and we have 64 bytes to read, then we'll correctly read the first 32 bytes and then wait for another poll trigger before we attempt the next read. This next poll trigger may never happen, in which case we just sit forever and never make progress, or it may trigger at some point in the future, and now we're just delivering the available data much later than we should have. io_read_mshot() could do retries itself, but that is wasteful as we'll be going through all of __io_read() again, and most likely in vain. Rather than do that, bump our poll reference count and have io_poll_check_events() do one more loop and check with vfs_poll() if we have more data to read. If we do, io_read_mshot() will get invoked again directly and we'll read the next chunk. io_poll_multishot_retry() must only get called from inside io_poll_issue(), which is our multishot retry handler, as we know we already "own" the request at this point. Cc: stable@vger.kernel.org Link: https://github.com/axboe/liburing/issues/1041 Fixes: fc68fcda0491 ("io_uring/rw: add support for IORING_OP_READ_MULTISHOT") Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-01-23io_uring: enable audit and restrict cred override for IORING_OP_FIXED_FD_INSTALLPaul Moore2-1/+4
We need to correct some aspects of the IORING_OP_FIXED_FD_INSTALL command to take into account the security implications of making an io_uring-private file descriptor generally accessible to a userspace task. The first change in this patch is to enable auditing of the FD_INSTALL operation as installing a file descriptor into a task's file descriptor table is a security relevant operation and something that admins/users may want to audit. The second change is to disable the io_uring credential override functionality, also known as io_uring "personalities", in the FD_INSTALL command. The credential override in FD_INSTALL is particularly problematic as it affects the credentials used in the security_file_receive() LSM hook. If a task were to request a credential override via REQ_F_CREDS on a FD_INSTALL operation, the LSM would incorrectly check to see if the overridden credentials of the io_uring were able to "receive" the file as opposed to the task's credentials. After discussions upstream, it's difficult to imagine a use case where we would want to allow a credential override on a FD_INSTALL operation so we are simply going to block REQ_F_CREDS on IORING_OP_FIXED_FD_INSTALL operations. Fixes: dc18b89ab113 ("io_uring/openclose: add support for IORING_OP_FIXED_FD_INSTALL") Signed-off-by: Paul Moore <paul@paul-moore.com> Link: https://lore.kernel.org/r/20240123215501.289566-2-paul@paul-moore.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-01-18Merge tag 'for-6.8/io_uring-2024-01-18' of git://git.kernel.dk/linuxLinus Torvalds4-47/+86
Pull io_uring fixes from Jens Axboe: "Nothing major in here, just a few fixes and cleanups that arrived after the initial merge window pull request got finalized, as well as a fix for a patch that got merged earlier" * tag 'for-6.8/io_uring-2024-01-18' of git://git.kernel.dk/linux: io_uring: combine cq_wait_nr checks io_uring: clean *local_work_add var naming io_uring: clean up local tw add-wait sync io_uring: adjust defer tw counting io_uring/register: guard compat syscall with CONFIG_COMPAT io_uring/rsrc: improve code generation for fixed file assignment io_uring/rw: cleanup io_rw_done()
2024-01-17Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvmLinus Torvalds1-1/+2
Pull kvm updates from Paolo Bonzini: "Generic: - Use memdup_array_user() to harden against overflow. - Unconditionally advertise KVM_CAP_DEVICE_CTRL for all architectures. - Clean up Kconfigs that all KVM architectures were selecting - New functionality around "guest_memfd", a new userspace API that creates an anonymous file and returns a file descriptor that refers to it. guest_memfd files are bound to their owning virtual machine, cannot be mapped, read, or written by userspace, and cannot be resized. guest_memfd files do however support PUNCH_HOLE, which can be used to switch a memory area between guest_memfd and regular anonymous memory. - New ioctl KVM_SET_MEMORY_ATTRIBUTES allowing userspace to specify per-page attributes for a given page of guest memory; right now the only attribute is whether the guest expects to access memory via guest_memfd or not, which in Confidential SVMs backed by SEV-SNP, TDX or ARM64 pKVM is checked by firmware or hypervisor that guarantees confidentiality (AMD PSP, Intel TDX module, or EL2 in the case of pKVM). x86: - Support for "software-protected VMs" that can use the new guest_memfd and page attributes infrastructure. This is mostly useful for testing, since there is no pKVM-like infrastructure to provide a meaningfully reduced TCB. - Fix a relatively benign off-by-one error when splitting huge pages during CLEAR_DIRTY_LOG. - Fix a bug where KVM could incorrectly test-and-clear dirty bits in non-leaf TDP MMU SPTEs if a racing thread replaces a huge SPTE with a non-huge SPTE. - Use more generic lockdep assertions in paths that don't actually care about whether the caller is a reader or a writer. - let Xen guests opt out of having PV clock reported as "based on a stable TSC", because some of them don't expect the "TSC stable" bit (added to the pvclock ABI by KVM, but never set by Xen) to be set. - Revert a bogus, made-up nested SVM consistency check for TLB_CONTROL. - Advertise flush-by-ASID support for nSVM unconditionally, as KVM always flushes on nested transitions, i.e. always satisfies flush requests. This allows running bleeding edge versions of VMware Workstation on top of KVM. - Sanity check that the CPU supports flush-by-ASID when enabling SEV support. - On AMD machines with vNMI, always rely on hardware instead of intercepting IRET in some cases to detect unmasking of NMIs - Support for virtualizing Linear Address Masking (LAM) - Fix a variety of vPMU bugs where KVM fail to stop/reset counters and other state prior to refreshing the vPMU model. - Fix a double-overflow PMU bug by tracking emulated counter events using a dedicated field instead of snapshotting the "previous" counter. If the hardware PMC count triggers overflow that is recognized in the same VM-Exit that KVM manually bumps an event count, KVM would pend PMIs for both the hardware-triggered overflow and for KVM-triggered overflow. - Turn off KVM_WERROR by default for all configs so that it's not inadvertantly enabled by non-KVM developers, which can be problematic for subsystems that require no regressions for W=1 builds. - Advertise all of the host-supported CPUID bits that enumerate IA32_SPEC_CTRL "features". - Don't force a masterclock update when a vCPU synchronizes to the current TSC generation, as updating the masterclock can cause kvmclock's time to "jump" unexpectedly, e.g. when userspace hotplugs a pre-created vCPU. - Use RIP-relative address to read kvm_rebooting in the VM-Enter fault paths, partly as a super minor optimization, but mostly to make KVM play nice with position independent executable builds. - Guard KVM-on-HyperV's range-based TLB flush hooks with an #ifdef on CONFIG_HYPERV as a minor optimization, and to self-document the code. - Add CONFIG_KVM_HYPERV to allow disabling KVM support for HyperV "emulation" at build time. ARM64: - LPA2 support, adding 52bit IPA/PA capability for 4kB and 16kB base granule sizes. Branch shared with the arm64 tree. - Large Fine-Grained Trap rework, bringing some sanity to the feature, although there is more to come. This comes with a prefix branch shared with the arm64 tree. - Some additional Nested Virtualization groundwork, mostly introducing the NV2 VNCR support and retargetting the NV support to that version of the architecture. - A small set of vgic fixes and associated cleanups. Loongarch: - Optimization for memslot hugepage checking - Cleanup and fix some HW/SW timer issues - Add LSX/LASX (128bit/256bit SIMD) support RISC-V: - KVM_GET_REG_LIST improvement for vector registers - Generate ISA extension reg_list using macros in get-reg-list selftest - Support for reporting steal time along with selftest s390: - Bugfixes Selftests: - Fix an annoying goof where the NX hugepage test prints out garbage instead of the magic token needed to run the test. - Fix build errors when a header is delete/moved due to a missing flag in the Makefile. - Detect if KVM bugged/killed a selftest's VM and print out a helpful message instead of complaining that a random ioctl() failed. - Annotate the guest printf/assert helpers with __printf(), and fix the various bugs that were lurking due to lack of said annotation" * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (185 commits) x86/kvm: Do not try to disable kvmclock if it was not enabled KVM: x86: add missing "depends on KVM" KVM: fix direction of dependency on MMU notifiers KVM: introduce CONFIG_KVM_COMMON KVM: arm64: Add missing memory barriers when switching to pKVM's hyp pgd KVM: arm64: vgic-its: Avoid potential UAF in LPI translation cache RISC-V: KVM: selftests: Add get-reg-list test for STA registers RISC-V: KVM: selftests: Add steal_time test support RISC-V: KVM: selftests: Add guest_sbi_probe_extension RISC-V: KVM: selftests: Move sbi_ecall to processor.c RISC-V: KVM: Implement SBI STA extension RISC-V: KVM: Add support for SBI STA registers RISC-V: KVM: Add support for SBI extension registers RISC-V: KVM: Add SBI STA info to vcpu_arch RISC-V: KVM: Add steal-update vcpu request RISC-V: KVM: Add SBI STA extension skeleton RISC-V: paravirt: Implement steal-time support RISC-V: Add SBI STA extension definitions RISC-V: paravirt: Add skeleton for pv-time support RISC-V: KVM: Fix indentation in kvm_riscv_vcpu_set_reg_csr() ...
2024-01-17io_uring: combine cq_wait_nr checksPavel Begunkov1-7/+27
Instead of explicitly checking ->cq_wait_nr for whether there are waiting, which is currently represented by 0, we can store there a large value and the nr_tw will automatically filter out those cases. Add a named constant for that and for the wake up bias value. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/38def30282654d980673976cd42fde9bab19b297.1705438669.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-01-17io_uring: clean *local_work_add var namingPavel Begunkov1-7/+7
if (!first) { ... } While it reads as do something if it's not the first entry, it does exactly the opposite because "first" here is a pointer to the first entry. Remove the confusion by renaming it into "head". Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/3b8be483b52f58a524185bb88694b8a268e7e85d.1705438669.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-01-17io_uring: clean up local tw add-wait syncPavel Begunkov1-2/+8
Kill a smp_mb__after_atomic() right before wake_up, it's useless, and add a comment explaining implicit barriers from cmpxchg and synchronsation around ->cq_wait_nr with the waiter. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/3007f3c2d53c72b61de56919ef56b53158b8276f.1705438669.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-01-17io_uring: adjust defer tw countingPavel Begunkov1-1/+1
The UINT_MAX work item counting bias in io_req_local_work_add() in case of !IOU_F_TWQ_LAZY_WAKE works in a sense that we will not miss a wake up, however it's still eerie. In particular, if we add a lazy work item after a non-lazy one, we'll increment it and get nr_tw==0, and subsequent adds may try to unnecessarily wake up the task, which is though not so likely to happen in real workloads. Half the bias, it's still large enough to be larger than any valid ->cq_wait_nr, which is limited by IORING_MAX_CQ_ENTRIES, but further have a good enough of space before it overflows. Fixes: 8751d15426a31 ("io_uring: reduce scheduling due to tw") Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/108b971e958deaf7048342930c341ba90f75d806.1705438669.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-01-17io_uring/register: guard compat syscall with CONFIG_COMPATJens Axboe1-3/+5
Add compat.h include to avoid a potential build issue: io_uring/register.c:281:6: error: call to undeclared function 'in_compat_syscall'; ISO C99 and later do not support implicit function declarations [-Werror,-Wimplicit-function-declaration] if (in_compat_syscall()) { ^ 1 warning generated. io_uring/register.c:282:9: error: call to undeclared function 'compat_get_bitmap'; ISO C99 and later do not support implicit function declarations [-Werror,-Wimplicit-function-declaration] ret = compat_get_bitmap(cpumask_bits(new_mask), ^ Fixes: c43203154d8a ("io_uring/register: move io_uring_register(2) related code to register.c") Reported-by: Manu Bretelle <chantra@meta.com> Reviewed-by: Jeff Moyer <jmoyer@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-01-11Merge tag 'for-6.8/io_uring-2024-01-08' of git://git.kernel.dk/linuxLinus Torvalds15-850/+752
Pull io_uring updates from Jens Axboe: "Mostly just come fixes and cleanups, but one feature as well. In detail: - Harden the check for handling IOPOLL based on return (Pavel) - Various minor optimizations (Pavel) - Drop remnants of SCM_RIGHTS fd passing support, now that it's no longer supported since 6.7 (me) - Fix for a case where bytes_done wasn't initialized properly on a failure condition for read/write requests (me) - Move the register related code to a separate file (me) - Add support for returning the provided ring buffer head (me) - Add support for adding a direct descriptor to the normal file table (me, Christian Brauner) - Fix for ensuring pending task_work for a ring with DEFER_TASKRUN is run even if we timeout waiting (me)" * tag 'for-6.8/io_uring-2024-01-08' of git://git.kernel.dk/linux: io_uring: ensure local task_work is run on wait timeout io_uring/kbuf: add method for returning provided buffer ring head io_uring/rw: ensure io->bytes_done is always initialized io_uring: drop any code related to SCM_RIGHTS io_uring/unix: drop usage of io_uring socket io_uring/register: move io_uring_register(2) related code to register.c io_uring/openclose: add support for IORING_OP_FIXED_FD_INSTALL io_uring/cmd: inline io_uring_cmd_get_task io_uring/cmd: inline io_uring_cmd_do_in_task_lazy io_uring: split out cmd api into a separate header io_uring: optimise ltimeout for inline execution io_uring: don't check iopoll if request completes
2024-01-11Merge tag 'for-6.8/block-2024-01-08' of git://git.kernel.dk/linuxLinus Torvalds1-1/+0
Pull block updates from Jens Axboe: "Pretty quiet round this time around. This contains: - NVMe updates via Keith: - nvme fabrics spec updates (Guixin, Max) - nvme target udpates (Guixin, Evan) - nvme attribute refactoring (Daniel) - nvme-fc numa fix (Keith) - MD updates via Song: - Fix/Cleanup RCU usage from conf->disks[i].rdev (Yu Kuai) - Fix raid5 hang issue (Junxiao Bi) - Add Yu Kuai as Reviewer of the md subsystem - Remove deprecated flavors (Song Liu) - raid1 read error check support (Li Nan) - Better handle events off-by-1 case (Alex Lyakas) - Efficiency improvements for passthrough (Kundan) - Support for mapping integrity data directly (Keith) - Zoned write fix (Damien) - rnbd fixes (Kees, Santosh, Supriti) - Default to a sane discard size granularity (Christoph) - Make the default max transfer size naming less confusing (Christoph) - Remove support for deprecated host aware zoned model (Christoph) - Misc fixes (me, Li, Matthew, Min, Ming, Randy, liyouhong, Daniel, Bart, Christoph)" * tag 'for-6.8/block-2024-01-08' of git://git.kernel.dk/linux: (78 commits) block: Treat sequential write preferred zone type as invalid block: remove disk_clear_zoned sd: remove the !ZBC && blk_queue_is_zoned case in sd_read_block_characteristics drivers/block/xen-blkback/common.h: Fix spelling typo in comment blk-cgroup: fix rcu lockdep warning in blkg_lookup() blk-cgroup: don't use removal safe list iterators block: floor the discard granularity to the physical block size mtd_blkdevs: use the default discard granularity bcache: use the default discard granularity zram: use the default discard granularity null_blk: use the default discard granularity nbd: use the default discard granularity ubd: use the default discard granularity block: default the discard granularity to sector size bcache: discard_granularity should not be smaller than a sector block: remove two comments in bio_split_discard block: rename and document BLK_DEF_MAX_SECTORS loop: don't abuse BLK_DEF_MAX_SECTORS aoe: don't abuse BLK_DEF_MAX_SECTORS null_blk: don't cap max_hw_sectors to BLK_DEF_MAX_SECTORS ...
2024-01-11io_uring/rsrc: improve code generation for fixed file assignmentJens Axboe2-7/+12
For the normal read/write path, we have already locked the ring submission side when assigning the file. This causes branch mispredictions when we then check and try and lock again in io_req_set_rsrc_node(). As this is a very hot path, this matters. Add a basic helper that already assumes we already have it locked, and use that in io_file_get_fixed(). Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-01-10io_uring/rw: cleanup io_rw_done()Jens Axboe1-21/+27
This originally came from the aio side, and it's laid out rather oddly. The common case here is that we either get -EIOCBQUEUED from submitting an async request, or that we complete the request correctly with the given number of bytes. Handling the odd internal restart error codes is not a common operation. Lay it out a bit more optimally that better explains the normal flow, and switch to avoiding the indirect call completely as this is our kiocb and we know the completion handler can only be one of two possible variants. While at it, move it to where it belongs in the file, with fellow end IO helpers. Outside of being easier to read, this also reduces the text size of the function by 24 bytes for me on arm64. Reviewed-by: Keith Busch <kbusch@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-01-09Merge tag 'mm-stable-2024-01-08-15-31' of ↵Linus Torvalds1-3/+2
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull MM updates from Andrew Morton: "Many singleton patches against the MM code. The patch series which are included in this merge do the following: - Peng Zhang has done some mapletree maintainance work in the series 'maple_tree: add mt_free_one() and mt_attr() helpers' 'Some cleanups of maple tree' - In the series 'mm: use memmap_on_memory semantics for dax/kmem' Vishal Verma has altered the interworking between memory-hotplug and dax/kmem so that newly added 'device memory' can more easily have its memmap placed within that newly added memory. - Matthew Wilcox continues folio-related work (including a few fixes) in the patch series 'Add folio_zero_tail() and folio_fill_tail()' 'Make folio_start_writeback return void' 'Fix fault handler's handling of poisoned tail pages' 'Convert aops->error_remove_page to ->error_remove_folio' 'Finish two folio conversions' 'More swap folio conversions' - Kefeng Wang has also contributed folio-related work in the series 'mm: cleanup and use more folio in page fault' - Jim Cromie has improved the kmemleak reporting output in the series 'tweak kmemleak report format'. - In the series 'stackdepot: allow evicting stack traces' Andrey Konovalov to permits clients (in this case KASAN) to cause eviction of no longer needed stack traces. - Charan Teja Kalla has fixed some accounting issues in the page allocator's atomic reserve calculations in the series 'mm: page_alloc: fixes for high atomic reserve caluculations'. - Dmitry Rokosov has added to the samples/ dorectory some sample code for a userspace memcg event listener application. See the series 'samples: introduce cgroup events listeners'. - Some mapletree maintanance work from Liam Howlett in the series 'maple_tree: iterator state changes'. - Nhat Pham has improved zswap's approach to writeback in the series 'workload-specific and memory pressure-driven zswap writeback'. - DAMON/DAMOS feature and maintenance work from SeongJae Park in the series 'mm/damon: let users feed and tame/auto-tune DAMOS' 'selftests/damon: add Python-written DAMON functionality tests' 'mm/damon: misc updates for 6.8' - Yosry Ahmed has improved memcg's stats flushing in the series 'mm: memcg: subtree stats flushing and thresholds'. - In the series 'Multi-size THP for anonymous memory' Ryan Roberts has added a runtime opt-in feature to transparent hugepages which improves performance by allocating larger chunks of memory during anonymous page faults. - Matthew Wilcox has also contributed some cleanup and maintenance work against eh buffer_head code int he series 'More buffer_head cleanups'. - Suren Baghdasaryan has done work on Andrea Arcangeli's series 'userfaultfd move option'. UFFDIO_MOVE permits userspace heap compaction algorithms to move userspace's pages around rather than UFFDIO_COPY'a alloc/copy/free. - Stefan Roesch has developed a 'KSM Advisor', in the series 'mm/ksm: Add ksm advisor'. This is a governor which tunes KSM's scanning aggressiveness in response to userspace's current needs. - Chengming Zhou has optimized zswap's temporary working memory use in the series 'mm/zswap: dstmem reuse optimizations and cleanups'. - Matthew Wilcox has performed some maintenance work on the writeback code, both code and within filesystems. The series is 'Clean up the writeback paths'. - Andrey Konovalov has optimized KASAN's handling of alloc and free stack traces for secondary-level allocators, in the series 'kasan: save mempool stack traces'. - Andrey also performed some KASAN maintenance work in the series 'kasan: assorted clean-ups'. - David Hildenbrand has gone to town on the rmap code. Cleanups, more pte batching, folio conversions and more. See the series 'mm/rmap: interface overhaul'. - Kinsey Ho has contributed some maintenance work on the MGLRU code in the series 'mm/mglru: Kconfig cleanup'. - Matthew Wilcox has contributed lruvec page accounting code cleanups in the series 'Remove some lruvec page accounting functions'" * tag 'mm-stable-2024-01-08-15-31' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (361 commits) mm, treewide: rename MAX_ORDER to MAX_PAGE_ORDER mm, treewide: introduce NR_PAGE_ORDERS selftests/mm: add separate UFFDIO_MOVE test for PMD splitting selftests/mm: skip test if application doesn't has root privileges selftests/mm: conform test to TAP format output selftests: mm: hugepage-mmap: conform to TAP format output selftests/mm: gup_test: conform test to TAP format output mm/selftests: hugepage-mremap: conform test to TAP format output mm/vmstat: move pgdemote_* out of CONFIG_NUMA_BALANCING mm: zsmalloc: return -ENOSPC rather than -EINVAL in zs_malloc while size is too large mm/memcontrol: remove __mod_lruvec_page_state() mm/khugepaged: use a folio more in collapse_file() slub: use a folio in __kmalloc_large_node slub: use folio APIs in free_large_kmalloc() slub: use alloc_pages_node() in alloc_slab_page() mm: remove inc/dec lruvec page state functions mm: ratelimit stat flush from workingset shrinker kasan: stop leaking stack trace handles mm/mglru: remove CONFIG_TRANSPARENT_HUGEPAGE mm/mglru: add dummy pmd_dirty() ...
2024-01-08Merge tag 'vfs-6.8.rw' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfsLinus Torvalds1-2/+2
Pull vfs rw updates from Christian Brauner: "This contains updates from Amir for read-write backing file helpers for stacking filesystems such as overlayfs: - Fanotify is currently in the process of introducing pre content events. Roughly, a new permission event will be added indicating that it is safe to write to the file being accessed. These events are used by hierarchical storage managers to e.g., fill the content of files on first access. During that work we noticed that our current permission checking is inconsistent in rw_verify_area() and remap_verify_area(). Especially in the splice code permission checking is done multiple times. For example, one time for the whole range and then again for partial ranges inside the iterator. In addition, we mostly do permission checking before we call file_start_write() except for a few places where we call it after. For pre-content events we need such permission checking to be done before file_start_write(). So this is a nice reason to clean this all up. After this series, all permission checking is done before file_start_write(). As part of this cleanup we also massaged the splice code a bit. We got rid of a few helpers because we are alredy drowning in special read-write helpers. We also cleaned up the return types for splice helpers. - Introduce generic read-write helpers for backing files. This lifts some overlayfs code to common code so it can be used by the FUSE passthrough work coming in over the next cycles. Make Amir and Miklos the maintainers for this new subsystem of the vfs" * tag 'vfs-6.8.rw' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (30 commits) fs: fix __sb_write_started() kerneldoc formatting fs: factor out backing_file_mmap() helper fs: factor out backing_file_splice_{read,write}() helpers fs: factor out backing_file_{read,write}_iter() helpers fs: prepare for stackable filesystems backing file helpers fsnotify: optionally pass access range in file permission hooks fsnotify: assert that file_start_write() is not held in permission hooks fsnotify: split fsnotify_perm() into two hooks fs: use splice_copy_file_range() inline helper splice: return type ssize_t from all helpers fs: use do_splice_direct() for nfsd/ksmbd server-side-copy fs: move file_start_write() into direct_splice_actor() fs: fork splice_file_range() from do_splice_direct() fs: create {sb,file}_write_not_started() helpers fs: create file_write_started() helper fs: create __sb_write_started() helper fs: move kiocb_start_write() into vfs_iocb_iter_write() fs: move permission hook out of do_iter_read() fs: move permission hook out of do_iter_write() fs: move file_start_write() into vfs_iter_write() ...
2024-01-08Merge tag 'vfs-6.8.misc' of ↵Linus Torvalds2-3/+3
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull misc vfs updates from Christian Brauner: "This contains the usual miscellaneous features, cleanups, and fixes for vfs and individual fses. Features: - Add Jan Kara as VFS reviewer - Show correct device and inode numbers in proc/<pid>/maps for vma files on stacked filesystems. This is now easily doable thanks to the backing file work from the last cycles. This comes with selftests Cleanups: - Remove a redundant might_sleep() from wait_on_inode() - Initialize pointer with NULL, not 0 - Clarify comment on access_override_creds() - Rework and simplify eventfd_signal() and eventfd_signal_mask() helpers - Process aio completions in batches to avoid needless wakeups - Completely decouple struct mnt_idmap from namespaces. We now only keep the actual idmapping around and don't stash references to namespaces - Reformat maintainer entries to indicate that a given subsystem belongs to fs/ - Simplify fput() for files that were never opened - Get rid of various pointless file helpers - Rename various file helpers - Rename struct file members after SLAB_TYPESAFE_BY_RCU switch from last cycle - Make relatime_need_update() return bool - Use GFP_KERNEL instead of GFP_USER when allocating superblocks - Replace deprecated ida_simple_*() calls with their current ida_*() counterparts Fixes: - Fix comments on user namespace id mapping helpers. They aren't kernel doc comments so they shouldn't be using /** - s/Retuns/Returns/g in various places - Add missing parameter documentation on can_move_mount_beneath() - Rename i_mapping->private_data to i_mapping->i_private_data - Fix a false-positive lockdep warning in pipe_write() for watch queues - Improve __fget_files_rcu() code generation to improve performance - Only notify writer that pipe resizing has finished after setting pipe->max_usage otherwise writers are never notified that the pipe has been resized and hang - Fix some kernel docs in hfsplus - s/passs/pass/g in various places - Fix kernel docs in ntfs - Fix kcalloc() arguments order reported by gcc 14 - Fix uninitialized value in reiserfs" * tag 'vfs-6.8.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (36 commits) reiserfs: fix uninit-value in comp_keys watch_queue: fix kcalloc() arguments order ntfs: dir.c: fix kernel-doc function parameter warnings fs: fix doc comment typo fs tree wide selftests/overlayfs: verify device and inode numbers in /proc/pid/maps fs/proc: show correct device and inode numbers in /proc/pid/maps eventfd: Remove usage of the deprecated ida_simple_xx() API fs: super: use GFP_KERNEL instead of GFP_USER for super block allocation fs/hfsplus: wrapper.c: fix kernel-doc warnings fs: add Jan Kara as reviewer fs/inode: Make relatime_need_update return bool pipe: wakeup wr_wait after setting max_usage file: remove __receive_fd() file: stop exposing receive_fd_user() fs: replace f_rcuhead with f_task_work file: remove pointless wrapper file: s/close_fd_get_file()/file_close_fd()/g Improve __fget_files_rcu() code generation (and thus __fget_light()) file: massage cleanup of files that failed to open fs/pipe: Fix lockdep false-positive in watchqueue pipe_write() ...
2024-01-04io_uring: ensure local task_work is run on wait timeoutJens Axboe1-2/+12
A previous commit added an earlier break condition here, which is fine if we're using non-local task_work as it'll be run on return to userspace. However, if DEFER_TASKRUN is used, then we could be leaving local task_work that is ready to process in the ctx list until next time that we enter the kernel to wait for events. Move the break condition to _after_ we have run task_work. Cc: stable@vger.kernel.org Fixes: 846072f16eed ("io_uring: mimimise io_cqring_wait_schedule") Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-01-02Merge tag 'loongarch-kvm-6.8' of ↵Paolo Bonzini12-98/+256
git://git.kernel.org/pub/scm/linux/kernel/git/chenhuacai/linux-loongson into HEAD LoongArch KVM changes for v6.8 1. Optimization for memslot hugepage checking. 2. Cleanup and fix some HW/SW timer issues. 3. Add LSX/LASX (128bit/256bit SIMD) support.
2023-12-29io_uring: use mempool KASAN hookAndrey Konovalov1-1/+1
Use the proper kasan_mempool_unpoison_object hook for unpoisoning cached objects. A future change might also update io_uring to check the return value of kasan_mempool_poison_object to prevent double-free and invalid-free bugs. This proves to be non-trivial with the current way io_uring caches objects, so this is left out-of-scope of this series. Link: https://lkml.kernel.org/r/eca18d6cbf676ed784f1a1f209c386808a8087c5.1703024586.git.andreyknvl@google.com Signed-off-by: Andrey Konovalov <andreyknvl@google.com> Cc: Alexander Lobakin <alobakin@pm.me> Cc: Alexander Potapenko <glider@google.com> Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com> Cc: Breno Leitao <leitao@debian.org> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Evgenii Stepanov <eugenis@google.com> Cc: Marco Elver <elver@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-12-29kasan: rename kasan_slab_free_mempool to kasan_mempool_poison_objectAndrey Konovalov1-2/+1
Patch series "kasan: save mempool stack traces". This series updates KASAN to save alloc and free stack traces for secondary-level allocators that cache and reuse allocations internally instead of giving them back to the underlying allocator (e.g. mempool). As a part of this change, introduce and document a set of KASAN hooks: bool kasan_mempool_poison_pages(struct page *page, unsigned int order); void kasan_mempool_unpoison_pages(struct page *page, unsigned int order); bool kasan_mempool_poison_object(void *ptr); void kasan_mempool_unpoison_object(void *ptr, size_t size); and use them in the mempool code. Besides mempool, skbuff and io_uring also cache allocations and already use KASAN hooks to poison those. Their code is updated to use the new mempool hooks. The new hooks save alloc and free stack traces (for normal kmalloc and slab objects; stack traces for large kmalloc objects and page_alloc are not supported by KASAN yet), improve the readability of the users' code, and also allow the users to prevent double-free and invalid-free bugs; see the patches for the details. This patch (of 21): Rename kasan_slab_free_mempool to kasan_mempool_poison_object. kasan_slab_free_mempool is a slightly confusing name: it is unclear whether this function poisons the object when it is freed into mempool or does something when the object is freed from mempool to the underlying allocator. The new name also aligns with other mempool-related KASAN hooks added in the following patches in this series. Link: https://lkml.kernel.org/r/cover.1703024586.git.andreyknvl@google.com Link: https://lkml.kernel.org/r/c5618685abb7cdbf9fb4897f565e7759f601da84.1703024586.git.andreyknvl@google.com Signed-off-by: Andrey Konovalov <andreyknvl@google.com> Cc: Alexander Lobakin <alobakin@pm.me> Cc: Alexander Potapenko <glider@google.com> Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com> Cc: Breno Leitao <leitao@debian.org> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Evgenii Stepanov <eugenis@google.com> Cc: Marco Elver <elver@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-12-21io_uring/kbuf: add method for returning provided buffer ring headJens Axboe3-0/+33
The tail of the provided ring buffer is shared between the kernel and the application, but the head is private to the kernel as the application doesn't need to see it. However, this also prevents the application from knowing how many buffers the kernel has consumed. Usually this is fine, as the information is inherently racy in that the kernel could be consuming buffers continually, but for cleanup purposes it may be relevant to know how many buffers are still left in the ring. Add IORING_REGISTER_PBUF_STATUS which will return status for a given provided buffer ring. Right now it just returns the head, but space is reserved for more information later in, if needed. Link: https://github.com/axboe/liburing/discussions/1020 Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-12-21io_uring/rw: ensure io->bytes_done is always initializedJens Axboe1-3/+7
If IOSQE_ASYNC is set and we fail importing an iovec for a readv or writev request, then we leave ->bytes_done uninitialized and hence the eventual failure CQE posted can potentially have a random res value rather than the expected -EINVAL. Setup ->bytes_done before potentially failing, so we have a consistent value if we fail the request early. Cc: stable@vger.kernel.org Reported-by: xingwei lee <xrivendell7@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-12-21Merge branch 'vfs.file'Christian Brauner1-1/+1
Bring in the changes to the file infrastructure for this cycle. Mostly cleanups and some performance tweaks. * file: remove __receive_fd() * file: stop exposing receive_fd_user() * fs: replace f_rcuhead with f_task_work * file: remove pointless wrapper * file: s/close_fd_get_file()/file_close_fd()/g * Improve __fget_files_rcu() code generation (and thus __fget_light()) * file: massage cleanup of files that failed to open Signed-off-by: Christian Brauner <brauner@kernel.org>
2023-12-19io_uring: drop any code related to SCM_RIGHTSJens Axboe4-217/+10
This is dead code after we dropped support for passing io_uring fds over SCM_RIGHTS, get rid of it. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-12-19io_uring/unix: drop usage of io_uring socketJens Axboe2-14/+0
Since we no longer allow sending io_uring fds over SCM_RIGHTS, move to using io_is_uring_fops() to detect whether this is a io_uring fd or not. With that done, kill off io_uring_get_socket() as nobody calls it anymore. This is in preparation to yanking out the rest of the core related to unix gc with io_uring. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-12-19io_uring/register: move io_uring_register(2) related code to register.cJens Axboe5-580/+619
Most of this code is basically self contained, move it out of the core io_uring file to bring a bit more separation to the registration related bits. This moves another ~10% of the code into register.c. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-12-14io_uring/cmd: fix breakage in SOCKET_URING_OP_SIOC* implementationAl Viro1-1/+1
In 8e9fad0e70b7 "io_uring: Add io_uring command support for sockets" you've got an include of asm-generic/ioctls.h done in io_uring/uring_cmd.c. That had been done for the sake of this chunk - + ret = prot->ioctl(sk, SIOCINQ, &arg); + if (ret) + return ret; + return arg; + case SOCKET_URING_OP_SIOCOUTQ: + ret = prot->ioctl(sk, SIOCOUTQ, &arg); SIOC{IN,OUT}Q are defined to symbols (FIONREAD and TIOCOUTQ) that come from ioctls.h, all right, but the values vary by the architecture. FIONREAD is 0x467F on mips 0x4004667F on alpha, powerpc and sparc 0x8004667F on sh and xtensa 0x541B everywhere else TIOCOUTQ is 0x7472 on mips 0x40047473 on alpha, powerpc and sparc 0x80047473 on sh and xtensa 0x5411 everywhere else ->ioctl() expects the same values it would've gotten from userland; all places where we compare with SIOC{IN,OUT}Q are using asm/ioctls.h, so they pick the correct values. io_uring_cmd_sock(), OTOH, ends up passing the default ones. Fixes: 8e9fad0e70b7 ("io_uring: Add io_uring command support for sockets") Cc: <stable@vger.kernel.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Link: https://lore.kernel.org/r/20231214213408.GT1674809@ZenIV Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-12-13io_uring/poll: don't enable lazy wake for POLLEXCLUSIVEJens Axboe1-3/+17
There are a few quirks around using lazy wake for poll unconditionally, and one of them is related the EPOLLEXCLUSIVE. Those may trigger exclusive wakeups, which wake a limited number of entries in the wait queue. If that wake number is less than the number of entries someone is waiting for (and that someone is also using DEFER_TASKRUN), then we can get stuck waiting for more entries while we should be processing the ones we already got. If we're doing exclusive poll waits, flag the request as not being compatible with lazy wakeups. Reported-by: Pavel Begunkov <asml.silence@gmail.com> Fixes: 6ce4a93dbb5b ("io_uring/poll: use IOU_F_TWQ_LAZY_WAKE for wakeups") Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-12-12splice: return type ssize_t from all helpersAmir Goldstein1-2/+2
Not sure why some splice helpers return long, maybe historic reasons. Change them all to return ssize_t to conform to the splice methods and to the rest of the helpers. Suggested-by: Christian Brauner <brauner@kernel.org> Link: https://lore.kernel.org/r/20231208-horchen-helium-d3ec1535ede5@brauner/ Signed-off-by: Amir Goldstein <amir73il@gmail.com> Link: https://lore.kernel.org/r/20231212094440.250945-2-amir73il@gmail.com Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Christian Brauner <brauner@kernel.org>
2023-12-12io_uring/openclose: add support for IORING_OP_FIXED_FD_INSTALLJens Axboe3-0/+56
io_uring can currently open/close regular files or fixed/direct descriptors. Or you can instantiate a fixed descriptor from a regular one, and then close the regular descriptor. But you currently can't turn a purely fixed/direct descriptor into a regular file descriptor. IORING_OP_FIXED_FD_INSTALL adds support for installing a direct descriptor into the normal file table, just like receiving a file descriptor or opening a new file would do. This is all nicely abstracted into receive_fd(), and hence adding support for this is truly trivial. Since direct descriptors are only usable within io_uring itself, it can be useful to turn them into real file descriptors if they ever need to be accessed via normal syscalls. This can either be a transitory thing, or just a permanent transition for a given direct descriptor. By default, new fds are installed with O_CLOEXEC set. The application can disable O_CLOEXEC by setting IORING_FIXED_FD_NO_CLOEXEC in the sqe->install_fd_flags member. Suggested-by: Christian Brauner <brauner@kernel.org> Reviewed-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-12-12io_uring/cmd: inline io_uring_cmd_get_taskPavel Begunkov1-6/+0
With io_uring_types.h we see all required definitions to inline io_uring_cmd_get_task(). Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/aa8e317f09e651a5f3e72f8c0ad3902084c1f930.1701391955.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-12-12io_uring/cmd: inline io_uring_cmd_do_in_task_lazyPavel Begunkov2-17/+0
Now as we can easily include io_uring_types.h, move IOU_F_TWQ_LAZY_WAKE and inline io_uring_cmd_do_in_task_lazy(). Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/2ec9fb31dd192d1c5cf26d0a2dec5657d88a8e48.1701391955.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-12-12io_uring: split out cmd api into a separate headerPavel Begunkov3-2/+3
linux/io_uring.h is slowly becoming a rubbish bin where we put anything exposed to other subsystems. For instance, the task exit hooks and io_uring cmd infra are completely orthogonal and don't need each other's definitions. Start cleaning it up by splitting out all command bits into a new header file. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/7ec50bae6e21f371d3850796e716917fc141225a.1701391955.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-12-12io_uring: optimise ltimeout for inline executionPavel Begunkov1-10/+9
At one point in time we had an optimisation that would not spin up a linked timeout timer when the master request successfully completes inline (during the first nowait execution attempt). We somehow lost it, so this patch restores it back. Note, that it's fine using io_arm_ltimeout() after the io_issue_sqe() completes the request because of delayed completion, but that that adds unwanted overhead. Reported-by: Christian Mazakas <christian.mazakas@gmail.com> Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/8bf69c2a4beec14c565c85c86edb871ca8b8bcc8.1701390926.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-12-12io_uring: don't check iopoll if request completesPavel Begunkov1-1/+5
IOPOLL request should never return IOU_OK, so the following iopoll queueing check in io_issue_sqe() after getting IOU_OK doesn't make any sense as would never turn true. Let's optimise on that and return a bit earlier. It's also much more resilient to potential bugs from mischieving iopoll implementations. Cc: <stable@vger.kernel.org> Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/2f8690e2fa5213a2ff292fac29a7143c036cdd60.1701390926.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-12-12Merge branch 'vfs.file' of ↵Jens Axboe1-1/+1
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs into for-6.8/io_uring Merge vfs.file from the VFS tree to avoid conflicts with receive_fd() now having 3 arguments rather than just 2. * 'vfs.file' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: file: remove __receive_fd() file: stop exposing receive_fd_user() fs: replace f_rcuhead with f_task_work file: remove pointless wrapper file: s/close_fd_get_file()/file_close_fd()/g Improve __fget_files_rcu() code generation (and thus __fget_light()) file: massage cleanup of files that failed to open
2023-12-12file: remove pointless wrapperChristian Brauner1-1/+1
Only io_uring uses __close_fd_get_file(). All it does is hide current->files but io_uring accesses files_struct directly right now anyway so it's a bit pointless. Just rename pick_file() to file_close_fd_locked() and let io_uring use it. Add a lockdep assert in there that we expect the caller to hold file_lock while we're at it. Link: https://lore.kernel.org/r/20231130-vfs-files-fixes-v1-2-e73ca6f4ea83@kernel.org Reviewed-by: Jens Axboe <axboe@kernel.dk> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Christian Brauner <brauner@kernel.org>
2023-12-07io_uring/af_unix: disable sending io_uring over socketsPavel Begunkov1-7/+0
File reference cycles have caused lots of problems for io_uring in the past, and it still doesn't work exactly right and races with unix_stream_read_generic(). The safest fix would be to completely disallow sending io_uring files via sockets via SCM_RIGHT, so there are no possible cycles invloving registered files and thus rendering SCM accounting on the io_uring side unnecessary. Cc: <stable@vger.kernel.org> Fixes: 0091bfc81741b ("io_uring/af_unix: defer registered files gc to io_uring release") Reported-and-suggested-by: Jann Horn <jannh@google.com> Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/c716c88321939156909cfa1bd8b0faaf1c804103.1701868795.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-12-05io_uring/kbuf: check for buffer list readiness after NULL checkJens Axboe1-2/+2
Move the buffer list 'is_ready' check below the validity check for the buffer list for a given group. Fixes: 5cf4f52e6d8a ("io_uring: free io_buffer_list entries via RCU") Reported-by: Dan Carpenter <dan.carpenter@linaro.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-12-05io_uring/kbuf: Fix an NULL vs IS_ERR() bug in io_alloc_pbuf_ring()Dan Carpenter1-2/+2
The io_mem_alloc() function returns error pointers, not NULL. Update the check accordingly. Fixes: b10b73c102a2 ("io_uring/kbuf: recycle freed mapped buffer ring entries") Signed-off-by: Dan Carpenter <dan.carpenter@linaro.org> Link: https://lore.kernel.org/r/5ed268d3-a997-4f64-bd71-47faa92101ab@moroto.mountain Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-12-03io_uring: fix mutex_unlock with unreferenced ctxPavel Begunkov1-6/+3
Callers of mutex_unlock() have to make sure that the mutex stays alive for the whole duration of the function call. For io_uring that means that the following pattern is not valid unless we ensure that the context outlives the mutex_unlock() call. mutex_lock(&ctx->uring_lock); req_put(req); // typically via io_req_task_submit() mutex_unlock(&ctx->uring_lock); Most contexts are fine: io-wq pins requests, syscalls hold the file, task works are taking ctx references and so on. However, the task work fallback path doesn't follow the rule. Cc: <stable@vger.kernel.org> Fixes: 04fc6c802d ("io_uring: save ctx put/get for task_work submit") Reported-by: Jann Horn <jannh@google.com> Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/io-uring/CAG48ez3xSoYb+45f1RLtktROJrpiDQ1otNvdR+YLQf7m+Krj5Q@mail.gmail.com/ Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-12-01io_uring: remove uring_cmd cookieKeith Busch1-1/+0
No more users of this field. Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Keith Busch <kbusch@kernel.org> Link: https://lore.kernel.org/r/20231130215309.2923568-5-kbusch@meta.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-11-28io_uring: use fget/fput consistentlyJens Axboe2-23/+24
Normally within a syscall it's fine to use fdget/fdput for grabbing a file from the file table, and it's fine within io_uring as well. We do that via io_uring_enter(2), io_uring_register(2), and then also for cancel which is invoked from the latter. io_uring cannot close its own file descriptors as that is explicitly rejected, and for the cancel side of things, the file itself is just used as a lookup cookie. However, it is more prudent to ensure that full references are always grabbed. For anything threaded, either explicitly in the application itself or through use of the io-wq worker threads, this is what happens anyway. Generalize it and use fget/fput throughout. Also see the below link for more details. Link: https://lore.kernel.org/io-uring/CAG48ez1htVSO3TqmrF8QcX2WFuYTRM-VZ_N10i-VZgbtg=NNqw@mail.gmail.com/ Suggested-by: Jann Horn <jannh@google.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-11-28io_uring: free io_buffer_list entries via RCUJens Axboe3-15/+56
mmap_lock nests under uring_lock out of necessity, as we may be doing user copies with uring_lock held. However, for mmap of provided buffer rings, we attempt to grab uring_lock with mmap_lock already held from do_mmap(). This makes lockdep, rightfully, complain: WARNING: possible circular locking dependency detected 6.7.0-rc1-00009-gff3337ebaf94-dirty #4438 Not tainted ------------------------------------------------------ buf-ring.t/442 is trying to acquire lock: ffff00020e1480a8 (&ctx->uring_lock){+.+.}-{3:3}, at: io_uring_validate_mmap_request.isra.0+0x4c/0x140 but task is already holding lock: ffff0000dc226190 (&mm->mmap_lock){++++}-{3:3}, at: vm_mmap_pgoff+0x124/0x264 which lock already depends on the new lock. the existing dependency chain (in reverse order) is: -> #1 (&mm->mmap_lock){++++}-{3:3}: __might_fault+0x90/0xbc io_register_pbuf_ring+0x94/0x488 __arm64_sys_io_uring_register+0x8dc/0x1318 invoke_syscall+0x5c/0x17c el0_svc_common.constprop.0+0x108/0x130 do_el0_svc+0x2c/0x38 el0_svc+0x4c/0x94 el0t_64_sync_handler+0x118/0x124 el0t_64_sync+0x168/0x16c -> #0 (&ctx->uring_lock){+.+.}-{3:3}: __lock_acquire+0x19a0/0x2d14 lock_acquire+0x2e0/0x44c __mutex_lock+0x118/0x564 mutex_lock_nested+0x20/0x28 io_uring_validate_mmap_request.isra.0+0x4c/0x140 io_uring_mmu_get_unmapped_area+0x3c/0x98 get_unmapped_area+0xa4/0x158 do_mmap+0xec/0x5b4 vm_mmap_pgoff+0x158/0x264 ksys_mmap_pgoff+0x1d4/0x254 __arm64_sys_mmap+0x80/0x9c invoke_syscall+0x5c/0x17c el0_svc_common.constprop.0+0x108/0x130 do_el0_svc+0x2c/0x38 el0_svc+0x4c/0x94 el0t_64_sync_handler+0x118/0x124 el0t_64_sync+0x168/0x16c From that mmap(2) path, we really just need to ensure that the buffer list doesn't go away from underneath us. For the lower indexed entries, they never go away until the ring is freed and we can always sanely reference those as long as the caller has a file reference. For the higher indexed ones in our xarray, we just need to ensure that the buffer list remains valid while we return the address of it. Free the higher indexed io_buffer_list entries via RCU. With that we can avoid needing ->uring_lock inside mmap(2), and simply hold the RCU read lock around the buffer list lookup and address check. To ensure that the arrayed lookup either returns a valid fully formulated entry via RCU lookup, add an 'is_ready' flag that we access with store and release memory ordering. This isn't needed for the xarray lookups, but doesn't hurt either. Since this isn't a fast path, retain it across both types. Similarly, for the allocated array inside the ctx, ensure we use the proper load/acquire as setup could in theory be running in parallel with mmap. While in there, add a few lockdep checks for documentation purposes. Cc: stable@vger.kernel.org Fixes: c56e022c0a27 ("io_uring: add support for user mapped provided buffer ring") Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-11-28io_uring/kbuf: prune deferred locked cache when tearing downJens Axboe1-0/+8
We used to just use our page list for final teardown, which would ensure that we got all the buffers, even the ones that were not on the normal cached list. But while moving to slab for the io_buffers, we know only prune this list, not the deferred locked list that we have. This can cause a leak of memory, if the workload ends up using the intermediate locked list. Fix this by always pruning both lists when tearing down. Fixes: b3a4dbc89d40 ("io_uring/kbuf: Use slab for struct io_buffer objects") Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-11-28io_uring/kbuf: recycle freed mapped buffer ring entriesJens Axboe1-11/+66
Right now we stash any potentially mmap'ed provided ring buffer range for freeing at release time, regardless of when they get unregistered. Since we're keeping track of these ranges anyway, keep track of their registration state as well, and use that to recycle ranges when appropriate rather than always allocate new ones. The lookup is a basic scan of entries, checking for the best matching free entry. Fixes: c392cbecd8ec ("io_uring/kbuf: defer release of mapped buffer rings") Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-11-28io_uring/kbuf: defer release of mapped buffer ringsJens Axboe3-5/+43
If a provided buffer ring is setup with IOU_PBUF_RING_MMAP, then the kernel allocates the memory for it and the application is expected to mmap(2) this memory. However, io_uring uses remap_pfn_range() for this operation, so we cannot rely on normal munmap/release on freeing them for us. Stash an io_buf_free entry away for each of these, if any, and provide a helper to free them post ->release(). Cc: stable@vger.kernel.org Fixes: c56e022c0a27 ("io_uring: add support for user mapped provided buffer ring") Reported-by: Jann Horn <jannh@google.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-11-28eventfd: simplify eventfd_signal_mask()Christian Brauner1-2/+2
The eventfd_signal_mask() helper was introduced for io_uring and similar to eventfd_signal() it always passed 1 for @n. So don't bother with that argument at all. Link: https://lore.kernel.org/r/20231122-vfs-eventfd-signal-v2-3-bd549b14ce0c@kernel.org Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Christian Brauner <brauner@kernel.org>
2023-11-27io_uring: enable io_mem_alloc/free to be used in other partsJens Axboe2-2/+5
In preparation for using these helpers, make them non-static and add them to our internal header. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-11-27io_uring: don't guard IORING_OFF_PBUF_RING with SETUP_NO_MMAPJens Axboe1-4/+6
This flag only applies to the SQ and CQ rings, it's perfectly valid to use a mmap approach for the provided ring buffers. Move the check into where it belongs. Cc: stable@vger.kernel.org Fixes: 03d89a2de25b ("io_uring: support for user allocated memory for rings/sqes") Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-11-27io_uring: don't allow discontig pages for IORING_SETUP_NO_MMAPJens Axboe1-18/+21
io_sqes_map() is used rather than io_mem_alloc(), if the application passes in memory for mapping rather than have the kernel allocate it and then mmap(2) the ranges. This then calls __io_uaddr_map() to perform the page mapping and pinning, which checks if we end up with the same pages, if more than one page is mapped. But this check is incorrect and only checks if the first and last pages are the same, where it really should be checking if the mapped pages are contigous. This allows mapping a single normal page, or a huge page range. Down the line we can add support for remapping pages to be virtually contigous, which is really all that io_uring cares about. Cc: stable@vger.kernel.org Fixes: 03d89a2de25b ("io_uring: support for user allocated memory for rings/sqes") Reported-by: Jann Horn <jannh@google.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-11-20io_uring: fix off-by one bvec indexKeith Busch1-1/+1
If the offset equals the bv_len of the first registered bvec, then the request does not include any of that first bvec. Skip it so that drivers don't have to deal with a zero length bvec, which was observed to break NVMe's PRP list creation. Cc: stable@vger.kernel.org Fixes: bd11b3a391e3 ("io_uring: don't use iov_iter_advance() for fixed buffers") Signed-off-by: Keith Busch <kbusch@kernel.org> Link: https://lore.kernel.org/r/20231120221831.2646460-1-kbusch@meta.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-11-20io_uring/fs: consider link->flags when getting path for LINKATCharles Mirabile1-1/+1
In order for `AT_EMPTY_PATH` to work as expected, the fact that the user wants that behavior needs to make it to `getname_flags` or it will return ENOENT. Fixes: cf30da90bc3a ("io_uring: add support for IORING_OP_LINKAT") Cc: <stable@vger.kernel.org> Link: https://github.com/axboe/liburing/issues/995 Signed-off-by: Charles Mirabile <cmirabil@redhat.com> Link: https://lore.kernel.org/r/20231120105545.1209530-1-cmirabil@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-11-15io_uring/fdinfo: remove need for sqpoll lock for thread/pid retrievalJens Axboe2-9/+12
A previous commit added a trylock for getting the SQPOLL thread info via fdinfo, but this introduced a regression where we often fail to get it if the thread is busy. For that case, we end up not printing the current CPU and PID info. Rather than rely on this lock, just print the pid we already stored in the io_sq_data struct, and ensure we update the current CPU every time we've slept or potentially rescheduled. The latter won't potentially be 100% accurate, but that wasn't the case before either as the task can get migrated at any time unless it has been pinned at creation time. We retain keeping the io_sq_data dereference inside the ctx->uring_lock, as it has always been, as destruction of the thread and data happen below that. We could make this RCU safe, but there's little point in doing that. With this, we always print the last valid information we had, rather than have spurious outputs with missing information. Fixes: 7644b1a1c9a7 ("io_uring/fdinfo: lock SQ thread while retrieving thread cpu/pid") Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-11-14Merge branch 'kvm-guestmemfd' into HEADPaolo Bonzini1-1/+2
Introduce several new KVM uAPIs to ultimately create a guest-first memory subsystem within KVM, a.k.a. guest_memfd. Guest-first memory allows KVM to provide features, enhancements, and optimizations that are kludgly or outright impossible to implement in a generic memory subsystem. The core KVM ioctl() for guest_memfd is KVM_CREATE_GUEST_MEMFD, which similar to the generic memfd_create(), creates an anonymous file and returns a file descriptor that refers to it. Again like "regular" memfd files, guest_memfd files live in RAM, have volatile storage, and are automatically released when the last reference is dropped. The key differences between memfd files (and every other memory subystem) is that guest_memfd files are bound to their owning virtual machine, cannot be mapped, read, or written by userspace, and cannot be resized. guest_memfd files do however support PUNCH_HOLE, which can be used to convert a guest memory area between the shared and guest-private states. A second KVM ioctl(), KVM_SET_MEMORY_ATTRIBUTES, allows userspace to specify attributes for a given page of guest memory. In the long term, it will likely be extended to allow userspace to specify per-gfn RWX protections, including allowing memory to be writable in the guest without it also being writable in host userspace. The immediate and driving use case for guest_memfd are Confidential (CoCo) VMs, specifically AMD's SEV-SNP, Intel's TDX, and KVM's own pKVM. For such use cases, being able to map memory into KVM guests without requiring said memory to be mapped into the host is a hard requirement. While SEV+ and TDX prevent untrusted software from reading guest private data by encrypting guest memory, pKVM provides confidentiality and integrity *without* relying on memory encryption. In addition, with SEV-SNP and especially TDX, accessing guest private memory can be fatal to the host, i.e. KVM must be prevent host userspace from accessing guest memory irrespective of hardware behavior. Long term, guest_memfd may be useful for use cases beyond CoCo VMs, for example hardening userspace against unintentional accesses to guest memory. As mentioned earlier, KVM's ABI uses userspace VMA protections to define the allow guest protection (with an exception granted to mapping guest memory executable), and similarly KVM currently requires the guest mapping size to be a strict subset of the host userspace mapping size. Decoupling the mappings sizes would allow userspace to precisely map only what is needed and with the required permissions, without impacting guest performance. A guest-first memory subsystem also provides clearer line of sight to things like a dedicated memory pool (for slice-of-hardware VMs) and elimination of "struct page" (for offload setups where userspace _never_ needs to DMA from or into guest memory). guest_memfd is the result of 3+ years of development and exploration; taking on memory management responsibilities in KVM was not the first, second, or even third choice for supporting CoCo VMs. But after many failed attempts to avoid KVM-specific backing memory, and looking at where things ended up, it is quite clear that of all approaches tried, guest_memfd is the simplest, most robust, and most extensible, and the right thing to do for KVM and the kernel at-large. The "development cycle" for this version is going to be very short; ideally, next week I will merge it as is in kvm/next, taking this through the KVM tree for 6.8 immediately after the end of the merge window. The series is still based on 6.6 (plus KVM changes for 6.7) so it will require a small fixup for changes to get_file_rcu() introduced in 6.7 by commit 0ede61d8589c ("file: convert to SLAB_TYPESAFE_BY_RCU"). The fixup will be done as part of the merge commit, and most of the text above will become the commit message for the merge. Pending post-merge work includes: - hugepage support - looking into using the restrictedmem framework for guest memory - introducing a testing mechanism to poison memory, possibly using the same memory attributes introduced here - SNP and TDX support There are two non-KVM patches buried in the middle of this series: fs: Rename anon_inode_getfile_secure() and anon_inode_getfd_secure() mm: Add AS_UNMOVABLE to mark mapping as completely unmovable The first is small and mostly suggested-by Christian Brauner; the second a bit less so but it was written by an mm person (Vlastimil Babka).
2023-11-14fs: Rename anon_inode_getfile_secure() and anon_inode_getfd_secure()Paolo Bonzini1-1/+2
The call to the inode_init_security_anon() LSM hook is not the sole reason to use anon_inode_getfile_secure() or anon_inode_getfd_secure(). For example, the functions also allow one to create a file with non-zero size, without needing a full-blown filesystem. In this case, you don't need a "secure" version, just unique inodes; the current name of the functions is confusing and does not explain well the difference with the more "standard" anon_inode_getfile() and anon_inode_getfd(). Of course, there is another side of the coin; neither io_uring nor userfaultfd strictly speaking need distinct inodes, and it is not that clear anymore that anon_inode_create_get{file,fd}() allow the LSM to intercept and block the inode's creation. If one was so inclined, anon_inode_getfile_secure() and anon_inode_getfd_secure() could be kept, using the shared inode or a new one depending on CONFIG_SECURITY. However, this is probably overkill, and potentially a cause of bugs in different configurations. Therefore, just add a comment to io_uring and userfaultfd explaining the choice of the function. While at it, remove the export for what is now anon_inode_create_getfd(). There is no in-tree module that uses it, and the old name is gone anyway. If anybody actually needs the symbol, they can ask or they can just use anon_inode_create_getfile(), which will be exported very soon for use in KVM. Suggested-by: Christian Brauner <brauner@kernel.org> Reviewed-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2023-11-06io_uring: do not clamp read length for multishot readDylan Yudaken1-1/+8
When doing a multishot read, the code path reuses the old read paths. However this breaks an assumption built into those paths, namely that struct io_rw::len is available for reuse by __io_import_iovec. For multishot this results in len being set for the first receive call, and then subsequent calls are clamped to that buffer length incorrectly. Instead keep len as zero after recycling buffers, to reuse the full buffer size of the next selected buffer. Fixes: fc68fcda0491 ("io_uring/rw: add support for IORING_OP_READ_MULTISHOT") Signed-off-by: Dylan Yudaken <dyudaken@gmail.com> Link: https://lore.kernel.org/r/20231106203909.197089-4-dyudaken@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-11-06io_uring: do not allow multishot read to set addr or lenDylan Yudaken1-0/+4
For addr: this field is not used, since buffer select is forced. But by forcing it to be zero it leaves open future uses of the field. len is actually usable, you could imagine that you want to receive multishot up to a certain length. However right now this is not how it is implemented, and it seems safer to force this to be zero. Fixes: fc68fcda0491 ("io_uring/rw: add support for IORING_OP_READ_MULTISHOT") Signed-off-by: Dylan Yudaken <dyudaken@gmail.com> Link: https://lore.kernel.org/r/20231106203909.197089-3-dyudaken@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-11-06io_uring: indicate if io_kbuf_recycle did recycle anythingDylan Yudaken2-8/+11
It can be useful to know if io_kbuf_recycle did actually recycle the buffer on the request, or if it left the request alone. Signed-off-by: Dylan Yudaken <dyudaken@gmail.com> Link: https://lore.kernel.org/r/20231106203909.197089-2-dyudaken@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-11-06io_uring/rw: add separate prep handler for fixed read/writeJens Axboe3-14/+21
Rather than sprinkle opcode checks in the generic read/write prep handler, have a separate prep handler for the vectored readv/writev operation. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-11-06io_uring/rw: add separate prep handler for readv/writevJens Axboe3-9/+18
Rather than sprinkle opcode checks in the generic read/write prep handler, have a separate prep handler for the vectored readv/writev operation. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-11-03io_uring/net: ensure socket is marked connected on connect retryJens Axboe1-13/+11
io_uring does non-blocking connection attempts, which can yield some unexpected results if a connect request is re-attempted by an an application. This is equivalent to the following sync syscall sequence: sock = socket(AF_INET, SOCK_STREAM | SOCK_NONBLOCK, IPPROTO_TCP); connect(sock, &addr, sizeof(addr); ret == -1 and errno == EINPROGRESS expected here. Now poll for POLLOUT on sock, and when that returns, we expect the socket to be connected. But if we follow that procedure with: connect(sock, &addr, sizeof(addr)); you'd expect ret == -1 and errno == EISCONN here, but you actually get ret == 0. If we attempt the connection one more time, then we get EISCON as expected. io_uring used to do this, but turns out that bluetooth fails with EBADFD if you attempt to re-connect. Also looks like EISCONN _could_ occur with this sequence. Retain the ->in_progress logic, but work-around a potential EISCONN or EBADFD error and only in those cases look at the sock_error(). This should work in general and avoid the odd sequence of a repeated connect request returning success when the socket is already connected. This is all a side effect of the socket state being in a CONNECTING state when we get EINPROGRESS, and only a re-connect or other related operation will turn that into CONNECTED. Cc: stable@vger.kernel.org Fixes: 3fb1bd688172 ("io_uring/net: handle -EINPROGRESS correct for IORING_OP_CONNECT") Link: https://github.com/axboe/liburing/issues/980 Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-11-03io_uring/rw: don't attempt to allocate async data if opcode doesn't need itJens Axboe1-0/+7
The new read multishot method doesn't need to allocate async data ever, as it doesn't do vectored IO and it must only be used with provided buffers. While it doesn't have ->prep_async() set, it also sets ->async_size to 0, which is different from any other read/write type we otherwise support. If it's used on a file type that isn't pollable, we do try and allocate this async data, and then try and use that data. But since we passed in a size of 0 for the data, we get a NULL back on data allocation. We then proceed to dereference that to copy state, and that obviously won't end well. Add a check in io_setup_async_rw() for this condition, and avoid copying state. Also add a check for whether or not buffer selection is specified in prep while at it. Fixes: fc68fcda0491 ("io_uring/rw: add support for IORING_OP_READ_MULTISHOT") Link: https://bugzilla.kernel.org/show_bug.cgi?id=218101 Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-11-01Merge tag 'io_uring-futex-2023-10-30' of git://git.kernel.dk/linuxLinus Torvalds7-0/+473
Pull io_uring futex support from Jens Axboe: "This adds support for using futexes through io_uring - first futex wake and wait, and then the vectored variant of waiting, futex waitv. For both wait/wake/waitv, we support the bitset variant, as the 'normal' variants can be easily implemented on top of that. PI and requeue are not supported through io_uring, just the above mentioned parts. This may change in the future, but in the spirit of keeping this small (and based on what people have been asking for), this is what we currently have. Wake support is pretty straight forward, most of the thought has gone into the wait side to avoid needing to offload wait operations to a blocking context. Instead, we rely on the usual callbacks to retry and post a completion event, when appropriate. As far as I can recall, the first request for futex support with io_uring came from Andres Freund, working on postgres. His aio rework of postgres was one of the early adopters of io_uring, and futex support was a natural extension for that. This is relevant from both a usability point of view, as well as for effiency and performance. In Andres's words, for the former: Futex wait support in io_uring makes it a lot easier to avoid deadlocks in concurrent programs that have their own buffer pool: Obviously pages in the application buffer pool have to be locked during IO. If the initiator of IO A needs to wait for a held lock B, the holder of lock B might wait for the IO A to complete. The ability to wait for a lock and IO completions at the same time provides an efficient way to avoid such deadlocks and in terms of effiency, even without unlocking the full potential yet, Andres says: Futex wake support in io_uring is useful because it allows for more efficient directed wakeups. For some "locks" postgres has queues implemented in userspace, with wakeup logic that cannot easily be implemented with FUTEX_WAKE_BITSET on a single "futex word" (imagine waiting for journal flushes to have completed up to a certain point). Thus a "lock release" sometimes need to wake up many processes in a row. A quick-and-dirty conversion to doing these wakeups via io_uring lead to a 3% throughput increase, with 12% fewer context switches, albeit in a fairly extreme workload" * tag 'io_uring-futex-2023-10-30' of git://git.kernel.dk/linux: io_uring: add support for vectored futex waits futex: make the vectored futex operations available futex: make futex_parse_waitv() available as a helper futex: add wake_data to struct futex_q io_uring: add support for futex wake and wait futex: abstract out a __futex_wake_mark() helper futex: factor out the futex wake handling futex: move FUTEX2_VALID_MASK to futex.h
2023-11-01Merge tag 'for-6.7/io_uring-sockopt-2023-10-30' of git://git.kernel.dk/linuxLinus Torvalds1-0/+53
Pull io_uring {get,set}sockopt support from Jens Axboe: "This adds support for using getsockopt and setsockopt via io_uring. The main use cases for this is to enable use of direct descriptors, rather than first instantiating a normal file descriptor, doing the option tweaking needed, then turning it into a direct descriptor. With this support, we can avoid needing a regular file descriptor completely. The net and bpf bits have been signed off on their side" * tag 'for-6.7/io_uring-sockopt-2023-10-30' of git://git.kernel.dk/linux: selftests/bpf/sockopt: Add io_uring support io_uring/cmd: Introduce SOCKET_URING_OP_SETSOCKOPT io_uring/cmd: Introduce SOCKET_URING_OP_GETSOCKOPT io_uring/cmd: return -EOPNOTSUPP if net is disabled selftests/net: Extract uring helpers to be reusable tools headers: Grab copy of io_uring.h io_uring/cmd: Pass compat mode in issue_flags net/socket: Break down __sys_getsockopt net/socket: Break down __sys_setsockopt bpf: Add sockptr support for setsockopt bpf: Add sockptr support for getsockopt
2023-11-01Merge tag 'for-6.7/io_uring-2023-10-30' of git://git.kernel.dk/linuxLinus Torvalds14-56/+649
Pull io_uring updates from Jens Axboe: "This contains the core io_uring updates, of which there are not many, and adds support for using WAITID through io_uring and hence not needing to block on these kinds of events. Outside of that, tweaks to the legacy provided buffer handling and some cleanups related to cancelations for uring_cmd support" * tag 'for-6.7/io_uring-2023-10-30' of git://git.kernel.dk/linux: io_uring/poll: use IOU_F_TWQ_LAZY_WAKE for wakeups io_uring/kbuf: Use slab for struct io_buffer objects io_uring/kbuf: Allow the full buffer id space for provided buffers io_uring/kbuf: Fix check of BID wrapping in provided buffers io_uring/rsrc: cleanup io_pin_pages() io_uring: cancelable uring_cmd io_uring: retain top 8bits of uring_cmd flags for kernel internal use io_uring: add IORING_OP_WAITID support exit: add internal include file with helpers exit: add kernel_waitid_prepare() helper exit: move core of do_wait() into helper exit: abstract out should_wake helper for child_wait_callback() io_uring/rw: add support for IORING_OP_READ_MULTISHOT io_uring/rw: mark readv/writev as vectored in the opcode definition io_uring/rw: split io_read() into a helper
2023-10-30Merge tag 'vfs-6.7.misc' of gitolite.kernel.org:pub/scm/linux/kernel/git/vfs/vfsLinus Torvalds1-8/+1
Pull misc vfs updates from Christian Brauner: "This contains the usual miscellaneous features, cleanups, and fixes for vfs and individual fses. Features: - Rename and export helpers that get write access to a mount. They are used in overlayfs to get write access to the upper mount. - Print the pretty name of the root device on boot failure. This helps in scenarios where we would usually only print "unknown-block(1,2)". - Add an internal SB_I_NOUMASK flag. This is another part in the endless POSIX ACL saga in a way. When POSIX ACLs are enabled via SB_POSIXACL the vfs cannot strip the umask because if the relevant inode has POSIX ACLs set it might take the umask from there. But if the inode doesn't have any POSIX ACLs set then we apply the umask in the filesytem itself. So we end up with: (1) no SB_POSIXACL -> strip umask in vfs (2) SB_POSIXACL -> strip umask in filesystem The umask semantics associated with SB_POSIXACL allowed filesystems that don't even support POSIX ACLs at all to raise SB_POSIXACL purely to avoid umask stripping. That specifically means NFS v4 and Overlayfs. NFS v4 does it because it delegates this to the server and Overlayfs because it needs to delegate umask stripping to the upper filesystem, i.e., the filesystem used as the writable layer. This went so far that SB_POSIXACL is raised eve on kernels that don't even have POSIX ACL support at all. Stop this blatant abuse and add SB_I_NOUMASK which is an internal superblock flag that filesystems can raise to opt out of umask handling. That should really only be the two mentioned above. It's not that we want any filesystems to do this. Ideally we have all umask handling always in the vfs. - Make overlayfs use SB_I_NOUMASK too. - Now that we have SB_I_NOUMASK, stop checking for SB_POSIXACL in IS_POSIXACL() if the kernel doesn't have support for it. This is a very old patch but it's only possible to do this now with the wider cleanup that was done. - Follow-up work on fake path handling from last cycle. Citing mostly from Amir: When overlayfs was first merged, overlayfs files of regular files and directories, the ones that are installed in file table, had a "fake" path, namely, f_path is the overlayfs path and f_inode is the "real" inode on the underlying filesystem. In v6.5, we took another small step by introducing of the backing_file container and the file_real_path() helper. This change allowed vfs and filesystem code to get the "real" path of an overlayfs backing file. With this change, we were able to make fsnotify work correctly and report events on the "real" filesystem objects that were accessed via overlayfs. This method works fine, but it still leaves the vfs vulnerable to new code that is not aware of files with fake path. A recent example is commit db1d1e8b9867 ("IMA: use vfs_getattr_nosec to get the i_version"). This commit uses direct referencing to f_path in IMA code that otherwise uses file_inode() and file_dentry() to reference the filesystem objects that it is measuring. This contains work to switch things around: instead of having filesystem code opt-in to get the "real" path, have generic code opt-in for the "fake" path in the few places that it is needed. Is it far more likely that new filesystems code that does not use the file_dentry() and file_real_path() helpers will end up causing crashes or averting LSM/audit rules if we keep the "fake" path exposed by default. This change already makes file_dentry() moot, but for now we did not change this helper just added a WARN_ON() in ovl_d_real() to catch if we have made any wrong assumptions. After the dust settles on this change, we can make file_dentry() a plain accessor and we can drop the inode argument to ->d_real(). - Switch struct file to SLAB_TYPESAFE_BY_RCU. This looks like a small change but it really isn't and I would like to see everyone on their tippie toes for any possible bugs from this work. Essentially we've been doing most of what SLAB_TYPESAFE_BY_RCU for files since a very long time because of the nasty interactions between the SCM_RIGHTS file descriptor garbage collection. So extending it makes a lot of sense but it is a subtle change. There are almost no places that fiddle with file rcu semantics directly and the ones that did mess around with struct file internal under rcu have been made to stop doing that because it really was always dodgy. I forgot to put in the link tag for this change and the discussion in the commit so adding it into the merge message: https://lore.kernel.org/r/20230926162228.68666-1-mjguzik@gmail.com Cleanups: - Various smaller pipe cleanups including the removal of a spin lock that was only used to protect against writes without pipe_lock() from O_NOTIFICATION_PIPE aka watch queues. As that was never implemented remove the additional locking from pipe_write(). - Annotate struct watch_filter with the new __counted_by attribute. - Clarify do_unlinkat() cleanup so that it doesn't look like an extra iput() is done that would cause issues. - Simplify file cleanup when the file has never been opened. - Use module helper instead of open-coding it. - Predict error unlikely for stale retry. - Use WRITE_ONCE() for mount expiry field instead of just commenting that one hopes the compiler doesn't get smart. Fixes: - Fix readahead on block devices. - Fix writeback when layztime is enabled and inodes whose timestamp is the only thing that changed reside on wb->b_dirty_time. This caused excessively large zombie memory cgroup when lazytime was enabled as such inodes weren't handled fast enough. - Convert BUG_ON() to WARN_ON_ONCE() in open_last_lookups()" * tag 'vfs-6.7.misc' of gitolite.kernel.org:pub/scm/linux/kernel/git/vfs/vfs: (26 commits) file, i915: fix file reference for mmap_singleton() vfs: Convert BUG_ON to WARN_ON_ONCE in open_last_lookups writeback, cgroup: switch inodes with dirty timestamps to release dying cgwbs chardev: Simplify usage of try_module_get() ovl: rely on SB_I_NOUMASK fs: fix umask on NFS with CONFIG_FS_POSIX_ACL=n fs: store real path instead of fake path in backing file f_path fs: create helper file_user_path() for user displayed mapped file path fs: get mnt_writers count for an open backing file's real path vfs: stop counting on gcc not messing with mnt_expiry_mark if not asked vfs: predict the error in retry_estale as unlikely backing file: free directly vfs: fix readahead(2) on block devices io_uring: use files_lookup_fd_locked() file: convert to SLAB_TYPESAFE_BY_RCU vfs: shave work on failed file open fs: simplify misleading code to remove ambiguity regarding ihold()/iput() watch_queue: Annotate struct watch_filter with __counted_by fs/pipe: use spinlock in pipe_read() only if there is a watch_queue fs/pipe: remove unnecessary spinlock from pipe_write() ...
2023-10-27Merge tag 'pull-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfsLinus Torvalds1-1/+1
Pull misc filesystem fixes from Al Viro: "Assorted fixes all over the place: literally nothing in common, could have been three separate pull requests. All are simple regression fixes, but not for anything from this cycle" * tag 'pull-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: ceph_wait_on_conflict_unlink(): grab reference before dropping ->d_lock io_uring: kiocb_done() should *not* trust ->ki_pos if ->{read,write}_iter() failed sparc32: fix a braino in fault handling in csum_and_copy_..._user()
2023-10-27io_uring: kiocb_done() should *not* trust ->ki_pos if ->{read,write}_iter() ↵Al Viro1-1/+1
failed ->ki_pos value is unreliable in such cases. For an obvious example, consider O_DSYNC write - we feed the data to page cache and start IO, then we make sure it's completed. Update of ->ki_pos is dealt with by the first part; failure in the second ends up with negative value returned _and_ ->ki_pos left advanced as if sync had been successful. In the same situation write(2) does not advance the file position at all. Reviewed-by: Christian Brauner <brauner@kernel.org> Reviewed-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2023-10-25io_uring/rw: disable IOCB_DIO_CALLER_COMPJens Axboe1-9/+0
If an application does O_DIRECT writes with io_uring and the file system supports IOCB_DIO_CALLER_COMP, then completions of the dio write side is done from the task_work that will post the completion event for said write as well. Whenever a dio write is done against a file, the inode i_dio_count is elevated. This enables other callers to use inode_dio_wait() to wait for previous writes to complete. If we defer the full dio completion to task_work, we are dependent on that task_work being run before the inode i_dio_count can be decremented. If the same task that issues io_uring dio writes with IOCB_DIO_CALLER_COMP performs a synchronous system call that calls inode_dio_wait(), then we can deadlock as we're blocked sleeping on the event to become true, but not processing the completions that will result in the inode i_dio_count being decremented. Until we can guarantee that this is the case, then disable the deferred caller completions. Fixes: 099ada2c8726 ("io_uring/rw: add write support for IOCB_DIO_CALLER_COMP") Reported-by: Andres Freund <andres@anarazel.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-10-25io_uring/fdinfo: lock SQ thread while retrieving thread cpu/pidJens Axboe1-6/+12
We could race with SQ thread exit, and if we do, we'll hit a NULL pointer dereference when the thread is cleared. Grab the SQPOLL data lock before attempting to get the task cpu and pid for fdinfo, this ensures we have a stable view of it. Cc: stable@vger.kernel.org Link: https://bugzilla.kernel.org/show_bug.cgi?id=218032 Reviewed-by: Gabriel Krisman Bertazi <krisman@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-10-19io_uring/cmd: Introduce SOCKET_URING_OP_SETSOCKOPTBreno Leitao1-0/+21
Add initial support for SOCKET_URING_OP_SETSOCKOPT. This new command is similar to setsockopt. This implementation leverages the function do_sock_setsockopt(), which is shared with the setsockopt() system call path. Important to say that userspace needs to keep the pointer's memory alive until the operation is completed. I.e, the memory could not be deallocated before the CQE is returned to userspace. Signed-off-by: Breno Leitao <leitao@debian.org> Reviewed-by: Gabriel Krisman Bertazi <krisman@suse.de> Link: https://lore.kernel.org/r/20231016134750.1381153-11-leitao@debian.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-10-19io_uring/cmd: Introduce SOCKET_URING_OP_GETSOCKOPTBreno Leitao1-0/+28
Add support for getsockopt command (SOCKET_URING_OP_GETSOCKOPT), where level is SOL_SOCKET. This is leveraging the sockptr_t infrastructure, where a sockptr_t is either userspace or kernel space, and handled as such. Differently from the getsockopt(2), the optlen field is not a userspace pointers. In getsockopt(2), userspace provides optlen pointer, which is overwritten by the kernel. In this implementation, userspace passes a u32, and the new value is returned in cqe->res. I.e., optlen is not a pointer. Important to say that userspace needs to keep the pointer alive until the CQE is completed. Signed-off-by: Breno Leitao <leitao@debian.org> Reviewed-by: Gabriel Krisman Bertazi <krisman@suse.de> Link: https://lore.kernel.org/r/20231016134750.1381153-10-leitao@debian.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-10-19io_uring/cmd: return -EOPNOTSUPP if net is disabledBreno Leitao1-0/+2
Protect io_uring_cmd_sock() to be called if CONFIG_NET is not set. If network is not enabled, but io_uring is, then we want to return -EOPNOTSUPP for any possible socket operation. This is helpful because io_uring_cmd_sock() can now call functions that only exits if CONFIG_NET is enabled without having #ifdef CONFIG_NET inside the function itself. Signed-off-by: Breno Leitao <leitao@debian.org> Link: https://lore.kernel.org/r/20231016134750.1381153-9-leitao@debian.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-10-19io_uring/cmd: Pass compat mode in issue_flagsBreno Leitao1-0/+2
Create a new flag to track if the operation is running compat mode. This basically check the context->compat and pass it to the issue_flags, so, it could be queried later in the callbacks. Signed-off-by: Breno Leitao <leitao@debian.org> Reviewed-by: Gabriel Krisman Bertazi <krisman@suse.de> Link: https://lore.kernel.org/r/20231016134750.1381153-6-leitao@debian.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-10-19io_uring/poll: use IOU_F_TWQ_LAZY_WAKE for wakeupsJens Axboe1-1/+1
With poll triggered retries, each event trigger will cause a task_work item to be added for processing. If the ring is setup with IORING_SETUP_DEFER_TASKRUN and a task is waiting on multiple events to complete, any task_work addition will wake the task for processing these items. This can cause more context switches than we would like, if the application is deliberately waiting on multiple items to increase efficiency. For example, if an application has receive multishot armed for sockets and wants to wait for N to complete within M usec of time, we should not be waking up and processing these items until we have all the events we asked for. By switching the poll trigger to lazy wake, we'll process them when they are all ready, in one swoop, rather than wake multiple times only to process one and then go back to sleep. At some point we probably want to look at just making the lazy wake the default, but for now, let's just selectively enable it where it makes sense. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-10-19io_uring: use files_lookup_fd_locked()Christian Brauner1-8/+1
While valid we don't need to open-code rcu dereferences if we're acquiring file_lock anyway. Suggested-by: Al Viro <viro@zeniv.linux.org.uk> Link: https://lore.kernel.org/r/20231010030615.GO800259@ZenIV Signed-off-by: Christian Brauner <brauner@kernel.org>
2023-10-18io_uring: fix crash with IORING_SETUP_NO_MMAP and invalid SQ ring addressJens Axboe1-0/+6
If we specify a valid CQ ring address but an invalid SQ ring address, we'll correctly spot this and free the allocated pages and clear them to NULL. However, we don't clear the ring page count, and hence will attempt to free the pages again. We've already cleared the address of the page array when freeing them, but we don't check for that. This causes the following crash: Unable to handle kernel NULL pointer dereference at virtual address 0000000000000000 Oops [#1] Modules linked in: CPU: 0 PID: 20 Comm: kworker/u2:1 Not tainted 6.6.0-rc5-dirty #56 Hardware name: ucbbar,riscvemu-bare (DT) Workqueue: events_unbound io_ring_exit_work epc : io_pages_free+0x2a/0x58 ra : io_rings_free+0x3a/0x50 epc : ffffffff808811a2 ra : ffffffff80881406 sp : ffff8f80000c3cd0 status: 0000000200000121 badaddr: 0000000000000000 cause: 000000000000000d [<ffffffff808811a2>] io_pages_free+0x2a/0x58 [<ffffffff80881406>] io_rings_free+0x3a/0x50 [<ffffffff80882176>] io_ring_exit_work+0x37e/0x424 [<ffffffff80027234>] process_one_work+0x10c/0x1f4 [<ffffffff8002756e>] worker_thread+0x252/0x31c [<ffffffff8002f5e4>] kthread+0xc4/0xe0 [<ffffffff8000332a>] ret_from_fork+0xa/0x1c Check for a NULL array in io_pages_free(), but also clear the page counts when we free them to be on the safer side. Reported-by: rtm@csail.mit.edu Fixes: 03d89a2de25b ("io_uring: support for user allocated memory for rings/sqes") Cc: stable@vger.kernel.org Reviewed-by: Jeff Moyer <jmoyer@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-10-05io-wq: fully initialize wqe before calling cpuhp_state_add_instance_nocalls()Jeff Moyer1-6/+4
I received a bug report with the following signature: [ 1759.937637] BUG: unable to handle page fault for address: ffffffffffffffe8 [ 1759.944564] #PF: supervisor read access in kernel mode [ 1759.949732] #PF: error_code(0x0000) - not-present page [ 1759.954901] PGD 7ab615067 P4D 7ab615067 PUD 7ab617067 PMD 0 [ 1759.960596] Oops: 0000 1 PREEMPT SMP PTI [ 1759.964804] CPU: 15 PID: 109 Comm: cpuhp/15 Kdump: loaded Tainted: G X ------- — 5.14.0-362.3.1.el9_3.x86_64 #1 [ 1759.976609] Hardware name: HPE ProLiant DL380 Gen10/ProLiant DL380 Gen10, BIOS U30 06/20/2018 [ 1759.985181] RIP: 0010:io_wq_for_each_worker.isra.0+0x24/0xa0 [ 1759.990877] Code: 90 90 90 90 90 90 0f 1f 44 00 00 41 56 41 55 41 54 55 48 8d 6f 78 53 48 8b 47 78 48 39 c5 74 4f 49 89 f5 49 89 d4 48 8d 58 e8 <8b> 13 85 d2 74 32 8d 4a 01 89 d0 f0 0f b1 0b 75 5c 09 ca 78 3d 48 [ 1760.009758] RSP: 0000:ffffb6f403603e20 EFLAGS: 00010286 [ 1760.015013] RAX: 0000000000000000 RBX: ffffffffffffffe8 RCX: 0000000000000000 [ 1760.022188] RDX: ffffb6f403603e50 RSI: ffffffffb11e95b0 RDI: ffff9f73b09e9400 [ 1760.029362] RBP: ffff9f73b09e9478 R08: 000000000000000f R09: 0000000000000000 [ 1760.036536] R10: ffffffffffffff00 R11: ffffb6f403603d80 R12: ffffb6f403603e50 [ 1760.043712] R13: ffffffffb11e95b0 R14: ffffffffb28531e8 R15: ffff9f7a6fbdf548 [ 1760.050887] FS: 0000000000000000(0000) GS:ffff9f7a6fbc0000(0000) knlGS:0000000000000000 [ 1760.059025] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 1760.064801] CR2: ffffffffffffffe8 CR3: 00000007ab610002 CR4: 00000000007706e0 [ 1760.071976] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 1760.079150] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [ 1760.086325] PKRU: 55555554 [ 1760.089044] Call Trace: [ 1760.091501] <TASK> [ 1760.093612] ? show_trace_log_lvl+0x1c4/0x2df [ 1760.097995] ? show_trace_log_lvl+0x1c4/0x2df [ 1760.102377] ? __io_wq_cpu_online+0x54/0xb0 [ 1760.106584] ? __die_body.cold+0x8/0xd [ 1760.110356] ? page_fault_oops+0x134/0x170 [ 1760.114479] ? kernelmode_fixup_or_oops+0x84/0x110 [ 1760.119298] ? exc_page_fault+0xa8/0x150 [ 1760.123247] ? asm_exc_page_fault+0x22/0x30 [ 1760.127458] ? __pfx_io_wq_worker_affinity+0x10/0x10 [ 1760.132453] ? __pfx_io_wq_worker_affinity+0x10/0x10 [ 1760.137446] ? io_wq_for_each_worker.isra.0+0x24/0xa0 [ 1760.142527] __io_wq_cpu_online+0x54/0xb0 [ 1760.146558] cpuhp_invoke_callback+0x109/0x460 [ 1760.151029] ? __pfx_io_wq_cpu_offline+0x10/0x10 [ 1760.155673] ? __pfx_smpboot_thread_fn+0x10/0x10 [ 1760.160320] cpuhp_thread_fun+0x8d/0x140 [ 1760.164266] smpboot_thread_fn+0xd3/0x1a0 [ 1760.168297] kthread+0xdd/0x100 [ 1760.171457] ? __pfx_kthread+0x10/0x10 [ 1760.175225] ret_from_fork+0x29/0x50 [ 1760.178826] </TASK> [ 1760.181022] Modules linked in: rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace fscache netfs rfkill sunrpc vfat fat dm_multipath intel_rapl_msr intel_rapl_common isst_if_common ipmi_ssif nfit libnvdimm mgag200 i2c_algo_bit ioatdma drm_shmem_helper drm_kms_helper acpi_ipmi syscopyarea x86_pkg_temp_thermal sysfillrect ipmi_si intel_powerclamp sysimgblt ipmi_devintf coretemp acpi_power_meter ipmi_msghandler rapl pcspkr dca intel_pch_thermal intel_cstate ses lpc_ich intel_uncore enclosure hpilo mei_me mei acpi_tad fuse drm xfs sd_mod sg bnx2x nvme nvme_core crct10dif_pclmul crc32_pclmul nvme_common ghash_clmulni_intel smartpqi tg3 t10_pi mdio uas libcrc32c crc32c_intel scsi_transport_sas usb_storage hpwdt wmi dm_mirror dm_region_hash dm_log dm_mod [ 1760.248623] CR2: ffffffffffffffe8 A cpu hotplug callback was issued before wq->all_list was initialized. This results in a null pointer dereference. The fix is to fully setup the io_wq before calling cpuhp_state_add_instance_nocalls(). Signed-off-by: Jeff Moyer <jmoyer@redhat.com> Link: https://lore.kernel.org/r/x49y1ghnecs.fsf@segfault.boston.devel.redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-10-05io_uring/kbuf: Use slab for struct io_buffer objectsGabriel Krisman Bertazi3-22/+30
The allocation of struct io_buffer for metadata of provided buffers is done through a custom allocator that directly gets pages and fragments them. But, slab would do just fine, as this is not a hot path (in fact, it is a deprecated feature) and, by keeping a custom allocator implementation we lose benefits like tracking, poisoning, sanitizers. Finally, the custom code is more complex and requires keeping the list of pages in struct ctx for no good reason. This patch cleans this path up and just uses slab. I microbenchmarked it by forcing the allocation of a large number of objects with the least number of io_uring commands possible (keeping nbufs=USHRT_MAX), with and without the patch. There is a slight increase in time spent in the allocation with slab, of course, but even when allocating to system resources exhaustion, which is not very realistic and happened around 1/2 billion provided buffers for me, it wasn't a significant hit in system time. Specially if we think of a real-world scenario, an application doing register/unregister of provided buffers will hit ctx->io_buffers_cache more often than actually going to slab. Signed-off-by: Gabriel Krisman Bertazi <krisman@suse.de> Link: https://lore.kernel.org/r/20231005000531.30800-4-krisman@suse.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-10-05io_uring/kbuf: Allow the full buffer id space for provided buffersGabriel Krisman Bertazi1-4/+7
nbufs tracks the number of buffers and not the last bgid. In 16-bit, we have 2^16 valid buffers, but the check mistakenly rejects the last bid. Let's fix it to make the interface consistent with the documentation. Fixes: ddf0322db79c ("io_uring: add IORING_OP_PROVIDE_BUFFERS") Signed-off-by: Gabriel Krisman Bertazi <krisman@suse.de> Link: https://lore.kernel.org/r/20231005000531.30800-3-krisman@suse.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-10-05io_uring/kbuf: Fix check of BID wrapping in provided buffersGabriel Krisman Bertazi1-1/+1
Commit 3851d25c75ed0 ("io_uring: check for rollover of buffer ID when providing buffers") introduced a check to prevent wrapping the BID counter when sqe->off is provided, but it's off-by-one too restrictive, rejecting the last possible BID (65534). i.e., the following fails with -EINVAL. io_uring_prep_provide_buffers(sqe, addr, size, 0xFFFF, 0, 0); Fixes: 3851d25c75ed ("io_uring: check for rollover of buffer ID when providing buffers") Signed-off-by: Gabriel Krisman Bertazi <krisman@suse.de> Link: https://lore.kernel.org/r/20231005000531.30800-2-krisman@suse.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-10-03io_uring: don't allow IORING_SETUP_NO_MMAP rings on highmem pagesJens Axboe1-1/+15
On at least arm32, but presumably any arch with highmem, if the application passes in memory that resides in highmem for the rings, then we should fail that ring creation. We fail it with -EINVAL, which is what kernels that don't support IORING_SETUP_NO_MMAP will do as well. Cc: stable@vger.kernel.org Fixes: 03d89a2de25b ("io_uring: support for user allocated memory for rings/sqes") Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-10-03io_uring: ensure io_lockdep_assert_cq_locked() handles disabled ringsJens Axboe1-14/+27
io_lockdep_assert_cq_locked() checks that locking is correctly done when a CQE is posted. If the ring is setup in a disabled state with IORING_SETUP_R_DISABLED, then ctx->submitter_task isn't assigned until the ring is later enabled. We generally don't post CQEs in this state, as no SQEs can be submitted. However it is possible to generate a CQE if tagged resources are being updated. If this happens and PROVE_LOCKING is enabled, then the locking check helper will dereference ctx->submitter_task, which hasn't been set yet. Fixup io_lockdep_assert_cq_locked() to handle this case correctly. While at it, convert it to a static inline as well, so that generated line offsets will actually reflect which condition failed, rather than just the line offset for io_lockdep_assert_cq_locked() itself. Reported-and-tested-by: syzbot+efc45d4e7ba6ab4ef1eb@syzkaller.appspotmail.com Fixes: f26cc9593581 ("io_uring: lockdep annotate CQ locking") Cc: stable@vger.kernel.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-10-03io_uring/kbuf: don't allow registered buffer rings on highmem pagesJens Axboe1-8/+19
syzbot reports that registering a mapped buffer ring on arm32 can trigger an OOPS. Registered buffer rings have two modes, one of them is the application passing in the memory that the buffer ring should reside in. Once those pages are mapped, we use page_address() to get a virtual address. This will obviously fail on highmem pages, which aren't mapped. Add a check if we have any highmem pages after mapping, and fail the attempt to register a provided buffer ring if we do. This will return the same error as kernels that don't support provided buffer rings to begin with. Link: https://lore.kernel.org/io-uring/000000000000af635c0606bcb889@google.com/ Fixes: c56e022c0a27 ("io_uring: add support for user mapped provided buffer ring") Cc: stable@vger.kernel.org Reported-by: syzbot+2113e61b8848fa7951d8@syzkaller.appspotmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-10-02io_uring/rsrc: cleanup io_pin_pages()Jens Axboe1-20/+17
This function is overly convoluted with a goto error path, and checks under the mmap_read_lock() that don't need to be at all. Rearrange it a bit so the checks and errors fall out naturally, rather than needing to jump around for it. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-09-29io_uring/fs: remove sqe->rw_flags checking from LINKATJens Axboe1-1/+1
This is unionized with the actual link flags, so they can of course be set and they will be evaluated further down. If not we fail any LINKAT that has to set option flags. Fixes: cf30da90bc3a ("io_uring: add support for IORING_OP_LINKAT") Cc: stable@vger.kernel.org Reported-by: Thomas Leonard <talex5@gmail.com> Link: https://github.com/axboe/liburing/issues/955 Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-09-29io_uring: add support for vectored futex waitsJens Axboe3-9/+173
This adds support for IORING_OP_FUTEX_WAITV, which allows registering a notification for a number of futexes at once. If one of the futexes are woken, then the request will complete with the index of the futex that got woken as the result. This is identical to what the normal vectored futex waitv operation does. Use like IORING_OP_FUTEX_WAIT, except sqe->addr must now contain a pointer to a struct futex_waitv array, and sqe->off must now contain the number of elements in that array. As flags are passed in the futex_vector array, and likewise for the value and futex address(es), sqe->addr2 and sqe->addr3 are also reserved for IORING_OP_FUTEX_WAITV. For cancelations, FUTEX_WAITV does not rely on the futex_unqueue() return value as we're dealing with multiple futexes. Instead, a separate per io_uring request atomic is used to claim ownership of the request. Waiting on N futexes could be done with IORING_OP_FUTEX_WAIT as well, but that punts a lot of the work to the application: 1) Application would need to submit N IORING_OP_FUTEX_WAIT requests, rather than just a single IORING_OP_FUTEX_WAITV. 2) When one futex is woken, application would need to cancel the remaining N-1 requests that didn't trigger. While this is of course doable, having a single vectored futex wait makes for much simpler application code. Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-09-29io_uring: add support for futex wake and waitJens Axboe7-0/+309
Add support for FUTEX_WAKE/WAIT primitives. IORING_OP_FUTEX_WAKE is mix of FUTEX_WAKE and FUTEX_WAKE_BITSET, as it does support passing in a bitset. Similary, IORING_OP_FUTEX_WAIT is a mix of FUTEX_WAIT and FUTEX_WAIT_BITSET. For both of them, they are using the futex2 interface. FUTEX_WAKE is straight forward, as those can always be done directly from the io_uring submission without needing async handling. For FUTEX_WAIT, things are a bit more complicated. If the futex isn't ready, then we rely on a callback via futex_queue->wake() when someone wakes up the futex. From that calback, we queue up task_work with the original task, which will post a CQE and wake it, if necessary. Cancelations are supported, both from the application point-of-view, but also to be able to cancel pending waits if the ring exits before all events have occurred. The return value of futex_unqueue() is used to gate who wins the potential race between cancelation and futex wakeups. Whomever gets a 'ret == 1' return from that claims ownership of the io_uring futex request. This is just the barebones wait/wake support. PI or REQUEUE support is not added at this point, unclear if we might look into that later. Likewise, explicit timeouts are not supported either. It is expected that users that need timeouts would do so via the usual io_uring mechanism to do that using linked timeouts. The SQE format is as follows: `addr` Address of futex `fd` futex2(2) FUTEX2_* flags `futex_flags` io_uring specific command flags. None valid now. `addr2` Value of futex `addr3` Mask to wake/wait Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-09-28io_uring: cancelable uring_cmdMing Lei2-0/+80
uring_cmd may never complete, such as ublk, in which uring cmd isn't completed until one new block request is coming from ublk block device. Add cancelable uring_cmd to provide mechanism to driver for cancelling pending commands in its own way. Add API of io_uring_cmd_mark_cancelable() for driver to mark one command as cancelable, then io_uring will cancel this command in io_uring_cancel_generic(). ->uring_cmd() callback is reused for canceling command in driver's way, then driver gets notified with the cancelling from io_uring. Add API of io_uring_cmd_get_task() to help driver cancel handler deal with the canceling. Reviewed-by: Gabriel Krisman Bertazi <krisman@suse.de> Suggested-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-09-28io_uring: retain top 8bits of uring_cmd flags for kernel internal useMing Lei2-1/+4
Retain top 8bits of uring_cmd flags for kernel internal use, so that we can move IORING_URING_CMD_POLLED out of uapi header. Reviewed-by: Gabriel Krisman Bertazi <krisman@suse.de> Reviewed-by: Anuj Gupta <anuj20.g@samsung.com> Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-09-21io_uring: add IORING_OP_WAITID supportJens Axboe6-1/+406
This adds support for an async version of waitid(2), in a fully async version. If an event isn't immediately available, wait for a callback to trigger a retry. The format of the sqe is as follows: sqe->len The 'which', the idtype being queried/waited for. sqe->fd The 'pid' (or id) being waited for. sqe->file_index The 'options' being set. sqe->addr2 A pointer to siginfo_t, if any, being filled in. buf_index, add3, and waitid_flags are reserved/unused for now. waitid_flags will be used for options for this request type. One interesting use case may be to add multi-shot support, so that the request stays armed and posts a notification every time a monitored process state change occurs. Note that this does not support rusage, on Arnd's recommendation. See the waitid(2) man page for details on the arguments. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-09-21io_uring/rw: add support for IORING_OP_READ_MULTISHOTJens Axboe3-1/+81
This behaves like IORING_OP_READ, except: 1) It only supports pollable files (eg pipes, sockets, etc). Note that for sockets, you probably want to use recv/recvmsg with multishot instead. 2) It supports multishot mode, meaning it will repeatedly trigger a read and fill a buffer when data is available. This allows similar use to recv/recvmsg but on non-sockets, where a single request will repeatedly post a CQE whenever data is read from it. 3) Because of #2, it must be used with provided buffers. This is uniformly true across any request type that supports multishot and transfers data, with the reason being that it's obviously not possible to pass in a single buffer for the data, as multiple reads may very well trigger before an application has a chance to process previous CQEs and the data passed from them. Reviewed-by: Gabriel Krisman Bertazi <krisman@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-09-21io_uring/rw: mark readv/writev as vectored in the opcode definitionJens Axboe3-4/+10
This is cleaner than gating on the opcode type, particularly as more read/write type opcodes may be added. Then we can use that for the data import, and for __io_read() on whether or not we need to copy state. Reviewed-by: Gabriel Krisman Bertazi <krisman@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-09-21io_uring/rw: split io_read() into a helperJens Axboe1-2/+13
Add __io_read() which does the grunt of the work, leaving the completion side to the new io_read(). No functional changes in this patch. Reviewed-by: Gabriel Krisman Bertazi <krisman@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-09-14io_uring/net: fix iter retargeting for selected bufPavel Begunkov1-0/+5
When using selected buffer feature, io_uring delays data iter setup until later. If io_setup_async_msg() is called before that it might see not correctly setup iterator. Pre-init nr_segs and judge from its state whether we repointing. Cc: stable@vger.kernel.org Reported-by: syzbot+a4c6e5ef999b68b26ed1@syzkaller.appspotmail.com Fixes: 0455d4ccec548 ("io_uring: add POLL_FIRST support for send/sendmsg and recv/recvmsg") Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/0000000000002770be06053c7757@google.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-09-07Revert "io_uring: fix IO hang in io_wq_put_and_exit from do_exit()"Jens Axboe1-32/+0
This reverts commit b484a40dc1f16edb58e5430105a021e1916e6f27. This commit cancels all requests with io-wq, not just the ones from the originating task. This breaks use cases that have thread pools, or just multiple tasks issuing requests on the same ring. The liburing regression test for this also shows that problem: $ test/thread-exit.t cqe->res=-125, Expected 512 where an IO thread gets its request canceled rather than complete successfully. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-09-07io_uring: fix unprotected iopoll overflowPavel Begunkov1-2/+2
[ 71.490669] WARNING: CPU: 3 PID: 17070 at io_uring/io_uring.c:769 io_cqring_event_overflow+0x47b/0x6b0 [ 71.498381] Call Trace: [ 71.498590] <TASK> [ 71.501858] io_req_cqe_overflow+0x105/0x1e0 [ 71.502194] __io_submit_flush_completions+0x9f9/0x1090 [ 71.503537] io_submit_sqes+0xebd/0x1f00 [ 71.503879] __do_sys_io_uring_enter+0x8c5/0x2380 [ 71.507360] do_syscall_64+0x39/0x80 We decoupled CQ locking from ->task_complete but haven't fixed up places forcing locking for CQ overflows. Fixes: ec26c225f06f5 ("io_uring: merge iopoll and normal completion paths") Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-09-07io_uring: break out of iowq iopoll on teardownPavel Begunkov3-0/+13
io-wq will retry iopoll even when it failed with -EAGAIN. If that races with task exit, which sets TIF_NOTIFY_SIGNAL for all its workers, such workers might potentially infinitely spin retrying iopoll again and again and each time failing on some allocation / waiting / etc. Don't keep spinning if io-wq is dying. Fixes: 561fb04a6a225 ("io_uring: replace workqueue usage with io-wq") Cc: stable@vger.kernel.org Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-09-05io_uring: add a sysctl to disable io_uring system-wideMatteo Rizzo1-0/+50
Introduce a new sysctl (io_uring_disabled) which can be either 0, 1, or 2. When 0 (the default), all processes are allowed to create io_uring instances, which is the current behavior. When 1, io_uring creation is disabled (io_uring_setup() will fail with -EPERM) for unprivileged processes not in the kernel.io_uring_group group. When 2, calls to io_uring_setup() fail with -EPERM regardless of privilege. Signed-off-by: Matteo Rizzo <matteorizzo@google.com> [JEM: modified to add io_uring_group] Signed-off-by: Jeff Moyer <jmoyer@redhat.com> Link: https://lore.kernel.org/r/x49y1i42j1z.fsf@segfault.boston.devel.redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-09-01io_uring/fdinfo: only print ->sq_array[] if it's thereJens Axboe1-0/+2
If a ring is setup with IORING_SETUP_NO_SQARRAY, then we don't have the SQ array. Don't try to dump info from it through fdinfo if that is the case. Reported-by: syzbot+216e2ea6e0bf4a0acdd7@syzkaller.appspotmail.com Fixes: 2af89abda7d9 ("io_uring: add option to remove SQ indirection") Reviewed-by: Gabriel Krisman Bertazi <krisman@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-09-01io_uring: fix IO hang in io_wq_put_and_exit from do_exit()Ming Lei1-0/+32
io_wq_put_and_exit() is called from do_exit(), but all FIXED_FILE requests in io_wq aren't canceled in io_uring_cancel_generic() called from do_exit(). Meantime io_wq IO code path may share resource with normal iopoll code path. So if any HIPRI request is submittd via io_wq, this request may not get resouce for moving on, given iopoll isn't possible in io_wq_put_and_exit(). The issue can be triggered when terminating 't/io_uring -n4 /dev/nullb0' with default null_blk parameters. Fix it by always cancelling all requests in io_wq by adding helper of io_uring_cancel_wq(), and this way is reasonable because io_wq destroying follows canceling requests immediately. Closes: https://lore.kernel.org/linux-block/3893581.1691785261@warthog.procyon.org.uk/ Reported-by: David Howells <dhowells@redhat.com> Cc: Chengming Zhou <zhouchengming@bytedance.com> Signed-off-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20230901134916.2415386-1-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-08-30io_uring: Don't set affinity on a dying sqpoll threadGabriel Krisman Bertazi1-1/+3
Syzbot reported a null-ptr-deref of sqd->thread inside io_sqpoll_wq_cpu_affinity. It turns out the sqd->thread can go away from under us during io_uring_register, in case the process gets a fatal signal during io_uring_register. It is not particularly hard to hit the race, and while I am not sure this is the exact case hit by syzbot, it solves it. Finally, checking ->thread is enough to close the race because we locked sqd while "parking" the thread, thus preventing it from going away. I reproduced it fairly consistently with a program that does: int main(void) { ... io_uring_queue_init(RING_LEN, &ring1, IORING_SETUP_SQPOLL); while (1) { io_uring_register_iowq_aff(ring, 1, &mask); } } Executed in a loop with timeout to trigger SIGTERM: while true; do timeout 1 /a.out ; done This will hit the following BUG() in very few attempts. BUG: kernel NULL pointer dereference, address: 00000000000007a8 PGD 800000010e949067 P4D 800000010e949067 PUD 10e46e067 PMD 0 Oops: 0000 [#1] PREEMPT SMP PTI CPU: 0 PID: 15715 Comm: dead-sqpoll Not tainted 6.5.0-rc7-next-20230825-g193296236fa0-dirty #23 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015 RIP: 0010:io_sqpoll_wq_cpu_affinity+0x27/0x70 Code: 90 90 90 0f 1f 44 00 00 55 53 48 8b 9f 98 03 00 00 48 85 db 74 4f 48 89 df 48 89 f5 e8 e2 f8 ff ff 48 8b 43 38 48 85 c0 74 22 <48> 8b b8 a8 07 00 00 48 89 ee e8 ba b1 00 00 48 89 df 89 c5 e8 70 RSP: 0018:ffffb04040ea7e70 EFLAGS: 00010282 RAX: 0000000000000000 RBX: ffff93c010749e40 RCX: 0000000000000001 RDX: 0000000000000000 RSI: ffffffffa7653331 RDI: 00000000ffffffff RBP: ffffb04040ea7eb8 R08: 0000000000000000 R09: c0000000ffffdfff R10: ffff93c01141b600 R11: ffffb04040ea7d18 R12: ffff93c00ea74840 R13: 0000000000000011 R14: 0000000000000000 R15: ffff93c00ea74800 FS: 00007fb7c276ab80(0000) GS:ffff93c36f200000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00000000000007a8 CR3: 0000000111634003 CR4: 0000000000370ef0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: <TASK> ? __die_body+0x1a/0x60 ? page_fault_oops+0x154/0x440 ? do_user_addr_fault+0x174/0x7b0 ? exc_page_fault+0x63/0x140 ? asm_exc_page_fault+0x22/0x30 ? io_sqpoll_wq_cpu_affinity+0x27/0x70 __io_register_iowq_aff+0x2b/0x60 __io_uring_register+0x614/0xa70 __x64_sys_io_uring_register+0xaa/0x1a0 do_syscall_64+0x3a/0x90 entry_SYSCALL_64_after_hwframe+0x6e/0xd8 RIP: 0033:0x7fb7c226fec9 Code: 2e 00 b8 ca 00 00 00 0f 05 eb a5 66 0f 1f 44 00 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 97 7f 2d 00 f7 d8 64 89 01 48 RSP: 002b:00007ffe2c0674f8 EFLAGS: 00000246 ORIG_RAX: 00000000000001ab RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007fb7c226fec9 RDX: 00007ffe2c067530 RSI: 0000000000000011 RDI: 0000000000000003 RBP: 00007ffe2c0675d0 R08: 00007ffe2c067550 R09: 00007ffe2c067550 R10: 0000000000000001 R11: 0000000000000246 R12: 0000000000000000 R13: 00007ffe2c067750 R14: 0000000000000000 R15: 0000000000000000 </TASK> Modules linked in: CR2: 00000000000007a8 ---[ end trace 0000000000000000 ]--- Reported-by: syzbot+c74fea926a78b8a91042@syzkaller.appspotmail.com Fixes: ebdfefc09c6d ("io_uring/sqpoll: fix io-wq affinity when IORING_SETUP_SQPOLL is used") Signed-off-by: Gabriel Krisman Bertazi <krisman@suse.de> Link: https://lore.kernel.org/r/87v8cybuo6.fsf@suse.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-08-29Merge tag 'for-6.6/io_uring-2023-08-28' of git://git.kernel.dk/linuxLinus Torvalds17-261/+339
Pull io_uring updates from Jens Axboe: "Fairly quiet round in terms of features, mostly just improvements all over the map for existing code. In detail: - Initial support for socket operations through io_uring. Latter half of this will likely land with the 6.7 kernel, then allowing things like get/setsockopt (Breno) - Cleanup of the cancel code, and then adding support for canceling requests with the opcode as the key (me) - Improvements for the io-wq locking (me) - Fix affinity setting for SQPOLL based io-wq (me) - Remove the io_uring userspace code. These were added initially as copies from liburing, but all of them have since bitrotted and are way out of date at this point. Rather than attempt to keep them in sync, just get rid of them. People will have liburing available anyway for these examples. (Pavel) - Series improving the CQ/SQ ring caching (Pavel) - Misc fixes and cleanups (Pavel, Yue, me)" * tag 'for-6.6/io_uring-2023-08-28' of git://git.kernel.dk/linux: (47 commits) io_uring: move iopoll ctx fields around io_uring: move multishot cqe cache in ctx io_uring: separate task_work/waiting cache line io_uring: banish non-hot data to end of io_ring_ctx io_uring: move non aligned field to the end io_uring: add option to remove SQ indirection io_uring: compact SQ/CQ heads/tails io_uring: force inline io_fill_cqe_req io_uring: merge iopoll and normal completion paths io_uring: reorder cqring_flush and wakeups io_uring: optimise extra io_get_cqe null check io_uring: refactor __io_get_cqe() io_uring: simplify big_cqe handling io_uring: cqe init hardening io_uring: improve cqe !tracing hot path io_uring/rsrc: Annotate struct io_mapped_ubuf with __counted_by io_uring/sqpoll: fix io-wq affinity when IORING_SETUP_SQPOLL is used io_uring: simplify io_run_task_work_sig return io_uring/rsrc: keep one global dummy_ubuf io_uring: never overflow io_aux_cqe ...
2023-08-29Merge tag 'mm-stable-2023-08-28-18-26' of ↵Linus Torvalds2-10/+2
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull MM updates from Andrew Morton: - Some swap cleanups from Ma Wupeng ("fix WARN_ON in add_to_avail_list") - Peter Xu has a series (mm/gup: Unify hugetlb, speed up thp") which reduces the special-case code for handling hugetlb pages in GUP. It also speeds up GUP handling of transparent hugepages. - Peng Zhang provides some maple tree speedups ("Optimize the fast path of mas_store()"). - Sergey Senozhatsky has improved te performance of zsmalloc during compaction (zsmalloc: small compaction improvements"). - Domenico Cerasuolo has developed additional selftest code for zswap ("selftests: cgroup: add zswap test program"). - xu xin has doe some work on KSM's handling of zero pages. These changes are mainly to enable the user to better understand the effectiveness of KSM's treatment of zero pages ("ksm: support tracking KSM-placed zero-pages"). - Jeff Xu has fixes the behaviour of memfd's MEMFD_NOEXEC_SCOPE_NOEXEC_ENFORCED sysctl ("mm/memfd: fix sysctl MEMFD_NOEXEC_SCOPE_NOEXEC_ENFORCED"). - David Howells has fixed an fscache optimization ("mm, netfs, fscache: Stop read optimisation when folio removed from pagecache"). - Axel Rasmussen has given userfaultfd the ability to simulate memory poisoning ("add UFFDIO_POISON to simulate memory poisoning with UFFD"). - Miaohe Lin has contributed some routine maintenance work on the memory-failure code ("mm: memory-failure: remove unneeded PageHuge() check"). - Peng Zhang has contributed some maintenance work on the maple tree code ("Improve the validation for maple tree and some cleanup"). - Hugh Dickins has optimized the collapsing of shmem or file pages into THPs ("mm: free retracted page table by RCU"). - Jiaqi Yan has a patch series which permits us to use the healthy subpages within a hardware poisoned huge page for general purposes ("Improve hugetlbfs read on HWPOISON hugepages"). - Kemeng Shi has done some maintenance work on the pagetable-check code ("Remove unused parameters in page_table_check"). - More folioification work from Matthew Wilcox ("More filesystem folio conversions for 6.6"), ("Followup folio conversions for zswap"). And from ZhangPeng ("Convert several functions in page_io.c to use a folio"). - page_ext cleanups from Kemeng Shi ("minor cleanups for page_ext"). - Baoquan He has converted some architectures to use the GENERIC_IOREMAP ioremap()/iounmap() code ("mm: ioremap: Convert architectures to take GENERIC_IOREMAP way"). - Anshuman Khandual has optimized arm64 tlb shootdown ("arm64: support batched/deferred tlb shootdown during page reclamation/migration"). - Better maple tree lockdep checking from Liam Howlett ("More strict maple tree lockdep"). Liam also developed some efficiency improvements ("Reduce preallocations for maple tree"). - Cleanup and optimization to the secondary IOMMU TLB invalidation, from Alistair Popple ("Invalidate secondary IOMMU TLB on permission upgrade"). - Ryan Roberts fixes some arm64 MM selftest issues ("selftests/mm fixes for arm64"). - Kemeng Shi provides some maintenance work on the compaction code ("Two minor cleanups for compaction"). - Some reduction in mmap_lock pressure from Matthew Wilcox ("Handle most file-backed faults under the VMA lock"). - Aneesh Kumar contributes code to use the vmemmap optimization for DAX on ppc64, under some circumstances ("Add support for DAX vmemmap optimization for ppc64"). - page-ext cleanups from Kemeng Shi ("add page_ext_data to get client data in page_ext"), ("minor cleanups to page_ext header"). - Some zswap cleanups from Johannes Weiner ("mm: zswap: three cleanups"). - kmsan cleanups from ZhangPeng ("minor cleanups for kmsan"). - VMA handling cleanups from Kefeng Wang ("mm: convert to vma_is_initial_heap/stack()"). - DAMON feature work from SeongJae Park ("mm/damon/sysfs-schemes: implement DAMOS tried total bytes file"), ("Extend DAMOS filters for address ranges and DAMON monitoring targets"). - Compaction work from Kemeng Shi ("Fixes and cleanups to compaction"). - Liam Howlett has improved the maple tree node replacement code ("maple_tree: Change replacement strategy"). - ZhangPeng has a general code cleanup - use the K() macro more widely ("cleanup with helper macro K()"). - Aneesh Kumar brings memmap-on-memory to ppc64 ("Add support for memmap on memory feature on ppc64"). - pagealloc cleanups from Kemeng Shi ("Two minor cleanups for pcp list in page_alloc"), ("Two minor cleanups for get pageblock migratetype"). - Vishal Moola introduces a memory descriptor for page table tracking, "struct ptdesc" ("Split ptdesc from struct page"). - memfd selftest maintenance work from Aleksa Sarai ("memfd: cleanups for vm.memfd_noexec"). - MM include file rationalization from Hugh Dickins ("arch: include asm/cacheflush.h in asm/hugetlb.h"). - THP debug output fixes from Hugh Dickins ("mm,thp: fix sloppy text output"). - kmemleak improvements from Xiaolei Wang ("mm/kmemleak: use object_cache instead of kmemleak_initialized"). - More folio-related cleanups from Matthew Wilcox ("Remove _folio_dtor and _folio_order"). - A VMA locking scalability improvement from Suren Baghdasaryan ("Per-VMA lock support for swap and userfaults"). - pagetable handling cleanups from Matthew Wilcox ("New page table range API"). - A batch of swap/thp cleanups from David Hildenbrand ("mm/swap: stop using page->private on tail pages for THP_SWAP + cleanups"). - Cleanups and speedups to the hugetlb fault handling from Matthew Wilcox ("Change calling convention for ->huge_fault"). - Matthew Wilcox has also done some maintenance work on the MM subsystem documentation ("Improve mm documentation"). * tag 'mm-stable-2023-08-28-18-26' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (489 commits) maple_tree: shrink struct maple_tree maple_tree: clean up mas_wr_append() secretmem: convert page_is_secretmem() to folio_is_secretmem() nios2: fix flush_dcache_page() for usage from irq context hugetlb: add documentation for vma_kernel_pagesize() mm: add orphaned kernel-doc to the rst files. mm: fix clean_record_shared_mapping_range kernel-doc mm: fix get_mctgt_type() kernel-doc mm: fix kernel-doc warning from tlb_flush_rmaps() mm: remove enum page_entry_size mm: allow ->huge_fault() to be called without the mmap_lock held mm: move PMD_ORDER to pgtable.h mm: remove checks for pte_index memcg: remove duplication detection for mem_cgroup_uncharge_swap mm/huge_memory: work on folio->swap instead of page->private when splitting folio mm/swap: inline folio_set_swap_entry() and folio_swap_entry() mm/swap: use dedicated entry for swap in folio mm/swap: stop using page->private on tail pages for THP_SWAP selftests/mm: fix WARNING comparing pointer to 0 selftests: cgroup: fix test_kmem_memcg_deletion kernel mem check ...
2023-08-28Merge tag 'iomap-6.6-merge-3' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linuxLinus Torvalds1-3/+24
Pull iomap updates from Darrick Wong: "We've got some big changes for this release -- I'm very happy to be landing willy's work to enable large folios for the page cache for general read and write IOs when the fs can make contiguous space allocations, and Ritesh's work to track sub-folio dirty state to eliminate the write amplification problems inherent in using large folios. As a bonus, io_uring can now process write completions in the caller's context instead of bouncing through a workqueue, which should reduce io latency dramatically. IOWs, XFS should see a nice performance bump for both IO paths. Summary: - Make large writes to the page cache fill sparse parts of the cache with large folios, then use large memcpy calls for the large folio. - Track the per-block dirty state of each large folio so that a buffered write to a single byte on a large folio does not result in a (potentially) multi-megabyte writeback IO. - Allow some directio completions to be performed in the initiating task's context instead of punting through a workqueue. This will reduce latency for some io_uring requests" * tag 'iomap-6.6-merge-3' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux: (26 commits) iomap: support IOCB_DIO_CALLER_COMP io_uring/rw: add write support for IOCB_DIO_CALLER_COMP fs: add IOCB flags related to passing back dio completions iomap: add IOMAP_DIO_INLINE_COMP iomap: only set iocb->private for polled bio iomap: treat a write through cache the same as FUA iomap: use an unsigned type for IOMAP_DIO_* defines iomap: cleanup up iomap_dio_bio_end_io() iomap: Add per-block dirty state tracking to improve performance iomap: Allocate ifs in ->write_begin() early iomap: Refactor iomap_write_delalloc_punch() function out iomap: Use iomap_punch_t typedef iomap: Fix possible overflow condition in iomap_write_delalloc_scan iomap: Add some uptodate state handling helpers for ifs state bitmap iomap: Drop ifs argument from iomap_set_range_uptodate() iomap: Rename iomap_page to iomap_folio_state and others iomap: Copy larger chunks from userspace iomap: Create large folios in the buffered write path filemap: Allow __filemap_get_folio to allocate large folios filemap: Add fgf_t typedef ...
2023-08-28Merge tag 'v6.6-vfs.misc' of ↵Linus Torvalds1-24/+9
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull misc vfs updates from Christian Brauner: "This contains the usual miscellaneous features, cleanups, and fixes for vfs and individual filesystems. Features: - Block mode changes on symlinks and rectify our broken semantics - Report file modifications via fsnotify() for splice - Allow specifying an explicit timeout for the "rootwait" kernel command line option. This allows to timeout and reboot instead of always waiting indefinitely for the root device to show up - Use synchronous fput for the close system call Cleanups: - Get rid of open-coded lockdep workarounds for async io submitters and replace it all with a single consolidated helper - Simplify epoll allocation helper - Convert simple_write_begin and simple_write_end to use a folio - Convert page_cache_pipe_buf_confirm() to use a folio - Simplify __range_close to avoid pointless locking - Disable per-cpu buffer head cache for isolated cpus - Port ecryptfs to kmap_local_page() api - Remove redundant initialization of pointer buf in pipe code - Unexport the d_genocide() function which is only used within core vfs - Replace printk(KERN_ERR) and WARN_ON() with WARN() Fixes: - Fix various kernel-doc issues - Fix refcount underflow for eventfds when used as EFD_SEMAPHORE - Fix a mainly theoretical issue in devpts - Check the return value of __getblk() in reiserfs - Fix a racy assert in i_readcount_dec - Fix integer conversion issues in various functions - Fix LSM security context handling during automounts that prevented NFS superblock sharing" * tag 'v6.6-vfs.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (39 commits) cachefiles: use kiocb_{start,end}_write() helpers ovl: use kiocb_{start,end}_write() helpers aio: use kiocb_{start,end}_write() helpers io_uring: use kiocb_{start,end}_write() helpers fs: create kiocb_{start,end}_write() helpers fs: add kerneldoc to file_{start,end}_write() helpers io_uring: rename kiocb_end_write() local helper splice: Convert page_cache_pipe_buf_confirm() to use a folio libfs: Convert simple_write_begin and simple_write_end to use a folio fs/dcache: Replace printk and WARN_ON by WARN fs/pipe: remove redundant initialization of pointer buf fs: Fix kernel-doc warnings devpts: Fix kernel-doc warnings doc: idmappings: fix an error and rephrase a paragraph init: Add support for rootwait timeout parameter vfs: fix up the assert in i_readcount_dec fs: Fix one kernel-doc comment docs: filesystems: idmappings: clarify from where idmappings are taken fs/buffer.c: disable per-CPU buffer_head cache for isolated CPUs vfs, security: Fix automount superblock LSM init problem, preventing NFS sb sharing ...
2023-08-24io_uring: move multishot cqe cache in ctxPavel Begunkov1-3/+3
We cache multishot CQEs before flushing them to the CQ in submit_state.cqe. It's a 16 entry cache totalling 256 bytes in the middle of the io_submit_state structure. Move it out of there, it should help with CPU caches for the submission state, and shouldn't affect cached CQEs. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/dbe1f39c043ee23da918836be44fcec252ce6711.1692916914.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-08-24io_uring: add option to remove SQ indirectionPavel Begunkov1-20/+32
Not many aware, but io_uring submission queue has two levels. The first level usually appears as sq_array and stores indexes into the actual SQ. To my knowledge, no one has ever seriously used it, nor liburing exposes it to users. Add IORING_SETUP_NO_SQARRAY, when set we don't bother creating and using the sq_array and SQ heads/tails will be pointing directly into the SQ. Improves memory footprint, in term of both allocations as well as cache usage, and also should make io_get_sqe() less branchy in the end. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/0ffa3268a5ef61d326201ff43a233315c96312e0.1692916914.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-08-24io_uring: force inline io_fill_cqe_reqPavel Begunkov1-1/+2
There are only 2 callers of io_fill_cqe_req left, and one of them is extremely hot. Force inline the function. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/ffce4fc5e3521966def848a4d930586dfe33ae11.1692916914.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-08-24io_uring: merge iopoll and normal completion pathsPavel Begunkov3-26/+18
io_do_iopoll() and io_submit_flush_completions() are pretty similar, both filling CQEs and then free a list of requests. Don't duplicate it and make iopoll use __io_submit_flush_completions(), which also helps with inlining and other optimisations. For that, we need to first find all completed iopoll requests and splice them from the iopoll list and then pass it down. This adds one extra list traversal, which should be fine as requests will stay hot in cache. CQ locking is already conditional, introduce ->lockless_cq and skip locking for IOPOLL as it's protected by ->uring_lock. We also add a wakeup optimisation for IOPOLL to __io_cq_unlock_post(), so it works just like io_cqring_ev_posted_iopoll(). Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/3840473f5e8a960de35b77292026691880f6bdbc.1692916914.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-08-24io_uring: reorder cqring_flush and wakeupsPavel Begunkov2-12/+4
Unlike in the past, io_commit_cqring_flush() doesn't do anything that may need io_cqring_wake() to be issued after, all requests it completes will go via task_work. Do io_commit_cqring_flush() after io_cqring_wake() to clean up __io_cq_unlock_post(). Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/ed32dcfeec47e6c97bd6b18c152ddce5b218403f.1692916914.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-08-24io_uring: optimise extra io_get_cqe null checkPavel Begunkov2-15/+12
If the cached cqe check passes in io_get_cqe*() it already means that the cqe we return is valid and non-zero, however the compiler is unable to optimise null checks like in io_fill_cqe_req(). Do a bit of trickery, return success/fail boolean from io_get_cqe*() and store cqe in the cqe parameter. That makes it do the right thing, erasing the check together with the introduced indirection. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/322ea4d3377d3d4efd8ae90ab8ed28a99f518210.1692916914.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-08-24io_uring: refactor __io_get_cqe()Pavel Begunkov2-20/+16
Make __io_get_cqe simpler by not grabbing the cqe from refilled cached, but letting io_get_cqe() do it for us. That's cleaner and removes some duplication. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/74dc8fdf2657e438b2e05e1d478a3596924604e9.1692916914.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-08-24io_uring: simplify big_cqe handlingPavel Begunkov3-20/+8
Don't keep big_cqe bits of req in a union with hash_node, find a separate space for it. It's bit safer, but also if we keep it always initialised, we can get rid of ugly REQ_F_CQE32_INIT handling. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/447aa1b2968978c99e655ba88db536e903df0fe9.1692916914.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-08-24io_uring: cqe init hardeningPavel Begunkov1-1/+1
io_kiocb::cqe stores the completion info which we'll memcpy to userspace, and we rely on callbacks and other later steps to populate it with right values. We have never had problems with that, but it would still be safer to zero it on allocation. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/b16a3b64dde678686460d3c3792c3ba6d3d1bc7a.1692916914.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-08-24io_uring: improve cqe !tracing hot pathPavel Begunkov1-4/+5
While looking at io_fill_cqe_req()'s asm I stumbled on our trace points turning into the chunk below: trace_io_uring_complete(req->ctx, req, req->cqe.user_data, req->cqe.res, req->cqe.flags, req->extra1, req->extra2); io_uring/io_uring.c:898: trace_io_uring_complete(req->ctx, req, req->cqe.user_data, movq 232(%rbx), %rdi # req_44(D)->big_cqe.extra2, _5 movq 224(%rbx), %rdx # req_44(D)->big_cqe.extra1, _6 movl 84(%rbx), %r9d # req_44(D)->cqe.D.81184.flags, _7 movl 80(%rbx), %r8d # req_44(D)->cqe.res, _8 movq 72(%rbx), %rcx # req_44(D)->cqe.user_data, _9 movq 88(%rbx), %rsi # req_44(D)->ctx, _10 ./arch/x86/include/asm/jump_label.h:27: asm_volatile_goto("1:" 1:jmp .L1772 # objtool NOPs this # ... It does a jump_label for actual tracing, but those 6 moves will stay there in the hottest io_uring path. As an optimisation, add a trace_io_uring_complete_enabled() check, which is also uses jump_labels, it tricks the compiler into behaving. It removes the junk without changing anything else int the hot path. Note: apparently, it's not only me noticing it, and people are also working it around. We should remove the check when it's solved generically or rework tracing. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/555d8312644b3776f4be7e23f9b92943875c4bc7.1692916914.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-08-21io_uring: stop calling free_compound_page()Matthew Wilcox (Oracle)2-10/+2
Patch series "Remove _folio_dtor and _folio_order", v2. This patch (of 13): folio_put() is the standard way to write this, and it's not appreciably slower. This is an enabling patch for removing free_compound_page() entirely. Link: https://lkml.kernel.org/r/20230816151201.3655946-1-willy@infradead.org Link: https://lkml.kernel.org/r/20230816151201.3655946-2-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: David Hildenbrand <david@redhat.com> Reviewed-by: Jens Axboe <axboe@kernel.dk> Cc: Sidhartha Kumar <sidhartha.kumar@oracle.com> Cc: Yanteng Si <siyanteng@loongson.cn> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2023-08-21io_uring: use kiocb_{start,end}_write() helpersAmir Goldstein1-19/+4
Use helpers instead of the open coded dance to silence lockdep warnings. Suggested-by: Jan Kara <jack@suse.cz> Signed-off-by: Amir Goldstein <amir73il@gmail.com> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Jens Axboe <axboe@kernel.dk> Message-Id: <20230817141337.1025891-5-amir73il@gmail.com> Signed-off-by: Christian Brauner <brauner@kernel.org>
2023-08-21io_uring: rename kiocb_end_write() local helperAmir Goldstein1-5/+5
This helper does not take a kiocb as input and we want to create a common helper by that name that takes a kiocb as input. Signed-off-by: Amir Goldstein <amir73il@gmail.com> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Jens Axboe <axboe@kernel.dk> Message-Id: <20230817141337.1025891-2-amir73il@gmail.com> Signed-off-by: Christian Brauner <brauner@kernel.org>
2023-08-17io_uring/rsrc: Annotate struct io_mapped_ubuf with __counted_byKees Cook1-1/+1
Prepare for the coming implementation by GCC and Clang of the __counted_by attribute. Flexible array members annotated with __counted_by can have their accesses bounds-checked at run-time checking via CONFIG_UBSAN_BOUNDS (for array indexing) and CONFIG_FORTIFY_SOURCE (for strcpy/memcpy-family functions). As found with Coccinelle[1], add __counted_by for struct io_mapped_ubuf. [1] https://github.com/kees/kernel-tools/blob/trunk/coccinelle/examples/counted_by.cocci Cc: Jens Axboe <axboe@kernel.dk> Cc: Pavel Begunkov <asml.silence@gmail.com> Cc: io-uring@vger.kernel.org Signed-off-by: Kees Cook <keescook@chromium.org> Reviewed-by: "Gustavo A. R. Silva" <gustavoars@kernel.org> Link: https://lore.kernel.org/r/20230817212146.never.853-kees@kernel.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-08-16io_uring/sqpoll: fix io-wq affinity when IORING_SETUP_SQPOLL is usedJens Axboe5-15/+41
If we setup the ring with SQPOLL, then that polling thread has its own io-wq setup. This means that if the application uses IORING_REGISTER_IOWQ_AFF to set the io-wq affinity, we should not be setting it for the invoking task, but rather the sqpoll task. Add an sqpoll helper that parks the thread and updates the affinity, and use that one if we're using SQPOLL. Fixes: fe76421d1da1 ("io_uring: allow user configurable IO thread CPU affinity") Cc: stable@vger.kernel.org # 5.10+ Link: https://github.com/axboe/liburing/discussions/884 Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-08-11io_uring: simplify io_run_task_work_sig returnPavel Begunkov1-2/+2
Nobody cares about io_run_task_work_sig returning 1, we only check for negative errors. Simplify by keeping to 0/-error returns. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/3aec8a532c003d6e50739b969a82989402696170.1691757663.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>