aboutsummaryrefslogtreecommitdiffstats
path: root/drivers/nvme
AgeCommit message (Collapse)AuthorFilesLines
6 daysMerge tag 'for-6.10/block-20240511' of git://git.kernel.dk/linuxLinus Torvalds2-8/+4
Pull block updates from Jens Axboe: - Add a partscan attribute in sysfs, fixing an issue with systemd relying on an internal interface that went away. - Attempt #2 at making long running discards interruptible. The previous attempt went into 6.9, but we ended up mostly reverting it as it had issues. - Remove old ida_simple API in bcache - Support for zoned write plugging, greatly improving the performance on zoned devices. - Remove the old throttle low interface, which has been experimental since 2017 and never made it beyond that and isn't being used. - Remove page->index debugging checks in brd, as it hasn't caught anything and prepares us for removing in struct page. - MD pull request from Song - Don't schedule block workers on isolated CPUs * tag 'for-6.10/block-20240511' of git://git.kernel.dk/linux: (84 commits) blk-throttle: delay initialization until configuration blk-throttle: remove CONFIG_BLK_DEV_THROTTLING_LOW block: fix that util can be greater than 100% block: support to account io_ticks precisely block: add plug while submitting IO bcache: fix variable length array abuse in btree_iter bcache: Remove usage of the deprecated ida_simple_xx() API md: Revert "md: Fix overflow in is_mddev_idle" blk-lib: check for kill signal in ioctl BLKDISCARD block: add a bio_await_chain helper block: add a blk_alloc_discard_bio helper block: add a bio_chain_and_submit helper block: move discard checks into the ioctl handler block: remove the discard_granularity check in __blkdev_issue_discard block/ioctl: prefer different overflow check null_blk: Fix the WARNING: modpost: missing MODULE_DESCRIPTION() block: fix and simplify blkdevparts= cmdline parsing block: refine the EOF check in blkdev_iomap_begin block: add a partscan sysfs attribute for disks block: add a disk_has_partscan helper ...
6 daysMerge tag 'for-6.10/io_uring-20240511' of git://git.kernel.dk/linuxLinus Torvalds1-4/+11
Pull io_uring updates from Jens Axboe: - Greatly improve send zerocopy performance, by enabling coalescing of sent buffers. MSG_ZEROCOPY already does this with send(2) and sendmsg(2), but the io_uring side did not. In local testing, the crossover point for send zerocopy being faster is now around 3000 byte packets, and it performs better than the sync syscall variants as well. This feature relies on a shared branch with net-next, which was pulled into both branches. - Unification of how async preparation is done across opcodes. Previously, opcodes that required extra memory for async retry would allocate that as needed, using on-stack state until that was the case. If async retry was needed, the on-stack state was adjusted appropriately for a retry and then copied to the allocated memory. This led to some fragile and ugly code, particularly for read/write handling, and made storage retries more difficult than they needed to be. Allocate the memory upfront, as it's cheap from our pools, and use that state consistently both initially and also from the retry side. - Move away from using remap_pfn_range() for mapping the rings. This is really not the right interface to use and can cause lifetime issues or leaks. Additionally, it means the ring sq/cq arrays need to be physically contigious, which can cause problems in production with larger rings when services are restarted, as memory can be very fragmented at that point. Move to using vm_insert_page(s) for the ring sq/cq arrays, and apply the same treatment to mapped ring provided buffers. This also helps unify the code we have dealing with allocating and mapping memory. Hard to see in the diffstat as we're adding a few features as well, but this kills about ~400 lines of code from the codebase as well. - Add support for bundles for send/recv. When used with provided buffers, bundles support sending or receiving more than one buffer at the time, improving the efficiency by only needing to call into the networking stack once for multiple sends or receives. - Tweaks for our accept operations, supporting both a DONTWAIT flag for skipping poll arm and retry if we can, and a POLLFIRST flag that the application can use to skip the initial accept attempt and rely purely on poll for triggering the operation. Both of these have identical flags on the receive side already. - Make the task_work ctx locking unconditional. We had various code paths here that would do a mix of lock/trylock and set the task_work state to whether or not it was locked. All of that goes away, we lock it unconditionally and get rid of the state flag indicating whether it's locked or not. The state struct still exists as an empty type, can go away in the future. - Add support for specifying NOP completion values, allowing it to be used for error handling testing. - Use set/test bit for io-wq worker flags. Not strictly needed, but also doesn't hurt and helps silence a KCSAN warning. - Cleanups for io-wq locking and work assignments, closing a tiny race where cancelations would not be able to find the work item reliably. - Misc fixes, cleanups, and improvements * tag 'for-6.10/io_uring-20240511' of git://git.kernel.dk/linux: (97 commits) io_uring: support to inject result for NOP io_uring: fail NOP if non-zero op flags is passed in io_uring/net: add IORING_ACCEPT_POLL_FIRST flag io_uring/net: add IORING_ACCEPT_DONTWAIT flag io_uring/filetable: don't unnecessarily clear/reset bitmap io_uring/io-wq: Use set_bit() and test_bit() at worker->flags io_uring/msg_ring: cleanup posting to IOPOLL vs !IOPOLL ring io_uring: Require zeroed sqe->len on provided-buffers send io_uring/notif: disable LAZY_WAKE for linked notifs io_uring/net: fix sendzc lazy wake polling io_uring/msg_ring: reuse ctx->submitter_task read using READ_ONCE instead of re-reading it io_uring/rw: reinstate thread check for retries io_uring/notif: implement notification stacking io_uring/notif: simplify io_notif_flush() net: add callback for setting a ubuf_info to skb net: extend ubuf_info callback to ops structure io_uring/net: support bundles for recv io_uring/net: support bundles for send io_uring/kbuf: add helpers for getting/peeking multiple buffers io_uring/net: add provided buffer support for IORING_OP_SEND ...
11 daysnvmet-rdma: fix possible bad dereference when freeing rspsSagi Grimberg1-12/+4
It is possible that the host connected and saw a cm established event and started sending nvme capsules on the qp, however the ctrl did not yet see an established event. This is why the rsp_wait_list exists (for async handling of these cmds, we move them to a pending list). Furthermore, it is possible that the ctrl cm times out, resulting in a connect-error cm event. in this case we hit a bad deref [1] because in nvmet_rdma_free_rsps we assume that all the responses are in the free list. We are freeing the cmds array anyways, so don't even bother to remove the rsp from the free_list. It is also guaranteed that we are not racing anything when we are releasing the queue so no other context accessing this array should be running. [1]: -- Workqueue: nvmet-free-wq nvmet_rdma_free_queue_work [nvmet_rdma] [...] pc : nvmet_rdma_free_rsps+0x78/0xb8 [nvmet_rdma] lr : nvmet_rdma_free_queue_work+0x88/0x120 [nvmet_rdma] Call trace: nvmet_rdma_free_rsps+0x78/0xb8 [nvmet_rdma] nvmet_rdma_free_queue_work+0x88/0x120 [nvmet_rdma] process_one_work+0x1ec/0x4a0 worker_thread+0x48/0x490 kthread+0x158/0x160 ret_from_fork+0x10/0x18 -- Signed-off-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Keith Busch <kbusch@kernel.org>
11 daysnvmet: prevent sprintf() overflow in nvmet_subsys_nsid_exists()Dan Carpenter1-3/+2
The nsid value is a u32 that comes from nvmet_req_find_ns(). It's endian data and we're on an error path and both of those raise red flags. So let's make this safer. 1) Make the buffer large enough for any u32. 2) Remove the unnecessary initialization. 3) Use snprintf() instead of sprintf() for even more safety. 4) The sprintf() function returns the number of bytes printed, not counting the NUL terminator. It is impossible for the return value to be <= 0 so delete that. Fixes: 505363957fad ("nvmet: fix nvme status code when namespace is disabled") Signed-off-by: Dan Carpenter <dan.carpenter@linaro.org> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Keith Busch <kbusch@kernel.org>
12 daysnvmet: make nvmet_wq unboundSagi Grimberg1-1/+2
When deleting many controllers one-by-one, it takes a very long time as these work elements may serialize as they are scheduled on the executing cpu instead of spreading. In general nvmet_wq can definitely be used for long standing work elements so its better to make it unbound regardless. Signed-off-by: Sagi Grimberg <sagi.grimberg@vastdata.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Keith Busch <kbusch@kernel.org>
12 daysnvmet-auth: return the error code to the nvmet_auth_ctrl_hash() callersMaurizio Lombardi1-1/+1
If nvmet_auth_ctrl_hash() fails, return the error code to its callers Signed-off-by: Maurizio Lombardi <mlombard@redhat.com> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Keith Busch <kbusch@kernel.org>
12 daysnvme-pci: Add quirk for broken MSIsSean Anderson2-3/+16
Sandisk SN530 NVMe drives have broken MSIs. On systems without MSI-X support, all commands time out resulting in the following message: nvme nvme0: I/O tag 12 (100c) QID 0 timeout, completion polled These timeouts cause the boot to take an excessively-long time (over 20 minutes) while the initial command queue is flushed. Address this by adding a quirk for drives with buggy MSIs. The lspci output for this device (recorded on a system with MSI-X support) is: 02:00.0 Non-Volatile memory controller: Sandisk Corp Device 5008 (rev 01) (prog-if 02 [NVM Express]) Subsystem: Sandisk Corp Device 5008 Flags: bus master, fast devsel, latency 0, IRQ 16, NUMA node 0 Memory at f7e00000 (64-bit, non-prefetchable) [size=16K] Memory at f7e04000 (64-bit, non-prefetchable) [size=256] Capabilities: [80] Power Management version 3 Capabilities: [90] MSI: Enable- Count=1/32 Maskable- 64bit+ Capabilities: [b0] MSI-X: Enable+ Count=17 Masked- Capabilities: [c0] Express Endpoint, MSI 00 Capabilities: [100] Advanced Error Reporting Capabilities: [150] Device Serial Number 00-00-00-00-00-00-00-00 Capabilities: [1b8] Latency Tolerance Reporting Capabilities: [300] Secondary PCI Express Capabilities: [900] L1 PM Substates Kernel driver in use: nvme Kernel modules: nvme Cc: <stable@vger.kernel.org> Signed-off-by: Sean Anderson <sean.anderson@linux.dev> Reviewed-by: Christoph Hellwig <hch@lst.de>
2024-05-01nvme-tcp: strict pdu pacing to avoid send stalls on TLSHannes Reinecke1-2/+8
TLS requires a strict pdu pacing via MSG_EOR to signal the end of a record and subsequent encryption. If we do not set MSG_EOR at the end of a sequence the record won't be closed, encryption doesn't start, and we end up with a send stall as the message will never be passed on to the TCP layer. So do not check for the queue status when TLS is enabled but rather make the MSG_MORE setting dependent on the current request only. Signed-off-by: Hannes Reinecke <hare@kernel.org> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-05-01nvmet: fix nvme status code when namespace is disabledSagi Grimberg3-1/+18
If the user disabled a nvmet namespace, it is removed from the subsystem namespaces list. When nvmet processes a command directed to an nsid that was disabled, it cannot differentiate between a nsid that is disabled vs. a non-existent namespace, and resorts to return NVME_SC_INVALID_NS with the dnr bit set. This translates to a non-retryable status for the host, which translates to a user error. We should expect disabled namespaces to not cause an I/O error in a multipath environment. Address this by searching a configfs item for the namespace nvmet failed to find, and if we found one, conclude that the namespace is disabled (perhaps temporarily). Return NVME_SC_INTERNAL_PATH_ERROR in this case and keep DNR bit cleared. Reported-by: Jirong Feng <jirong.feng@easystack.cn> Tested-by: Jirong Feng <jirong.feng@easystack.cn> Signed-off-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-05-01nvmet-tcp: fix possible memory leak when tearing down a controllerSagi Grimberg1-7/+4
When we teardown the controller, we wait for pending I/Os to complete (sq->ref on all queues to drop to zero) and then we go over the commands, and free their command buffers in case they are still fetching data from the host (e.g. processing nvme writes) and have yet to take a reference on the sq. However, we may miss the case where commands have failed before executing and are queued for sending a response, but will never occur because the queue socket is already down. In this case we may miss deallocating command buffers. Solve this by freeing all commands buffers as nvmet_tcp_free_cmd_buffers is idempotent anyways. Reported-by: Yi Zhang <yi.zhang@redhat.com> Tested-by: Yi Zhang <yi.zhang@redhat.com> Signed-off-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-05-01nvme: cancel pending I/O if nvme controller is in terminal stateNilay Shroff3-22/+28
While I/O is running, if the pci bus error occurs then in-flight I/O can not complete. Worst, if at this time, user (logically) hot-unplug the nvme disk then the nvme_remove() code path can't forward progress until in-flight I/O is cancelled. So these sequence of events may potentially hang hot-unplug code path indefinitely. This patch helps cancel the pending/in-flight I/O from the nvme request timeout handler in case the nvme controller is in the terminal (DEAD/DELETING/DELETING_NOIO) state and that helps nvme_remove() code path forward progress and finish successfully. Link: https://lore.kernel.org/all/199be893-5dfa-41e5-b6f2-40ac90ebccc4@linux.ibm.com/ Signed-off-by: Nilay Shroff <nilay@linux.ibm.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-05-01nvmet-auth: replace pr_debug() with pr_err() to report an error.Maurizio Lombardi1-3/+3
In nvmet_auth_host_hash(), if a mismatch is detected in the hash length the kernel should print an error. Signed-off-by: Maurizio Lombardi <mlombard@redhat.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-05-01nvmet-auth: return the error code to the nvmet_auth_host_hash() callersMaurizio Lombardi1-1/+1
If the nvmet_auth_host_hash() function fails, the error code should be returned to its callers. Signed-off-by: Maurizio Lombardi <mlombard@redhat.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-05-01nvme: find numa distance only if controller has valid numa idNilay Shroff1-1/+2
On system where native nvme multipath is configured and iopolicy is set to numa but the nvme controller numa node id is undefined or -1 (NUMA_NO_NODE) then avoid calculating node distance for finding optimal io path. In such case we may access numa distance table with invalid index and that may potentially refer to incorrect memory. So this patch ensures that if the nvme controller numa node id is -1 then instead of calculating node distance for finding optimal io path, we set the numa node distance of such controller to default 10 (LOCAL_DISTANCE). Link: https://lore.kernel.org/all/20240413090614.678353-1-nilay@linux.ibm.com/ Signed-off-by: Nilay Shroff <nilay@linux.ibm.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-04-17block: Simplify blk_revalidate_disk_zones() interfaceDamien Le Moal1-1/+1
The only user of blk_revalidate_disk_zones() second argument was the SCSI disk driver (sd). Now that this driver does not require this update_driver_data argument, remove it to simplify the interface of blk_revalidate_disk_zones(). Also update the function kdoc comment to be more accurate (i.e. there is no gendisk ->revalidate method). Signed-off-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Bart Van Assche <bvanassche@acm.org> Tested-by: Hans Holmberg <hans.holmberg@wdc.com> Tested-by: Dennis Maisenbacher <dennis.maisenbacher@wdc.com> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Link: https://lore.kernel.org/r/20240408014128.205141-21-dlemoal@kernel.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-04-17nvmet: zns: Do not reference the gendisk conv_zones_bitmapDamien Le Moal1-7/+3
The gendisk conventional zone bitmap is going away. So to check for the presence of conventional zones on a zoned target device, always use report zones. Signed-off-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Tested-by: Hans Holmberg <hans.holmberg@wdc.com> Tested-by: Dennis Maisenbacher <dennis.maisenbacher@wdc.com> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Link: https://lore.kernel.org/r/20240408014128.205141-19-dlemoal@kernel.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-04-15nvme/io_uring: use helper for polled completionsJens Axboe1-4/+11
NVMe is making up issue_flags, which is a no-no in general, and to make matters worse, they are completely the wrong ones. For a pure polled request, which it does check for, we're already inside the ctx->uring_lock when the completions are run off io_do_iopoll(). Hence the correct flag would be '0' rather than IO_URING_F_UNLOCKED. Reviewed-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-04-11nvme: fix warn output about shared namespaces without CONFIG_NVME_MULTIPATHYi Zhang1-1/+1
Move the stray '.' that is currently at the end of the line after newline '\n' to before newline character which is the right position. Fixes: ce8d78616a6b ("nvme: warn about shared namespaces without CONFIG_NVME_MULTIPATH") Signed-off-by: Yi Zhang <yi.zhang@redhat.com> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-04-04nvme-fc: rename free_ctrl callback to match name patternDaniel Wagner1-2/+2
Rename nvme_fc_nvme_ctrl_freed to nvme_fc_free_ctrl to match the name pattern for the callback. Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Daniel Wagner <dwagner@suse.de> Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-04-04nvmet-fc: move RCU read lock to nvmet_fc_assoc_existsDaniel Wagner1-7/+10
The RCU lock is only needed for the lookup loop and not for list_ad_tail_rcu call. Thus move it down the call chain into nvmet_fc_assoc_exists. While at it also fix the name typo of the function. Signed-off-by: Daniel Wagner <dwagner@suse.de> Reviewed-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-04-04nvmet: implement unique discovery NQNHannes Reinecke2-0/+54
Unique discovery NQNs allow to differentiate between discovery services from (typically physically separate) NVMe-oF subsystems. This is required for establishing secured connections as otherwise the credentials won't be unique and the integrity of the connection cannot be guaranteed. This patch adds a configfs attribute 'discovery_nqn' in the 'nvmet' configfs directory to specify the unique discovery NQN. Signed-off-by: Hannes Reinecke <hare@kernel.org> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-04-04nvme: don't create a multipath node for zero capacity devicesChristoph Hellwig1-1/+1
Apparently there are nvme controllers around that report namespaces in the namespace list which have zero capacity. Return -ENXIO instead of -ENODEV from nvme_update_ns_info_block so we don't create a hidden multipath node for these namespaces but entirely ignore them. Fixes: 46e7422cda84 ("nvme: move common logic into nvme_update_ns_info") Reported-by: Nilay Shroff <nilay@linux.ibm.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Tested-by: Nilay Shroff <nilay@linux.ibm.com> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-04-02nvme: split nvme_update_zone_infoChristoph Hellwig3-23/+41
nvme_update_zone_info does (admin queue) I/O to the device and can fail. We fail to abort the queue limits update if that happen, but really should avoid with the frozen I/O queue as much as possible anyway. Split the logic into a helper to query the information that can be called on an unfrozen queue and one to apply it to the queue limits. Fixes: 9b130d681443 ("nvme: use the atomic queue limits update API") Reported-by: Kanchan Joshi <joshi.k@samsung.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Kanchan Joshi <joshi.k@samsung.com> Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-04-02nvme-multipath: don't inherit LBA-related fields for the multipath nodeChristoph Hellwig1-0/+20
Linux 6.9 made the nvme multipath nodes not properly pick up changes when the LBA size goes smaller after an nvme format. This is because we now try to inherit the queue settings for the multipath node entirely from the individual paths. That is the right thing to do for I/O size limitations, which make up most of the queue limits, but it is wrong for changes to the namespace configuration, where we do want to pick up the new format, which will eventually show up on all paths once they are re-queried. Fix this by not inheriting the block size and related fields and always for updating them. Fixes: 8f03cfa117e0 ("nvme: don't use nvme_update_disk_info for the multipath disk") Reported-by: Nilay Shroff <nilay@linux.ibm.com> Tested-by: Nilay Shroff <nilay@linux.ibm.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-03-21Merge tag 'nvme-6.9-2024-03-21' of git://git.infradead.org/nvme into block-6.9Jens Axboe11-33/+233
Pull NVMe fixes from Keith: "nvme updates for Linux 6.9 - Make an informative message less ominous (Keith) - Enhanced trace decoding (Guixin) - TCP updates (Hannes, Li) - Fabrics connect deadlock fix (Chunguang) - Platform API migration update (Uwe) - A new device quirk (Jiawei)" * tag 'nvme-6.9-2024-03-21' of git://git.infradead.org/nvme: nvmet-rdma: remove NVMET_RDMA_REQ_INVALIDATE_RKEY flag nvme: remove redundant BUILD_BUG_ON check nvme/tcp: Add wq_unbound modparam for nvme_tcp_wq nvme-tcp: Export the nvme_tcp_wq to sysfs drivers/nvme: Add quirks for device 126f:2262 nvme: parse format command's lbafu when tracing nvme: add tracing of reservation commands nvme: parse zns command's zsa and zrasf to string nvme: use nvme_disk_is_ns_head helper nvme: fix reconnection fail due to reserved tag allocation nvmet: add tracing of zns commands nvmet: add tracing of authentication commands nvme-apple: Convert to platform remove callback returning void nvmet-tcp: do not continue for invalid icreq nvme: change shutdown timeout setting message
2024-03-21nvmet-rdma: remove NVMET_RDMA_REQ_INVALIDATE_RKEY flagGuixin Liu1-5/+3
We can simply use invalidate_rkey to check instead of adding a flag. Signed-off-by: Guixin Liu <kanie@linux.alibaba.com> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-03-21nvme: remove redundant BUILD_BUG_ON checkGuixin Liu1-3/+0
Remove redundant BUILD_BUG_ON check of struct nvme_dsm_range, it's already checked in nvme_init_ctrl(). Signed-off-by: Guixin Liu <kanie@linux.alibaba.com> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-03-18nvme/tcp: Add wq_unbound modparam for nvme_tcp_wqLi Feng1-3/+18
The default nvme_tcp_wq will use all CPUs to process tasks. Sometimes it is necessary to set CPU affinity to improve performance. A new module parameter wq_unbound is added here. If set to true, users can configure cpu affinity through /sys/devices/virtual/workqueue/nvme_tcp_wq/cpumask. Signed-off-by: Li Feng <fengli@smartx.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-03-18nvme-tcp: Export the nvme_tcp_wq to sysfsLi Feng1-1/+1
Make the workqueue userspace visible for easy viewing and configuration. Signed-off-by: Li Feng <fengli@smartx.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-03-18drivers/nvme: Add quirks for device 126f:2262Jiawei Fu (iBug)1-0/+3
This commit adds NVME_QUIRK_NO_DEEPEST_PS and NVME_QUIRK_BOGUS_NID for device [126f:2262], which appears to be a generic VID:PID pair used for many SSDs based on the Silicon Motion SM2262/SM2262EN controller. Two of my SSDs with this VID:PID pair exhibit the same behavior: * They frequently have trouble exiting the deepest power state (5), resulting in the entire disk unresponsive. Verified by setting nvme_core.default_ps_max_latency_us=10000 and observing them behaving normally. * They produce all-zero nguid and eui64 with `nvme id-ns` command. The offending products are: * HP SSD EX950 1TB * HIKVISION C2000Pro 2TB Signed-off-by: Jiawei Fu <i@ibugone.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-03-14nvme: parse format command's lbafu when tracingGuixin Liu1-1/+4
Add the parse of format command's lbafu to calculate lbaf. Signed-off-by: Guixin Liu <kanie@linux.alibaba.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-03-14nvme: add tracing of reservation commandsGuixin Liu1-0/+62
Add detailed parsing of reservation commands to make the trace log more consistent and human-readable. Signed-off-by: Guixin Liu <kanie@linux.alibaba.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-03-14nvme: parse zns command's zsa and zrasf to stringGuixin Liu1-3/+35
Parse zone mgmt send commands's zsa and receive command's zrasf to string to make the trace log more human-readable. Signed-off-by: Guixin Liu <kanie@linux.alibaba.com> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-03-14nvme: use nvme_disk_is_ns_head helperGuixin Liu2-4/+2
Use nvme_disk_is_ns_head helper instead of check fops directly, and also drop CONFIG_NVME_MULTIPATH check. Signed-off-by: Guixin Liu <kanie@linux.alibaba.com> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-03-14nvme: fix reconnection fail due to reserved tag allocationChunguang Xu2-9/+4
We found a issue on production environment while using NVMe over RDMA, admin_q reconnect failed forever while remote target and network is ok. After dig into it, we found it may caused by a ABBA deadlock due to tag allocation. In my case, the tag was hold by a keep alive request waiting inside admin_q, as we quiesced admin_q while reset ctrl, so the request maked as idle and will not process before reset success. As fabric_q shares tagset with admin_q, while reconnect remote target, we need a tag for connect command, but the only one reserved tag was held by keep alive command which waiting inside admin_q. As a result, we failed to reconnect admin_q forever. In order to fix this issue, I think we should keep two reserved tags for admin queue. Fixes: ed01fee283a0 ("nvme-fabrics: only reserve a single tag") Signed-off-by: Chunguang Xu <chunguang.xu@shopee.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-03-12Merge tag 'net-next-6.9' of ↵Linus Torvalds2-9/+2
git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next Pull networking updates from Jakub Kicinski: "Core & protocols: - Large effort by Eric to lower rtnl_lock pressure and remove locks: - Make commonly used parts of rtnetlink (address, route dumps etc) lockless, protected by RCU instead of rtnl_lock. - Add a netns exit callback which already holds rtnl_lock, allowing netns exit to take rtnl_lock once in the core instead of once for each driver / callback. - Remove locks / serialization in the socket diag interface. - Remove 6 calls to synchronize_rcu() while holding rtnl_lock. - Remove the dev_base_lock, depend on RCU where necessary. - Support busy polling on a per-epoll context basis. Poll length and budget parameters can be set independently of system defaults. - Introduce struct net_hotdata, to make sure read-mostly global config variables fit in as few cache lines as possible. - Add optional per-nexthop statistics to ease monitoring / debug of ECMP imbalance problems. - Support TCP_NOTSENT_LOWAT in MPTCP. - Ensure that IPv6 temporary addresses' preferred lifetimes are long enough, compared to other configured lifetimes, and at least 2 sec. - Support forwarding of ICMP Error messages in IPSec, per RFC 4301. - Add support for the independent control state machine for bonding per IEEE 802.1AX-2008 5.4.15 in addition to the existing coupled control state machine. - Add "network ID" to MCTP socket APIs to support hosts with multiple disjoint MCTP networks. - Re-use the mono_delivery_time skbuff bit for packets which user space wants to be sent at a specified time. Maintain the timing information while traversing veth links, bridge etc. - Take advantage of MSG_SPLICE_PAGES for RxRPC DATA and ACK packets. - Simplify many places iterating over netdevs by using an xarray instead of a hash table walk (hash table remains in place, for use on fastpaths). - Speed up scanning for expired routes by keeping a dedicated list. - Speed up "generic" XDP by trying harder to avoid large allocations. - Support attaching arbitrary metadata to netconsole messages. Things we sprinkled into general kernel code: - Enforce VM_IOREMAP flag and range in ioremap_page_range and introduce VM_SPARSE kind and vm_area_[un]map_pages (used by bpf_arena). - Rework selftest harness to enable the use of the full range of ksft exit code (pass, fail, skip, xfail, xpass). Netfilter: - Allow userspace to define a table that is exclusively owned by a daemon (via netlink socket aliveness) without auto-removing this table when the userspace program exits. Such table gets marked as orphaned and a restarting management daemon can re-attach/regain ownership. - Speed up element insertions to nftables' concatenated-ranges set type. Compact a few related data structures. BPF: - Add BPF token support for delegating a subset of BPF subsystem functionality from privileged system-wide daemons such as systemd through special mount options for userns-bound BPF fs to a trusted & unprivileged application. - Introduce bpf_arena which is sparse shared memory region between BPF program and user space where structures inside the arena can have pointers to other areas of the arena, and pointers work seamlessly for both user-space programs and BPF programs. - Introduce may_goto instruction that is a contract between the verifier and the program. The verifier allows the program to loop assuming it's behaving well, but reserves the right to terminate it. - Extend the BPF verifier to enable static subprog calls in spin lock critical sections. - Support registration of struct_ops types from modules which helps projects like fuse-bpf that seeks to implement a new struct_ops type. - Add support for retrieval of cookies for perf/kprobe multi links. - Support arbitrary TCP SYN cookie generation / validation in the TC layer with BPF to allow creating SYN flood handling in BPF firewalls. - Add code generation to inline the bpf_kptr_xchg() helper which improves performance when stashing/popping the allocated BPF objects. Wireless: - Add SPP (signaling and payload protected) AMSDU support. - Support wider bandwidth OFDMA, as required for EHT operation. Driver API: - Major overhaul of the Energy Efficient Ethernet internals to support new link modes (2.5GE, 5GE), share more code between drivers (especially those using phylib), and encourage more uniform behavior. Convert and clean up drivers. - Define an API for querying per netdev queue statistics from drivers. - IPSec: account in global stats for fully offloaded sessions. - Create a concept of Ethernet PHY Packages at the Device Tree level, to allow parameterizing the existing PHY package code. - Enable Rx hashing (RSS) on GTP protocol fields. Misc: - Improvements and refactoring all over networking selftests. - Create uniform module aliases for TC classifiers, actions, and packet schedulers to simplify creating modprobe policies. - Address all missing MODULE_DESCRIPTION() warnings in networking. - Extend the Netlink descriptions in YAML to cover message encapsulation or "Netlink polymorphism", where interpretation of nested attributes depends on link type, classifier type or some other "class type". Drivers: - Ethernet high-speed NICs: - Add a new driver for Marvell's Octeon PCI Endpoint NIC VF. - Intel (100G, ice, idpf): - support E825-C devices - nVidia/Mellanox: - support devices with one port and multiple PCIe links - Broadcom (bnxt): - support n-tuple filters - support configuring the RSS key - Wangxun (ngbe/txgbe): - implement irq_domain for TXGBE's sub-interrupts - Pensando/AMD: - support XDP - optimize queue submission and wakeup handling (+17% bps) - optimize struct layout, saving 28% of memory on queues - Ethernet NICs embedded and virtual: - Google cloud vNIC: - refactor driver to perform memory allocations for new queue config before stopping and freeing the old queue memory - Synopsys (stmmac): - obey queueMaxSDU and implement counters required by 802.1Qbv - Renesas (ravb): - support packet checksum offload - suspend to RAM and runtime PM support - Ethernet switches: - nVidia/Mellanox: - support for nexthop group statistics - Microchip: - ksz8: implement PHY loopback - add support for KSZ8567, a 7-port 10/100Mbps switch - PTP: - New driver for RENESAS FemtoClock3 Wireless clock generator. - Support OCP PTP cards designed and built by Adva. - CAN: - Support recvmsg() flags for own, local and remote traffic on CAN BCM sockets. - Support for esd GmbH PCIe/402 CAN device family. - m_can: - Rx/Tx submission coalescing - wake on frame Rx - WiFi: - Intel (iwlwifi): - enable signaling and payload protected A-MSDUs - support wider-bandwidth OFDMA - support for new devices - bump FW API to 89 for AX devices; 90 for BZ/SC devices - MediaTek (mt76): - mt7915: newer ADIE version support - mt7925: radio temperature sensor support - Qualcomm (ath11k): - support 6 GHz station power modes: Low Power Indoor (LPI), Standard Power) SP and Very Low Power (VLP) - QCA6390 & WCN6855: support 2 concurrent station interfaces - QCA2066 support - Qualcomm (ath12k): - refactoring in preparation for Multi-Link Operation (MLO) support - 1024 Block Ack window size support - firmware-2.bin support - support having multiple identical PCI devices (firmware needs to have ATH12K_FW_FEATURE_MULTI_QRTR_ID) - QCN9274: support split-PHY devices - WCN7850: enable Power Save Mode in station mode - WCN7850: P2P support - RealTek: - rtw88: support for more rtw8811cu and rtw8821cu devices - rtw89: support SCAN_RANDOM_SN and SET_SCAN_DWELL - rtlwifi: speed up USB firmware initialization - rtwl8xxxu: - RTL8188F: concurrent interface support - Channel Switch Announcement (CSA) support in AP mode - Broadcom (brcmfmac): - per-vendor feature support - per-vendor SAE password setup - DMI nvram filename quirk for ACEPC W5 Pro" * tag 'net-next-6.9' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (2255 commits) nexthop: Fix splat with CONFIG_DEBUG_PREEMPT=y nexthop: Fix out-of-bounds access during attribute validation nexthop: Only parse NHA_OP_FLAGS for dump messages that require it nexthop: Only parse NHA_OP_FLAGS for get messages that require it bpf: move sleepable flag from bpf_prog_aux to bpf_prog bpf: hardcode BPF_PROG_PACK_SIZE to 2MB * num_possible_nodes() selftests/bpf: Add kprobe multi triggering benchmarks ptp: Move from simple ida to xarray vxlan: Remove generic .ndo_get_stats64 vxlan: Do not alloc tstats manually devlink: Add comments to use netlink gen tool nfp: flower: handle acti_netdevs allocation failure net/packet: Add getsockopt support for PACKET_COPY_THRESH net/netlink: Add getsockopt support for NETLINK_LISTEN_ALL_NSID selftests/bpf: Add bpf_arena_htab test. selftests/bpf: Add bpf_arena_list test. selftests/bpf: Add unit tests for bpf_arena_alloc/free_pages bpf: Add helper macro bpf_addr_space_cast() libbpf: Recognize __arena global variables. bpftool: Recognize arena map type ...
2024-03-11Merge tag 'for-6.9/block-20240310' of git://git.kernel.dk/linuxLinus Torvalds18-290/+361
Pull block updates from Jens Axboe: - MD pull requests via Song: - Cleanup redundant checks (Yu Kuai) - Remove deprecated headers (Marc Zyngier, Song Liu) - Concurrency fixes (Li Lingfeng) - Memory leak fix (Li Nan) - Refactor raid1 read_balance (Yu Kuai, Paul Luse) - Clean up and fix for md_ioctl (Li Nan) - Other small fixes (Gui-Dong Han, Heming Zhao) - MD atomic limits (Christoph) - NVMe pull request via Keith: - RDMA target enhancements (Max) - Fabrics fixes (Max, Guixin, Hannes) - Atomic queue_limits usage (Christoph) - Const use for class_register (Ricardo) - Identification error handling fixes (Shin'ichiro, Keith) - Improvement and cleanup for cached request handling (Christoph) - Moving towards atomic queue limits. Core changes and driver bits so far (Christoph) - Fix UAF issues in aoeblk (Chun-Yi) - Zoned fix and cleanups (Damien) - s390 dasd cleanups and fixes (Jan, Miroslav) - Block issue timestamp caching (me) - noio scope guarding for zoned IO (Johannes) - block/nvme PI improvements (Kanchan) - Ability to terminate long running discard loop (Keith) - bdev revalidation fix (Li) - Get rid of old nr_queues hack for kdump kernels (Ming) - Support for async deletion of ublk (Ming) - Improve IRQ bio recycling (Pavel) - Factor in CPU capacity for remote vs local completion (Qais) - Add shared_tags configfs entry for null_blk (Shin'ichiro - Fix for a regression in page refcounts introduced by the folio unification (Tony) - Misc fixes and cleanups (Arnd, Colin, John, Kunwu, Li, Navid, Ricardo, Roman, Tang, Uwe) * tag 'for-6.9/block-20240310' of git://git.kernel.dk/linux: (221 commits) block: partitions: only define function mac_fix_string for CONFIG_PPC_PMAC block/swim: Convert to platform remove callback returning void cdrom: gdrom: Convert to platform remove callback returning void block: remove disk_stack_limits md: remove mddev->queue md: don't initialize queue limits md/raid10: use the atomic queue limit update APIs md/raid5: use the atomic queue limit update APIs md/raid1: use the atomic queue limit update APIs md/raid0: use the atomic queue limit update APIs md: add queue limit helpers md: add a mddev_is_dm helper md: add a mddev_add_trace_msg helper md: add a mddev_trace_remap helper bcache: move calculation of stripe_size and io_opt into bcache_device_init virtio_blk: Do not use disk_set_max_open/active_zones() aoe: fix the potential use-after-free problem in aoecmd_cfg_pkts block: move capacity validation to blkpg_do_ioctl() block: prevent division by zero in blk_rq_stat_sum() drbd: atomically update queue limits in drbd_reconsider_queue_parameters ...
2024-03-11Merge tag 'vfs-6.9.super' of ↵Linus Torvalds2-9/+9
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull block handle updates from Christian Brauner: "Last cycle we changed opening of block devices, and opening a block device would return a bdev_handle. This allowed us to implement support for restricting and forbidding writes to mounted block devices. It was accompanied by converting and adding helpers to operate on bdev_handles instead of plain block devices. That was already a good step forward but ultimately it isn't necessary to have special purpose helpers for opening block devices internally that return a bdev_handle. Fundamentally, opening a block device internally should just be equivalent to opening files. So now all internal opens of block devices return files just as a userspace open would. Instead of introducing a separate indirection into bdev_open_by_*() via struct bdev_handle bdev_file_open_by_*() is made to just return a struct file. Opening and closing a block device just becomes equivalent to opening and closing a file. This all works well because internally we already have a pseudo fs for block devices and so opening block devices is simple. There's a few places where we needed to be careful such as during boot when the kernel is supposed to mount the rootfs directly without init doing it. Here we need to take care to ensure that we flush out any asynchronous file close. That's what we already do for opening, unpacking, and closing the initramfs. So nothing new here. The equivalence of opening and closing block devices to regular files is a win in and of itself. But it also has various other advantages. We can remove struct bdev_handle completely. Various low-level helpers are now private to the block layer. Other helpers were simply removable completely. A follow-up series that is already reviewed build on this and makes it possible to remove bdev->bd_inode and allows various clean ups of the buffer head code as well. All places where we stashed a bdev_handle now just stash a file and use simple accessors to get to the actual block device which was already the case for bdev_handle" * tag 'vfs-6.9.super' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (35 commits) block: remove bdev_handle completely block: don't rely on BLK_OPEN_RESTRICT_WRITES when yielding write access bdev: remove bdev pointer from struct bdev_handle bdev: make struct bdev_handle private to the block layer bdev: make bdev_{release, open_by_dev}() private to block layer bdev: remove bdev_open_by_path() reiserfs: port block device access to file ocfs2: port block device access to file nfs: port block device access to files jfs: port block device access to file f2fs: port block device access to files ext4: port block device access to file erofs: port device access to file btrfs: port device access to file bcachefs: port block device access to file target: port block device access to file s390: port block device access to file nvme: port block device access to file block2mtd: port device access to files bcache: port block device access to files ...
2024-03-08nvmet: add tracing of zns commandsGuixin Liu1-0/+66
Add nvme_cmd_zone_append, nvme_cmd_zone_mgmt_send and nvme_cmd_zone_mgmt_recv parse to nvme target tracing. Signed-off-by: Guixin Liu <kanie@linux.alibaba.com> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-03-08nvmet: add tracing of authentication commandsGuixin Liu1-0/+32
Add nvme_fabrics_type_auth_send and nvme_fabrics_type_auth_receive to the nvme target's tracing facility. Signed-off-by: Guixin Liu <kanie@linux.alibaba.com> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-03-08nvme-apple: Convert to platform remove callback returning voidUwe Kleine-König1-4/+2
The .remove() callback for a platform driver returns an int which makes many driver authors wrongly assume it's possible to do error handling by returning an error code. However the value returned is ignored (apart from emitting a warning) and this typically results in resource leaks. To improve here there is a quest to make the remove callback return void. In the first step of this quest all drivers are converted to .remove_new(), which already returns void. Eventually after all drivers are converted, .remove_new() will be renamed to .remove(). Trivially convert this driver from always returning zero in the remove callback to the void returning variant. Signed-off-by: Uwe Kleine-König <u.kleine-koenig@pengutronix.de> Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-03-08nvmet-tcp: do not continue for invalid icreqHannes Reinecke1-0/+1
When the length check for an icreq sqe fails we should not continue processing but rather return immediately as all other contents of that sqe cannot be relied on. Signed-off-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-03-08nvme: change shutdown timeout setting messageKeith Busch1-1/+1
User visible messages containing the word "timeout" can be alarming. This one from nvme is just reporting a potentially informative device configuration, and everything is working as designed. Change the text to report the less concerning "D3 entry latency", which is where this value comes from anyway. Reported-by: Len Brown <lenb@kernel.org> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-03-06nvme: clear caller pointer on identify failureKeith Busch1-1/+4
The memory allocated for the identification is freed on failure. Set it to NULL so the caller doesn't have a pointer to that freed address. Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-03-06nvme: host: fix double-free of struct nvme_id_ns in ns_update_nuse()Shin'ichiro Kawasaki1-5/+2
When nvme_identify_ns() fails, it frees the pointer to the struct nvme_id_ns before it returns. However, ns_update_nuse() calls kfree() for the pointer even when nvme_identify_ns() fails. This results in KASAN double-free, which was observed with blktests nvme/045 with proposed patches [1] on the kernel v6.8-rc7. Fix the double-free by skipping kfree() when nvme_identify_ns() fails. Link: https://lore.kernel.org/linux-block/20240304161303.19681-1-dwagner@suse.de/ [1] Fixes: a1a825ab6a60 ("nvme: add csi, ms and nuse to sysfs") Signed-off-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Daniel Wagner <dwagner@suse.de> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-03-05nvme: fcloop: make fcloop_class constantRicardo B. Marliere1-8/+9
Since commit 43a7206b0963 ("driver core: class: make class_register() take a const *"), the driver core allows for struct class to be in read-only memory, so move the fcloop_class structure to be declared at build time placing it into read-only memory, instead of having to be dynamically allocated at boot time. Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Suggested-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Ricardo B. Marliere <ricardo@marliere.net> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-03-05nvme: fabrics: make nvmf_class constantRicardo B. Marliere1-9/+11
Since commit 43a7206b0963 ("driver core: class: make class_register() take a const *"), the driver core allows for struct class to be in read-only memory, so move the nvmf_class structure to be declared at build time placing it into read-only memory, instead of having to be dynamically allocated at boot time. Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Suggested-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Ricardo B. Marliere <ricardo@marliere.net> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-03-05nvme: core: constify struct class usageRicardo B. Marliere1-25/+28
Since commit 43a7206b0963 ("driver core: class: make class_register() take a const *"), the driver core allows for struct class to be in read-only memory, so move the structures nvme_class, nvme_subsys_class and nvme_ns_chr_class to be declared at build time placing them into read-only memory, instead of having to be dynamically allocated at boot time. Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Suggested-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Ricardo B. Marliere <ricardo@marliere.net> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-03-05net: introduce page_frag_cache_drain()Yunsheng Lin2-9/+2
When draining a page_frag_cache, most user are doing the similar steps, so introduce an API to avoid code duplication. Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com> Acked-by: Jason Wang <jasowang@redhat.com> Reviewed-by: Alexander Duyck <alexanderduyck@fb.com> Acked-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-03-04nvme-fabrics: typo in nvmf_parse_key()Hannes Reinecke1-1/+1
Of course we should use the key if there is no error ... Signed-off-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-03-04nvme-multipath: use atomic queue limits API for stacking limitsChristoph Hellwig1-3/+6
Switch to the queue_limits_* helpers to stack the bdev limits, which also includes updating the readahead settings. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-03-04nvme-multipath: pass queue_limits to blk_alloc_diskChristoph Hellwig1-6/+7
The multipath disk starts out with the stacking default limits. The one interesting part here is that blk_set_stacking_limits sets the max_zone_append_sectorts to UINT_MAX, which fails the validation for non-zoned devices. With the old one call per limit scheme this was fine because no one verified this weird mismatch and it was fixed by blk_stack_limits a little later before I/O could be issued. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-03-04nvme: use the atomic queue limits update APIChristoph Hellwig3-79/+80
Changes the callchains that update queue_limits to build an on-stack queue_limits and update it atomically. Note that for now only the admin queue actually passes it to the queue allocation function. Doing the same for the gendisks used for the namespaces will require a little more work. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-03-04nvme: cleanup nvme_configure_metadataChristoph Hellwig1-28/+19
Fold nvme_init_ms into nvme_configure_metadata after splitting up a little helper to deal with the extended LBA formats. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-03-04nvme: don't query identify data in configure_metadataChristoph Hellwig1-30/+19
Move reading the Identify Namespace Data Structure, NVM Command Set out of configure_metadata into the caller. This allows doing the identify call outside the frozen I/O queues, and prepares for using data from the Identify data structure for other purposes. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-03-04nvme: split out a nvme_identify_ns_nvm helperChristoph Hellwig1-12/+26
Split the logic to query the Identify Namespace Data Structure, NVM Command Set into a separate helper. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Max Gurtovoy <mgurtovoy@nvidia.com> Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-03-04nvme: move common logic into nvme_update_ns_infoChristoph Hellwig1-42/+42
nvme_update_ns_info_generic and nvme_update_ns_info_block share a fair amount of logic related to not fully supported namespace formats and updating the multipath information. Move this logic into the common caller. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-03-04nvme: move setting the write cache flags out of nvme_set_queue_limitsChristoph Hellwig1-3/+2
nvme_set_queue_limits is used on the admin queue and all gendisks including hidden ones that don't support block I/O. The write cache setting on the other hand only makes sense for block I/O. Move the blk_queue_write_cache call to nvme_update_ns_info_block instead. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Max Gurtovoy <mgurtovoy@nvidia.com> Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-03-04nvme: move a few things out of nvme_update_disk_infoChristoph Hellwig1-20/+26
Move setting up the integrity profile and setting the disk capacity out of nvme_update_disk_info to get nvme_update_disk_info into a shape where it just sets queue_limits eventually. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-03-04nvme: don't use nvme_update_disk_info for the multipath diskChristoph Hellwig1-1/+2
Currently nvme_update_ns_info_block calls nvme_update_disk_info both for the namespace attached disk, and the multipath one (if it exists). This is very different from how other stacking drivers work, and leads to a lot of complexity. Switch to setting the disk capacity and initializing the integrity profile, and let blk_stack_limits which already is called just below deal with updating the other limits. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-03-04nvme: move blk_integrity_unregister into nvme_init_integrityChristoph Hellwig1-2/+2
Move uneregistering the existing integrity profile into the helper dealing with all the other integrity / metadata setup. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Max Gurtovoy <mgurtovoy@nvidia.com> Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-03-04nvme: cleanup the nvme_init_integrity calling conventionsChristoph Hellwig1-14/+15
Handle the no metadata support case in nvme_init_integrity as well to simplify the calling convention and prepare for future changes in the area. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Max Gurtovoy <mgurtovoy@nvidia.com> Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-03-04nvme: move max_integrity_segments handling out of nvme_init_integrityChristoph Hellwig1-7/+4
max_integrity_segments is just a hardware limit and doesn't need to be in nvme_init_integrity with the PI setup. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Max Gurtovoy <mgurtovoy@nvidia.com> Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-03-04nvme: remove nvme_revalidate_zonesChristoph Hellwig3-12/+3
Handle setting the zone size / chunk_sectors and max_append_sectors limits together with the other ZNS limits, and just open code the call to blk_revalidate_zones in the current place. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-03-04nvme: move NVME_QUIRK_DEALLOCATE_ZEROES out of nvme_config_discardChristoph Hellwig1-5/+6
Move the handling of the NVME_QUIRK_DEALLOCATE_ZEROES quirk out of nvme_config_discard so that it is combined with the normal write_zeroes limit handling. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Max Gurtovoy <mgurtovoy@nvidia.com> Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-03-04nvme: set max_hw_sectors unconditionallyChristoph Hellwig1-8/+8
All transports set a max_hw_sectors value in the nvme_ctrl, so make the code using it unconditional and clean it up using a little helper. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Max Gurtovoy <mgurtovoy@nvidia.com> Reviewed-by: John Garry <john.g.garry@oracle.com> Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-03-02nvme-fabrics: check max outstanding commandsGuixin Liu1-0/+5
Maxcmd is mandatory for fabrics, check it early to identify the root cause instead of waiting for it to propagate to "sqsize" and "allocing queue". By the way, change nvme_check_ctrl_fabric_info() to nvmf_validate_identify_ctrl(). Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Signed-off-by: Guixin Liu <kanie@linux.alibaba.com> Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-03-02nvmet-rdma: set max_queue_size for RDMA transportMax Gurtovoy1-0/+8
A new port configuration was added to set max_queue_size. Clamp user configuration to RDMA transport limits. Increase the maximal queue size of RDMA controllers from 128 to 256 (the default size stays 128 same as before). Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Israel Rukshin <israelr@nvidia.com> Signed-off-by: Max Gurtovoy <mgurtovoy@nvidia.com> Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-03-02nvmet: introduce new max queue size configuration entryMax Gurtovoy3-3/+46
Using this port configuration, one will be able to set the maximal queue size to be used for any controller that will be associated to the configured port. The default value stayed 1024 but each transport will be able to set the its own values before enabling the port. Introduce lower limit of 16 for minimal queue depth (same as we use in the host fabrics drivers). Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Israel Rukshin <israelr@nvidia.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Guixin Liu <kanie@linux.alibaba.com> Signed-off-by: Max Gurtovoy <mgurtovoy@nvidia.com> Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-03-02nvme-rdma: clamp queue size according to ctrl capMax Gurtovoy1-4/+10
If a controller is configured with metadata support, clamp the maximal queue size to be 128 since there are more resources that are needed for metadata operations. Otherwise, clamp it to 256. Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Israel Rukshin <israelr@nvidia.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Max Gurtovoy <mgurtovoy@nvidia.com> Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-03-02nvme-rdma: introduce NVME_RDMA_MAX_METADATA_QUEUE_SIZE definitionMax Gurtovoy1-0/+2
This definition will be used by controllers that are configured with metadata support. For now, both regular and metadata controllers have the same maximal queue size but later commit will increase the maximal queue size for regular RDMA controllers to 256. We'll keep the maximal queue size for metadata controllers to be 128 since there are more resources that are needed for metadata operations and 128 is the optimal size found for metadata controllers base on testing. Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Israel Rukshin <israelr@nvidia.com> Signed-off-by: Max Gurtovoy <mgurtovoy@nvidia.com> Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-03-02nvmet: set ctrl pi_support cap before initializing cap regMax Gurtovoy2-2/+1
This is a preparation for setting the maximal queue size of a controller that supports PI. Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Israel Rukshin <israelr@nvidia.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Max Gurtovoy <mgurtovoy@nvidia.com> Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-03-02nvmet: set maxcmd to be per controllerMax Gurtovoy4-4/+4
This is a preparation for having a dynamic configuration of max queue size for a controller. Make sure that the maxcmd field stays the same as the MQES (+1) value as we do today. Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Israel Rukshin <israelr@nvidia.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Max Gurtovoy <mgurtovoy@nvidia.com> Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-03-02nvmet: compare mqes and sqsize only for IO SQMax Gurtovoy1-1/+2
According to the NVMe Spec: " MQES: This field indicates the maximum individual queue size that the controller supports. For NVMe over PCIe implementations, this value applies to the I/O Submission Queues and I/O Completion Queues that the host creates. For NVMe over Fabrics implementations, this value applies to only the I/O Submission Queues that the host creates. " Align the target code to compare mqes and sqsize as mentioned in the NVMe Spec. Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Max Gurtovoy <mgurtovoy@nvidia.com> Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-02-25nvme: port block device access to fileChristian Brauner2-9/+9
Link: https://lore.kernel.org/r/20240123-vfs-bdev-file-v2-15-adbd023e19cc@kernel.org Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-02-19block: pass a queue_limits argument to blk_alloc_diskChristoph Hellwig1-3/+3
Pass a queue_limits to blk_alloc_disk and apply it if non-NULL. This will allow allocating queues with valid queue limits instead of setting the values one at a time later. Also change blk_alloc_disk to return an ERR_PTR instead of just NULL which can't distinguish errors. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dan Williams <dan.j.williams@intel.com> Reviewed-by: Himanshu Madhani <himanshu.madhani@oracle.com> Link: https://lore.kernel.org/r/20240215071055.2201424-2-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-02-13nvmet: remove superfluous initializationChaitanya Kulkarni1-2/+2
Remove superfluous initialization of status variable in nvmet_execute_admin_connect() and nvmet_execute_io_connect(), since it will get overwritten by nvmet_copy_from_sgl(). Signed-off-by: Chaitanya Kulkarni <kch@nvidia.com> Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-02-13nvme: implement support for relaxed effectsKeith Busch1-0/+4
NVM Express TP4167 provides a way for controllers to report a relaxed execution constraint. Specifically, it notifies of exclusivity for IO vs. admin commands instead of grouping these together. If set, then we don't need to freeze IO in order to execute that admin command. The freezing distrupts IO processes, so it's nice to avoid that if the controller tells us it's not necessary. Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-02-13nvme-fabrics: fix I/O connect error handlingChaitanya Kulkarni1-0/+1
In nvmf_connect_io_queue(), if connect I/O command fails, we log the error and continue for authentication. This overrides error captured from __nvme_submit_sync_cmd(), causing wrong return value. Add goto out_free_data after logging connect error to fix the issue. Fixes: f50fff73d620c ("nvme: implement In-Band authentication") Signed-off-by: Chaitanya Kulkarni <kch@nvidia.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-02-13block: pass a queue_limits argument to blk_mq_alloc_diskChristoph Hellwig1-1/+1
Pass a queue_limits to blk_mq_alloc_disk and apply it if non-NULL. This will allow allocating queues with valid queue limits instead of setting the values one at a time later. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Keith Busch <kbusch@kernel.org> Reviewed-by: John Garry <john.g.garry@oracle.com> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Link: https://lore.kernel.org/r/20240213073425.1621680-11-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-02-13block: pass a queue_limits argument to blk_mq_init_queueChristoph Hellwig2-4/+4
Pass a queue_limits to blk_mq_init_queue and apply it if non-NULL. This will allow allocating queues with valid queue limits instead of setting the values one at a time later. Also rename the function to blk_mq_alloc_queue as that is a much better name for a function that allocates a queue and always pass the queuedata argument instead of having a separate version for the extra argument. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Keith Busch <kbusch@kernel.org> Reviewed-by: John Garry <john.g.garry@oracle.com> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Link: https://lore.kernel.org/r/20240213073425.1621680-10-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-02-12nvme: allow integrity when PI is not in first bytesKanchan Joshi2-1/+8
NVM command set 1.0 (or later) mandates PI to be in the last bytes of metadata. But this was not supported in the block-layer, and driver registered a nop profile. Since block-integrity can now handle flexible PI offset, change the driver to support this configuration. Signed-off-by: Kanchan Joshi <joshi.k@samsung.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Keith Busch <kbusch@kernel.org> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Link: https://lore.kernel.org/r/20240201130126.211402-4-joshi.k@samsung.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-02-12block: remove gfp_flags from blkdev_zone_mgmtJohannes Thumshirn1-3/+2
Now that all callers pass in GFP_KERNEL to blkdev_zone_mgmt() and use memalloc_no{io,fs}_{save,restore}() to define the allocation scope, we can drop the gfp_mask parameter from blkdev_zone_mgmt() as well as blkdev_zone_reset_all() and blkdev_zone_reset_all_emulated(). Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Mike Snitzer <snitzer@kernel.org> Link: https://lore.kernel.org/r/20240128-zonefs_nofs-v3-5-ae3b7c8def61@wdc.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-02-07nvme: use ns->head->pi_size instead of t10_pi_tuple structure sizeFrancis Pravin1-1/+1
Currently kernel supports 8 byte and 16 byte protection information. So, use ns->head->pi_size instead of sizeof(struct t10_pi_tuple). Signed-off-by: Francis Pravin <francis.p@samsung.com> Signed-off-by: Sathyavathi M <sathya.m@samsung.com> Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-02-07nvme-core: fix comment to reflect right functionsChaitanya Kulkarni1-2/+2
The functions and the attribute listed in the comment doesn't exists in the code, (ns->logging_enabled, nvme_passthru_err_log_enabled_store() and nvme_passthru_err_log_enabled_show()) Update the comment with right function names and a comment ns->head->passthru_err_log_enabled, nvme_io_passthru_err_log_enabled_store() and nvme_io_passthru_err_log_enabled_show(). Signed-off-by: Chaitanya Kulkarni <kch@nvidia.com> Reviewed-by: Alan Adamson <alan.adamson@oracle.com> Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-02-07nvme: move passthrough logging attribute to headKeith Busch3-18/+17
The namespace does not have attributes, but the head does. Move the new logging attribute to that structure instead of dereferencing the wrong type. And while we're here, fix the reverse-tree coding style. Fixes: 9f079dda14339e ("nvme: allow passthru cmd error logging") Reported-by: Tasmiya Nalatwad <tasmiya@linux.vnet.ibm.com> Tested-by: Tasmiya Nalatwad <tasmiya@linux.vnet.ibm.com> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Reviewed-by: Alan Adamson <alan.adamson@oracle.com> Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-02-01nvme-host: fix the updating of the firmware versionMaurizio Lombardi1-2/+5
The original code didn't update the firmware version if the "next slot" of the AFI register isn't zero or if the "current slot" field is zero; in those cases it assumed that a reset was needed. However, the NVMe specification doesn't exclude the possibility that the "next slot" value is equal to the "current slot" value, meaning that the same firmware slot will be activated after performing a controller level reset; in this case a reset is clearly not necessary and we can safely update the firmware version. Modify the code so the kernel will report that a Controller Level Reset is needed only in the following cases: 1) If the "current slot" field is zero. This is invalid and means that something is wrong, a reset is needed. or 2) if the "next slot" field isn't zero AND it's not equal to the "current slot" value. This means that at the next reset a different firmware slot will be activated. Fixes: 983a338b96c8 ("nvme: update firmware version after commit") Signed-off-by: Maurizio Lombardi <mlombard@redhat.com> Reviewed-by: Daniel Wagner <dwagner@suse.de> Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-02-01nvme: allow passthru cmd error loggingAlan Adamson3-7/+113
Commit d7ac8dca938c ("nvme: quiet user passthrough command errors") disabled error logging for user passthrough commands. This commit adds the ability to opt-in to passthrough admin error logging. IO commands initiated as passthrough will always be logged. The logging output for passthrough commands (Admin and IO) has been changed to include CDWXX fields. nvme0n1: Read(0x2), LBA Out of Range (sct 0x0 / sc 0x80) DNR cdw10=0x0 cdw11=0x1 cdw12=0x70000 cdw13=0x0 cdw14=0x0 cdw15=0x0 Add a helper function nvme_log_err_passthru() which allows us to log error for passthru commands by decoding cdw10-cdw15 values of nvme command. Add a new sysfs attr passthru_err_log_enabled that allows user to conditionally enable passthrough command logging for either passthrough Admin commands sent to the controller or passthrough IO commands sent to a namespace. By default, passthrough error logging is disabled. To enable passthrough admin error logging: echo 1 > /sys/class/nvme/nvme0/passthru_err_log_enabled To disable passthrough admin error logging: echo 0 > /sys/class/nvme/nvme0/passthru_err_log_enabled To enable passthrough io error logging: echo 1 > /sys/class/nvme/nvme0/nvme0n1/passthru_err_log_enabled To disable passthrough io error logging: echo 0 > /sys/class/nvme/nvme0/nvme0n1/passthru_err_log_enabled Signed-off-by: Alan Adamson <alan.adamson@oracle.com> Signed-off-by: Chaitanya Kulkarni <kch@nvidia.com> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-02-01nvme-fc: show hostnqn when connecting to fc targetNitin U. Yewale1-2/+2
Log hostnqn when connecting to nvme target. As hostnqn could be changed, logging this information in syslog at appropriate time may help in troubleshooting. Signed-off-by: Nitin U. Yewale <nyewale@redhat.com> Reviewed-by: John Meneghini <jmeneghi@redhat.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-02-01nvme-rdma: show hostnqn when connecting to rdma targetNitin U. Yewale1-2/+2
Log hostnqn when connecting to nvme target. As hostnqn could be changed, logging this information in syslog at appropriate time may help in troubleshooting. Signed-off-by: Nitin U. Yewale <nyewale@redhat.com> Reviewed-by: John Meneghini <jmeneghi@redhat.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-02-01nvme-tcp: show hostnqn when connecting to tcp targetNitin U. Yewale1-2/+2
Log hostnqn when connecting to nvme target. As hostnqn could be changed, logging this information in syslog at appropriate time may help in troubleshooting. Signed-off-by: Nitin U. Yewale <nyewale@redhat.com> Reviewed-by: John Meneghini <jmeneghi@redhat.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-02-01nvmet-fc: use RCU list iterator for assoc_listDaniel Wagner1-12/+22
The assoc_list is a RCU protected list, thus use the RCU flavor of list functions. Let's use this opportunity and refactor this code and move the lookup into a helper and give it a descriptive name. Signed-off-by: Daniel Wagner <dwagner@suse.de> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-02-01nvmet-fc: take ref count on tgtport before delete assocDaniel Wagner1-8/+23
We have to ensure that the tgtport is not going away before be have remove all the associations. Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Daniel Wagner <dwagner@suse.de> Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-02-01nvmet-fc: avoid deadlock on delete association pathDaniel Wagner1-3/+13
When deleting an association the shutdown path is deadlocking because we try to flush the nvmet_wq nested. Avoid this by deadlock by deferring the put work into its own work item. Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Daniel Wagner <dwagner@suse.de> Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-02-01nvmet-fc: abort command when there is no bindingDaniel Wagner1-2/+6
When the target port has not active port binding, there is no point in trying to process the command as it has to fail anyway. Instead adding checks to all commands abort the command early. Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Daniel Wagner <dwagner@suse.de> Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-02-01nvmet-fc: do not tack refs on tgtports from assocDaniel Wagner1-7/+1
The association life time is tied to the life time of the target port. That means we should not take extra a refcount when creating a association. Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Daniel Wagner <dwagner@suse.de> Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-02-01nvmet-fc: remove null hostport pointer checkDaniel Wagner1-4/+2
An association has always a valid hostport pointer. Remove useless null pointer check. Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Daniel Wagner <dwagner@suse.de> Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-02-01nvmet-fc: hold reference on hostport matchDaniel Wagner1-2/+0
The hostport data structure is shared between the association, this why we keep track of the users via a refcount. So we should not decrement the refcount on a match and free the hostport several times. Reported by KASAN. Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Daniel Wagner <dwagner@suse.de> Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-02-01nvmet-fc: free queue and assoc directlyDaniel Wagner1-4/+2
Neither struct nvmet_fc_tgt_queue nor struct nvmet_fc_tgt_assoc are data structure which are used in a RCU context. So there is no reason to delay the free operation. Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Daniel Wagner <dwagner@suse.de> Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-02-01nvmet-fc: defer cleanup using RCU properlyDaniel Wagner1-46/+37
When the target executes a disconnect and the host triggers a reconnect immediately, the reconnect command still finds an existing association. The reconnect crashes later on because nvmet_fc_delete_target_assoc blindly removes resources while the reconnect code wants to use it. To address this, nvmet_fc_find_target_assoc should not be able to lookup an association which is being removed. The association list is already under RCU lifetime management, so let's properly use it and remove the association from the list and wait for a grace period before cleaning up all. This means we also can drop the RCU management on the queues, because this is now handled via the association itself. A second step split the execution context so that the initial disconnect command can complete without running the reconnect code in the same context. As usual, this is done by deferring the ->done to a workqueue. Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Daniel Wagner <dwagner@suse.de> Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-02-01nvmet-fc: release reference on target portDaniel Wagner1-1/+2
In case we return early out of __nvmet_fc_finish_ls_req() we still have to release the reference on the target port. Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Daniel Wagner <dwagner@suse.de> Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-02-01nvmet-fcloop: swap the list_add_tail argumentsDaniel Wagner1-3/+3
The first argument of list_add_tail function is the new element which should be added to the list which is the second argument. Swap the arguments to allow processing more than one element at a time. Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Daniel Wagner <dwagner@suse.de> Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-02-01nvme-fc: do not wait in vain when unloading moduleDaniel Wagner1-41/+6
The module exit path has race between deleting all controllers and freeing 'left over IDs'. To prevent double free a synchronization between nvme_delete_ctrl and ida_destroy has been added by the initial commit. There is some logic around trying to prevent from hanging forever in wait_for_completion, though it does not handling all cases. E.g. blktests is able to reproduce the situation where the module unload hangs forever. If we completely rely on the cleanup code executed from the nvme_delete_ctrl path, all IDs will be freed eventually. This makes calling ida_destroy unnecessary. We only have to ensure that all nvme_delete_ctrl code has been executed before we leave nvme_fc_exit_module. This is done by flushing the nvme_delete_wq workqueue. While at it, remove the unused nvme_fc_wq workqueue too. Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Daniel Wagner <dwagner@suse.de> Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-01-31nvme-fc: log human-readable opcode on timeoutCaleb Sander1-3/+5
The fc transport logs the opcode and fctype on command timeout. This is sufficient information to identify the command issued, but not very human-readable. Use the nvme_fabrics_opcode_str() helper to also log the name of the command, as rdma and tcp already do. Signed-off-by: Caleb Sander <csander@purestorage.com> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-01-31nvme: split out fabrics version of nvme_opcode_str()Caleb Sander4-11/+17
nvme_opcode_str() currently supports admin, IO, and fabrics commands. However, fabrics commands aren't allowed for the pci transport. Currently the pci caller passes 0 as the fctype, which means any fabrics command would be displayed as "Property Set". Move fabrics command support into a function nvme_fabrics_opcode_str() and remove the fctype argument to nvme_opcode_str(). This way, a fabrics command will display as "Unknown" for pci. Convert the rdma and tcp transports to use nvme_fabrics_opcode_str(). Signed-off-by: Caleb Sander <csander@purestorage.com> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-01-31nvme: remove redundant status maskCaleb Sander1-1/+1
In nvme_get_error_status_str(), the status code is already masked with 0x7ff at the beginning of the function. Don't bother masking it again when indexing nvme_statuses. Signed-off-by: Caleb Sander <csander@purestorage.com> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-01-31nvme: return string as char *, not unsigned char *Caleb Sander2-13/+13
The functions in drivers/nvme/host/constants.c returning human-readable status and opcode strings currently use type "const unsigned char *". Typically string constants use type "const char *", so remove "unsigned" from the return types. This is a purely cosmetic change to clarify that the functions return text strings instead of an array of bytes, for example. Signed-off-by: Caleb Sander <csander@purestorage.com> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-01-31nvme-common: add module descriptionChaitanya Kulkarni2-0/+2
Add MODULE_DESCRIPTION() in order to remove warnings & get clean build:- WARNING: modpost: missing MODULE_DESCRIPTION() in drivers/nvme/common/nvme-auth.o WARNING: modpost: missing MODULE_DESCRIPTION() in drivers/nvme/common/nvme-keyring.o Signed-off-by: Chaitanya Kulkarni <kch@nvidia.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-01-31nvme: enable retries for authentication commandsHannes Reinecke3-1/+5
Authentication commands might trigger a lengthy computation on the controller or even a callout to an external entity. In these cases the controller might return a status without the DNR bit set, indicating that the command should be retried. This patch enables retries for authentication commands by setting NVME_SUBMIT_RETRY for __nvme_submit_sync_cmd(). Reported-by: Martin George <marting@netapp.com> Signed-off-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-01-31nvme: change __nvme_submit_sync_cmd() calling conventionsHannes Reinecke4-20/+41
Combine the two arguments 'flags' and 'at_head' from __nvme_submit_sync_cmd() into a single 'flags' argument and use function-specific values to indicate what should be set within the function. Signed-off-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-01-31nvme-auth: open-code single-use macrosHannes Reinecke1-7/+7
No point in having macros just for a single function nvme_auth_submit(). Open-code them into the caller. Signed-off-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-01-29nvme: use ctrl state accessorKeith Busch8-30/+37
The ctrl->state value is updated in another thread using WRITE_ONCE, so ensure all the readers use the appropriate accessor. Reviewed-by: Sagi Grimberg <sagi@grmberg.me> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-01-26nvmet-tcp: fix nvme tcp ida memory leakGuixin Liu1-0/+1
The nvmet_tcp_queue_ida should be destroy when the nvmet-tcp module exit. Signed-off-by: Guixin Liu <kanie@linux.alibaba.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-01-24nvme-rdma: Fix transfer length when write_generate/read_verify are 0Israel Rukshin1-3/+8
When the block layer doesn't generate/verify metadata, the SG length is smaller than the transfer length. This is because the SG length doesn't include the metadata length that is added by the HW on the wire. The target failes those commands with "Data SGL Length Invalid" by comparing the transfer length and the SG length. Fix it by adding the metadata length to the transfer length when there is no metadata SGL. The bug reproduces when setting read_verify/write_generate configs to 0 at the child multipath device or at the primary device when NVMe multipath is disabled. Note that setting those configs to 0 on the multipath device (ns_head) doesn't have any impact on the I/Os. Fixes: 5ec5d3bddc6b ("nvme-rdma: add metadata/T10-PI support") Signed-off-by: Israel Rukshin <israelr@nvidia.com> Signed-off-by: Max Gurtovoy <mgurtovoy@nvidia.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-01-24nvmet: add module description to stop warningsChaitanya Kulkarni6-0/+6
Add MODULE_DESCRIPTION() in order to remove warnings & get clean build:- WARNING: modpost: missing MODULE_DESCRIPTION() in drivers/nvme/target/nvmet.o WARNING: modpost: missing MODULE_DESCRIPTION() in drivers/nvme/target/nvme-loop.o WARNING: modpost: missing MODULE_DESCRIPTION() in drivers/nvme/target/nvmet-rdma.o WARNING: modpost: missing MODULE_DESCRIPTION() in drivers/nvme/target/nvmet-fc.o WARNING: modpost: missing MODULE_DESCRIPTION() in drivers/nvme/target/nvme-fcloop.o WARNING: modpost: missing MODULE_DESCRIPTION() in drivers/nvme/target/nvmet-tcp.o Signed-off-by: Chaitanya Kulkarni <kch@nvidia.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-01-24nvme: add module description to stop warningsChaitanya Kulkarni6-0/+6
Add MODULE_DESCRIPTION() in order to remove warnings & get clean build:- WARNING: modpost: missing MODULE_DESCRIPTION() in drivers/nvme/host/nvme-core.o WARNING: modpost: missing MODULE_DESCRIPTION() in drivers/nvme/host/nvme.o WARNING: modpost: missing MODULE_DESCRIPTION() in drivers/nvme/host/nvme-fabrics.o WARNING: modpost: missing MODULE_DESCRIPTION() in drivers/nvme/host/nvme-rdma.o WARNING: modpost: missing MODULE_DESCRIPTION() in drivers/nvme/host/nvme-fc.o WARNING: modpost: missing MODULE_DESCRIPTION() in drivers/nvme/host/nvme-tcp.o Signed-off-by: Chaitanya Kulkarni <kch@nvidia.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-01-23nvmet: unify aer type enumGuixin Liu2-3/+3
The host and target use two definition of aer type, unify them into a single one. Signed-off-by: Guixin Liu <kanie@linux.alibaba.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-01-18Merge tag 'for-6.8/block-2024-01-18' of git://git.kernel.dk/linuxLinus Torvalds14-87/+146
Pull block fixes from Jens Axboe: - NVMe pull request via Keith: - tcp, fc, and rdma target fixes (Maurizio, Daniel, Hannes, Christoph) - discard fixes and improvements (Christoph) - timeout debug improvements (Keith, Max) - various cleanups (Daniel, Max, Giuxen) - trace event string fixes (Arnd) - shadow doorbell setup on reset fix (William) - a write zeroes quirk for SK Hynix (Jim) - MD pull request via Song: - Sparse warning since v6.0 (Bart) - /proc/mdstat regression since v6.7 (Yu Kuai) - Use symbolic error value (Christian) - IO Priority documentation update (Christian) - Fix for accessing queue limits without having entered the queue (Christoph, me) - Fix for loop dio support (Christoph) - Move null_blk off deprecated ida interface (Christophe) - Ensure nbd initializes full msghdr (Eric) - Fix for a regression with the folio conversion, which is now easier to hit because of an unrelated change (Matthew) - Remove redundant check in virtio-blk (Li) - Fix for a potential hang in sbitmap (Ming) - Fix for partial zone appending (Damien) - Misc changes and fixes (Bart, me, Kemeng, Dmitry) * tag 'for-6.8/block-2024-01-18' of git://git.kernel.dk/linux: (45 commits) Documentation: block: ioprio: Update schedulers loop: fix the the direct I/O support check when used on top of block devices blk-mq: Remove the hctx 'run' debugfs attribute nbd: always initialize struct msghdr completely block: Fix iterating over an empty bio with bio_for_each_folio_all block: bio-integrity: fix kcalloc() arguments order virtio_blk: remove duplicate check if queue is broken in virtblk_done sbitmap: remove stale comment in sbq_calc_wake_batch block: Correct a documentation comment in blk-cgroup.c null_blk: Remove usage of the deprecated ida_simple_xx() API block: ensure we hold a queue reference when using queue limits blk-mq: rename blk_mq_can_use_cached_rq block: print symbolic error name instead of error code blk-mq: fix IO hang from sbitmap wakeup race nvmet-rdma: avoid circular locking dependency on install_queue() nvmet-tcp: avoid circular locking dependency on install_queue() nvme-pci: set doorbell config before unquiescing block: fix partial zone append completion handling in req_bio_endio() block/iocost: silence warning on 'last_period' potentially being unused md/raid1: Use blk_opf_t for read and write operations ...
2024-01-11Merge tag 'for-6.8/io_uring-2024-01-08' of git://git.kernel.dk/linuxLinus Torvalds1-1/+1
Pull io_uring updates from Jens Axboe: "Mostly just come fixes and cleanups, but one feature as well. In detail: - Harden the check for handling IOPOLL based on return (Pavel) - Various minor optimizations (Pavel) - Drop remnants of SCM_RIGHTS fd passing support, now that it's no longer supported since 6.7 (me) - Fix for a case where bytes_done wasn't initialized properly on a failure condition for read/write requests (me) - Move the register related code to a separate file (me) - Add support for returning the provided ring buffer head (me) - Add support for adding a direct descriptor to the normal file table (me, Christian Brauner) - Fix for ensuring pending task_work for a ring with DEFER_TASKRUN is run even if we timeout waiting (me)" * tag 'for-6.8/io_uring-2024-01-08' of git://git.kernel.dk/linux: io_uring: ensure local task_work is run on wait timeout io_uring/kbuf: add method for returning provided buffer ring head io_uring/rw: ensure io->bytes_done is always initialized io_uring: drop any code related to SCM_RIGHTS io_uring/unix: drop usage of io_uring socket io_uring/register: move io_uring_register(2) related code to register.c io_uring/openclose: add support for IORING_OP_FIXED_FD_INSTALL io_uring/cmd: inline io_uring_cmd_get_task io_uring/cmd: inline io_uring_cmd_do_in_task_lazy io_uring: split out cmd api into a separate header io_uring: optimise ltimeout for inline execution io_uring: don't check iopoll if request completes
2024-01-11Merge tag 'for-6.8/block-2024-01-08' of git://git.kernel.dk/linuxLinus Torvalds11-325/+307
Pull block updates from Jens Axboe: "Pretty quiet round this time around. This contains: - NVMe updates via Keith: - nvme fabrics spec updates (Guixin, Max) - nvme target udpates (Guixin, Evan) - nvme attribute refactoring (Daniel) - nvme-fc numa fix (Keith) - MD updates via Song: - Fix/Cleanup RCU usage from conf->disks[i].rdev (Yu Kuai) - Fix raid5 hang issue (Junxiao Bi) - Add Yu Kuai as Reviewer of the md subsystem - Remove deprecated flavors (Song Liu) - raid1 read error check support (Li Nan) - Better handle events off-by-1 case (Alex Lyakas) - Efficiency improvements for passthrough (Kundan) - Support for mapping integrity data directly (Keith) - Zoned write fix (Damien) - rnbd fixes (Kees, Santosh, Supriti) - Default to a sane discard size granularity (Christoph) - Make the default max transfer size naming less confusing (Christoph) - Remove support for deprecated host aware zoned model (Christoph) - Misc fixes (me, Li, Matthew, Min, Ming, Randy, liyouhong, Daniel, Bart, Christoph)" * tag 'for-6.8/block-2024-01-08' of git://git.kernel.dk/linux: (78 commits) block: Treat sequential write preferred zone type as invalid block: remove disk_clear_zoned sd: remove the !ZBC && blk_queue_is_zoned case in sd_read_block_characteristics drivers/block/xen-blkback/common.h: Fix spelling typo in comment blk-cgroup: fix rcu lockdep warning in blkg_lookup() blk-cgroup: don't use removal safe list iterators block: floor the discard granularity to the physical block size mtd_blkdevs: use the default discard granularity bcache: use the default discard granularity zram: use the default discard granularity null_blk: use the default discard granularity nbd: use the default discard granularity ubd: use the default discard granularity block: default the discard granularity to sector size bcache: discard_granularity should not be smaller than a sector block: remove two comments in bio_split_discard block: rename and document BLK_DEF_MAX_SECTORS loop: don't abuse BLK_DEF_MAX_SECTORS aoe: don't abuse BLK_DEF_MAX_SECTORS null_blk: don't cap max_hw_sectors to BLK_DEF_MAX_SECTORS ...
2024-01-10nvmet-rdma: avoid circular locking dependency on install_queue()Hannes Reinecke1-3/+16
nvmet_rdma_install_queue() is driven from the ->io_work workqueue function, but will call flush_workqueue() which might trigger ->release_work() which in itself calls flush_work on ->io_work. To avoid that check for pending queue in disconnecting status, and return 'controller busy' when we reached a certain threshold. Signed-off-by: Hannes Reinecke <hare@suse.de> Tested-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-01-10nvmet-tcp: avoid circular locking dependency on install_queue()Hannes Reinecke1-3/+15
nvmet_tcp_install_queue() is driven from the ->io_work workqueue function, but will call flush_workqueue() which might trigger ->release_work() which in itself calls flush_work on ->io_work. To avoid that check for pending queue in disconnecting status, and return 'controller busy' when we reached a certain threshold. Signed-off-by: Hannes Reinecke <hare@suse.de> Tested-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-01-10Merge tag 'hardening-v6.8-rc1' of ↵Linus Torvalds2-6/+6
git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux Pull hardening updates from Kees Cook: - Introduce the param_unknown_fn type and other clean ups (Andy Shevchenko) - Various __counted_by annotations (Christophe JAILLET, Gustavo A. R. Silva, Kees Cook) - Add KFENCE test to LKDTM (Stephen Boyd) - Various strncpy() refactorings (Justin Stitt) - Fix qnx4 to avoid writing into the smaller of two overlapping buffers - Various strlcpy() refactorings * tag 'hardening-v6.8-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux: qnx4: Use get_directory_fname() in qnx4_match() qnx4: Extract dir entry filename processing into helper atags_proc: Add __counted_by for struct buffer and use struct_size() tracing/uprobe: Replace strlcpy() with strscpy() params: Fix multi-line comment style params: Sort headers params: Use size_add() for kmalloc() params: Do not go over the limit when getting the string length params: Introduce the param_unknown_fn type lkdtm: Add kfence read after free crash type nvme-fc: replace deprecated strncpy with strscpy nvdimm/btt: replace deprecated strncpy with strscpy nvme-fabrics: replace deprecated strncpy with strscpy drm/modes: replace deprecated strncpy with strscpy_pad afs: Add __counted_by for struct afs_acl and use struct_size() VMCI: Annotate struct vmci_handle_arr with __counted_by i40e: Annotate struct i40e_qvlist_info with __counted_by HID: uhid: replace deprecated strncpy with strscpy samples: Replace strlcpy() with strscpy() SUNRPC: Replace strlcpy() with strscpy()
2024-01-10nvme-pci: set doorbell config before unquiescingWilliam Butler1-1/+1
During resets, if queues are unquiesced first, then the host can submit IOs to the controller using shadow doorbell logic but the controller won't be aware. This can lead to necessary MMIO doorbells from being not issued, causing requests to be delayed and timed-out. Signed-off-by: William Butler <wab@google.com> Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-01-08nvmet-tcp: Fix the H2C expected PDU len calculationMaurizio Lombardi1-3/+7
The nvmet_tcp_handle_h2c_data_pdu() function should take into consideration the possibility that the header digest and/or the data digests are enabled when calculating the expected PDU length, before comparing it to the value stored in cmd->pdu_len. Fixes: efa56305908b ("nvmet-tcp: Fix a kernel panic when host sends an invalid H2C PDU length") Signed-off-by: Maurizio Lombardi <mlombard@redhat.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-01-08nvme-tcp: enhance timeout kernel logMax Gurtovoy1-3/+3
Print the command_id along side blk-mq's tag to help match commands with protocol wire traces and logs. Signed-off-by: Max Gurtovoy <mgurtovoy@nvidia.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-01-08nvme-rdma: enhance timeout kernel logMax Gurtovoy1-3/+8
Print the command_id along side blk-mq's tag to help match commands with protocol wire traces and logs. Signed-off-by: Max Gurtovoy <mgurtovoy@nvidia.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-01-08nvme-pci: enhance timeout kernel logKeith Busch1-10/+13
Kernel configs don't necessarily have opcode decoding, and some opcodes are not even decodable. It is still interesting for debugging SSD issues to know what opcode is timing out, what request type it came from, and the data size (if applicable). Also print the command_id along side blk-mq's tag to help match commands with protocol wire traces and firmware logs, Reviewed-by: Max Gurtovoy <mgurtovoy@nvidia.com> Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-01-05nvme: trace: avoid memcpy overflow warningArnd Bergmann1-1/+1
A previous patch introduced a struct_group() in nvme_common_command to help stringop fortification figure out the length of the fields, but one function is not currently using them: In file included from drivers/nvme/target/core.c:7: In file included from include/linux/string.h:254: include/linux/fortify-string.h:592:4: error: call to '__read_overflow2_field' declared with 'warning' attribute: detected read beyond size of field (2nd parameter); maybe use struct_group()? [-Werror,-Wattribute-warning] __read_overflow2_field(q_size_field, size); ^ Change this one to use the correct field name to avoid the warning. Fixes: 5c629dc9609dc ("nvme: use struct group for generic command dwords") Signed-off-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-01-05nvmet: re-fix tracing strncpy() warningArnd Bergmann1-2/+1
An earlier patch had tried to address a warning about a string copy with missing zero termination: drivers/nvme/target/trace.h:52:3: warning: ‘strncpy’ specified bound 32 equals destination size [-Wstringop-truncation] The new version causes a different warning with some compiler versions, notably gcc-9 and gcc-10, and also misses the zero padding that was apparently done intentionally in the original code: drivers/nvme/target/trace.h:56:2: error: 'strncpy' specified bound depends on the length of the source argument [-Werror=stringop-overflow=] Change it to use strscpy_pad() with the original length, which will give a properly padded and zero-terminated string as well as avoiding the warning. Fixes: d86481e924a7 ("nvmet: use min of device_path and disk len") Signed-off-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-01-05nvme: introduce nvme_disk_is_ns_head helperGuixin Liu3-6/+17
We currently rely on gendisk's file operations (fops) to distinguish between a namespace head (ns_head) and a regular namespace. To enhance code readability, introduce a helper function. Additionally, we must ensure that the device is not an ns_head before calling nvme_get_ns_from_dev(). To enforce this, add a WARN_ON check within the nvme_get_ns_from_dev(). Signed-off-by: Guixin Liu <kanie@linux.alibaba.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Liu Song <liusong@linux.alibaba.com> [include fix: https://lore.kernel.org/oe-kbuild-all/202401031943.0N72Tkji-lkp@intel.com/] Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-01-05nvme-pci: disable write zeroes for SK Hynix BC901Jim.Lin1-0/+2
SK Hynix BC901 drive write zero will cause Chromebook takes more than 20 mins to switch to developer mode "disable write zeroes" can fix this issue and Sk Hynix has been verified. Signed-off-by: Jim.Lin <jim.lin@siliconmotion.com> Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-01-05nvmet-fcloop: Remove remote port from list when unlinkingDaniel Wagner1-5/+2
The remote port is removed too late from fcloop_nports list. Remove it when port is unregistered. This prevents a busy loop in fcloop_exit, because it is possible the remote port is found in the list and thus we will never progress. The kernel log will be spammed with nvme_fcloop: fcloop_exit: Failed deleting remote port nvme_fcloop: fcloop_exit: Failed deleting target port Signed-off-by: Daniel Wagner <dwagner@suse.de> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-01-03nvmet-trace: avoid dereferencing pointer too earlyDaniel Wagner2-14/+20
The first command issued from the host to the target is the fabrics connect command. At this point, neither the target queue nor the controller have been allocated. But we already try to trace this command in nvmet_req_init. Reported by KASAN. Reviewed-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Daniel Wagner <dwagner@suse.de> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-01-03nvmet-fc: remove unnecessary bracketDaniel Wagner1-1/+1
There is no need for the bracket around the identifier. Remove it. Signed-off-by: Daniel Wagner <dwagner@suse.de> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-01-03nvme: simplify the max_discard_segments calculationChristoph Hellwig2-9/+6
Just stash away the DMRL value in the nvme_ctrl struture, and leave all interpretation to nvme_config_discard, where we know DSM is supported by the time we're configuring the number of segments. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-01-03nvme: fix max_discard_sectors calculationChristoph Hellwig2-12/+9
ctrl->max_discard_sectors stores a value that is potentially based of the DMRSL field in Identify Controller, which is in units of LBAs and thus dependent on the Format of a namespace. Fix this by moving the calculation of max_discard_sectors entirely into nvme_config_discard and replacing the ctrl->max_discard_sectors value with a local variable so that the calculation is always namespace-specific. Fixes: 1a86924e4f46 ("nvme: fix interpretation of DMRSL") Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-01-03nvme: also skip discard granularity updates in nvme_config_discardChristoph Hellwig1-3/+1
Don't just skip the discard sectors and segments but also the granularity if a value was already set before. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Max Gurtovoy <mgurtovoy@nvidia.com> Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-01-03nvme: update the explanation for not updating the limits in nvme_config_discardChristoph Hellwig1-1/+7
Expeand the comment a bit to explain what is going on. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-01-03nvmet-tcp: fix a missing endianess conversion in nvmet_tcp_try_peek_pduChristoph Hellwig1-1/+1
No, a __le32 cast doesn't magically byteswap on big-endian systems.. Fixes: 70525e5d82f6 ("nvmet-tcp: peek icreq before starting TLS") Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-01-03nvme-common: mark nvme_tls_psk_prio staticChristoph Hellwig1-1/+1
Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-01-03nvme: tcp: remove unnecessary goto statementGuixin Liu1-3/+2
There is no requirement to call nvme_tcp_free_queue() for queue deallocation if the pskid is null or the queue allocation fails, as the NVME_TCP_Q_ALLOCATED flag would not be set in such scenarios. Signed-off-by: Guixin Liu <kanie@linux.alibaba.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-01-02nvmet-tcp: remove boilerplate codeMaurizio Lombardi1-8/+8
Simplify the nvmet_tcp_handle_h2c_data_pdu() function by removing boilerplate code. Signed-off-by: Maurizio Lombardi <mlombard@redhat.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-01-02nvmet-tcp: fix a crash in nvmet_req_complete()Maurizio Lombardi1-2/+1
in nvmet_tcp_handle_h2c_data_pdu(), if the host sends a data_offset different from rbytes_done, the driver ends up calling nvmet_req_complete() passing a status error. The problem is that at this point cmd->req is not yet initialized, the kernel will crash after dereferencing a NULL pointer. Fix the bug by replacing the call to nvmet_req_complete() with nvmet_tcp_fatal_error(). Fixes: 872d26a391da ("nvmet-tcp: add NVMe over TCP target driver") Reviewed-by: Keith Busch <kbsuch@kernel.org> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Maurizio Lombardi <mlombard@redhat.com> Signed-off-by: Keith Busch <kbusch@kernel.org>
2024-01-02nvmet-tcp: Fix a kernel panic when host sends an invalid H2C PDU lengthMaurizio Lombardi1-1/+12
If the host sends an H2CData command with an invalid DATAL, the kernel may crash in nvmet_tcp_build_pdu_iovec(). Unable to handle kernel NULL pointer dereference at virtual address 0000000000000000 lr : nvmet_tcp_io_work+0x6ac/0x718 [nvmet_tcp] Call trace: process_one_work+0x174/0x3c8 worker_thread+0x2d0/0x3e8 kthread+0x104/0x110 Fix the bug by raising a fatal error if DATAL isn't coherent with the packet size. Also, the PDU length should never exceed the MAXH2CDATA parameter which has been communicated to the host in nvmet_tcp_handle_icreq(). Fixes: 872d26a391da ("nvmet-tcp: add NVMe over TCP target driver") Signed-off-by: Maurizio Lombardi <mlombard@redhat.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Keith Busch <kbusch@kernel.org>
2023-12-21Merge tag 'nvme-6.8-2023-12-21' of git://git.infradead.org/nvme into ↵Jens Axboe11-156/+277
for-6.8/block Pull NVMe updates from Keith: "nvme updates for Linux 6.8 - nvme fabrics spec updates (Guixin, Max) - nvme target udpates (Guixin, Evan) - nvme attribute refactoring (Daniel) - nvme-fc numa fix (Keith)" * tag 'nvme-6.8-2023-12-21' of git://git.infradead.org/nvme: nvme-fc: set numa_node after nvme_init_ctrl nvme-fabrics: don't check discovery ioccsz/iorcsz nvmet: configfs: use ctrl->instance to track passthru subsystems nvme: repack struct nvme_ns_head nvme: add csi, ms and nuse to sysfs nvme: rename ns attribute group nvme: refactor ns info setup function nvme: refactor ns info helpers nvme: move ns id info to struct nvme_ns_head nvmet: remove cntlid_min and cntlid_max check in nvmet_alloc_ctrl nvmet: allow identical cntlid_min and cntlid_max settings nvme-fabrics: check ioccsz and iorcsz nvme: introduce nvme_check_ctrl_fabric_info helper
2023-12-21nvme-fc: set numa_node after nvme_init_ctrlKeith Busch1-4/+2
nvme_init_ctrl() resets numa_node to NUMA_NO_NODE, so be sure to set the desired value after that function call so it won't be overwritten. Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Jens Axboe <axboe@kernel.dk> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Max Gurtovoy <mgurtovoy@nvidia.com> Signed-off-by: Keith Busch <kbusch@kernel.org>
2023-12-21nvme-fabrics: don't check discovery ioccsz/iorcszMax Gurtovoy1-2/+2
IOCCSZ and IORCSZ are reserved for discovery controllers. Avoid checking their values during identify controller phase. Fixes: 2fcd3ab39826 ("nvme-fabrics: check ioccsz and iorcsz") Reported-by: Daniel Wagner <dwagner@suse.de> Tested-by: Daniel Wagner <dwagner@suse.de> Signed-off-by: Max Gurtovoy <mgurtovoy@nvidia.com> Signed-off-by: Keith Busch <kbusch@kernel.org>
2023-12-19block: simplify disk_set_zonedChristoph Hellwig1-1/+1
Only use disk_set_zoned to actually enable zoned device support. For clearing it, call disk_clear_zoned, which is renamed from disk_clear_zone_settings and now directly clears the zoned flag as well. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Link: https://lore.kernel.org/r/20231217165359.604246-5-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-12-19block: remove support for the host aware zone modelChristoph Hellwig1-1/+1
When zones were first added the SCSI and ATA specs, two different models were supported (in addition to the drive managed one that is invisible to the host): - host managed where non-conventional zones there is strict requirement to write at the write pointer, or else an error is returned - host aware where a write point is maintained if writes always happen at it, otherwise it is left in an under-defined state and the sequential write preferred zones behave like conventional zones (probably very badly performing ones, though) Not surprisingly this lukewarm model didn't prove to be very useful and was finally removed from the ZBC and SBC specs (NVMe never implemented it). Due to to the easily disappearing write pointer host software could never rely on the write pointer to actually be useful for say recovery. Fortunately only a few HDD prototypes shipped using this model which never made it to mass production. Drop the support before it is too late. Note that any such host aware prototype HDD can still be used with Linux as we'll now treat it as a conventional HDD. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Link: https://lore.kernel.org/r/20231217165359.604246-4-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-12-19nvme-pci: fix sleeping function called from interrupt contextMaurizio Lombardi1-1/+2
the nvme_handle_cqe() interrupt handler calls nvme_complete_async_event() but the latter may call nvme_auth_stop() which is a blocking function. Sleeping functions can't be called in interrupt context BUG: sleeping function called from invalid context in_atomic(): 1, irqs_disabled(): 1, non_block: 0, pid: 0, name: swapper/15 Call Trace: <IRQ> __cancel_work_timer+0x31e/0x460 ? nvme_change_ctrl_state+0xcf/0x3c0 [nvme_core] ? nvme_change_ctrl_state+0xcf/0x3c0 [nvme_core] nvme_complete_async_event+0x365/0x480 [nvme_core] nvme_poll_cq+0x262/0xe50 [nvme] Fix the bug by moving nvme_auth_stop() to fw_act_work (executed by the nvme_wq workqueue) Fixes: f50fff73d620 ("nvme: implement In-Band authentication") Signed-off-by: Maurizio Lombardi <mlombard@redhat.com> Reviewed-by: Jens Axboe <axboe@kernel.dk> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Keith Busch <kbusch@kernel.org>
2023-12-19nvmet: configfs: use ctrl->instance to track passthru subsystemsEvan Burgess1-2/+2
To prevent enabling more than one passthrough subsystem per NVMe controller, passthru.c maintains an xarray indexed by cntlid values. Passthrough for a given nvmet subsystem cannot be enabled by configfs if the subsystem's passthru_ctrl->cntlid value is already accounted for in the xarray. However, according to the NVMe spec (rev 2.0c, p.145), "The Controller ID (CNTLID) value returned in the Identify Controller data structure may be used to uniquely identify a controller within an NVM subsystem," meaning that cntlid values are not guaranteed to be globally unique across multiple subsystems. Instead, the cntlid only uniquely identifies multiple controllers _within_ a subsystem. As a result, multiple unique & valid NVMe targets can be blocked from enabling passthrough at the same time if their controllers share cntlid values, a behavior allowed by the spec. Fix this by indexing the xarray with passthru_ctrl->instance values, which are allocated per controller by IDA and thus should be truly unique. Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Evan Burgess <evan.burgess@seagate.com> Signed-off-by: Keith Busch <kbusch@kernel.org>
2023-12-19nvme: repack struct nvme_ns_headDaniel Wagner1-4/+4
ns_id, lba_shift and ms are always accessed for every read/write I/O in nvme_setup_rw. By grouping these variables into one cacheline we can safe some cycles. 4k sequential reads: baseline patched Bandwidth: 1620 1634 IOPs 66345579 66910939 Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Daniel Wagner <dwagner@suse.de> Signed-off-by: Keith Busch <kbusch@kernel.org>
2023-12-19nvme: add csi, ms and nuse to sysfsDaniel Wagner3-1/+96
libnvme is using the sysfs for enumarating the nvme resources. Though there are few missing attritbutes in the sysfs. For these libnvme issues commands during discovering. As the kernel already knows all these attributes and we would like to avoid libnvme to issue commands all the time, expose these missing attributes. The nuse value is updated on request because the nuse is a volatile value. Since any user can read the sysfs attribute, a very simple rate limit is added (update once every 5 seconds). A more sophisticated update strategy can be added later if there is actually a need for it. Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Daniel Wagner <dwagner@suse.de> Signed-off-by: Keith Busch <kbusch@kernel.org>
2023-12-19nvme: rename ns attribute groupDaniel Wagner4-10/+10
Drop the 'id' part of the attribute group name because we want to expose non 'id' related attributes via the ns attribute group. Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Daniel Wagner <dwagner@suse.de> Signed-off-by: Keith Busch <kbusch@kernel.org>
2023-12-19nvme: refactor ns info setup functionDaniel Wagner2-60/+62
Use nvme_ns_head instead of nvme_ns where possible. This reduces the coupling between the different data structures. Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Daniel Wagner <dwagner@suse.de> Signed-off-by: Keith Busch <kbusch@kernel.org>
2023-12-19nvme: refactor ns info helpersDaniel Wagner4-28/+34
Pass in the nvme_ns_head pointer directly. This reduces the necessity on the caller side have the nvme_ns data structure present. Thus we can refactor the caller side in the next step as well. Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Daniel Wagner <dwagner@suse.de> Signed-off-by: Keith Busch <kbusch@kernel.org>
2023-12-19nvme: move ns id info to struct nvme_ns_headDaniel Wagner5-66/+69
Move the namesapce info to struct nvme_ns_head, because it's the same for all associated namespaces. Note: with multipathing enabled the PI information is shared between all paths. If a path is using a different PI configuration it will overwrite the previous settings. This is obviously not correct and such configuration will be rejected in future. For the time being we expect a correctly configured storage. Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Daniel Wagner <dwagner@suse.de> Signed-off-by: Keith Busch <kbusch@kernel.org>
2023-12-19Revert "nvme-fc: fix race between error recovery and creating association"Keith Busch1-16/+5
The commit was identified to might sleep in invalid context and is blocking regression testing. This reverts commit ee6fdc5055e916b1dd497f11260d4901c4c1e55e. Link: https://lore.kernel.org/linux-nvme/hkhl56n665uvc6t5d6h3wtx7utkcorw4xlwi7d2t2bnonavhe6@xaan6pu43ap6/ Link: https://lists.infradead.org/pipermail/linux-nvme/2023-December/043756.html Reported-by: Daniel Wagner <dwagner@suse.de> Reported-by: Maurizio Lombardi <mlombard@redhat.com> Cc: Michael Liang <mliang@purestorage.com> Tested-by: Daniel Wagner <dwagner@suse.de> Reviewed-by: Daniel Wagner <dwagner@suse.de> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Keith Busch <kbusch@kernel.org>
2023-12-13nvmet: remove cntlid_min and cntlid_max check in nvmet_alloc_ctrlGuixin Liu1-3/+0
The cntlid_min and cntlid_max are checked in configfs, don't check again in nvmet_alloc_ctrl(). Signed-off-by: Guixin Liu <kanie@linux.alibaba.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Keith Busch <kbusch@kernel.org>
2023-12-13nvmet: allow identical cntlid_min and cntlid_max settingsGuixin Liu1-2/+2
When the user wants to restrict to only creating one controller, they can set cntlid_min and cntlid_max to the same value. Signed-off-by: Guixin Liu <kanie@linux.alibaba.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Keith Busch <kbusch@kernel.org>
2023-12-12io_uring: split out cmd api into a separate headerPavel Begunkov1-1/+1
linux/io_uring.h is slowly becoming a rubbish bin where we put anything exposed to other subsystems. For instance, the task exit hooks and io_uring cmd infra are completely orthogonal and don't need each other's definitions. Start cleaning it up by splitting out all command bits into a new header file. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/7ec50bae6e21f371d3850796e716917fc141225a.1701391955.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-12-07nvme-pci: Add sleep quirk for Kingston drivesGeorg Gottleuber2-1/+20
Some Kingston NV1 and A2000 are wasting a lot of power on specific TUXEDO platforms in s2idle sleep if 'Simple Suspend' is used. This patch applies a new quirk 'Force No Simple Suspend' to achieve a low power sleep without 'Simple Suspend'. Signed-off-by: Werner Sembach <wse@tuxedocomputers.com> Signed-off-by: Georg Gottleuber <ggo@tuxedocomputers.com> Cc: <stable@vger.kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Keith Busch <kbusch@kernel.org>
2023-12-06nvme-fabrics: check ioccsz and iorcszGuixin Liu1-0/+14
Make sure that ioccsz and iorcsz returned by target are correct before use it. Per 2.0a base NVMe spec: I/O Queue Command Capsule Supported Size (IOCCSZ): This field defines the maximum I/O command capsule size in 16 byte units. The minimum value that shall be indicated is 4 corresponding to 64 bytes. Signed-off-by: Guixin Liu <kanie@linux.alibaba.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Keith Busch <kbusch@kernel.org>
2023-12-06nvme: introduce nvme_check_ctrl_fabric_info helperGuixin Liu1-18/+24
Inroduce nvme_check_ctrl_fabric_info helper to check fabric controller info returned by target. Signed-off-by: Guixin Liu <kanie@linux.alibaba.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Keith Busch <kbusch@kernel.org>
2023-12-04nvme: fix deadlock between reset and scanBitao Hu2-0/+11
If controller reset occurs when allocating namespace, both nvme_reset_work and nvme_scan_work will hang, as shown below. Test Scripts: for ((t=1;t<=128;t++)) do nsid=`nvme create-ns /dev/nvme1 -s 14537724 -c 14537724 -f 0 -m 0 \ -d 0 | awk -F: '{print($NF);}'` nvme attach-ns /dev/nvme1 -n $nsid -c 0 done nvme reset /dev/nvme1 We will find that both nvme_reset_work and nvme_scan_work hung: INFO: task kworker/u249:4:17848 blocked for more than 120 seconds. "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. task:kworker/u249:4 state:D stack: 0 pid:17848 ppid: 2 flags:0x00000028 Workqueue: nvme-reset-wq nvme_reset_work [nvme] Call trace: __switch_to+0xb4/0xfc __schedule+0x22c/0x670 schedule+0x4c/0xd0 blk_mq_freeze_queue_wait+0x84/0xc0 nvme_wait_freeze+0x40/0x64 [nvme_core] nvme_reset_work+0x1c0/0x5cc [nvme] process_one_work+0x1d8/0x4b0 worker_thread+0x230/0x440 kthread+0x114/0x120 INFO: task kworker/u249:3:22404 blocked for more than 120 seconds. "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. task:kworker/u249:3 state:D stack: 0 pid:22404 ppid: 2 flags:0x00000028 Workqueue: nvme-wq nvme_scan_work [nvme_core] Call trace: __switch_to+0xb4/0xfc __schedule+0x22c/0x670 schedule+0x4c/0xd0 rwsem_down_write_slowpath+0x32c/0x98c down_write+0x70/0x80 nvme_alloc_ns+0x1ac/0x38c [nvme_core] nvme_validate_or_alloc_ns+0xbc/0x150 [nvme_core] nvme_scan_ns_list+0xe8/0x2e4 [nvme_core] nvme_scan_work+0x60/0x500 [nvme_core] process_one_work+0x1d8/0x4b0 worker_thread+0x260/0x440 kthread+0x114/0x120 INFO: task nvme:28428 blocked for more than 120 seconds. "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. task:nvme state:D stack: 0 pid:28428 ppid: 27119 flags:0x00000000 Call trace: __switch_to+0xb4/0xfc __schedule+0x22c/0x670 schedule+0x4c/0xd0 schedule_timeout+0x160/0x194 do_wait_for_common+0xac/0x1d0 __wait_for_common+0x78/0x100 wait_for_completion+0x24/0x30 __flush_work.isra.0+0x74/0x90 flush_work+0x14/0x20 nvme_reset_ctrl_sync+0x50/0x74 [nvme_core] nvme_dev_ioctl+0x1b0/0x250 [nvme_core] __arm64_sys_ioctl+0xa8/0xf0 el0_svc_common+0x88/0x234 do_el0_svc+0x7c/0x90 el0_svc+0x1c/0x30 el0_sync_handler+0xa8/0xb0 el0_sync+0x148/0x180 The reason for the hang is that nvme_reset_work occurs while nvme_scan_work is still running. nvme_scan_work may add new ns into ctrl->namespaces list after nvme_reset_work frozen all ns->q in ctrl->namespaces list. The newly added ns is not frozen, so nvme_wait_freeze will wait forever. Unfortunately, ctrl->namespaces_rwsem is held by nvme_reset_work, so nvme_scan_work will also wait forever. Now we are deadlocked! PROCESS1 PROCESS2 ============== ============== nvme_scan_work ... nvme_reset_work nvme_validate_or_alloc_ns nvme_dev_disable nvme_alloc_ns nvme_start_freeze down_write ... nvme_ns_add_to_ctrl_list ... up_write nvme_wait_freeze ... down_read nvme_alloc_ns blk_mq_freeze_queue_wait down_write Fix by marking the ctrl with say NVME_CTRL_FROZEN flag set in nvme_start_freeze and cleared in nvme_unfreeze. Then the scan can check it before adding the new namespace (under the namespaces_rwsem). Signed-off-by: Bitao Hu <yaoma@linux.alibaba.com> Reviewed-by: Guixin Liu <kanie@linux.alibaba.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Keith Busch <kbusch@kernel.org>
2023-12-04nvme: prevent potential spectre v1 gadgetNitesh Shetty1-0/+3
This patch fixes the smatch warning, "nvmet_ns_ana_grpid_store() warn: potential spectre issue 'nvmet_ana_group_enabled' [w] (local cap)" Prevent the contents of kernel memory from being leaked to user space via speculative execution by using array_index_nospec. Signed-off-by: Nitesh Shetty <nj.shetty@samsung.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Keith Busch <kbusch@kernel.org>
2023-12-04nvme: improve NVME_HOST_AUTH and NVME_TARGET_AUTH config descriptionsShin'ichiro Kawasaki2-4/+6
Currently two similar config options NVME_HOST_AUTH and NVME_TARGET_AUTH have almost same descriptions. It is confusing to choose them in menuconfig. Improve the descriptions to distinguish them. Signed-off-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Keith Busch <kbusch@kernel.org>
2023-12-04nvme-ioctl: move capable() admin check to the endKeith Busch1-10/+11
This can be an expensive call on some kernel configs. Move it to the end after checking the cheaper ways to determine if the command is allowed. Reviewed-by: Jens Axboe <axboe@kernel.dk> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Keith Busch <kbusch@kernel.org>
2023-12-04nvme: ensure reset state check orderingKeith Busch5-49/+63
A different CPU may be setting the ctrl->state value, so ensure proper barriers to prevent optimizing to a stale state. Normally it isn't a problem to observe the wrong state as it is merely advisory to take a quicker path during initialization and error recovery, but seeing an old state can report unexpected ENETRESET errors when a reset request was in fact successful. Reported-by: Minh Hoang <mh2022@meta.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Keith Busch <kbusch@kernel.org> Signed-off-by: Hannes Reinecke <hare@suse.de>
2023-12-04nvme: introduce helper function to get ctrl stateKeith Busch1-0/+5
The controller state is typically written by another CPU, so reading it should ensure no optimizations are taken. This is a repeated pattern in the driver, so start with adding a convenience function that returns the controller state with READ_ONCE(). Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Keith Busch <kbusch@kernel.org>
2023-12-01nvme: use bio_integrity_map_userKeith Busch1-168/+29
Map user metadata buffers directly. Now that the bio tracks the metadata, nvme doesn't need special metadata handling and tracking with callbacks and additional fields in the pdu. Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Keith Busch <kbusch@kernel.org> Link: https://lore.kernel.org/r/20231130215309.2923568-3-kbusch@meta.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-12-01nvme-fc: replace deprecated strncpy with strscpyJustin Stitt1-4/+4
strncpy() is deprecated for use on NUL-terminated destination strings [1] and as such we should prefer more robust and less ambiguous string interfaces. Let's instead use strscpy() [2] as it guarantees NUL-termination on the destination buffer. Moreover, there is no need to use: | min(FCNVME_ASSOC_HOSTNQN_LEN, NVMF_NQN_SIZE)); I imagine this was originally done to make sure the destination buffer is NUL-terminated by ensuring we copy a number of bytes less than the size of our destination, thus leaving some NUL-bytes at the end. However, with strscpy(), we no longer need to do this and we can instead opt for the more idiomatic strscpy() usage of: | strscpy(dest, src, sizeof(dest)) Also, no NUL-padding is required as lsop is zero-allocated: | lsop = kzalloc((sizeof(*lsop) + | sizeof(*assoc_rqst) + sizeof(*assoc_acc) + | ctrl->lport->ops->lsrqst_priv_sz), GFP_KERNEL); ... and assoc_rqst points to a field in lsop: | assoc_rqst = (struct fcnvme_ls_cr_assoc_rqst *)&lsop[1]; Therefore, any additional NUL-byte assignments (like the ones that strncpy() makes) are redundant. Link: https://www.kernel.org/doc/html/latest/process/deprecated.html#strncpy-on-nul-terminated-strings [1] Link: https://manpages.debian.org/testing/linux-manual-4.8/strscpy.9.en.html [2] Link: https://github.com/KSPP/linux/issues/90 Cc: linux-hardening@vger.kernel.org Signed-off-by: Justin Stitt <justinstitt@google.com> Similar-to: https://lore.kernel.org/all/20231018-strncpy-drivers-nvme-host-fabrics-c-v1-1-b6677df40a35@google.com/ Reviewed-by: Kees Cook <keescook@chromium.org> Link: https://lore.kernel.org/r/20231019-strncpy-drivers-nvme-host-fc-c-v1-1-5805c15e4b49@google.com Signed-off-by: Kees Cook <keescook@chromium.org>
2023-12-01nvme-fabrics: replace deprecated strncpy with strscpyJustin Stitt1-2/+2
strncpy() is deprecated for use on NUL-terminated destination strings [1] and as such we should prefer more robust and less ambiguous string interfaces. We expect both data->subsysnqn and data->hostnqn to be NUL-terminated based on their usage with format specifier ("%s"): fabrics.c: 322: dev_err(ctrl->device, 323: "%s, subsysnqn \"%s\"\n", 324: inv_data, data->subsysnqn); ... 349: dev_err(ctrl->device, 350: "Connect for subsystem %s is not allowed, hostnqn: %s\n", 351: data->subsysnqn, data->hostnqn); Moreover, there's no need to NUL-pad since `data` is zero-allocated already in fabrics.c: 383: data = kzalloc(sizeof(*data), GFP_KERNEL); ... therefore any further NUL-padding is rendered useless. Considering the above, a suitable replacement is `strscpy` [2] due to the fact that it guarantees NUL-termination on the destination buffer without unnecessarily NUL-padding. I opted not to switch NVMF_NQN_SIZE to sizeof(data->xyz) because the size is defined as: | /* NQN names in commands fields specified one size */ | #define NVMF_NQN_FIELD_LEN 256 ... while NVMF_NQN_SIZE is defined as: | /* However the max length of a qualified name is another size */ | #define NVMF_NQN_SIZE 223 Since 223 seems pretty magic, I'm not going to touch it. Link: https://www.kernel.org/doc/html/latest/process/deprecated.html#strncpy-on-nul-terminated-strings [1] Link: https://manpages.debian.org/testing/linux-manual-4.8/strscpy.9.en.html [2] Link: https://github.com/KSPP/linux/issues/90 Cc: linux-hardening@vger.kernel.org Signed-off-by: Justin Stitt <justinstitt@google.com> Reviewed-by: Kees Cook <keescook@chromium.org> Link: https://lore.kernel.org/r/20231018-strncpy-drivers-nvme-host-fabrics-c-v1-1-b6677df40a35@google.com Signed-off-by: Kees Cook <keescook@chromium.org>
2023-12-01nvme-core: check for too small lba shiftKeith Busch1-2/+3
The block layer doesn't support logical block sizes smaller than 512 bytes. The nvme spec doesn't support that small either, but the driver isn't checking to make sure the device responded with usable data. Failing to catch this will result in a kernel bug, either from a division by zero when stacking, or a zero length bio. Reviewed-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Keith Busch <kbusch@kernel.org>
2023-11-27nvme: check for valid nvme_identify_ns() before using itEwan D. Milne1-0/+9
When scanning namespaces, it is possible to get valid data from the first call to nvme_identify_ns() in nvme_alloc_ns(), but not from the second call in nvme_update_ns_info_block(). In particular, if the NSID becomes inactive between the two commands, a storage device may return a buffer filled with zero as per 4.1.5.1. In this case, we can get a kernel crash due to a divide-by-zero in blk_stack_limits() because ns->lba_shift will be set to zero. PID: 326 TASK: ffff95fec3cd8000 CPU: 29 COMMAND: "kworker/u98:10" #0 [ffffad8f8702f9e0] machine_kexec at ffffffff91c76ec7 #1 [ffffad8f8702fa38] __crash_kexec at ffffffff91dea4fa #2 [ffffad8f8702faf8] crash_kexec at ffffffff91deb788 #3 [ffffad8f8702fb00] oops_end at ffffffff91c2e4bb #4 [ffffad8f8702fb20] do_trap at ffffffff91c2a4ce #5 [ffffad8f8702fb70] do_error_trap at ffffffff91c2a595 #6 [ffffad8f8702fbb0] exc_divide_error at ffffffff928506e6 #7 [ffffad8f8702fbd0] asm_exc_divide_error at ffffffff92a00926 [exception RIP: blk_stack_limits+434] RIP: ffffffff92191872 RSP: ffffad8f8702fc80 RFLAGS: 00010246 RAX: 0000000000000000 RBX: ffff95efa0c91800 RCX: 0000000000000001 RDX: 0000000000000000 RSI: 0000000000000001 RDI: 0000000000000001 RBP: 00000000ffffffff R8: ffff95fec7df35a8 R9: 0000000000000000 R10: 0000000000000000 R11: 0000000000000001 R12: 0000000000000000 R13: 0000000000000000 R14: 0000000000000000 R15: ffff95fed33c09a8 ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018 #8 [ffffad8f8702fce0] nvme_update_ns_info_block at ffffffffc06d3533 [nvme_core] #9 [ffffad8f8702fd18] nvme_scan_ns at ffffffffc06d6fa7 [nvme_core] This happened when the check for valid data was moved out of nvme_identify_ns() into one of the callers. Fix this by checking in both callers. Link: https://bugzilla.kernel.org/show_bug.cgi?id=218186 Fixes: 0dd6fff2aad4 ("nvme: bring back auto-removal of deleted namespaces during sequential scan") Cc: stable@vger.kernel.org Signed-off-by: Ewan D. Milne <emilne@redhat.com> Signed-off-by: Keith Busch <kbusch@kernel.org>
2023-11-27nvme-core: fix a memory leak in nvme_ns_info_from_identify()Maurizio Lombardi1-2/+5
In case of error, free the nvme_id_ns structure that was allocated by nvme_identify_ns(). Signed-off-by: Maurizio Lombardi <mlombard@redhat.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Kanchan Joshi <joshi.k@samsung.com> Signed-off-by: Keith Busch <kbusch@kernel.org>
2023-11-27nvme: fine-tune sending of first keep-aliveMark O'Donovan1-2/+11
Keep-alive commands are sent half-way through the kato period. This normally works well but fails when the keep-alive system is started when we are more than half way through the kato. This can happen on larger setups or due to host delays. With this change we now time the initial keep-alive command from the controller initialisation time, rather than the keep-alive mechanism activation time. Signed-off-by: Mark O'Donovan <shiftee@posteo.net> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Keith Busch <kbusch@kernel.org>
2023-11-22nvme: tcp: fix compile-time checks for TLS modeArnd Bergmann1-17/+14
When CONFIG_NVME_KEYRING is enabled as a loadable module, but the TCP host code is built-in, it fails to link: arm-linux-gnueabi-ld: drivers/nvme/host/tcp.o: in function `nvme_tcp_setup_ctrl': tcp.c:(.text+0x1940): undefined reference to `nvme_tls_psk_default' The problem is that the compile-time conditionals are inconsistent here, using a mix of #ifdef CONFIG_NVME_TCP_TLS, IS_ENABLED(CONFIG_NVME_TCP_TLS) and IS_ENABLED(CONFIG_NVME_KEYRING) checks, with CONFIG_NVME_KEYRING controlling whether the implementation is actually built. Change it to use IS_ENABLED(CONFIG_NVME_KEYRING) checks consistently, which should help readability and make it less error-prone. Combining it with the check for the ctrl->opts->tls flag lets the compiler drop all the TLS code in configurations without this feature, which also helps runtime behavior in addition to avoiding the link failure. To make it possible for the compiler to build the dead code, both the tls_handshake_timeout variable and the TLS specific members of nvme_tcp_queue need to be moved out of the #ifdef block as well, but at least the former of these gets optimized out again. Signed-off-by: Arnd Bergmann <arnd@arndb.de> Link: https://lore.kernel.org/r/20231122224719.4042108-4-arnd@kernel.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-11-22nvme: target: fix Kconfig select statementsArnd Bergmann1-2/+2
When the NVME target code is built-in but its TCP frontend is a loadable module, enabling keyring support causes a link failure: x86_64-linux-ld: vmlinux.o: in function `nvmet_ports_make': configfs.c:(.text+0x100a211): undefined reference to `nvme_keyring_id' The problem is that CONFIG_NVME_TARGET_TCP_TLS is a 'bool' symbol that depends on the tristate CONFIG_NVME_TARGET_TCP, so any 'select' from it inherits the state of the tristate symbol rather than the intended CONFIG_NVME_TARGET one that contains the actual call. The same thing is true for CONFIG_KEYS, which itself is required for NVME_KEYRING. Signed-off-by: Arnd Bergmann <arnd@arndb.de> Link: https://lore.kernel.org/r/20231122224719.4042108-3-arnd@kernel.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-11-22nvme: target: fix nvme_keyring_id() referencesArnd Bergmann1-1/+1
In configurations without CONFIG_NVME_TARGET_TCP_TLS, the keyring code might not be available, or using it will result in a runtime failure: x86_64-linux-ld: vmlinux.o: in function `nvmet_ports_make': configfs.c:(.text+0x100a211): undefined reference to `nvme_keyring_id' Add a check to ensure we only check the keyring if there is a chance of it being used, which avoids both the runtime and link-time problems. Signed-off-by: Arnd Bergmann <arnd@arndb.de> Link: https://lore.kernel.org/r/20231122224719.4042108-2-arnd@kernel.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2023-11-22nvme: move nvme_stop_keep_alive() back to original positionHannes Reinecke4-12/+11
Stopping keep-alive not only stops the keep-alive workqueue, but also needs to be synchronized with I/O termination as we must not send a keep-alive command after all I/O had been terminated. So to avoid any regressions move the call to stop_keep_alive() back to its original position and ensure that keep-alive is correctly stopped failing to setup the admin queue. Fixes: 4733b65d82bd ("nvme: start keep-alive after admin queue setup") Suggested-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Keith Busch <kbusch@kernel.org>
2023-11-20nvmet-tcp: always initialize tls_handshake_tmo_workHannes Reinecke1-1/+3
The TLS handshake timeout work item should always be initialized to avoid a crash when cancelling the workqueue. Fixes: 675b453e0241 ("nvmet-tcp: enable TLS handshake upcall") Suggested-by: Maurizio Lombardi <mlombard@redhat.com> Signed-off-by: Hannes Reinecke <hare@suse.de> Tested-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com> Tested-by: Yi Zhang <yi.zhang@redhat.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Keith Busch <kbusch@kernel.org>
2023-11-20nvmet: nul-terminate the NQNs passed in the connect commandChristoph Hellwig1-0/+4
The host and subsystem NQNs are passed in the connect command payload and interpreted as nul-terminated strings. Ensure they actually are nul-terminated before using them. Fixes: a07b4970f464 "nvmet: add a generic NVMe target") Reported-by: Alon Zahavi <zahavi.alon@gmail.com> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Keith Busch <kbusch@kernel.org>
2023-11-20nvme: blank out authentication fabrics options if not configuredHannes Reinecke1-0/+2
If the config option NVME_HOST_AUTH is not selected we should not accept the corresponding fabrics options. This allows userspace to detect if NVMe authentication has been enabled for the kernel. Cc: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com> Fixes: f50fff73d620 ("nvme: implement In-Band authentication") Signed-off-by: Hannes Reinecke <hare@suse.de> Tested-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com> Reviewed-by: Daniel Wagner <dwagner@suse.de> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Keith Busch <kbusch@kernel.org>
2023-11-20nvme: catch errors from nvme_configure_metadata()Hannes Reinecke1-6/+13
nvme_configure_metadata() is issuing I/O, so we might incur an I/O error which will cause the connection to be reset. But in that case any further probing will race with reset and cause UAF errors. So return a status from nvme_configure_metadata() and abort probing if there was an I/O error. Signed-off-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Keith Busch <kbusch@kernel.org>
2023-11-20nvme-tcp: only evaluate 'tls' option if TLS is selectedHannes Reinecke1-1/+1
We only need to evaluate the 'tls' connect option if TLS is enabled; otherwise we might be getting a link error. Fixes: 706add13676d ("nvme: keyring: fix conditional compilation") Reported-by: kernel test robot <yujie.liu@intel.com> Closes: https://lore.kernel.org/r/202311140426.0eHrTXBr-lkp@intel.com/ Signed-off-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Keith Busch <kbusch@kernel.org>
2023-11-20nvme-auth: set explanation code for failure2 msgsMark O'Donovan1-0/+2
Some error cases were not setting an auth-failure-reason-code-explanation. This means an AUTH_Failure2 message will be sent with an explanation value of 0 which is a reserved value. Signed-off-by: Mark O'Donovan <shiftee@posteo.net> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Keith Busch <kbusch@kernel.org>
2023-11-20nvme-auth: unlock mutex in one place onlyMark O'Donovan1-2/+1
Signed-off-by: Mark O'Donovan <shiftee@posteo.net> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Keith Busch <kbusch@kernel.org>
2023-11-08nvme: keyring: fix conditional compilationHannes Reinecke2-12/+6
The keyring and auth functions can be called from both the host and the target side and are controlled by Kconfig options for each of the combinations, but the declarations are controlled by #ifdef checks on the shared Kconfig symbols. This leads to link failures in combinations where one of the frontends is built-in and the other one is a module, and the keyring code ends up in a module that is not reachable from the builtin code: ld: drivers/nvme/host/core.o: in function `nvme_core_exit': core.c:(.exit.text+0x4): undefined reference to `nvme_keyring_exit' ld: drivers/nvme/host/core.o: in function `nvme_core_init': core.c:(.init.text+0x94): undefined reference to `nvme_keyring_init ld: drivers/nvme/host/tcp.o: in function `nvme_tcp_setup_ctrl': tcp.c:(.text+0x4c18): undefined reference to `nvme_tls_psk_default' Address this by moving nvme_keyring_init()/nvme_keyring_exit() into module init/exit functions for the keyring module. Fixes: be8e82caa6859 ("nvme-tcp: enable TLS handshake upcall") Signed-off-by: Hannes Reinecke <hare@suse.de> Cc: Arnd Bergmann <arnd@arndb.de> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Keith Busch <kbusch@kernel.org>
2023-11-07nvme: common: make keyring and auth separate modulesArnd Bergmann6-13/+9
When only the keyring module is included but auth is not, modpost complains about the lack of a module license tag: ERROR: modpost: missing MODULE_LICENSE() in drivers/nvme/common/nvme-common.o Address this by making both modules buildable standalone, removing the now unnecessary CONFIG_NVME_COMMON symbol in the process. Also, now that NVME_KEYRING config symbol can be either a module or built-in, the stubs need to check for '#if IS_ENABLED' rather than a simple '#ifdef'. Fixes: 9d77eb5277849 ("nvme-keyring: register '.nvme' keyring") Signed-off-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Keith Busch <kbusch@kernel.org>
2023-11-06nvme: start keep-alive after admin queue setupHannes Reinecke2-3/+9
Setting up I/O queues might take quite some time on larger and/or busy setups, so KATO might expire before all I/O queues could be set up. Fix this by start keep alive from the ->init_ctrl_finish() callback, and stopping it when calling nvme_cancel_admin_tagset(). Signed-off-by: Hannes Reinecke <hare@suse.de> Tested-by: Mark O'Donovan <shiftee@posteo.net> [fixed nvme-fc compile error] Signed-off-by: Keith Busch <kbusch@kernel.org>
2023-11-06nvme-loop: always quiesce and cancel commands before destroying admin qHannes Reinecke1-0/+4
Once ->init_ctrl_finish() is called there may be commands outstanding, so we should quiesce the admin queue and cancel all commands prior to call nvme_loop_destroy_admin_queue(). Signed-off-by: Hannes Reinecke <hare@suse.de> Tested-by: Mark O'Donovan <shiftee@posteo.net> Signed-off-by: Keith Busch <kbusch@kernel.org>
2023-11-06nvme-tcp: avoid open-coding nvme_tcp_teardown_admin_queue()Hannes Reinecke1-5/+1
nvme_tcp_setup_ctrl() has an open-coded version of nvme_tcp_teardown_admin_queue(). Signed-off-by: Hannes Reinecke <hare@suse.de> Tested-by: Mark O'Donovan <shiftee@posteo.net> Signed-off-by: Keith Busch <kbusch@kernel.org>
2023-11-06nvme-auth: always set valid seq_num in dhchap replyMark O'Donovan2-3/+2
Currently a seqnum of zero is sent during uni-directional authentication. The zero value is reserved for the secure channel feature which is not yet implemented. Relevant extract from the spec: The value 0h is used to indicate that bidirectional authentication is not performed, but a challenge value C2 is carried in order to generate a pre-shared key (PSK) for subsequent establishment of a secure channel Signed-off-by: Mark O'Donovan <shiftee@posteo.net> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.de>
2023-11-06nvme-auth: add flag for bi-directional authMark O'Donovan1-1/+4
Introduces an explicit variable for bi-directional auth. The currently used variable chap->s2 is incorrectly zeroed for uni-directional auth. That will be fixed in the next patch so this needs to change to avoid sending unexpected success2 messages Signed-off-by: Mark O'Donovan <shiftee@posteo.net> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.de>
2023-11-06nvme-auth: auth success1 msg always includes respMark O'Donovan1-4/+1
In cases where RVALID is false, the response is still transmitted, but is cleared to zero. Relevant extract from the spec: Response R2, if valid (i.e., if the RVALID field is set to 01h), cleared to 0h otherwise Signed-off-by: Mark O'Donovan <shiftee@posteo.net> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Christoph Hellwig <hch@lst.de>
2023-11-06nvme: fix error-handling for io_uring nvme-passthroughAnuj Gupta1-2/+5
Driver may return an error before submitting the command to the device. Ensure that such error is propagated up. Fixes: 456cba386e94 ("nvme: wire-up uring-cmd support for io-passthru on char-device.") Signed-off-by: Anuj Gupta <anuj20.g@samsung.com> Signed-off-by: Kanchan Joshi <joshi.k@samsung.com> Reviewed-by: Niklas Cassel <niklas.cassel@wdc.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Keith Busch <kbusch@kernel.org>
2023-11-06nvme: update firmware version after commitDaniel Wagner1-1/+14
The firmware version sysfs entry needs to be updated after a successfully firmware activation. nvme-cli stopped issuing an Identify Controller command to list the current firmware information and relies on sysfs showing the current firmware version. Reported-by: Kenji Tomonaga <tkenbo@gmail.com> Signed-off-by: Daniel Wagner <dwagner@suse.de> Tested-by: Kenji Tomonaga <tkenbo@gmail.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Niklas Cassel <niklas.cassel@wdc.com> [fixed off-by one afi index] Signed-off-by: Keith Busch <kbusch@kernel.org>
2023-11-06nvme-tcp: Fix a memory leakChristophe JAILLET1-1/+2
All error handling path end to the error handling path, except this one. Go to the error handling branch as well here, otherwise 'icreq' and 'icresp' will leak. Fixes: 2837966ab2a8 ("nvme-tcp: control message handling for recvmsg()") Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Keith Busch <kbusch@kernel.org>