aboutsummaryrefslogtreecommitdiffstats
path: root/net/rds
AgeCommit message (Collapse)AuthorFilesLines
6 daysnet: change proto and proto_ops accept typeJens Axboe1-1/+5
Rather than pass in flags, error pointer, and whether this is a kernel invocation or not, add a struct proto_accept_arg struct as the argument. This then holds all of these arguments, and prepares accept for being able to pass back more information. No functional changes in this patch. Acked-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-05-03net: rds: Remove the now superfluous sentinel elements from ctl_table arrayJoel Granados3-3/+0
This commit comes at the tail end of a greater effort to remove the empty elements at the end of the ctl_table arrays (sentinels) which will reduce the overall build time size of the kernel and run time memory bloat by ~64 bytes per sentinel (further information Link : https://lore.kernel.org/all/ZO5Yx5JFogGi%2FcBo@bombadil.infradead.org/) * Remove sentinel element from ctl_table structs. Signed-off-by: Joel Granados <j.granados@samsung.com> Acked-by: Allison Henderson <allison.henderson@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2024-03-29net/rds: fix possible cp null dereferenceMahmoud Adam1-1/+1
cp might be null, calling cp->cp_conn would produce null dereference [Simon Horman adds:] Analysis: * cp is a parameter of __rds_rdma_map and is not reassigned. * The following call-sites pass a NULL cp argument to __rds_rdma_map() - rds_get_mr() - rds_get_mr_for_dest * Prior to the code above, the following assumes that cp may be NULL (which is indicative, but could itself be unnecessary) trans_private = rs->rs_transport->get_mr( sg, nents, rs, &mr->r_key, cp ? cp->cp_conn : NULL, args->vec.addr, args->vec.bytes, need_odp ? ODP_ZEROBASED : ODP_NOT_NEEDED); * The code modified by this patch is guarded by IS_ERR(trans_private), where trans_private is assigned as per the previous point in this analysis. The only implementation of get_mr that I could locate is rds_ib_get_mr() which can return an ERR_PTR if the conn (4th) argument is NULL. * ret is set to PTR_ERR(trans_private). rds_ib_get_mr can return ERR_PTR(-ENODEV) if the conn (4th) argument is NULL. Thus ret may be -ENODEV in which case the code in question will execute. Conclusion: * cp may be NULL at the point where this patch adds a check; this patch does seem to address a possible bug Fixes: c055fc00c07b ("net/rds: fix WARNING in rds_conn_connect_if_down") Cc: stable@vger.kernel.org # v4.19+ Signed-off-by: Mahmoud Adam <mngyadam@amazon.com> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://lore.kernel.org/r/20240326153132.55580-1-mngyadam@amazon.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-03-19rds: introduce acquire/release ordering in acquire/release_in_xmit()Yewon Choi1-3/+2
acquire/release_in_xmit() work as bit lock in rds_send_xmit(), so they are expected to ensure acquire/release memory ordering semantics. However, test_and_set_bit/clear_bit() don't imply such semantics, on top of this, following smp_mb__after_atomic() does not guarantee release ordering (memory barrier actually should be placed before clear_bit()). Instead, we use clear_bit_unlock/test_and_set_bit_lock() here. Fixes: 0f4b1c7e89e6 ("rds: fix rds_send_xmit() serialization") Fixes: 1f9ecd7eacfd ("RDS: Pass rds_conn_path to rds_send_xmit()") Signed-off-by: Yewon Choi <woni9911@gmail.com> Reviewed-by: Michal Kubiak <michal.kubiak@intel.com> Link: https://lore.kernel.org/r/ZfQUxnNTO9AJmzwc@libra05 Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-03-07Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski2-5/+4
Cross-merge networking fixes after downstream PR. No conflicts. Adjacent changes: net/core/page_pool_user.c 0b11b1c5c320 ("netdev: let netlink core handle -EMSGSIZE errors") 429679dcf7d9 ("page_pool: fix netlink dump stop/resume") Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-03-06net/rds: fix WARNING in rds_conn_connect_if_downEdward Adam Davis2-5/+4
If connection isn't established yet, get_mr() will fail, trigger connection after get_mr(). Fixes: 584a8279a44a ("RDS: RDMA: return appropriate error on rdma map failures") Reported-and-tested-by: syzbot+d4faee732755bba9838e@syzkaller.appspotmail.com Signed-off-by: Edward Adam Davis <eadavis@qq.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2024-02-15Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski1-2/+11
Cross-merge networking fixes after downstream PR. No conflicts. Adjacent changes: net/core/dev.c 9f30831390ed ("net: add rcu safety to rtnl_prop_list_size()") 723de3ebef03 ("net: free altname using an RCU callback") net/unix/garbage.c 11498715f266 ("af_unix: Remove io_uring code for GC.") 25236c91b5ab ("af_unix: Fix task hung while purging oob_skb in GC.") drivers/net/ethernet/renesas/ravb_main.c ed4adc07207d ("net: ravb: Count packets instead of descriptors in GbEth RX path" ) c2da9408579d ("ravb: Add Rx checksum offload support for GbEth") net/mptcp/protocol.c bdd70eb68913 ("mptcp: drop the push_pending field") 28e5c1380506 ("mptcp: annotate lockless accesses around read-mostly fields") Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-02-13net:rds: Fix possible deadlock in rds_message_putAllison Henderson1-2/+11
Functions rds_still_queued and rds_clear_recv_queue lock a given socket in order to safely iterate over the incoming rds messages. However calling rds_inc_put while under this lock creates a potential deadlock. rds_inc_put may eventually call rds_message_purge, which will lock m_rs_lock. This is the incorrect locking order since m_rs_lock is meant to be locked before the socket. To fix this, we move the message item to a local list or variable that wont need rs_recv_lock protection. Then we can safely call rds_inc_put on any item stored locally after rs_recv_lock is released. Fixes: bdbe6fbc6a2f ("RDS: recv.c") Reported-by: syzbot+f9db6ff27b9bfdcfeca0@syzkaller.appspotmail.com Reported-by: syzbot+dcd73ff9291e6d34b3ab@syzkaller.appspotmail.com Signed-off-by: Allison Henderson <allison.henderson@oracle.com> Link: https://lore.kernel.org/r/20240209022854.200292-1-allison.henderson@oracle.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-01-30net: rds: Simplify the allocation of slab caches in rds_conn_initKunwu Chan1-3/+1
Use the new KMEM_CACHE() macro instead of direct kmem_cache_create to simplify the creation of SLAB caches. Signed-off-by: Kunwu Chan <chentao@kylinos.cn> Link: https://lore.kernel.org/r/20240124075801.471330-1-chentao@kylinos.cn Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-01-22net/rds: Fix UBSAN: array-index-out-of-bounds in rds_cmsg_recvSharath Srinivasan1-1/+1
Syzcaller UBSAN crash occurs in rds_cmsg_recv(), which reads inc->i_rx_lat_trace[j + 1] with index 4 (3 + 1), but with array size of 4 (RDS_RX_MAX_TRACES). Here 'j' is assigned from rs->rs_rx_trace[i] and in-turn from trace.rx_trace_pos[i] in rds_recv_track_latency(), with both arrays sized 3 (RDS_MSG_RX_DGRAM_TRACE_MAX). So fix the off-by-one bounds check in rds_recv_track_latency() to prevent a potential crash in rds_cmsg_recv(). Found by syzcaller: ================================================================= UBSAN: array-index-out-of-bounds in net/rds/recv.c:585:39 index 4 is out of range for type 'u64 [4]' CPU: 1 PID: 8058 Comm: syz-executor228 Not tainted 6.6.0-gd2f51b3516da #1 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.15.0-1 04/01/2014 Call Trace: <TASK> __dump_stack lib/dump_stack.c:88 [inline] dump_stack_lvl+0x136/0x150 lib/dump_stack.c:106 ubsan_epilogue lib/ubsan.c:217 [inline] __ubsan_handle_out_of_bounds+0xd5/0x130 lib/ubsan.c:348 rds_cmsg_recv+0x60d/0x700 net/rds/recv.c:585 rds_recvmsg+0x3fb/0x1610 net/rds/recv.c:716 sock_recvmsg_nosec net/socket.c:1044 [inline] sock_recvmsg+0xe2/0x160 net/socket.c:1066 __sys_recvfrom+0x1b6/0x2f0 net/socket.c:2246 __do_sys_recvfrom net/socket.c:2264 [inline] __se_sys_recvfrom net/socket.c:2260 [inline] __x64_sys_recvfrom+0xe0/0x1b0 net/socket.c:2260 do_syscall_x64 arch/x86/entry/common.c:51 [inline] do_syscall_64+0x40/0x110 arch/x86/entry/common.c:82 entry_SYSCALL_64_after_hwframe+0x63/0x6b ================================================================== Fixes: 3289025aedc0 ("RDS: add receive message trace used by application") Reported-by: Chenyuan Yang <chenyuan0y@gmail.com> Closes: https://lore.kernel.org/linux-rdma/CALGdzuoVdq-wtQ4Az9iottBqC5cv9ZhcE5q8N7LfYFvkRsOVcw@mail.gmail.com/ Signed-off-by: Sharath Srinivasan <sharath.srinivasan@oracle.com> Reviewed-by: Simon Horman <horms@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-12-11ipv6: annotate data-races around np->mcast_oifEric Dumazet1-1/+1
np->mcast_oif is read locklessly in some contexts. Make all accesses to this field lockless, adding appropriate annotations. This also makes setsockopt( IPV6_MULTICAST_IF ) lockless. Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: David Ahern <dsahern@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-10-01net: prevent address rewrite in kernel_bind()Jordan Rife2-2/+2
Similar to the change in commit 0bdf399342c5("net: Avoid address overwrite in kernel_connect"), BPF hooks run on bind may rewrite the address passed to kernel_bind(). This change 1) Makes a copy of the bind address in kernel_bind() to insulate callers. 2) Replaces direct calls to sock->ops->bind() in net with kernel_bind() Link: https://lore.kernel.org/netdev/20230912013332.2048422-1-jrife@google.com/ Fixes: 4fbac77d2d09 ("bpf: Hooks for sys_bind") Cc: stable@vger.kernel.org Reviewed-by: Willem de Bruijn <willemb@google.com> Signed-off-by: Jordan Rife <jrife@google.com> Reviewed-by: Simon Horman <horms@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-10-01net: replace calls to sock->ops->connect() with kernel_connect()Jordan Rife1-1/+1
commit 0bdf399342c5 ("net: Avoid address overwrite in kernel_connect") ensured that kernel_connect() will not overwrite the address parameter in cases where BPF connect hooks perform an address rewrite. This change replaces direct calls to sock->ops->connect() in net with kernel_connect() to make these call safe. Link: https://lore.kernel.org/netdev/20230912013332.2048422-1-jrife@google.com/ Fixes: d74bad4e74ee ("bpf: Hooks for sys_connect") Cc: stable@vger.kernel.org Reviewed-by: Willem de Bruijn <willemb@google.com> Signed-off-by: Jordan Rife <jrife@google.com> Reviewed-by: Simon Horman <horms@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-09-20net: rds: Fix possible NULL-pointer dereferenceArtem Chernyshev1-5/+7
In rds_rdma_cm_event_handler_cmn() check, if conn pointer exists before dereferencing it as rdma_set_service_type() argument Found by Linux Verification Center (linuxtesting.org) with SVACE. Fixes: fd261ce6a30e ("rds: rdma: update rdma transport for tos") Signed-off-by: Artem Chernyshev <artem.chernyshev@red-soft.ru> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-08-29Merge tag 'sysctl-6.6-rc1' of ↵Linus Torvalds1-1/+2
git://git.kernel.org/pub/scm/linux/kernel/git/mcgrof/linux Pull sysctl updates from Luis Chamberlain: "Long ago we set out to remove the kitchen sink on kernel/sysctl.c arrays and placings sysctls to their own sybsystem or file to help avoid merge conflicts. Matthew Wilcox pointed out though that if we're going to do that we might as well also *save* space while at it and try to remove the extra last sysctl entry added at the end of each array, a sentintel, instead of bloating the kernel by adding a new sentinel with each array moved. Doing that was not so trivial, and has required slowing down the moves of kernel/sysctl.c arrays and measuring the impact on size by each new move. The complex part of the effort to help reduce the size of each sysctl is being done by the patient work of el señor Don Joel Granados. A lot of this is truly painful code refactoring and testing and then trying to measure the savings of each move and removing the sentinels. Although Joel already has code which does most of this work, experience with sysctl moves in the past shows is we need to be careful due to the slew of odd build failures that are possible due to the amount of random Kconfig options sysctls use. To that end Joel's work is split by first addressing the major housekeeping needed to remove the sentinels, which is part of this merge request. The rest of the work to actually remove the sentinels will be done later in future kernel releases. The preliminary math is showing this will all help reduce the overall build time size of the kernel and run time memory consumed by the kernel by about ~64 bytes per array where we are able to remove each sentinel in the future. That also means there is no more bloating the kernel with the extra ~64 bytes per array moved as no new sentinels are created" * tag 'sysctl-6.6-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/mcgrof/linux: sysctl: Use ctl_table_size as stopping criteria for list macro sysctl: SIZE_MAX->ARRAY_SIZE in register_net_sysctl vrf: Update to register_net_sysctl_sz networking: Update to register_net_sysctl_sz netfilter: Update to register_net_sysctl_sz ax.25: Update to register_net_sysctl_sz sysctl: Add size to register_net_sysctl function sysctl: Add size arg to __register_sysctl_init sysctl: Add size to register_sysctl sysctl: Add a size arg to __register_sysctl_table sysctl: Add size argument to init_header sysctl: Add ctl_table_size to ctl_table_header sysctl: Use ctl_table_header in list_for_each_table_entry sysctl: Prefer ctl_table_header in proc_sysctl
2023-08-15networking: Update to register_net_sysctl_szJoel Granados1-1/+2
Move from register_net_sysctl to register_net_sysctl_sz for all the networking related files. Do this while making sure to mirror the NULL assignments with a table_size of zero for the unprivileged users. We need to move to the new function in preparation for when we change SIZE_MAX to ARRAY_SIZE() in the register_net_sysctl macro. Failing to do so would erroneously allow ARRAY_SIZE() to be called on a pointer. We hold off the SIZE_MAX to ARRAY_SIZE change until we have migrated all the relevant net sysctl registering functions to register_net_sysctl_sz in subsequent commits. An additional size function was added to the following files in order to calculate the size of an array that is defined in another file: include/net/ipv6.h net/ipv6/icmp.c net/ipv6/route.c net/ipv6/sysctl_net_ipv6.c Signed-off-by: Joel Granados <j.granados@samsung.com> Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
2023-08-13net/rds: Remove unused function declarationsYue Haibing3-5/+0
Commit 39de8281791c ("RDS: Main header file") declared but never implemented rds_trans_init() and rds_trans_exit(), remove it. Commit d37c9359056f ("RDS: Move loop-only function to loop.c") removed the implementation rds_message_inc_free() but not the declaration. Since commit 55b7ed0b582f ("RDS: Common RDMA transport code") rds_rdma_conn_connect() is never implemented and used. rds_tcp_map_seq() is never implemented and used since commit 70041088e3b9 ("RDS: Add TCP transport to RDS"). Signed-off-by: Yue Haibing <yuehaibing@huawei.com> Reviewed-by: Simon Horman <horms@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-06-24sock: Remove ->sendpage*() in favour of sendmsg(MSG_SPLICE_PAGES)David Howells1-1/+0
Remove ->sendpage() and ->sendpage_locked(). sendmsg() with MSG_SPLICE_PAGES should be used instead. This allows multiple pages and multipage folios to be passed through. Signed-off-by: David Howells <dhowells@redhat.com> Acked-by: Marc Kleine-Budde <mkl@pengutronix.de> # for net/can cc: Jens Axboe <axboe@kernel.dk> cc: Matthew Wilcox <willy@infradead.org> cc: linux-afs@lists.infradead.org cc: mptcp@lists.linux.dev cc: rds-devel@oss.oracle.com cc: tipc-discussion@lists.sourceforge.net cc: virtualization@lists.linux-foundation.org Link: https://lore.kernel.org/r/20230623225513.2732256-16-dhowells@redhat.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-06-24rds: Use sendmsg(MSG_SPLICE_PAGES) rather than sendpageDavid Howells1-11/+12
When transmitting data, call down into TCP using a single sendmsg with MSG_SPLICE_PAGES to indicate that content should be spliced. To make this work, the data is assembled in a bio_vec array and attached to a BVEC-type iterator. Signed-off-by: David Howells <dhowells@redhat.com> cc: Santosh Shilimkar <santosh.shilimkar@oracle.com> cc: Jens Axboe <axboe@kernel.dk> cc: Matthew Wilcox <willy@infradead.org> cc: rds-devel@oss.oracle.com Link: https://lore.kernel.org/r/20230623225513.2732256-6-dhowells@redhat.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-02-13rds: rds_rm_zerocopy_callback() correct order for list_add_tail()Pietro Borrello1-1/+1
rds_rm_zerocopy_callback() uses list_add_tail() with swapped arguments. This links the list head with the new entry, losing the references to the remaining part of the list. Fixes: 9426bbc6de99 ("rds: use list structure to track information for zerocopy completion notification") Suggested-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Pietro Borrello <borrello@diag.uniroma1.it> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-02-09Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski1-3/+3
net/devlink/leftover.c / net/core/devlink.c: 565b4824c39f ("devlink: change port event netdev notifier from per-net to global") f05bd8ebeb69 ("devlink: move code to a dedicated directory") 687125b5799c ("devlink: split out core code") https://lore.kernel.org/all/20230208094657.379f2b1a@canb.auug.org.au/ Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-02-09rds: rds_rm_zerocopy_callback() use list_first_entry()Pietro Borrello1-3/+3
rds_rm_zerocopy_callback() uses list_entry() on the head of a list causing a type confusion. Use list_first_entry() to actually access the first element of the rs_zcookie_queue list. Fixes: 9426bbc6de99 ("rds: use list structure to track information for zerocopy completion notification") Reviewed-by: Willem de Bruijn <willemb@google.com> Signed-off-by: Pietro Borrello <borrello@diag.uniroma1.it> Link: https://lore.kernel.org/r/20230202-rds-zerocopy-v3-1-83b0df974f9a@diag.uniroma1.it Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2023-01-27net: add missing includes of linux/sched/clock.hJakub Kicinski2-0/+2
Number of files depend on linux/sched/clock.h getting included by linux/skbuff.h which soon will no longer be the case. Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-01-23net/sock: Introduce trace_sk_data_ready()Peilin Ye2-0/+4
As suggested by Cong, introduce a tracepoint for all ->sk_data_ready() callback implementations. For example: <...> iperf-609 [002] ..... 70.660425: sk_data_ready: family=2 protocol=6 func=sock_def_readable iperf-609 [002] ..... 70.660436: sk_data_ready: family=2 protocol=6 func=sock_def_readable <...> Suggested-by: Cong Wang <cong.wang@bytedance.com> Signed-off-by: Peilin Ye <peilin.ye@bytedance.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-11-04rds: remove redundant variable total_payload_lenColin Ian King1-2/+1
Variable total_payload_len is being used to accumulate payload lengths however it is never read or used afterwards. It is redundant and can be removed. Signed-off-by: Colin Ian King <colin.i.king@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-10-26net/rds: remove variable total_copiedColin Ian King1-2/+0
Variable total_copied is just being incremented and it's never used anywhere else. The variable and the increment are redundant so remove it. Signed-off-by: Colin Ian King <colin.i.king@gmail.com> Link: https://lore.kernel.org/r/20221024135046.2159523-1-colin.i.king@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-10-24net: add a refcount tracker for kernel socketsEric Dumazet1-0/+3
Commit ffa84b5ffb37 ("net: add netns refcount tracker to struct sock") added a tracker to sockets, but did not track kernel sockets. We still have syzbot reports hinting about netns being destroyed while some kernel TCP sockets had not been dismantled. This patch tracks kernel sockets, and adds a ref_tracker_dir_print() call to net_free() right before the netns is freed. Normally, each layer is responsible for properly releasing its kernel sockets before last call to net_free(). This debugging facility is enabled with CONFIG_NET_NS_REFCNT_TRACKER=y Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com> Tested-by: Kuniyuki Iwashima <kuniyu@amazon.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-10-11treewide: use get_random_{u8,u16}() when possible, part 2Jason A. Donenfeld1-1/+1
Rather than truncate a 32-bit value to a 16-bit value or an 8-bit value, simply use the get_random_{u8,u16}() functions, which are faster than wasting the additional bytes from a 32-bit value. This was done by hand, identifying all of the places where one of the random integer functions was used in a non-32-bit context. Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Reviewed-by: Kees Cook <keescook@chromium.org> Reviewed-by: Yury Norov <yury.norov@gmail.com> Acked-by: Jakub Kicinski <kuba@kernel.org> Acked-by: Heiko Carstens <hca@linux.ibm.com> # for s390 Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2022-10-03Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski1-1/+1
Merge in the left-over fixes before the net-next pull-request. Conflicts: drivers/net/ethernet/mediatek/mtk_ppe.c ae3ed15da588 ("net: ethernet: mtk_eth_soc: fix state in __mtk_foe_entry_clear") 9d8cb4c096ab ("net: ethernet: mtk_eth_soc: add foe_entry_size to mtk_eth_soc") https://lore.kernel.org/all/6cb6893b-4921-a068-4c30-1109795110bb@tessares.net/ kernel/bpf/helpers.c 8addbfc7b308 ("bpf: Gate dynptr API behind CAP_BPF") 5679ff2f138f ("bpf: Move bpf_loop and bpf_for_each_map_elem under CAP_BPF") 8a67f2de9b1d ("bpf: expose bpf_strtol and bpf_strtoul to all program types") https://lore.kernel.org/all/20221003201957.13149-1-daniel@iogearbox.net/ Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-10-03net: rds: don't hold sock lock when cancelling work from ↵Tetsuo Handa1-1/+1
rds_tcp_reset_callbacks() syzbot is reporting lockdep warning at rds_tcp_reset_callbacks() [1], for commit ac3615e7f3cffe2a ("RDS: TCP: Reduce code duplication in rds_tcp_reset_callbacks()") added cancel_delayed_work_sync() into a section protected by lock_sock() without realizing that rds_send_xmit() might call lock_sock(). We don't need to protect cancel_delayed_work_sync() using lock_sock(), for even if rds_{send,recv}_worker() re-queued this work while __flush_work() from cancel_delayed_work_sync() was waiting for this work to complete, retried rds_{send,recv}_worker() is no-op due to the absence of RDS_CONN_UP bit. Link: https://syzkaller.appspot.com/bug?extid=78c55c7bc6f66e53dce2 [1] Reported-by: syzbot <syzbot+78c55c7bc6f66e53dce2@syzkaller.appspotmail.com> Co-developed-by: Hillf Danton <hdanton@sina.com> Signed-off-by: Hillf Danton <hdanton@sina.com> Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Tested-by: syzbot <syzbot+78c55c7bc6f66e53dce2@syzkaller.appspotmail.com> Fixes: ac3615e7f3cffe2a ("RDS: TCP: Reduce code duplication in rds_tcp_reset_callbacks()") Signed-off-by: David S. Miller <davem@davemloft.net>
2022-09-19net: rds: add missing __init/__exit annotations to module init/exit funcsXiu Jianfeng3-4/+4
Add missing __init/__exit annotations to module init/exit funcs. Signed-off-by: Xiu Jianfeng <xiujianfeng@huawei.com> Link: https://lore.kernel.org/r/20220909091840.247946-1-xiujianfeng@huawei.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-08-31net/rds: Pass a pointer to virt_to_page()Linus Walleij1-1/+1
Functions that work on a pointer to virtual memory such as virt_to_pfn() and users of that function such as virt_to_page() are supposed to pass a pointer to virtual memory, ideally a (void *) or other pointer. However since many architectures implement virt_to_pfn() as a macro, this function becomes polymorphic and accepts both a (unsigned long) and a (void *). If we instead implement a proper virt_to_pfn(void *addr) function the following happens (occurred on arch/arm): net/rds/message.c:357:56: warning: passing argument 1 of 'virt_to_pfn' makes pointer from integer without a cast [-Wint-conversion] Fix this with an explicit cast. Cc: Santosh Shilimkar <santosh.shilimkar@oracle.com> Cc: rds-devel@oss.oracle.com Signed-off-by: Linus Walleij <linus.walleij@linaro.org> Link: https://lore.kernel.org/r/20220829132001.114858-1-linus.walleij@linaro.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-08-12rds: add missing barrier to release_refillMikulas Patocka1-0/+1
The functions clear_bit and set_bit do not imply a memory barrier, thus it may be possible that the waitqueue_active function (which does not take any locks) is moved before clear_bit and it could miss a wakeup event. Fix this bug by adding a memory barrier after clear_bit. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Cc: stable@vger.kernel.org Signed-off-by: David S. Miller <davem@davemloft.net>
2022-08-08iov_iter: advancing variants of iov_iter_get_pages{,_alloc}()Al Viro1-2/+1
Most of the users immediately follow successful iov_iter_get_pages() with advancing by the amount it had returned. Provide inline wrappers doing that, convert trivial open-coded uses of those. BTW, iov_iter_get_pages() never returns more than it had been asked to; such checks in cifs ought to be removed someday... Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2022-08-01net/rds: Use PTR_ERR instead of IS_ERR for rdsdebug()Li Qiong1-1/+1
If 'local_odp_mr->r_trans_private' is a error code, it is better to print the error code than to print the value of IS_ERR(). Signed-off-by: Li Qiong <liqiong@nfschina.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-05-24Merge tag 'v5.18' into rdma.git for-nextJason Gunthorpe4-5/+25
Following patches have dependencies. Resolve the merge conflict in drivers/net/ethernet/mellanox/mlx5/core/main.c by keeping the new names for the fs functions following linux-next: https://lore.kernel.org/r/20220519113529.226bc3e2@canb.auug.org.au/ Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2022-05-05net: rds: use maybe_get_net() when acquiring refcount on TCP socketsTetsuo Handa4-6/+18
Eric Dumazet is reporting addition on 0 problem at rds_tcp_tune(), for delayed works queued in rds_wq might be invoked after a net namespace's refcount already reached 0. Since rds_tcp_exit_net() from cleanup_net() calls flush_workqueue(rds_wq), it is guaranteed that we can instead use maybe_get_net() from delayed work functions until rds_tcp_exit_net() returns. Note that I'm not convinced that all works which might access a net namespace are already queued in rds_wq by the moment rds_tcp_exit_net() calls flush_workqueue(rds_wq). If some race is there, rds_tcp_exit_net() will fail to wait for work functions, and kmem_cache_free() could be called from net_free() before maybe_get_net() is called from rds_tcp_tune(). Reported-by: Eric Dumazet <edumazet@google.com> Fixes: 3a58f13a881ed351 ("net: rds: acquire refcount on TCP sockets") Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://lore.kernel.org/r/41d09faf-bc78-1a87-dfd1-c6d1b5984b61@I-love.SAKURA.ne.jp Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-05-03net: rds: acquire refcount on TCP socketsTetsuo Handa1-0/+8
syzbot is reporting use-after-free read in tcp_retransmit_timer() [1], for TCP socket used by RDS is accessing sock_net() without acquiring a refcount on net namespace. Since TCP's retransmission can happen after a process which created net namespace terminated, we need to explicitly acquire a refcount. Link: https://syzkaller.appspot.com/bug?extid=694120e1002c117747ed [1] Reported-by: syzbot <syzbot+694120e1002c117747ed@syzkaller.appspotmail.com> Fixes: 26abe14379f8e2fa ("net: Modify sk_alloc to not reference count the netns of kernel sockets.") Fixes: 8a68173691f03661 ("net: sk_clone_lock() should only do get_net() if the parent is not a kernel socket") Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Tested-by: syzbot <syzbot+694120e1002c117747ed@syzkaller.appspotmail.com> Link: https://lore.kernel.org/r/a5fb1fc4-2284-3359-f6a0-e4e390239d7b@I-love.SAKURA.ne.jp Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2022-04-06RDMA: Split kernel-only global device caps from uverbs device capsJason Gunthorpe1-2/+2
Split out flags from ib_device::device_cap_flags that are only used internally to the kernel into kernel_cap_flags that is not part of the uapi. This limits the device_cap_flags to being the same bitmap that will be copied to userspace. This cleanly splits out the uverbs flags from the kernel flags to avoid confusion in the flags bitmap. Add some short comments describing which each of the kernel flags is connected to. Remove unused kernel flags. Link: https://lore.kernel.org/r/0-v2-22c19e565eef+139a-kern_caps_jgg@nvidia.com Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Max Gurtovoy <mgurtovoy@nvidia.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2021-12-16Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski1-0/+1
No conflicts. Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-12-14rds: memory leak in __rds_conn_create()Hangyu Hua1-0/+1
__rds_conn_create() did not release conn->c_path when loop_trans != 0 and trans->t_prefer_loopback != 0 and is_outgoing == 0. Fixes: aced3ce57cd3 ("RDS tcp loopback connection can hang") Signed-off-by: Hangyu Hua <hbh25y@gmail.com> Reviewed-by: Sharath Srinivasan <sharath.srinivasan@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-12-02Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski1-1/+1
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-12-02net/rds: correct socket tunable error in rds_tcp_tune()William Kucharski1-1/+1
Correct an error where setting /proc/sys/net/rds/tcp/rds_tcp_rcvbuf would instead modify the socket's sk_sndbuf and would leave sk_rcvbuf untouched. Fixes: c6a58ffed536 ("RDS: TCP: Add sysctl tunables for sndbuf/rcvbuf on rds-tcp socket") Signed-off-by: William Kucharski <william.kucharski@oracle.com> Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-11-22rds: Fix a typo in a commentChristophe JAILLET1-1/+1
s/cold/could/ Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr> Acked-By: Devesh Sharma <devesh.s.sharma@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-08-18net/rds: dma_map_sg is entitled to merge entriesGerd Rausch1-2/+2
Function "dma_map_sg" is entitled to merge adjacent entries and return a value smaller than what was passed as "nents". Subsequently "ib_map_mr_sg" needs to work with this value ("sg_dma_len") rather than the original "nents" parameter ("sg_len"). This old RDS bug was exposed and reliably causes kernel panics (using RDMA operations "rds-stress -D") on x86_64 starting with: commit c588072bba6b ("iommu/vt-d: Convert intel iommu driver to the iommu ops") Simply put: Linux 5.11 and later. Signed-off-by: Gerd Rausch <gerd.rausch@oracle.com> Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com> Link: https://lore.kernel.org/r/60efc69f-1f35-529d-a7ef-da0549cad143@oracle.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-06-30Merge tag 'net-next-5.14' of ↵Linus Torvalds2-2/+2
git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next Pull networking updates from Jakub Kicinski: "Core: - BPF: - add syscall program type and libbpf support for generating instructions and bindings for in-kernel BPF loaders (BPF loaders for BPF), this is a stepping stone for signed BPF programs - infrastructure to migrate TCP child sockets from one listener to another in the same reuseport group/map to improve flexibility of service hand-off/restart - add broadcast support to XDP redirect - allow bypass of the lockless qdisc to improving performance (for pktgen: +23% with one thread, +44% with 2 threads) - add a simpler version of "DO_ONCE()" which does not require jump labels, intended for slow-path usage - virtio/vsock: introduce SOCK_SEQPACKET support - add getsocketopt to retrieve netns cookie - ip: treat lowest address of a IPv4 subnet as ordinary unicast address allowing reclaiming of precious IPv4 addresses - ipv6: use prandom_u32() for ID generation - ip: add support for more flexible field selection for hashing across multi-path routes (w/ offload to mlxsw) - icmp: add support for extended RFC 8335 PROBE (ping) - seg6: add support for SRv6 End.DT46 behavior - mptcp: - DSS checksum support (RFC 8684) to detect middlebox meddling - support Connection-time 'C' flag - time stamping support - sctp: packetization Layer Path MTU Discovery (RFC 8899) - xfrm: speed up state addition with seq set - WiFi: - hidden AP discovery on 6 GHz and other HE 6 GHz improvements - aggregation handling improvements for some drivers - minstrel improvements for no-ack frames - deferred rate control for TXQs to improve reaction times - switch from round robin to virtual time-based airtime scheduler - add trace points: - tcp checksum errors - openvswitch - action execution, upcalls - socket errors via sk_error_report Device APIs: - devlink: add rate API for hierarchical control of max egress rate of virtual devices (VFs, SFs etc.) - don't require RCU read lock to be held around BPF hooks in NAPI context - page_pool: generic buffer recycling New hardware/drivers: - mobile: - iosm: PCIe Driver for Intel M.2 Modem - support for Qualcomm MSM8998 (ipa) - WiFi: Qualcomm QCN9074 and WCN6855 PCI devices - sparx5: Microchip SparX-5 family of Enterprise Ethernet switches - Mellanox BlueField Gigabit Ethernet (control NIC of the DPU) - NXP SJA1110 Automotive Ethernet 10-port switch - Qualcomm QCA8327 switch support (qca8k) - Mikrotik 10/25G NIC (atl1c) Driver changes: - ACPI support for some MDIO, MAC and PHY devices from Marvell and NXP (our first foray into MAC/PHY description via ACPI) - HW timestamping (PTP) support: bnxt_en, ice, sja1105, hns3, tja11xx - Mellanox/Nvidia NIC (mlx5) - NIC VF offload of L2 bridging - support IRQ distribution to Sub-functions - Marvell (prestera): - add flower and match all - devlink trap - link aggregation - Netronome (nfp): connection tracking offload - Intel 1GE (igc): add AF_XDP support - Marvell DPU (octeontx2): ingress ratelimit offload - Google vNIC (gve): new ring/descriptor format support - Qualcomm mobile (rmnet & ipa): inline checksum offload support - MediaTek WiFi (mt76) - mt7915 MSI support - mt7915 Tx status reporting - mt7915 thermal sensors support - mt7921 decapsulation offload - mt7921 enable runtime pm and deep sleep - Realtek WiFi (rtw88) - beacon filter support - Tx antenna path diversity support - firmware crash information via devcoredump - Qualcomm WiFi (wcn36xx) - Wake-on-WLAN support with magic packets and GTK rekeying - Micrel PHY (ksz886x/ksz8081): add cable test support" * tag 'net-next-5.14' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (2168 commits) tcp: change ICSK_CA_PRIV_SIZE definition tcp_yeah: check struct yeah size at compile time gve: DQO: Fix off by one in gve_rx_dqo() stmmac: intel: set PCI_D3hot in suspend stmmac: intel: Enable PHY WOL option in EHL net: stmmac: option to enable PHY WOL with PMT enabled net: say "local" instead of "static" addresses in ndo_dflt_fdb_{add,del} net: use netdev_info in ndo_dflt_fdb_{add,del} ptp: Set lookup cookie when creating a PTP PPS source. net: sock: add trace for socket errors net: sock: introduce sk_error_report net: dsa: replay the local bridge FDB entries pointing to the bridge dev too net: dsa: ensure during dsa_fdb_offload_notify that dev_hold and dev_put are on the same dev net: dsa: include fdb entries pointing to bridge in the host fdb list net: dsa: include bridge addresses which are local in the host fdb list net: dsa: sync static FDB entries on foreign interfaces to hardware net: dsa: install the host MDB and FDB entries in the master's RX filter net: dsa: reference count the FDB addresses at the cross-chip notifier level net: dsa: introduce a separate cross-chip notifier type for host FDBs net: dsa: reference count the MDB entries at the cross-chip notifier level ...
2021-06-28Merge tag 'fallthrough-fixes-clang-5.14-rc1' of ↵Linus Torvalds2-0/+3
git://git.kernel.org/pub/scm/linux/kernel/git/gustavoars/linux Pull fallthrough fixes from Gustavo Silva: "Fix many fall-through warnings when building with Clang 12.0.0 and '-Wimplicit-fallthrough' so that we at some point will be able to enable that warning by default" * tag 'fallthrough-fixes-clang-5.14-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gustavoars/linux: (26 commits) rxrpc: Fix fall-through warnings for Clang drm/nouveau/clk: Fix fall-through warnings for Clang drm/nouveau/therm: Fix fall-through warnings for Clang drm/nouveau: Fix fall-through warnings for Clang xfs: Fix fall-through warnings for Clang xfrm: Fix fall-through warnings for Clang tipc: Fix fall-through warnings for Clang sctp: Fix fall-through warnings for Clang rds: Fix fall-through warnings for Clang net/packet: Fix fall-through warnings for Clang net: netrom: Fix fall-through warnings for Clang ide: Fix fall-through warnings for Clang hwmon: (max6621) Fix fall-through warnings for Clang hwmon: (corsair-cpro) Fix fall-through warnings for Clang firewire: core: Fix fall-through warnings for Clang braille_console: Fix fall-through warnings for Clang ipv4: Fix fall-through warnings for Clang qlcnic: Fix fall-through warnings for Clang bnxt_en: Fix fall-through warnings for Clang netxen_nic: Fix fall-through warnings for Clang ...
2021-06-18Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski1-1/+1
Trivial conflicts in net/can/isotp.c and tools/testing/selftests/net/mptcp/mptcp_connect.sh scaled_ppm_to_ppb() was moved from drivers/ptp/ptp_clock.c to include/linux/ptp_clock_kernel.h in -next so re-apply the fix there. Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-06-08net: rds: fix memory leak in rds_recvmsgPavel Skripkin1-1/+1
Syzbot reported memory leak in rds. The problem was in unputted refcount in case of error. int rds_recvmsg(struct socket *sock, struct msghdr *msg, size_t size, int msg_flags) { ... if (!rds_next_incoming(rs, &inc)) { ... } After this "if" inc refcount incremented and if (rds_cmsg_recv(inc, msg, rs)) { ret = -EFAULT; goto out; } ... out: return ret; } in case of rds_cmsg_recv() fail the refcount won't be decremented. And it's easy to see from ftrace log, that rds_inc_addref() don't have rds_inc_put() pair in rds_recvmsg() after rds_cmsg_recv() 1) | rds_recvmsg() { 1) 3.721 us | rds_inc_addref(); 1) 3.853 us | rds_message_inc_copy_to_user(); 1) + 10.395 us | rds_cmsg_recv(); 1) + 34.260 us | } Fixes: bdbe6fbc6a2f ("RDS: recv.c") Reported-and-tested-by: syzbot+5134cdf021c4ed5aaa5f@syzkaller.appspotmail.com Signed-off-by: Pavel Skripkin <paskripkin@gmail.com> Reviewed-by: Håkon Bugge <haakon.bugge@oracle.com> Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-05-31rds: Fix spelling mistakesZheng Yongjun2-2/+2
Fix some spelling mistakes in comments: alloced ==> allocated Signed-off-by: Zheng Yongjun <zhengyongjun3@huawei.com> Reviewed-by: Håkon Bugge <haakon.bugge@oracle.com> Link: https://lore.kernel.org/r/20210531063617.3018637-1-zhengyongjun3@huawei.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-05-21RDS tcp loopback connection can hangRao Shoaib4-9/+27
When TCP is used as transport and a program on the system connects to RDS port 16385, connection is accepted but denied per the rules of RDS. However, RDS connections object is left in the list. Next loopback connection will select that connection object as it is at the head of list. The connection attempt will hang as the connection object is set to connect over TCP which is not allowed The issue can be reproduced easily, use rds-ping to ping a local IP address. After that use any program like ncat to connect to the same IP address and port 16385. This will hang so ctrl-c out. Now try rds-ping, it will hang. To fix the issue this patch adds checks to disallow the connection object creation and destroys the connection object. Signed-off-by: Rao Shoaib <rao.shoaib@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-05-17rds: Fix fall-through warnings for ClangGustavo A. R. Silva2-0/+3
In preparation to enable -Wimplicit-fallthrough for Clang, fix multiple warnings by explicitly adding multiple break statements instead of letting the code fall through to the next case. Link: https://github.com/KSPP/linux/issues/115 Reviewed-by: Håkon Bugge <haakon.bugge@oracle.com> Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org>
2021-05-01Merge tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdmaLinus Torvalds2-34/+2
Pull rdma updates from Jason Gunthorpe: "This is significantly bug fixes and general cleanups. The noteworthy new features are fairly small: - XRC support for HNS and improves RQ operations - Bug fixes and updates for hns, mlx5, bnxt_re, hfi1, i40iw, rxe, siw and qib - Quite a few general cleanups on spelling, error handling, static checker detections, etc - Increase the number of device ports supported beyond 255. High port count software switches now exist - Several bug fixes for rtrs - mlx5 Device Memory support for host controlled atomics - Report SRQ tables through to rdma-tool" * tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma: (145 commits) IB/qib: Remove redundant assignment to ret RDMA/nldev: Add copy-on-fork attribute to get sys command RDMA/bnxt_re: Fix a double free in bnxt_qplib_alloc_res RDMA/siw: Fix a use after free in siw_alloc_mr IB/hfi1: Remove redundant variable rcd RDMA/nldev: Add QP numbers to SRQ information RDMA/nldev: Return SRQ information RDMA/restrack: Add support to get resource tracking for SRQ RDMA/nldev: Return context information RDMA/core: Add CM to restrack after successful attachment to a device RDMA/cma: Skip device which doesn't support CM RDMA/rxe: Fix a bug in rxe_fill_ip_info() RDMA/mlx5: Expose private query port RDMA/mlx4: Remove an unused variable RDMA/mlx5: Fix type assignment for ICM DM IB/mlx5: Set right RoCE l3 type and roce version while deleting GID RDMA/i40iw: Fix error unwinding when i40iw_hmc_sd_one fails RDMA/cxgb4: add missing qpid increment IB/ipoib: Remove unnecessary struct declaration RDMA/bnxt_re: Get rid of custom module reference counting ...
2021-04-29Merge tag 'net-next-5.13' of ↵Linus Torvalds3-6/+1
git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next Pull networking updates from Jakub Kicinski: "Core: - bpf: - allow bpf programs calling kernel functions (initially to reuse TCP congestion control implementations) - enable task local storage for tracing programs - remove the need to store per-task state in hash maps, and allow tracing programs access to task local storage previously added for BPF_LSM - add bpf_for_each_map_elem() helper, allowing programs to walk all map elements in a more robust and easier to verify fashion - sockmap: support UDP and cross-protocol BPF_SK_SKB_VERDICT redirection - lpm: add support for batched ops in LPM trie - add BTF_KIND_FLOAT support - mostly to allow use of BTF on s390 which has floats in its headers files - improve BPF syscall documentation and extend the use of kdoc parsing scripts we already employ for bpf-helpers - libbpf, bpftool: support static linking of BPF ELF files - improve support for encapsulation of L2 packets - xdp: restructure redirect actions to avoid a runtime lookup, improving performance by 4-8% in microbenchmarks - xsk: build skb by page (aka generic zerocopy xmit) - improve performance of software AF_XDP path by 33% for devices which don't need headers in the linear skb part (e.g. virtio) - nexthop: resilient next-hop groups - improve path stability on next-hops group changes (incl. offload for mlxsw) - ipv6: segment routing: add support for IPv4 decapsulation - icmp: add support for RFC 8335 extended PROBE messages - inet: use bigger hash table for IP ID generation - tcp: deal better with delayed TX completions - make sure we don't give up on fast TCP retransmissions only because driver is slow in reporting that it completed transmitting the original - tcp: reorder tcp_congestion_ops for better cache locality - mptcp: - add sockopt support for common TCP options - add support for common TCP msg flags - include multiple address ids in RM_ADDR - add reset option support for resetting one subflow - udp: GRO L4 improvements - improve 'forward' / 'frag_list' co-existence with UDP tunnel GRO, allowing the first to take place correctly even for encapsulated UDP traffic - micro-optimize dev_gro_receive() and flow dissection, avoid retpoline overhead on VLAN and TEB GRO - use less memory for sysctls, add a new sysctl type, to allow using u8 instead of "int" and "long" and shrink networking sysctls - veth: allow GRO without XDP - this allows aggregating UDP packets before handing them off to routing, bridge, OvS, etc. - allow specifing ifindex when device is moved to another namespace - netfilter: - nft_socket: add support for cgroupsv2 - nftables: add catch-all set element - special element used to define a default action in case normal lookup missed - use net_generic infra in many modules to avoid allocating per-ns memory unnecessarily - xps: improve the xps handling to avoid potential out-of-bound accesses and use-after-free when XPS change race with other re-configuration under traffic - add a config knob to turn off per-cpu netdev refcnt to catch underflows in testing Device APIs: - add WWAN subsystem to organize the WWAN interfaces better and hopefully start driving towards more unified and vendor- independent APIs - ethtool: - add interface for reading IEEE MIB stats (incl. mlx5 and bnxt support) - allow network drivers to dump arbitrary SFP EEPROM data, current offset+length API was a poor fit for modern SFP which define EEPROM in terms of pages (incl. mlx5 support) - act_police, flow_offload: add support for packet-per-second policing (incl. offload for nfp) - psample: add additional metadata attributes like transit delay for packets sampled from switch HW (and corresponding egress and policy-based sampling in the mlxsw driver) - dsa: improve support for sandwiched LAGs with bridge and DSA - netfilter: - flowtable: use direct xmit in topologies with IP forwarding, bridging, vlans etc. - nftables: counter hardware offload support - Bluetooth: - improvements for firmware download w/ Intel devices - add support for reading AOSP vendor capabilities - add support for virtio transport driver - mac80211: - allow concurrent monitor iface and ethernet rx decap - set priority and queue mapping for injected frames - phy: add support for Clause-45 PHY Loopback - pci/iov: add sysfs MSI-X vector assignment interface to distribute MSI-X resources to VFs (incl. mlx5 support) New hardware/drivers: - dsa: mv88e6xxx: add support for Marvell mv88e6393x - 11-port Ethernet switch with 8x 1-Gigabit Ethernet and 3x 10-Gigabit interfaces. - dsa: support for legacy Broadcom tags used on BCM5325, BCM5365 and BCM63xx switches - Microchip KSZ8863 and KSZ8873; 3x 10/100Mbps Ethernet switches - ath11k: support for QCN9074 a 802.11ax device - Bluetooth: Broadcom BCM4330 and BMC4334 - phy: Marvell 88X2222 transceiver support - mdio: add BCM6368 MDIO mux bus controller - r8152: support RTL8153 and RTL8156 (USB Ethernet) chips - mana: driver for Microsoft Azure Network Adapter (MANA) - Actions Semi Owl Ethernet MAC - can: driver for ETAS ES58X CAN/USB interfaces Pure driver changes: - add XDP support to: enetc, igc, stmmac - add AF_XDP support to: stmmac - virtio: - page_to_skb() use build_skb when there's sufficient tailroom (21% improvement for 1000B UDP frames) - support XDP even without dedicated Tx queues - share the Tx queues with the stack when necessary - mlx5: - flow rules: add support for mirroring with conntrack, matching on ICMP, GTP, flex filters and more - support packet sampling with flow offloads - persist uplink representor netdev across eswitch mode changes - allow coexistence of CQE compression and HW time-stamping - add ethtool extended link error state reporting - ice, iavf: support flow filters, UDP Segmentation Offload - dpaa2-switch: - move the driver out of staging - add spanning tree (STP) support - add rx copybreak support - add tc flower hardware offload on ingress traffic - ionic: - implement Rx page reuse - support HW PTP time-stamping - octeon: support TC hardware offloads - flower matching on ingress and egress ratelimitting. - stmmac: - add RX frame steering based on VLAN priority in tc flower - support frame preemption (FPE) - intel: add cross time-stamping freq difference adjustment - ocelot: - support forwarding of MRP frames in HW - support multiple bridges - support PTP Sync one-step timestamping - dsa: mv88e6xxx, dpaa2-switch: offload bridge port flags like learning, flooding etc. - ipa: add IPA v4.5, v4.9 and v4.11 support (Qualcomm SDX55, SM8350, SC7280 SoCs) - mt7601u: enable TDLS support - mt76: - add support for 802.3 rx frames (mt7915/mt7615) - mt7915 flash pre-calibration support - mt7921/mt7663 runtime power management fixes" * tag 'net-next-5.13' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (2451 commits) net: selftest: fix build issue if INET is disabled net: netrom: nr_in: Remove redundant assignment to ns net: tun: Remove redundant assignment to ret net: phy: marvell: add downshift support for M88E1240 net: dsa: ksz: Make reg_mib_cnt a u8 as it never exceeds 255 net/sched: act_ct: Remove redundant ct get and check icmp: standardize naming of RFC 8335 PROBE constants bpf, selftests: Update array map tests for per-cpu batched ops bpf: Add batched ops support for percpu array bpf: Implement formatted output helpers with bstr_printf seq_file: Add a seq_bprintf function sfc: adjust efx->xdp_tx_queue_count with the real number of initialized queues net:nfc:digital: Fix a double free in digital_tg_recv_dep_req net: fix a concurrency bug in l2tp_tunnel_register() net/smc: Remove redundant assignment to rc mpls: Remove redundant assignment to err llc2: Remove redundant assignment to rc net/tls: Remove redundant initialization of record rds: Remove redundant assignment to nr_sig dt-bindings: net: mdio-gpio: add compatible for microchip,mdio-smi0 ...
2021-04-27rds: Remove redundant assignment to nr_sigJiapeng Chong1-1/+0
Variable nr_sig is being assigned a value however the assignment is never read, so this redundant assignment can be removed. Cleans up the following clang-analyzer warning: net/rds/ib_send.c:297:2: warning: Value stored to 'nr_sig' is never read [clang-analyzer-deadcode.DeadStores]. Reported-by: Abaci Robot <abaci@linux.alibaba.com> Signed-off-by: Jiapeng Chong <jiapeng.chong@linux.alibaba.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-04-24Revert "net/rds: Avoid potential use after free in rds_send_remove_from_sock"Linus Torvalds2-2/+1
This reverts commit 0c85a7e87465f2d4cbc768e245f4f45b2f299b05. The games with 'rm' are on (two separate instances) of a local variable, and make no difference. Quoting Aditya Pakki: "I was the author of the patch and it was the cause of the giant UMN revert. The patch is garbage and I was unaware of the steps involved in retracting it. I *believed* the maintainers would pull it, given it was already under Greg's list. The patch does not introduce any bugs but is pointless and is stupid. I accept my incompetence and for not requesting a revert earlier." Link: https://lwn.net/Articles/854319/ Requested-by: Aditya Pakki <pakki001@umn.edu> Cc: Santosh Shilimkar <santosh.shilimkar@oracle.com> Cc: David S. Miller <davem@davemloft.net> Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-04-12rds: ib: Remove two ib_modify_qp() callsHåkon Bugge2-34/+2
For some HCAs, ib_modify_qp() is an expensive operation running virtualized. For both the active and passive side, the QP returned by the CM has the state set to RTS, so no need for this excess RTS -> RTS transition. With IB Core's ability to set the RNR Retry timer, we use this interface to shave off another ib_modify_qp(). Fixes: ec16227e1414 ("RDS/IB: Infiniband transport") Link: https://lore.kernel.org/r/1617216194-12890-3-git-send-email-haakon.bugge@oracle.com Signed-off-by: Håkon Bugge <haakon.bugge@oracle.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2021-04-09Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski2-2/+4
Conflicts: MAINTAINERS - keep Chandrasekar drivers/net/ethernet/mellanox/mlx5/core/en_main.c - simple fix + trust the code re-added to param.c in -next is fine include/linux/bpf.h - trivial include/linux/ethtool.h - trivial, fix kdoc while at it include/linux/skmsg.h - move to relevant place in tcp.c, comment re-wrapped net/core/skmsg.c - add the sk = sk // sk = NULL around calls net/tipc/crypto.c - trivial Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-04-07net/rds: Avoid potential use after free in rds_send_remove_from_sockAditya Pakki2-1/+2
In case of rs failure in rds_send_remove_from_sock(), the 'rm' resource is freed and later under spinlock, causing potential use-after-free. Set the free pointer to NULL to avoid undefined behavior. Signed-off-by: Aditya Pakki <pakki001@umn.edu> Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-31net/rds: Fix a use after free in rds_message_map_pagesLv Yunlong1-1/+2
In rds_message_map_pages, the rm is freed by rds_message_put(rm). But rm is still used by rm->data.op_sg in return value. My patch assigns ERR_CAST(rm->data.op_sg) to err before the rm is freed to avoid the uaf. Fixes: 7dba92037baf3 ("net/rds: Use ERR_PTR for rds_message_alloc_sgs()") Signed-off-by: Lv Yunlong <lyl2019@mail.ustc.edu.cn> Reviewed-by: Håkon Bugge <haakon.bugge@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-28net: rds: Fix a typoLu Wei1-1/+1
Modify "beween" to "between" in net/rds/send.c. Reported-by: Hulk Robot <hulkci@huawei.com> Signed-off-by: Lu Wei <luwei32@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-03-10net/rds: Drop duplicate sin and sin6 assignmentsYejune Deng1-4/+0
There is no need to assign the msg->msg_name to sin or sin6, because there is DECLARE_SOCKADDR statement. Signed-off-by: Yejune Deng <yejune.deng@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-02-02net/rds: restrict iovecs length for RDS_CMSG_RDMA_ARGSSabyrzhan Tasbolatov1-0/+3
syzbot found WARNING in rds_rdma_extra_size [1] when RDS_CMSG_RDMA_ARGS control message is passed with user-controlled 0x40001 bytes of args->nr_local, causing order >= MAX_ORDER condition. The exact value 0x40001 can be checked with UIO_MAXIOV which is 0x400. So for kcalloc() 0x400 iovecs with sizeof(struct rds_iovec) = 0x10 is the closest limit, with 0x10 leftover. Same condition is currently done in rds_cmsg_rdma_args(). [1] WARNING: mm/page_alloc.c:5011 [..] Call Trace: alloc_pages_current+0x18c/0x2a0 mm/mempolicy.c:2267 alloc_pages include/linux/gfp.h:547 [inline] kmalloc_order+0x2e/0xb0 mm/slab_common.c:837 kmalloc_order_trace+0x14/0x120 mm/slab_common.c:853 kmalloc_array include/linux/slab.h:592 [inline] kcalloc include/linux/slab.h:621 [inline] rds_rdma_extra_size+0xb2/0x3b0 net/rds/rdma.c:568 rds_rm_size net/rds/send.c:928 [inline] Reported-by: syzbot+1bd2b07f93745fa38425@syzkaller.appspotmail.com Signed-off-by: Sabyrzhan Tasbolatov <snovitoll@gmail.com> Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com> Link: https://lore.kernel.org/r/20210201203233.1324704-1-snovitoll@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-11-17rds: stop using dmapoolChristoph Hellwig5-69/+101
RDMA ULPs should only perform DMA through the ib_dma_* API instead of using the hidden dma_device directly. In addition using the dma coherent API family that dmapool is a part of can be very ineffcient on plaforms that are not DMA coherent. Switch to use slab allocations and the ib_dma_* APIs instead. Link: https://lore.kernel.org/r/20201106181941.1878556-6-hch@lst.de Signed-off-by: Christoph Hellwig <hch@lst.de> Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2020-11-17Merge branch 'for-rc' into rdma.gitJason Gunthorpe1-2/+3
From https://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma.git The rc RDMA branch is needed due to dependencies on the next patches. Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2020-11-12RDMA: Lift ibdev_to_node from rds to common codeChristoph Hellwig1-7/+0
Lift the ibdev_to_node from rds to common code and document it. Link: https://lore.kernel.org/r/20201106181941.1878556-4-hch@lst.de Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2020-10-28RDMA: Add rdma_connect_locked()Jason Gunthorpe1-2/+3
There are two flows for handling RDMA_CM_EVENT_ROUTE_RESOLVED, either the handler triggers a completion and another thread does rdma_connect() or the handler directly calls rdma_connect(). In all cases rdma_connect() needs to hold the handler_mutex, but when handler's are invoked this is already held by the core code. This causes ULPs using the 2nd method to deadlock. Provide a rdma_connect_locked() and have all ULPs call it from their handlers. Link: https://lore.kernel.org/r/0-v2-53c22d5c1405+33-rdma_connect_locking_jgg@nvidia.com Reported-and-tested-by: Guoqing Jiang <guoqing.jiang@cloud.ionos.com> Fixes: 2a7cec538169 ("RDMA/cma: Fix locking for the RDMA_CM_CONNECT state") Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com> Acked-by: Jack Wang <jinpu.wang@cloud.ionos.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Max Gurtovoy <mgurtovoy@nvidia.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
2020-10-09net/rds: suppress page allocation failure error in recv buffer refillManjunath Patil1-3/+3
RDS/IB tries to refill the recv buffer in softirq context using GFP_NOWAIT flag. However alloc failure is handled by queueing a work to refill the recv buffer with GFP_KERNEL flag. This means failure to allocate with GFP_NOWAIT isn't fatal. Do not print the PAF warnings if softirq context fails to refill the recv buffer. We will see the PAF warnings when worker also fails to allocate. Signed-off-by: Manjunath Patil <manjunath.b.patil@oracle.com> Reviewed-by: Aruna Ramakrishna <aruna.ramakrishna@oracle.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-09-20RDS: drop double zeroingJulia Lawall1-1/+1
sg_init_table zeroes its first argument, so the allocation of that argument doesn't have to. the semantic patch that makes this change is as follows: (http://coccinelle.lip6.fr/) // <smpl> @@ expression x,n,flags; @@ x = - kcalloc + kmalloc_array (n,sizeof(*x),flags) ... sg_init_table(x,n) // </smpl> Signed-off-by: Julia Lawall <Julia.Lawall@inria.fr> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-09-18net: rds: delete duplicated wordsRandy Dunlap2-2/+2
Drop repeated words in net/rds/. Signed-off-by: Randy Dunlap <rdunlap@infradead.org> Cc: "David S. Miller" <davem@davemloft.net> Cc: Jakub Kicinski <kuba@kernel.org> Cc: Santosh Shilimkar <santosh.shilimkar@oracle.com> Cc: linux-rdma@vger.kernel.org Cc: rds-devel@oss.oracle.com Signed-off-by: David S. Miller <davem@davemloft.net>
2020-08-23treewide: Use fallthrough pseudo-keywordGustavo A. R. Silva1-1/+1
Replace the existing /* fall through */ comments and its variants with the new pseudo-keyword macro fallthrough[1]. Also, remove unnecessary fall-through markings when it is the case. [1] https://www.kernel.org/doc/html/v5.7/process/deprecated.html?highlight=fallthrough#implicit-switch-case-fall-through Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org>
2020-08-02Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netDavid S. Miller1-1/+2
Resolved kernel/bpf/btf.c using instructions from merge commit 69138b34a7248d2396ab85c8652e20c0c39beaba Signed-off-by: David S. Miller <davem@davemloft.net>
2020-07-31rds: Prevent kernel-infoleak in rds_notify_queue_get()Peilin Ye1-1/+2
rds_notify_queue_get() is potentially copying uninitialized kernel stack memory to userspace since the compiler may leave a 4-byte hole at the end of `cmsg`. In 2016 we tried to fix this issue by doing `= { 0 };` on `cmsg`, which unfortunately does not always initialize that 4-byte hole. Fix it by using memset() instead. Cc: stable@vger.kernel.org Fixes: f037590fff30 ("rds: fix a leak of kernel memory") Fixes: bdbe6fbc6a2f ("RDS: recv.c") Suggested-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Peilin Ye <yepeilin.cs@gmail.com> Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-07-24net: pass a sockptr_t into ->setsockoptChristoph Hellwig3-27/+23
Rework the remaining setsockopt code to pass a sockptr_t instead of a plain user pointer. This removes the last remaining set_fs(KERNEL_DS) outside of architecture specific code. Signed-off-by: Christoph Hellwig <hch@lst.de> Acked-by: Stefan Schmidt <stefan@datenfreihafen.org> [ieee802154] Acked-by: Matthieu Baerts <matthieu.baerts@tessares.net> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-07-19net: rds: rdma_transport.h: delete duplicated wordRandy Dunlap1-1/+1
Delete the doubled word "be" in a comment. Signed-off-by: Randy Dunlap <rdunlap@infradead.org> Cc: Santosh Shilimkar <santosh.shilimkar@oracle.com> Cc: netdev@vger.kernel.org Cc: linux-rdma@vger.kernel.org Signed-off-by: David S. Miller <davem@davemloft.net>
2020-07-01rds: If one path needs re-connection, check all and re-connectRao Shoaib3-1/+20
In testing with mprds enabled, Oracle Cluster nodes after reboot were not able to communicate with others nodes and so failed to rejoin the cluster. Peers with lower IP address initiated connection but the node could not respond as it choose a different path and could not initiate a connection as it had a higher IP address. With this patch, when a node sends out a packet and the selected path is down, all other paths are also checked and any down paths are re-connected. Reviewed-by: Ka-cheong Poon <ka-cheong.poon@oracle.com> Reviewed-by: David Edmondson <david.edmondson@oracle.com> Signed-off-by: Somasundaram Krishnasamy <somasundaram.krishnasamy@oracle.com> Signed-off-by: Rao Shoaib <rao.shoaib@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-06-25rds: transport module should be auto loaded when transport is setRao Shoaib1-9/+17
This enhancement auto loads transport module when the transport is set via SO_RDS_TRANSPORT socket option. Reviewed-by: Ka-Cheong Poon <ka-cheong.poon@oracle.com> Reviewed-by: Håkon Bugge <haakon.bugge@oracle.com> Signed-off-by: Rao Shoaib <rao.shoaib@oracle.com> Signed-off-by: Somasundaram Krishnasamy <somasundaram.krishnasamy@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-06-15net/rds: NULL pointer de-reference in rds_ib_add_one()Ka-Cheong Poon1-1/+7
The parent field of a struct device may be NULL. The macro ibdev_to_node() should check for that. Signed-off-by: Ka-Cheong Poon <ka-cheong.poon@oracle.com> Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-06-14treewide: replace '---help---' in Kconfig files with 'help'Masahiro Yamada1-3/+3
Since commit 84af7a6194e4 ("checkpatch: kconfig: prefer 'help' over '---help---'"), the number of '---help---' has been gradually decreasing, but there are still more than 2400 instances. This commit finishes the conversion. While I touched the lines, I also fixed the indentation. There are a variety of indentation styles found. a) 4 spaces + '---help---' b) 7 spaces + '---help---' c) 8 spaces + '---help---' d) 1 space + 1 tab + '---help---' e) 1 tab + '---help---' (correct indentation) f) 1 tab + 1 space + '---help---' g) 1 tab + 2 spaces + '---help---' In order to convert all of them to 1 tab + 'help', I ran the following commend: $ find . -name 'Kconfig*' | xargs sed -i 's/^[[:space:]]*---help---/\thelp/' Signed-off-by: Masahiro Yamada <masahiroy@kernel.org>
2020-06-05Merge tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdmaLinus Torvalds8-332/+38
Pull rdma updates from Jason Gunthorpe: "A more active cycle than most of the recent past, with a few large, long discussed works this time. The RNBD block driver has been posted for nearly two years now, and flowing through RDMA due to it also introducing a new ULP. The removal of FMR has been a recurring discussion theme for a long time. And the usual smattering of features and bug fixes. Summary: - Various small driver bugs fixes in rxe, mlx5, hfi1, and efa - Continuing driver cleanups in bnxt_re, hns - Big cleanup of mlx5 QP creation flows - More consistent use of src port and flow label when LAG is used and a mlx5 implementation - Additional set of cleanups for IB CM - 'RNBD' network block driver and target. This is a network block RDMA device specific to ionos's cloud environment. It brings strong multipath and resiliency capabilities. - Accelerated IPoIB for HFI1 - QP/WQ/SRQ ioctl migration for uverbs, and support for multiple async fds - Support for exchanging the new IBTA defiend ECE data during RDMA CM exchanges - Removal of the very old and insecure FMR interface from all ULPs and drivers. FRWR should be preferred for at least a decade now" * tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma: (247 commits) RDMA/cm: Spurious WARNING triggered in cm_destroy_id() RDMA/mlx5: Return ECE DC support RDMA/mlx5: Don't rely on FW to set zeros in ECE response RDMA/mlx5: Return an error if copy_to_user fails IB/hfi1: Use free_netdev() in hfi1_netdev_free() RDMA/hns: Uninitialized variable in modify_qp_init_to_rtr() RDMA/core: Move and rename trace_cm_id_create() IB/hfi1: Fix hfi1_netdev_rx_init() error handling RDMA: Remove 'max_map_per_fmr' RDMA: Remove 'max_fmr' RDMA/core: Remove FMR device ops RDMA/rdmavt: Remove FMR memory registration RDMA/mthca: Remove FMR support for memory registration RDMA/mlx4: Remove FMR support for memory registration RDMA/i40iw: Remove FMR leftovers RDMA/bnxt_re: Remove FMR leftovers RDMA/mlx5: Remove FMR leftovers RDMA/core: Remove FMR pool API RDMA/rds: Remove FMR support for memory registration RDMA/srp: Remove support for FMR memory registration ...
2020-06-02RDMA: Remove 'max_fmr'Jason Gunthorpe1-1/+1
Now that FMR support is gone, this attribute can be deleted from all places. Link: https://lore.kernel.org/r/12-v3-f58e6669d5d3+2cf-fmr_removal_jgg@mellanox.com Reviewed-by: Max Gurtovoy <maxg@mellanox.com> Reviewed-by: Bernard Metzler <bmt@zurich.ibm.com> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2020-06-02RDMA/rds: Remove FMR support for memory registrationMax Gurtovoy8-322/+21
Use FRWR method for memory registration by default and remove the ancient and unsafe FMR method. Link: https://lore.kernel.org/r/3-v3-f58e6669d5d3+2cf-fmr_removal_jgg@mellanox.com Signed-off-by: Max Gurtovoy <maxg@mellanox.com> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2020-05-28tcp: add tcp_sock_set_keepcntChristoph Hellwig2-15/+4
Add a helper to directly set the TCP_KEEPCNT sockopt from kernel space without going through a fake uaccess. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-28tcp: add tcp_sock_set_keepintvlChristoph Hellwig1-3/+1
Add a helper to directly set the TCP_KEEPINTVL sockopt from kernel space without going through a fake uaccess. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-28tcp: add tcp_sock_set_keepidleChristoph Hellwig1-4/+1
Add a helper to directly set the TCP_KEEP_IDLE sockopt from kernel space without going through a fake uaccess. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-28tcp: add tcp_sock_set_nodelayChristoph Hellwig3-12/+2
Add a helper to directly set the TCP_NODELAY sockopt from kernel space without going through a fake uaccess. Cleanup the callers to avoid pointless wrappers now that this is a simple function call. Signed-off-by: Christoph Hellwig <hch@lst.de> Acked-by: Sagi Grimberg <sagi@grimberg.me> Acked-by: Jason Gunthorpe <jgg@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-28tcp: add tcp_sock_set_corkChristoph Hellwig1-7/+2
Add a helper to directly set the TCP_CORK sockopt from kernel space without going through a fake uaccess. Cleanup the callers to avoid pointless wrappers now that this is a simple function call. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-28net: add sock_set_keepaliveChristoph Hellwig1-5/+1
Add a helper to directly set the SO_KEEPALIVE sockopt from kernel space without going through a fake uaccess. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-28net: add sock_no_lingerChristoph Hellwig3-14/+2
Add a helper to directly set the SO_LINGER sockopt from kernel space with onoff set to true and a linger time of 0 without going through a fake uaccess. Signed-off-by: Christoph Hellwig <hch@lst.de> Acked-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-27RDMA/cma: Provide ECE reject reasonLeon Romanovsky1-1/+3
IBTA declares "vendor option not supported" reject reason in REJ messages if passive side doesn't want to accept proposed ECE options. Due to the fact that ECE is managed by userspace, there is a need to let users to provide such rejected reason. Link: https://lore.kernel.org/r/20200526103304.196371-7-leon@kernel.org Signed-off-by: Leon Romanovsky <leonro@mellanox.com> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2020-05-21Merge tag 'v5.7-rc6' into rdma.git for-nextJason Gunthorpe4-65/+51
Linux 5.7-rc6 Conflict in drivers/net/ethernet/mellanox/mlx5/core/steering/dr_send.c resolved by deleting dr_cq_event, matching how netdev resolved it. Required for dependencies in the following patches. Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2020-05-20rds: fix crash in rds_info_getsockopt()John Hubbard1-1/+2
The conversion to pin_user_pages() had a bug: it overlooked the case of allocation of pages failing. Fix that by restoring an equivalent check. Reported-by: syzbot+118ac0af4ac7f785a45b@syzkaller.appspotmail.com Fixes: dbfe7d74376e ("rds: convert get_user_pages() --> pin_user_pages()") Cc: David S. Miller <davem@davemloft.net> Cc: Jakub Kicinski <kuba@kernel.org> Cc: netdev@vger.kernel.org Cc: linux-rdma@vger.kernel.org Cc: rds-devel@oss.oracle.com Signed-off-by: John Hubbard <jhubbard@nvidia.com> Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-17rds: convert get_user_pages() --> pin_user_pages()John Hubbard1-4/+2
This code was using get_user_pages_fast(), in a "Case 2" scenario (DMA/RDMA), using the categorization from [1]. That means that it's time to convert the get_user_pages_fast() + put_page() calls to pin_user_pages_fast() + unpin_user_pages() calls. There is some helpful background in [2]: basically, this is a small part of fixing a long-standing disconnect between pinning pages, and file systems' use of those pages. [1] Documentation/core-api/pin_user_pages.rst [2] "Explicit pinning of user-space pages": https://lwn.net/Articles/807108/ Cc: David S. Miller <davem@davemloft.net> Cc: Jakub Kicinski <kuba@kernel.org> Cc: netdev@vger.kernel.org Cc: linux-rdma@vger.kernel.org Cc: rds-devel@oss.oracle.com Signed-off-by: John Hubbard <jhubbard@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-06RDMA: Allow ib_client's to fail when add() is calledJason Gunthorpe1-8/+13
When a client is added it isn't allowed to fail, but all the client's have various failure paths within their add routines. This creates the very fringe condition where the client was added, failed during add and didn't set the client_data. The core code will then still call other client_data centric ops like remove(), rename(), get_nl_info(), and get_net_dev_by_params() with NULL client_data - which is confusing and unexpected. If the add() callback fails, then do not call any more client ops for the device, even remove. Remove all the now redundant checks for NULL client_data in ops callbacks. Update all the add() callbacks to return error codes appropriately. EOPNOTSUPP is used for cases where the ULP does not support the ib_device - eg because it only works with IB. Link: https://lore.kernel.org/r/20200421172440.387069-1-leon@kernel.org Signed-off-by: Leon Romanovsky <leonro@mellanox.com> Acked-by: Ursula Braun <ubraun@linux.ibm.com> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2020-04-28Merge branch 'work.sysctl' of ↵Daniel Borkmann1-4/+2
ssh://gitolite.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull in Christoph Hellwig's series that changes the sysctl's ->proc_handler methods to take kernel pointers instead. It gets rid of the set_fs address space overrides used by BPF. As per discussion, pull in the feature branch into bpf-next as it relates to BPF sysctl progs. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/bpf/20200427071508.GV23230@ZenIV.linux.org.uk/T/
2020-04-27sysctl: pass kernel pointers to ->proc_handlerChristoph Hellwig1-4/+2
Instead of having all the sysctl handlers deal with user pointers, which is rather hairy in terms of the BPF interaction, copy the input to and from userspace in common code. This also means that the strings are always NUL-terminated by the common code, making the API a little bit safer. As most handler just pass through the data to one of the common handlers a lot of the changes are mechnical. Signed-off-by: Christoph Hellwig <hch@lst.de> Acked-by: Andrey Ignatov <rdna@fb.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2020-04-15net/rds: Use ERR_PTR for rds_message_alloc_sgs()Jason Gunthorpe4-21/+19
Returning the error code via a 'int *ret' when the function returns a pointer is very un-kernely and causes gcc 10's static analysis to choke: net/rds/message.c: In function ‘rds_message_map_pages’: net/rds/message.c:358:10: warning: ‘ret’ may be used uninitialized in this function [-Wmaybe-uninitialized] 358 | return ERR_PTR(ret); Use a typical ERR_PTR return instead. Signed-off-by: Jason Gunthorpe <jgg@mellanox.com> Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-04-09net/rds: Fix MR reference counting problemKa-Cheong Poon2-21/+12
In rds_free_mr(), it calls rds_destroy_mr(mr) directly. But this defeats the purpose of reference counting and makes MR free handling impossible. It means that holding a reference does not guarantee that it is safe to access some fields. For example, In rds_cmsg_rdma_dest(), it increases the ref count, unlocks and then calls mr->r_trans->sync_mr(). But if rds_free_mr() (and rds_destroy_mr()) is called in between (there is no lock preventing this to happen), r_trans_private is set to NULL, causing a panic. Similar issue is in rds_rdma_unuse(). Reported-by: zerons <sironhide0null@gmail.com> Signed-off-by: Ka-Cheong Poon <ka-cheong.poon@oracle.com> Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-04-09net/rds: Replace struct rds_mr's r_refcount with struct krefKa-Cheong Poon3-23/+20
And removed rds_mr_put(). Signed-off-by: Ka-Cheong Poon <ka-cheong.poon@oracle.com> Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-02-16net/rds: Track user mapped pages through special APILeon Romanovsky1-12/+12
Convert net/rds to use the newly introduces pin_user_pages() API, which properly sets FOLL_PIN. Setting FOLL_PIN is now required for code that requires tracking of pinned pages. Note that this effectively changes the code's behavior: it now ultimately calls set_page_dirty_lock(), instead of set_page_dirty(). This is probably more accurate. As Christoph Hellwig put it, "set_page_dirty() is only safe if we are dealing with a file backed page where we have reference on the inode it hangs off." [1] [1] https://lore.kernel.org/r/20190723153640.GB720@lst.de Cc: Hans Westgaard Ry <hans.westgaard.ry@oracle.com> Cc: Santosh Shilimkar <santosh.shilimkar@oracle.com> Signed-off-by: Leon Romanovsky <leonro@mellanox.com> Signed-off-by: John Hubbard <jhubbard@nvidia.com> Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-18net/rds: Use prefetch for On-Demand-Paging MRHans Westgaard Ry1-0/+9
Try prefetching pages when using On-Demand-Paging MR using ib_advise_mr. Signed-off-by: Hans Westgaard Ry <hans.westgaard.ry@oracle.com> Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com> Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
2020-01-18net/rds: Handle ODP mr registration/unregistrationHans Westgaard Ry7-56/+244
On-Demand-Paging MRs are registered using ib_reg_user_mr and unregistered with ib_dereg_mr. Signed-off-by: Hans Westgaard Ry <hans.westgaard.ry@oracle.com> Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com> Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
2020-01-16net/rds: Detect need of On-Demand-Paging memory registrationHans Westgaard Ry1-2/+4
Add code to check if memory intended for RDMA is FS-DAX-memory. RDS will fail with error code EOPNOTSUPP if FS-DAX-memory is detected. Signed-off-by: Hans Westgaard Ry <hans.westgaard.ry@oracle.com> Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com> Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
2019-11-16Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netDavid S. Miller1-8/+15
Lots of overlapping changes and parallel additions, stuff like that. Signed-off-by: David S. Miller <davem@davemloft.net>
2019-11-16rds: ib: update WR sizes when bringing up connectionDag Moxnes1-8/+15
Currently WR sizes are updated from rds_ib_sysctl_max_send_wr and rds_ib_sysctl_max_recv_wr when a connection is shut down. As a result, a connection being down while rds_ib_sysctl_max_send_wr or rds_ib_sysctl_max_recv_wr are updated, will not update the sizes when it comes back up. Move resizing of WRs to rds_ib_setup_qp so that connections will be setup with the most current WR sizes. Signed-off-by: Dag Moxnes <dag.moxnes@oracle.com> Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-10-17net/rds: Remove unnecessary null checkYueHaibing1-2/+1
Null check before dma_pool_destroy is redundant, so remove it. This is detected by coccinelle. Signed-off-by: YueHaibing <yuehaibing@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-10-06net/rds: Add missing include fileYueHaibing2-0/+2
Fix build error: net/rds/ib_cm.c: In function rds_dma_hdrs_alloc: net/rds/ib_cm.c:475:13: error: implicit declaration of function dma_pool_zalloc; did you mean mempool_alloc? [-Werror=implicit-function-declaration] hdrs[i] = dma_pool_zalloc(pool, GFP_KERNEL, &hdr_daddrs[i]); ^~~~~~~~~~~~~~~ mempool_alloc net/rds/ib.c: In function rds_ib_dev_free: net/rds/ib.c:111:3: error: implicit declaration of function dma_pool_destroy; did you mean mempool_destroy? [-Werror=implicit-function-declaration] dma_pool_destroy(rds_ibdev->rid_hdrs_pool); ^~~~~~~~~~~~~~~~ mempool_destroy Reported-by: Hulk Robot <hulkci@huawei.com> Fixes: 9b17f5884be4 ("net/rds: Use DMA memory pool allocation for rds_header") Signed-off-by: YueHaibing <yuehaibing@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-10-05Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netDavid S. Miller1-3/+3
2019-10-03net/rds: Use DMA memory pool allocation for rds_headerKa-Cheong Poon5-61/+153
Currently, RDS calls ib_dma_alloc_coherent() to allocate a large piece of contiguous DMA coherent memory to store struct rds_header for sending/receiving packets. The memory allocated is then partitioned into struct rds_header. This is not necessary and can be costly at times when memory is fragmented. Instead, RDS should use the DMA memory pool interface to handle this. The DMA addresses of the pre- allocated headers are stored in an array. At send/receive ring initialization and refill time, this arrary is de-referenced to get the DMA addresses. This array is not accessed at send/receive packet processing. Suggested-by: Håkon Bugge <haakon.bugge@oracle.com> Signed-off-by: Ka-Cheong Poon <ka-cheong.poon@oracle.com> Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-10-02net/rds: Log vendor error if send/recv Work requests failSudhakar Dindukurti2-4/+5
Log vendor error if work requests fail. Vendor error provides more information that is used for debugging the issue. Signed-off-by: Sudhakar Dindukurti <sudhakar.dindukurti@oracle.com> Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-10-02net/rds: Fix error handling in rds_ib_add_one()Dotan Barak1-3/+3
rds_ibdev:ipaddr_list and rds_ibdev:conn_list are initialized after allocation some resources such as protection domain. If allocation of such resources fail, then these uninitialized variables are accessed in rds_ib_dev_free() in failure path. This can potentially crash the system. The code has been updated to initialize these variables very early in the function. Signed-off-by: Dotan Barak <dotanb@dev.mellanox.co.il> Signed-off-by: Sudhakar Dindukurti <sudhakar.dindukurti@oracle.com> Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-09-27net/rds: Check laddr_check before calling itKa-Cheong Poon1-1/+4
In rds_bind(), laddr_check is called without checking if it is NULL or not. And rs_transport should be reset if rds_add_bound() fails. Fixes: c5c1a030a7db ("net/rds: An rds_sock is added too early to the hash table") Reported-by: syzbot+fae39afd2101a17ec624@syzkaller.appspotmail.com Signed-off-by: Ka-Cheong Poon <ka-cheong.poon@oracle.com> Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-09-26net: Fix Kconfig indentationKrzysztof Kozlowski1-2/+2
Adjust indentation from spaces to tab (+optional two spaces) as in coding style with command like: $ sed -e 's/^ /\t/' -i */Kconfig Signed-off-by: Krzysztof Kozlowski <krzk@kernel.org> Acked-by: Sven Eckelmann <sven@narfation.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-09-17Merge ra.kernel.org:/pub/scm/linux/kernel/git/netdev/netDavid S. Miller1-1/+1
Pull in bug fixes from 'net' tree for the merge window. Signed-off-by: David S. Miller <davem@davemloft.net>
2019-09-15net/rds: Fix 'ib_evt_handler_call' element in 'rds_ib_stat_names'Gerd Rausch1-1/+1
All entries in 'rds_ib_stat_names' are stringified versions of the corresponding "struct rds_ib_statistics" element without the "s_"-prefix. Fix entry 'ib_evt_handler_call' to do the same. Fixes: f4f943c958a2 ("RDS: IB: ack more receive completions to improve performance") Signed-off-by: Gerd Rausch <gerd.rausch@oracle.com> Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-09-15Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netDavid S. Miller1-22/+18
Minor overlapping changes in the btusb and ixgbe drivers. Signed-off-by: David S. Miller <davem@davemloft.net>
2019-09-11net/rds: An rds_sock is added too early to the hash tableKa-Cheong Poon1-22/+18
In rds_bind(), an rds_sock is added to the RDS bind hash table before rs_transport is set. This means that the socket can be found by the receive code path when rs_transport is NULL. And the receive code path de-references rs_transport for congestion update check. This can cause a panic. An rds_sock should not be added to the bind hash table before all the needed fields are set. Reported-by: syzbot+4b4f8163c2e246df3c4c@syzkaller.appspotmail.com Signed-off-by: Ka-Cheong Poon <ka-cheong.poon@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-09-05Convert usage of IN_MULTICAST to ipv4_is_multicastDave Taht3-6/+6
IN_MULTICAST's primary intent is as a uapi macro. Elsewhere in the kernel we use ipv4_is_multicast consistently. This patch unifies linux's multicast checks to use that function rather than this macro. Signed-off-by: Dave Taht <dave.taht@gmail.com> Reviewed-by: Toke Høiland-Jørgensen <toke@toke.dk> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-09-02Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netDavid S. Miller1-1/+4
r8152 conflicts are the NAPI fixes in 'net' overlapping with some tasklet stuff in net-next Signed-off-by: David S. Miller <davem@davemloft.net>
2019-08-27net/rds: Fix info leak in rds6_inc_info_copy()Ka-Cheong Poon1-1/+4
The rds6_inc_info_copy() function has a couple struct members which are leaking stack information. The ->tos field should hold actual information and the ->flags field needs to be zeroed out. Fixes: 3eb450367d08 ("rds: add type of service(tos) infrastructure") Fixes: b7ff8b1036f0 ("rds: Extend RDS API for IPv6 support") Reported-by: 黄ID蝴蝶 <butterflyhuangxx@gmail.com> Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Ka-Cheong Poon <ka-cheong.poon@oracle.com> Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-08-27Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netDavid S. Miller4-8/+22
Minor conflict in r8169, bug fix had two versions in net and net-next, take the net-next hunks. Signed-off-by: David S. Miller <davem@davemloft.net>
2019-08-24net: rds: add service level support in rds-infoZhu Yanjun4-8/+22
>From IB specific 7.6.5 SERVICE LEVEL, Service Level (SL) is used to identify different flows within an IBA subnet. It is carried in the local route header of the packet. Before this commit, run "rds-info -I". The outputs are as below: " RDS IB Connections: LocalAddr RemoteAddr Tos SL LocalDev RemoteDev 192.2.95.3 192.2.95.1 2 0 fe80::21:28:1a:39 fe80::21:28:10:b9 192.2.95.3 192.2.95.1 1 0 fe80::21:28:1a:39 fe80::21:28:10:b9 192.2.95.3 192.2.95.1 0 0 fe80::21:28:1a:39 fe80::21:28:10:b9 " After this commit, the output is as below: " RDS IB Connections: LocalAddr RemoteAddr Tos SL LocalDev RemoteDev 192.2.95.3 192.2.95.1 2 2 fe80::21:28:1a:39 fe80::21:28:10:b9 192.2.95.3 192.2.95.1 1 1 fe80::21:28:1a:39 fe80::21:28:10:b9 192.2.95.3 192.2.95.1 0 0 fe80::21:28:1a:39 fe80::21:28:10:b9 " The commit fe3475af3bdf ("net: rds: add per rds connection cache statistics") adds cache_allocs in struct rds_info_rdma_connection as below: struct rds_info_rdma_connection { ... __u32 rdma_mr_max; __u32 rdma_mr_size; __u8 tos; __u32 cache_allocs; }; The peer struct in rds-tools of struct rds_info_rdma_connection is as below: struct rds_info_rdma_connection { ... uint32_t rdma_mr_max; uint32_t rdma_mr_size; uint8_t tos; uint8_t sl; uint32_t cache_allocs; }; The difference between userspace and kernel is the member variable sl. In the kernel struct, the member variable sl is missing. This will introduce risks. So it is necessary to use this commit to avoid this risk. Fixes: fe3475af3bdf ("net: rds: add per rds connection cache statistics") CC: Joe Jin <joe.jin@oracle.com> CC: JUNXIAO_BI <junxiao.bi@oracle.com> Suggested-by: Gerd Rausch <gerd.rausch@oracle.com> Signed-off-by: Zhu Yanjun <yanjun.zhu@oracle.com> Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-08-23net/rds: Whitelist rdma_cookie and rx_tstamp for usercopyDag Moxnes3-15/+27
Add the RDMA cookie and RX timestamp to the usercopy whitelist. After the introduction of hardened usercopy whitelisting (https://lwn.net/Articles/727322/), a warning is displayed when the RDMA cookie or RX timestamp is copied to userspace: kernel: WARNING: CPU: 3 PID: 5750 at mm/usercopy.c:81 usercopy_warn+0x8e/0xa6 [...] kernel: Call Trace: kernel: __check_heap_object+0xb8/0x11b kernel: __check_object_size+0xe3/0x1bc kernel: put_cmsg+0x95/0x115 kernel: rds_recvmsg+0x43d/0x620 [rds] kernel: sock_recvmsg+0x43/0x4a kernel: ___sys_recvmsg+0xda/0x1e6 kernel: ? __handle_mm_fault+0xcae/0xf79 kernel: __sys_recvmsg+0x51/0x8a kernel: SyS_recvmsg+0x12/0x1c kernel: do_syscall_64+0x79/0x1ae When the whitelisting feature was introduced, the memory for the RDMA cookie and RX timestamp in RDS was not added to the whitelist, causing the warning above. Signed-off-by: Dag Moxnes <dag.moxnes@oracle.com> Tested-by: Jenny <jenny.x.xu@oracle.com> Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-08-15rds: check for excessive looping in rds_send_xmitAndy Grover3-1/+14
Original commit from 2011 updated to include a change by Yuval Shaia <yuval.shaia@oracle.com> that adds a new statistic counter "send_stuck_rm" to capture the messages looping exessively in the send path. Signed-off-by: Gerd Rausch <gerd.rausch@oracle.com> Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-08-15net/rds: Add a few missing rds_stat_names entriesGerd Rausch1-0/+2
In a previous commit, fields were added to "struct rds_statistics" but array "rds_stat_names" was not updated accordingly. Please note the inconsistent naming of the string representations that is done in the name of compatibility with the Oracle internal code-base. s_recv_bytes_added_to_socket -> "recv_bytes_added_to_sock" s_recv_bytes_removed_from_socket -> "recv_bytes_freed_fromsock" Fixes: 192a798f5299 ("RDS: add stat for socket recv memory usage") Signed-off-by: Gerd Rausch <gerd.rausch@oracle.com> Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-08-15RDS: don't use GFP_ATOMIC for sk_alloc in rds_createChris Mason1-1/+1
Signed-off-by: Chris Mason <chris.mason@oracle.com> Signed-off-by: Bang Nguyen <bang.nguyen@oracle.com> Signed-off-by: Gerd Rausch <gerd.rausch@oracle.com> Signed-off-by: Somasundaram Krishnasamy <somasundaram.krishnasamy@oracle.com> Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-08-15RDS: limit the number of times we loop in rds_send_xmitChris Mason1-1/+11
This will kick the RDS worker thread if we have been looping too long. Original commit from 2012 updated to include a change by Venkat Venkatsubra <venkat.x.venkatsubra@oracle.com> that triggers "must_wake" if "rds_ib_recv_refill_one" fails. Signed-off-by: Gerd Rausch <gerd.rausch@oracle.com> Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-08-15net/rds: Add RDS6_INFO_SOCKETS and RDS6_INFO_RECV_MESSAGES optionsKa-Cheong Poon1-3/+90
Add support of the socket options RDS6_INFO_SOCKETS and RDS6_INFO_RECV_MESSAGES which update the RDS_INFO_SOCKETS and RDS_INFO_RECV_MESSAGES options respectively. The old options work for IPv4 sockets only. Signed-off-by: Ka-Cheong Poon <ka-cheong.poon@oracle.com> Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-07-27net: rds: Fix possible null-pointer dereferences in ↵Jia-Ju Bai1-1/+4
rds_rdma_cm_event_handler_cmn() In rds_rdma_cm_event_handler_cmn(), there are some if statements to check whether conn is NULL, such as on lines 65, 96 and 112. But conn is not checked before being used on line 108: trans->cm_connect_complete(conn, event); and on lines 140-143: rdsdebug("DISCONNECT event - dropping connection " "%pI6c->%pI6c\n", &conn->c_laddr, &conn->c_faddr); rds_conn_drop(conn); Thus, possible null-pointer dereferences may occur. To fix these bugs, conn is checked before being used. These bugs are found by a static analysis tool STCheck written by us. Signed-off-by: Jia-Ju Bai <baijiaju1990@gmail.com> Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-07-19Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netLinus Torvalds5-49/+109
Pull networking fixes from David Miller: 1) Fix AF_XDP cq entry leak, from Ilya Maximets. 2) Fix handling of PHY power-down on RTL8411B, from Heiner Kallweit. 3) Add some new PCI IDs to iwlwifi, from Ihab Zhaika. 4) Fix handling of neigh timers wrt. entries added by userspace, from Lorenzo Bianconi. 5) Various cases of missing of_node_put(), from Nishka Dasgupta. 6) The new NET_ACT_CT needs to depend upon NF_NAT, from Yue Haibing. 7) Various RDS layer fixes, from Gerd Rausch. 8) Fix some more fallout from TCQ_F_CAN_BYPASS generalization, from Cong Wang. 9) Fix FIB source validation checks over loopback, also from Cong Wang. 10) Use promisc for unsupported number of filters, from Justin Chen. 11) Missing sibling route unlink on failure in ipv6, from Ido Schimmel. * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (90 commits) tcp: fix tcp_set_congestion_control() use from bpf hook ag71xx: fix return value check in ag71xx_probe() ag71xx: fix error return code in ag71xx_probe() usb: qmi_wwan: add D-Link DWM-222 A2 device ID bnxt_en: Fix VNIC accounting when enabling aRFS on 57500 chips. net: dsa: sja1105: Fix missing unlock on error in sk_buff() gve: replace kfree with kvfree selftests/bpf: fix test_xdp_noinline on s390 selftests/bpf: fix "valid read map access into a read-only array 1" on s390 net/mlx5: Replace kfree with kvfree MAINTAINERS: update netsec driver ipv6: Unlink sibling route in case of failure liquidio: Replace vmalloc + memset with vzalloc udp: Fix typo in net/ipv4/udp.c net: bcmgenet: use promisc for unsupported filters ipv6: rt6_check should return NULL if 'from' is NULL tipc: initialize 'validated' field of received packets selftests: add a test case for rp_filter fib: relax source validation check for loopback packets mlxsw: spectrum: Do not process learned records with a dummy FID ...
2019-07-17net/rds: Initialize ic->i_fastreg_wrs upon allocationGerd Rausch1-1/+1
Otherwise, if an IB connection is torn down before "rds_ib_setup_qp" is called, the value of "ic->i_fastreg_wrs" is still at zero (as it wasn't initialized by "rds_ib_setup_qp"). Consequently "rds_ib_conn_path_shutdown" will spin forever, waiting for it to go back to "RDS_IB_DEFAULT_FR_WR", which of course will never happen as there are no outstanding work requests. Signed-off-by: Gerd Rausch <gerd.rausch@oracle.com> Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-07-17net/rds: Keep track of and wait for FRWR segments in use upon shutdownGerd Rausch3-6/+45
Since "rds_ib_free_frmr" and "rds_ib_free_frmr_list" simply put the FRMR memory segments on the "drop_list" or "free_list", and it is the job of "rds_ib_flush_mr_pool" to reap those entries by ultimately issuing a "IB_WR_LOCAL_INV" work-request, we need to trigger and then wait for all those memory segments attached to a particular connection to be fully released before we can move on to release the QP, CQ, etc. So we make "rds_ib_conn_path_shutdown" wait for one more atomic_t called "i_fastreg_inuse_count" that keeps track of how many FRWR memory segments are out there marked "FRMR_IS_INUSE" (and also wake_up rds_ib_ring_empty_wait, as they go away). Signed-off-by: Gerd Rausch <gerd.rausch@oracle.com> Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-07-17net/rds: Set fr_state only to FRMR_IS_FREE if IB_WR_LOCAL_INV had been ↵Gerd Rausch1-1/+2
successful Fix a bug where fr_state first goes to FRMR_IS_STALE, because of a failure of operation IB_WR_LOCAL_INV, but then gets set back to "FRMR_IS_FREE" uncoditionally, even though the operation failed. Signed-off-by: Gerd Rausch <gerd.rausch@oracle.com> Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-07-17net/rds: Fix NULL/ERR_PTR inconsistencyGerd Rausch1-2/+2
Make function "rds_ib_try_reuse_ibmr" return NULL in case memory region could not be allocated, since callers simply check if the return value is not NULL. Signed-off-by: Gerd Rausch <gerd.rausch@oracle.com> Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-07-17net/rds: Wait for the FRMR_IS_FREE (or FRMR_IS_STALE) transition after ↵Gerd Rausch2-27/+40
posting IB_WR_LOCAL_INV In order to: 1) avoid a silly bouncing between "clean_list" and "drop_list" triggered by function "rds_ib_reg_frmr" as it is releases frmr regions whose state is not "FRMR_IS_FREE" right away. 2) prevent an invalid access error in a race from a pending "IB_WR_LOCAL_INV" operation with a teardown ("dma_unmap_sg", "put_page") and de-registration ("ib_dereg_mr") of the corresponding memory region. Signed-off-by: Gerd Rausch <gerd.rausch@oracle.com> Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-07-17net/rds: Get rid of "wait_clean_list_grace" and add lockingGerd Rausch2-38/+19
Waiting for activity on the "clean_list" to quiesce is no substitute for proper locking. We can have multiple threads competing for "llist_del_first" via "rds_ib_reuse_mr", and a single thread competing for "llist_del_all" and "llist_del_first" via "rds_ib_flush_mr_pool". Since "llist_del_first" depends on "list->first->next" not to change in the midst of the operation, simply waiting for all current calls to "rds_ib_reuse_mr" to quiesce across all CPUs is woefully inadequate: By the time "wait_clean_list_grace" is done iterating over all CPUs to see that there is no concurrent caller to "rds_ib_reuse_mr", a new caller may have just shown up on the first CPU. Furthermore, <linux/llist.h> explicitly calls out the need for locking: * Cases where locking is needed: * If we have multiple consumers with llist_del_first used in one consumer, * and llist_del_first or llist_del_all used in other consumers, * then a lock is needed. Also, while at it, drop the unused "pool" parameter from "list_to_llist_nodes". Signed-off-by: Gerd Rausch <gerd.rausch@oracle.com> Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-07-17net/rds: Give fr_state a chance to transition to FRMR_IS_FREEGerd Rausch2-1/+27
In the context of FRMR (ib_frmr.c): Memory regions make it onto the "clean_list" via "rds_ib_flush_mr_pool", after the memory region has been posted for invalidation via "rds_ib_post_inv". At that point in time, "fr_state" may still be in state "FRMR_IS_INUSE", since the only place where "fr_state" transitions to "FRMR_IS_FREE" is in "rds_ib_mr_cqe_handler", which is triggered by a tasklet. So in case we notice that "fr_state != FRMR_IS_FREE" (see below), we wait for "fr_inv_done" to trigger with a maximum of 10msec. Then we check again, and only put the memory region onto the drop_list (via "rds_ib_free_frmr") in case the situation remains unchanged. This avoids the problem of memory-regions bouncing between "clean_list" and "drop_list" before they even have a chance to be properly invalidated. Signed-off-by: Gerd Rausch <gerd.rausch@oracle.com> Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-07-15Merge tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdmaLinus Torvalds1-4/+4
Pull rdma updates from Jason Gunthorpe: "A smaller cycle this time. Notably we see another new driver, 'Soft iWarp', and the deletion of an ancient unused driver for nes. - Revise and simplify the signature offload RDMA MR APIs - More progress on hoisting object allocation boiler plate code out of the drivers - Driver bug fixes and revisions for hns, hfi1, efa, cxgb4, qib, i40iw - Tree wide cleanups: struct_size, put_user_page, xarray, rst doc conversion - Removal of obsolete ib_ucm chardev and nes driver - netlink based discovery of chardevs and autoloading of the modules providing them - Move more of the rdamvt/hfi1 uapi to include/uapi/rdma - New driver 'siw' for software based iWarp running on top of netdev, much like rxe's software RoCE. - mlx5 feature to report events in their raw devx format to userspace - Expose per-object counters through rdma tool - Adaptive interrupt moderation for RDMA (DIM), sharing the DIM core from netdev" * tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma: (194 commits) RMDA/siw: Require a 64 bit arch RDMA/siw: Mark expected switch fall-throughs RDMA/core: Fix -Wunused-const-variable warnings rdma/siw: Remove set but not used variable 's' rdma/siw: Add missing dependencies on LIBCRC32C and DMA_VIRT_OPS RDMA/siw: Add missing rtnl_lock around access to ifa rdma/siw: Use proper enumerated type in map_cqe_status RDMA/siw: Remove unnecessary kthread create/destroy printouts IB/rdmavt: Fix variable shadowing issue in rvt_create_cq RDMA/core: Fix race when resolving IP address RDMA/core: Make rdma_counter.h compile stand alone IB/core: Work on the caller socket net namespace in nldev_newlink() RDMA/rxe: Fill in wc byte_len with IB_WC_RECV_RDMA_WITH_IMM RDMA/mlx5: Set RDMA DIM to be enabled by default RDMA/nldev: Added configuration of RDMA dynamic interrupt moderation to netlink RDMA/core: Provide RDMA DIM support for ULPs linux/dim: Implement RDMA adaptive moderation (DIM) IB/mlx5: Report correctly tag matching rendezvous capability docs: infiniband: add it to the driver-api bookset IB/mlx5: Implement VHCA tunnel mechanism in DEVX ...
2019-07-09rds: avoid version downgrade to legitimate newer peer connectionsSantosh Shilimkar1-2/+4
Connections with legitimate tos values can get into usual connection race. It can result in consumer reject. We don't want tos value or protocol version to be demoted for such connections otherwise piers would end up different tos values which can results in no connection. Example a peer initiated connection with say tos 8 while usual connection racing can get downgraded to tos 0 which is not desirable. Patch fixes above issue introduced by commit commit d021fabf525f ("rds: rdma: add consumer reject") Reported-by: Yanjun Zhu <yanjun.zhu@oracle.com> Tested-by: Yanjun Zhu <yanjun.zhu@oracle.com> Signed-off-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
2019-07-09rds: Return proper "tos" value to user-spaceGerd Rausch1-0/+1
The proper "tos" value needs to be returned to user-space (sockopt RDS_INFO_CONNECTIONS). Fixes: 3eb450367d08 ("rds: add type of service(tos) infrastructure") Signed-off-by: Gerd Rausch <gerd.rausch@oracle.com> Reviewed-by: Zhu Yanjun <yanjun.zhu@oracle.com> Signed-off-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
2019-07-09rds: Accept peer connection reject messages due to incompatible versionGerd Rausch1-2/+3
Prior to commit d021fabf525ff ("rds: rdma: add consumer reject") function "rds_rdma_cm_event_handler_cmn" would always honor a rejected connection attempt by issuing a "rds_conn_drop". The commit mentioned above added a "break", eliminating the "fallthrough" case and made the "rds_conn_drop" rather conditional: Now it only happens if a "consumer defined" reject (i.e. "rdma_reject") carries an integer-value of "1" inside "private_data": if (!conn) break; err = (int *)rdma_consumer_reject_data(cm_id, event, &len); if (!err || (err && ((*err) == RDS_RDMA_REJ_INCOMPAT))) { pr_warn("RDS/RDMA: conn <%pI6c, %pI6c> rejected, dropping connection\n", &conn->c_laddr, &conn->c_faddr); conn->c_proposed_version = RDS_PROTOCOL_COMPAT_VERSION; rds_conn_drop(conn); } rdsdebug("Connection rejected: %s\n", rdma_reject_msg(cm_id, event->status)); break; /* FALLTHROUGH */ A number of issues are worth mentioning here: #1) Previous versions of the RDS code simply rejected a connection by calling "rdma_reject(cm_id, NULL, 0);" So the value of the payload in "private_data" will not be "1", but "0". #2) Now the code has become dependent on host byte order and sizing. If one peer is big-endian, the other is little-endian, or there's a difference in sizeof(int) (e.g. ILP64 vs LP64), the *err check does not work as intended. #3) There is no check for "len" to see if the data behind *err is even valid. Luckily, it appears that the "rdma_reject(cm_id, NULL, 0)" will always carry 148 bytes of zeroized payload. But that should probably not be relied upon here. #4) With the added "break;", we might as well drop the misleading "/* FALLTHROUGH */" comment. This commit does _not_ address issue #2, as the sender would have to agree on a byte order as well. Here is the sequence of messages in this observed error-scenario: Host-A is pre-QoS changes (excluding the commit mentioned above) Host-B is post-QoS changes (including the commit mentioned above) #1 Host-B issues a connection request via function "rds_conn_path_transition" connection state transitions to "RDS_CONN_CONNECTING" #2 Host-A rejects the incompatible connection request (from #1) It does so by calling "rdma_reject(cm_id, NULL, 0);" #3 Host-B receives an "RDMA_CM_EVENT_REJECTED" event (from #2) But since the code is changed in the way described above, it won't drop the connection here, simply because "*err == 0". #4 Host-A issues a connection request #5 Host-B receives an "RDMA_CM_EVENT_CONNECT_REQUEST" event and ends up calling "rds_ib_cm_handle_connect". But since the state is already in "RDS_CONN_CONNECTING" (as of #1) it will end up issuing a "rdma_reject" without dropping the connection: if (rds_conn_state(conn) == RDS_CONN_CONNECTING) { /* Wait and see - our connect may still be succeeding */ rds_ib_stats_inc(s_ib_connect_raced); } goto out; #6 Host-A receives an "RDMA_CM_EVENT_REJECTED" event (from #5), drops the connection and tries again (goto #4) until it gives up. Tested-by: Zhu Yanjun <yanjun.zhu@oracle.com> Signed-off-by: Gerd Rausch <gerd.rausch@oracle.com> Signed-off-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
2019-07-09Revert "RDS: IB: split the mr registration and invalidation path"Gerd Rausch3-16/+8
This reverts commit 56012459310a1dbcc55c2dbf5500a9f7571402cb. RDS kept spinning inside function "rds_ib_post_reg_frmr", waiting for "i_fastreg_wrs" to become incremented: while (atomic_dec_return(&ibmr->ic->i_fastreg_wrs) <= 0) { atomic_inc(&ibmr->ic->i_fastreg_wrs); cpu_relax(); } Looking at the original commit: commit 56012459310a ("RDS: IB: split the mr registration and invalidation path") In there, the "rds_ib_mr_cqe_handler" was changed in the following way: void rds_ib_mr_cqe_handler(struct rds_ib_connection *ic, struct ib_wc *wc) if (frmr->fr_inv) { frmr->fr_state = FRMR_IS_FREE; frmr->fr_inv = false; atomic_inc(&ic->i_fastreg_wrs); } else { atomic_inc(&ic->i_fastunreg_wrs); } It looks like it's got it exactly backwards: Function "rds_ib_post_reg_frmr" keeps track of the outstanding requests via "i_fastreg_wrs". Function "rds_ib_post_inv" keeps track of the outstanding requests via "i_fastunreg_wrs" (post original commit). It also sets: frmr->fr_inv = true; However the completion handler "rds_ib_mr_cqe_handler" adjusts "i_fastreg_wrs" when "fr_inv" had been true, and adjusts "i_fastunreg_wrs" otherwise. The original commit was done in the name of performance: to remove the performance bottleneck No performance benefit could be observed with a fixed-up version of the original commit measured between two Oracle X7 servers, both equipped with Mellanox Connect-X5 HCAs. The prudent course of action is to revert this commit. Signed-off-by: Gerd Rausch <gerd.rausch@oracle.com> Signed-off-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
2019-07-09rds: fix reordering with composite message notificationSantosh Shilimkar4-30/+14
RDS composite message(rdma + control) user notification needs to be triggered once the full message is delivered and such a fix was added as part of commit 941f8d55f6d61 ("RDS: RDMA: Fix the composite message user notification"). But rds_send_remove_from_sock is missing data part notify check and hence at times the user don't get notification which isn't desirable. One way is to fix the rds_send_remove_from_sock to check of that case but considering the ordering complexity with completion handler and rdma + control messages are always dispatched back to back in same send context, just delaying the signaled completion on rmda work request also gets the desired behaviour. i.e Notifying application only after RDMA + control message send completes. So patch updates the earlier fix with this approach. The delay signaling completions of rdma op till the control message send completes fix was done by Venkat Venkatsubra in downstream kernel. Reviewed-and-tested-by: Zhu Yanjun <yanjun.zhu@oracle.com> Reviewed-by: Gerd Rausch <gerd.rausch@oracle.com> Signed-off-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
2019-06-28Merge tag 'v5.2-rc6' into rdma.git for-nextJason Gunthorpe4-5/+11
For dependencies in next patches. Resolve conflicts: - Use uverbs_get_cleared_udata() with new cq allocation flow - Continue to delete nes despite SPDX conflict - Resolve list appends in mlx5_command_str() - Use u16 for vport_rule stuff - Resolve list appends in struct ib_client Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-06-07Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller3-5/+10
Some ISDN files that got removed in net-next had some changes done in mainline, take the removals. Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-06net: rds: fix memory leak in rds_ib_flush_mr_poolZhu Yanjun1-4/+6
When the following tests last for several hours, the problem will occur. Server: rds-stress -r 1.1.1.16 -D 1M Client: rds-stress -r 1.1.1.14 -s 1.1.1.16 -D 1M -T 30 The following will occur. " Starting up.... tsks tx/s rx/s tx+rx K/s mbi K/s mbo K/s tx us/c rtt us cpu % 1 0 0 0.00 0.00 0.00 0.00 0.00 -1.00 1 0 0 0.00 0.00 0.00 0.00 0.00 -1.00 1 0 0 0.00 0.00 0.00 0.00 0.00 -1.00 1 0 0 0.00 0.00 0.00 0.00 0.00 -1.00 " >From vmcore, we can find that clean_list is NULL. >From the source code, rds_mr_flushd calls rds_ib_mr_pool_flush_worker. Then rds_ib_mr_pool_flush_worker calls " rds_ib_flush_mr_pool(pool, 0, NULL); " Then in function " int rds_ib_flush_mr_pool(struct rds_ib_mr_pool *pool, int free_all, struct rds_ib_mr **ibmr_ret) " ibmr_ret is NULL. In the source code, " ... list_to_llist_nodes(pool, &unmap_list, &clean_nodes, &clean_tail); if (ibmr_ret) *ibmr_ret = llist_entry(clean_nodes, struct rds_ib_mr, llnode); /* more than one entry in llist nodes */ if (clean_nodes->next) llist_add_batch(clean_nodes->next, clean_tail, &pool->clean_list); ... " When ibmr_ret is NULL, llist_entry is not executed. clean_nodes->next instead of clean_nodes is added in clean_list. So clean_nodes is discarded. It can not be used again. The workqueue is executed periodically. So more and more clean_nodes are discarded. Finally the clean_list is NULL. Then this problem will occur. Fixes: 1bc144b62524 ("net, rds, Replace xlist in net/rds/xlist.h with llist") Signed-off-by: Zhu Yanjun <yanjun.zhu@oracle.com> Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-05net: rds: fix memory leak when unload rds_rdmaZhu Yanjun2-1/+4
When KASAN is enabled, after several rds connections are created, then "rmmod rds_rdma" is run. The following will appear. " BUG rds_ib_incoming (Not tainted): Objects remaining in rds_ib_incoming on __kmem_cache_shutdown() Call Trace: dump_stack+0x71/0xab slab_err+0xad/0xd0 __kmem_cache_shutdown+0x17d/0x370 shutdown_cache+0x17/0x130 kmem_cache_destroy+0x1df/0x210 rds_ib_recv_exit+0x11/0x20 [rds_rdma] rds_ib_exit+0x7a/0x90 [rds_rdma] __x64_sys_delete_module+0x224/0x2c0 ? __ia32_sys_delete_module+0x2c0/0x2c0 do_syscall_64+0x73/0x190 entry_SYSCALL_64_after_hwframe+0x44/0xa9 " This is rds connection memory leak. The root cause is: When "rmmod rds_rdma" is run, rds_ib_remove_one will call rds_ib_dev_shutdown to drop the rds connections. rds_ib_dev_shutdown will call rds_conn_drop to drop rds connections as below. " rds_conn_path_drop(&conn->c_path[0], false); " In the above, destroy is set to false. void rds_conn_path_drop(struct rds_conn_path *cp, bool destroy) { atomic_set(&cp->cp_state, RDS_CONN_ERROR); rcu_read_lock(); if (!destroy && rds_destroy_pending(cp->cp_conn)) { rcu_read_unlock(); return; } queue_work(rds_wq, &cp->cp_down_w); rcu_read_unlock(); } In the above function, destroy is set to false. rds_destroy_pending is called. This does not move rds connections to ib_nodev_conns. So destroy is set to true to move rds connections to ib_nodev_conns. In rds_ib_unregister_client, flush_workqueue is called to make rds_wq finsh shutdown rds connections. The function rds_ib_destroy_nodev_conns is called to shutdown rds connections finally. Then rds_ib_recv_exit is called to destroy slab. void rds_ib_recv_exit(void) { kmem_cache_destroy(rds_ib_incoming_slab); kmem_cache_destroy(rds_ib_frag_slab); } The above slab memory leak will not occur again. >From tests, 256 rds connections [root@ca-dev14 ~]# time rmmod rds_rdma real 0m16.522s user 0m0.000s sys 0m8.152s 512 rds connections [root@ca-dev14 ~]# time rmmod rds_rdma real 0m32.054s user 0m0.000s sys 0m15.568s To rmmod rds_rdma with 256 rds connections, about 16 seconds are needed. And with 512 rds connections, about 32 seconds are needed. >From ftrace, when one rds connection is destroyed, " 19) | rds_conn_destroy [rds]() { 19) 7.782 us | rds_conn_path_drop [rds](); 15) | rds_shutdown_worker [rds]() { 15) | rds_conn_shutdown [rds]() { 15) 1.651 us | rds_send_path_reset [rds](); 15) 7.195 us | } 15) + 11.434 us | } 19) 2.285 us | rds_cong_remove_conn [rds](); 19) * 24062.76 us | } " So if many rds connections will be destroyed, this function rds_ib_destroy_nodev_conns uses most of time. Suggested-by: Håkon Bugge <haakon.bugge@oracle.com> Signed-off-by: Zhu Yanjun <yanjun.zhu@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-05net: rds: add per rds connection cache statisticsZhu Yanjun1-0/+2
The variable cache_allocs is to indicate how many frags (KiB) are in one rds connection frag cache. The command "rds-info -Iv" will output the rds connection cache statistics as below: " RDS IB Connections: LocalAddr RemoteAddr Tos SL LocalDev RemoteDev 1.1.1.14 1.1.1.14 58 255 fe80::2:c903:a:7a31 fe80::2:c903:a:7a31 send_wr=256, recv_wr=1024, send_sge=8, rdma_mr_max=4096, rdma_mr_size=257, cache_allocs=12 " This means that there are about 12KiB frag in this rds connection frag cache. Since rds.h in rds-tools is not related with the kernel rds.h, the change in kernel rds.h does not affect rds-tools. rds-info in rds-tools 2.0.5 and 2.0.6 is tested with this commit. It works well. Signed-off-by: Zhu Yanjun <yanjun.zhu@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-05-21rds: Don't check return value from destroy CQLeon Romanovsky1-4/+4
There is no value in checking ib_destroy_cq() result and skipping to clear struct ic fields. This connection needs to be reinitialized anyway. Signed-off-by: Leon Romanovsky <leonro@mellanox.com> Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-05-21treewide: Add SPDX license identifier - Makefile/KconfigThomas Gleixner1-0/+1
Add SPDX license identifiers to all Make/Kconfig files which: - Have no license information of any form These files fall under the project license, GPL v2 only. The resulting SPDX license identifier is: GPL-2.0-only Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-05-14mm/gup: change GUP fast to use flags rather than a write 'bool'Ira Weiny2-2/+3
To facilitate additional options to get_user_pages_fast() change the singular write parameter to be gup_flags. This patch does not change any functionality. New functionality will follow in subsequent patches. Some of the get_user_pages_fast() call sites were unchanged because they already passed FOLL_WRITE or 0 for the write parameter. NOTE: It was suggested to change the ordering of the get_user_pages_fast() arguments to ensure that callers were converted. This breaks the current GUP call site convention of having the returned pages be the final parameter. So the suggestion was rejected. Link: http://lkml.kernel.org/r/20190328084422.29911-4-ira.weiny@intel.com Link: http://lkml.kernel.org/r/20190317183438.2057-4-ira.weiny@intel.com Signed-off-by: Ira Weiny <ira.weiny@intel.com> Reviewed-by: Mike Marshall <hubcap@omnibond.com> Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Dan Williams <dan.j.williams@intel.com> Cc: "David S. Miller" <davem@davemloft.net> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: James Hogan <jhogan@kernel.org> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: John Hubbard <jhubbard@nvidia.com> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Paul Mackerras <paulus@samba.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Ralf Baechle <ralf@linux-mips.org> Cc: Rich Felker <dalias@libc.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Yoshinori Sato <ysato@users.sourceforge.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-05-05net: rds: fix spelling mistake "syctl" -> "sysctl"Colin Ian King1-1/+1
There is a spelling mistake in a pr_warn warning. Fix it. Signed-off-by: Colin Ian King <colin.king@canonical.com> Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-05-01rds: ib: force endiannes annotationNicholas Mc Guire1-5/+3
While the endiannes is being handled correctly as indicated by the comment above the offending line - sparse was unhappy with the missing annotation as be64_to_cpu() expects a __be64 argument. To mitigate this annotation all involved variables are changed to a consistent __le64 and the conversion to uint64_t delayed to the call to rds_cong_map_updated(). Signed-off-by: Nicholas Mc Guire <hofrat@osadl.org> Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-24net: rds: exchange of 8K and 1M poolZhu Yanjun2-3/+11
Before the commit 490ea5967b0d ("RDS: IB: move FMR code to its own file"), when the dirty_count is greater than 9/10 of max_items of 8K pool, 1M pool is used, Vice versa. After the commit 490ea5967b0d ("RDS: IB: move FMR code to its own file"), the above is removed. When we make the following tests. Server: rds-stress -r 1.1.1.16 -D 1M Client: rds-stress -r 1.1.1.14 -s 1.1.1.16 -D 1M The following will appear. " connecting to 1.1.1.16:4000 negotiated options, tasks will start in 2 seconds Starting up..header from 1.1.1.166:4001 to id 4001 bogus .. tsks tx/s rx/s tx+rx K/s mbi K/s mbo K/s tx us/c rtt us cpu % 1 0 0 0.00 0.00 0.00 0.00 0.00 -1.00 1 0 0 0.00 0.00 0.00 0.00 0.00 -1.00 1 0 0 0.00 0.00 0.00 0.00 0.00 -1.00 1 0 0 0.00 0.00 0.00 0.00 0.00 -1.00 1 0 0 0.00 0.00 0.00 0.00 0.00 -1.00 ... " So this exchange between 8K and 1M pool is added back. Fixes: commit 490ea5967b0d ("RDS: IB: move FMR code to its own file") Signed-off-by: Zhu Yanjun <yanjun.zhu@oracle.com> Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-12net/rds: Check address length before reading address familyTetsuo Handa2-0/+5
syzbot is reporting uninitialized value at rds_connect() [1] and rds_bind() [2]. This is because syzbot is passing ulen == 0 whereas these functions expect that it is safe to access sockaddr->family field in order to determine minimal address length for validation. [1] https://syzkaller.appspot.com/bug?id=f4e61c010416c1e6f0fa3ffe247561b60a50ad71 [2] https://syzkaller.appspot.com/bug?id=a4bf9e41b7e055c3823fdcd83e8c58ca7270e38f Reported-by: syzbot <syzbot+0049bebbf3042dbd2e8f@syzkaller.appspotmail.com> Reported-by: syzbot <syzbot+915c9f99f3dbc4bd6cd1@syzkaller.appspotmail.com> Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-03-28net: rds: force to destroy connection if t_sock is NULL in rds_tcp_kill_sock().Mao Wenan1-1/+1
When it is to cleanup net namespace, rds_tcp_exit_net() will call rds_tcp_kill_sock(), if t_sock is NULL, it will not call rds_conn_destroy(), rds_conn_path_destroy() and rds_tcp_conn_free() to free connection, and the worker cp_conn_w is not stopped, afterwards the net is freed in net_drop_ns(); While cp_conn_w rds_connect_worker() will call rds_tcp_conn_path_connect() and reference 'net' which has already been freed. In rds_tcp_conn_path_connect(), rds_tcp_set_callbacks() will set t_sock = sock before sock->ops->connect, but if connect() is failed, it will call rds_tcp_restore_callbacks() and set t_sock = NULL, if connect is always failed, rds_connect_worker() will try to reconnect all the time, so rds_tcp_kill_sock() will never to cancel worker cp_conn_w and free the connections. Therefore, the condition !tc->t_sock is not needed if it is going to do cleanup_net->rds_tcp_exit_net->rds_tcp_kill_sock, because tc->t_sock is always NULL, and there is on other path to cancel cp_conn_w and free connection. So this patch is to fix this. rds_tcp_kill_sock(): ... if (net != c_net || !tc->t_sock) ... Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com> ================================================================== BUG: KASAN: use-after-free in inet_create+0xbcc/0xd28 net/ipv4/af_inet.c:340 Read of size 4 at addr ffff8003496a4684 by task kworker/u8:4/3721 CPU: 3 PID: 3721 Comm: kworker/u8:4 Not tainted 5.1.0 #11 Hardware name: linux,dummy-virt (DT) Workqueue: krdsd rds_connect_worker Call trace: dump_backtrace+0x0/0x3c0 arch/arm64/kernel/time.c:53 show_stack+0x28/0x38 arch/arm64/kernel/traps.c:152 __dump_stack lib/dump_stack.c:77 [inline] dump_stack+0x120/0x188 lib/dump_stack.c:113 print_address_description+0x68/0x278 mm/kasan/report.c:253 kasan_report_error mm/kasan/report.c:351 [inline] kasan_report+0x21c/0x348 mm/kasan/report.c:409 __asan_report_load4_noabort+0x30/0x40 mm/kasan/report.c:429 inet_create+0xbcc/0xd28 net/ipv4/af_inet.c:340 __sock_create+0x4f8/0x770 net/socket.c:1276 sock_create_kern+0x50/0x68 net/socket.c:1322 rds_tcp_conn_path_connect+0x2b4/0x690 net/rds/tcp_connect.c:114 rds_connect_worker+0x108/0x1d0 net/rds/threads.c:175 process_one_work+0x6e8/0x1700 kernel/workqueue.c:2153 worker_thread+0x3b0/0xdd0 kernel/workqueue.c:2296 kthread+0x2f0/0x378 kernel/kthread.c:255 ret_from_fork+0x10/0x18 arch/arm64/kernel/entry.S:1117 Allocated by task 687: save_stack mm/kasan/kasan.c:448 [inline] set_track mm/kasan/kasan.c:460 [inline] kasan_kmalloc+0xd4/0x180 mm/kasan/kasan.c:553 kasan_slab_alloc+0x14/0x20 mm/kasan/kasan.c:490 slab_post_alloc_hook mm/slab.h:444 [inline] slab_alloc_node mm/slub.c:2705 [inline] slab_alloc mm/slub.c:2713 [inline] kmem_cache_alloc+0x14c/0x388 mm/slub.c:2718 kmem_cache_zalloc include/linux/slab.h:697 [inline] net_alloc net/core/net_namespace.c:384 [inline] copy_net_ns+0xc4/0x2d0 net/core/net_namespace.c:424 create_new_namespaces+0x300/0x658 kernel/nsproxy.c:107 unshare_nsproxy_namespaces+0xa0/0x198 kernel/nsproxy.c:206 ksys_unshare+0x340/0x628 kernel/fork.c:2577 __do_sys_unshare kernel/fork.c:2645 [inline] __se_sys_unshare kernel/fork.c:2643 [inline] __arm64_sys_unshare+0x38/0x58 kernel/fork.c:2643 __invoke_syscall arch/arm64/kernel/syscall.c:35 [inline] invoke_syscall arch/arm64/kernel/syscall.c:47 [inline] el0_svc_common+0x168/0x390 arch/arm64/kernel/syscall.c:83 el0_svc_handler+0x60/0xd0 arch/arm64/kernel/syscall.c:129 el0_svc+0x8/0xc arch/arm64/kernel/entry.S:960 Freed by task 264: save_stack mm/kasan/kasan.c:448 [inline] set_track mm/kasan/kasan.c:460 [inline] __kasan_slab_free+0x114/0x220 mm/kasan/kasan.c:521 kasan_slab_free+0x10/0x18 mm/kasan/kasan.c:528 slab_free_hook mm/slub.c:1370 [inline] slab_free_freelist_hook mm/slub.c:1397 [inline] slab_free mm/slub.c:2952 [inline] kmem_cache_free+0xb8/0x3a8 mm/slub.c:2968 net_free net/core/net_namespace.c:400 [inline] net_drop_ns.part.6+0x78/0x90 net/core/net_namespace.c:407 net_drop_ns net/core/net_namespace.c:406 [inline] cleanup_net+0x53c/0x6d8 net/core/net_namespace.c:569 process_one_work+0x6e8/0x1700 kernel/workqueue.c:2153 worker_thread+0x3b0/0xdd0 kernel/workqueue.c:2296 kthread+0x2f0/0x378 kernel/kthread.c:255 ret_from_fork+0x10/0x18 arch/arm64/kernel/entry.S:1117 The buggy address belongs to the object at ffff8003496a3f80 which belongs to the cache net_namespace of size 7872 The buggy address is located 1796 bytes inside of 7872-byte region [ffff8003496a3f80, ffff8003496a5e40) The buggy address belongs to the page: page:ffff7e000d25a800 count:1 mapcount:0 mapping:ffff80036ce4b000 index:0x0 compound_mapcount: 0 flags: 0xffffe0000008100(slab|head) raw: 0ffffe0000008100 dead000000000100 dead000000000200 ffff80036ce4b000 raw: 0000000000000000 0000000080040004 00000001ffffffff 0000000000000000 page dumped because: kasan: bad access detected Memory state around the buggy address: ffff8003496a4580: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb ffff8003496a4600: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb >ffff8003496a4680: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb ^ ffff8003496a4700: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb ffff8003496a4780: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb ================================================================== Fixes: 467fa15356ac("RDS-TCP: Support multiple RDS-TCP listen endpoints, one per netns.") Reported-by: Hulk Robot <hulkci@huawei.com> Signed-off-by: Mao Wenan <maowenan@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-03-09Merge tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdmaLinus Torvalds5-27/+20
Pull rdma updates from Jason Gunthorpe: "This has been a slightly more active cycle than normal with ongoing core changes and quite a lot of collected driver updates. - Various driver fixes for bnxt_re, cxgb4, hns, mlx5, pvrdma, rxe - A new data transfer mode for HFI1 giving higher performance - Significant functional and bug fix update to the mlx5 On-Demand-Paging MR feature - A chip hang reset recovery system for hns - Change mm->pinned_vm to an atomic64 - Update bnxt_re to support a new 57500 chip - A sane netlink 'rdma link add' method for creating rxe devices and fixing the various unregistration race conditions in rxe's unregister flow - Allow lookup up objects by an ID over netlink - Various reworking of the core to driver interface: - drivers should not assume umem SGLs are in PAGE_SIZE chunks - ucontext is accessed via udata not other means - start to make the core code responsible for object memory allocation - drivers should convert struct device to struct ib_device via a helper - drivers have more tools to avoid use after unregister problems" * tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma: (280 commits) net/mlx5: ODP support for XRC transport is not enabled by default in FW IB/hfi1: Close race condition on user context disable and close RDMA/umem: Revert broken 'off by one' fix RDMA/umem: minor bug fix in error handling path RDMA/hns: Use GFP_ATOMIC in hns_roce_v2_modify_qp cxgb4: kfree mhp after the debug print IB/rdmavt: Fix concurrency panics in QP post_send and modify to error IB/rdmavt: Fix loopback send with invalidate ordering IB/iser: Fix dma_nents type definition IB/mlx5: Set correct write permissions for implicit ODP MR bnxt_re: Clean cq for kernel consumers only RDMA/uverbs: Don't do double free of allocated PD RDMA: Handle ucontext allocations by IB/core RDMA/core: Fix a WARN() message bnxt_re: fix the regression due to changes in alloc_pbl IB/mlx4: Increase the timeout for CM cache IB/core: Abort page fault handler silently during owning process exit IB/mlx5: Validate correct PD before prefetch MR IB/mlx5: Protect against prefetch of invalid MR RDMA/uverbs: Store PR pointer before it is overwritten ...
2019-02-08Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller1-3/+3
An ipvlan bug fix in 'net' conflicted with the abstraction away of the IPV6 specific support in 'net-next'. Similarly, a bug fix for mlx5 in 'net' conflicted with the flow action conversion in 'net-next'. Signed-off-by: David S. Miller <davem@davemloft.net>
2019-02-04rds: rdma: update rdma transport for tosSantosh Shilimkar6-22/+29
For RDMA transports, RDS TOS is an extension of IB QoS(Annex A13) to provide clients the ability to segregate traffic flows for different type of data. RDMA CM abstract it for ULPs using rdma_set_service_type(). Internally, each traffic flow is represented by a connection with all of its independent resources like that of a normal connection, and is differentiated by service type. In other words, there can be multiple qp connections between an IP pair and each supports a unique service type. The feature has been added from RDSv4.1 onwards and supports rolling upgrades. RDMA connection metadata also carries the tos information to set up SL on end to end context. The original code was developed by Bang Nguyen in downstream kernel back in 2.6.32 kernel days and it has evolved over period of time. Reviewed-by: Sowmini Varadhan <sowmini.varadhan@oracle.com> Signed-off-by: Santosh Shilimkar <santosh.shilimkar@oracle.com> [yanjun.zhu@oracle.com: Adapted original patch with ipv6 changes] Signed-off-by: Zhu Yanjun <yanjun.zhu@oracle.com>
2019-02-04rds: add transport specific tos_map hookSantosh Shilimkar4-4/+24
RDMA transport maps user tos to underline virtual lanes(VL) for IB or DSCP values. RDMA CM transport abstract thats for RDS. TCP transport makes use of default priority 0 and maps all user tos values to it. Reviewed-by: Sowmini Varadhan <sowmini.varadhan@oracle.com> Signed-off-by: Santosh Shilimkar <santosh.shilimkar@oracle.com> [yanjun.zhu@oracle.com: Adapted original patch with ipv6 changes] Signed-off-by: Zhu Yanjun <yanjun.zhu@oracle.com>
2019-02-04rds: add type of service(tos) infrastructureSantosh Shilimkar10-17/+61
RDS Service type (TOS) is user-defined and needs to be configured via RDS IOCTL interface. It must be set before initiating any traffic and once set the TOS can not be changed. All out-going traffic from the socket will be associated with its TOS. Reviewed-by: Sowmini Varadhan <sowmini.varadhan@oracle.com> Signed-off-by: Santosh Shilimkar <santosh.shilimkar@oracle.com> [yanjun.zhu@oracle.com: Adapted original patch with ipv6 changes] Signed-off-by: Zhu Yanjun <yanjun.zhu@oracle.com>
2019-02-04rds: rdma: add consumer rejectSantosh Shilimkar3-2/+22
For legacy protocol version incompatibility with non linux RDS, consumer reject reason being used to convey it to peer. But the choice of reject reason value as '1' was really poor. Anyway for interoperability reasons with shipping products, it needs to be supported. For any future versions, properly encoded reject reason should to be used. Reviewed-by: Sowmini Varadhan <sowmini.varadhan@oracle.com> Signed-off-by: Santosh Shilimkar <santosh.shilimkar@oracle.com> [yanjun.zhu@oracle.com: Adapted original patch with ipv6 changes] Signed-off-by: Zhu Yanjun <yanjun.zhu@oracle.com>
2019-02-04rds: make v3.1 as compat versionSantosh Shilimkar4-17/+29
Mark RDSv3.1 as compat version and add v4.1 version macro's. Subsequent patches enable TOS(Type of Service) feature which is tied with v4.1 for RDMA transport. Reviewed-by: Sowmini Varadhan <sowmini.varadhan@oracle.com> Signed-off-by: Santosh Shilimkar <santosh.shilimkar@oracle.com> [yanjun.zhu@oracle.com: Adapted original patch with ipv6 changes] Signed-off-by: Zhu Yanjun <yanjun.zhu@oracle.com>
2019-02-04Merge tag 'v5.0-rc5' into rdma.git for-nextJason Gunthorpe4-9/+5
Linux 5.0-rc5 Needed to merge the include/uapi changes so we have an up to date single-tree for these files. Patches already posted are also expected to need this for dependencies.
2019-02-04IB/core: Remove ib_sg_dma_address() and ib_sg_dma_len()Bart Van Assche5-27/+20
Keeping single line wrapper functions is not useful. Hence remove the ib_sg_dma_address() and ib_sg_dma_len() functions. This patch does not change any functionality. Signed-off-by: Bart Van Assche <bvanassche@acm.org> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2019-02-03socket: Add SO_TIMESTAMP[NS]_NEWDeepa Dinamani2-4/+20
Add SO_TIMESTAMP_NEW and SO_TIMESTAMPNS_NEW variants of socket timestamp options. These are the y2038 safe versions of the SO_TIMESTAMP_OLD and SO_TIMESTAMPNS_OLD for all architectures. Note that the format of scm_timestamping.ts[0] is not changed in this patch. Signed-off-by: Deepa Dinamani <deepa.kernel@gmail.com> Acked-by: Willem de Bruijn <willemb@google.com> Cc: jejb@parisc-linux.org Cc: ralf@linux-mips.org Cc: rth@twiddle.net Cc: linux-alpha@vger.kernel.org Cc: linux-mips@linux-mips.org Cc: linux-parisc@vger.kernel.org Cc: linux-rdma@vger.kernel.org Cc: netdev@vger.kernel.org Cc: sparclinux@vger.kernel.org Signed-off-by: David S. Miller <davem@davemloft.net>
2019-02-03socket: Use old_timeval types for socket timestampsDeepa Dinamani1-1/+1
As part of y2038 solution, all internal uses of struct timeval are replaced by struct __kernel_old_timeval and struct compat_timeval by struct old_timeval32. Make socket timestamps use these new types. This is mainly to be able to verify that the kernel build is y2038 safe when such non y2038 safe types are not supported anymore. Signed-off-by: Deepa Dinamani <deepa.kernel@gmail.com> Acked-by: Willem de Bruijn <willemb@google.com> Cc: isdn@linux-pingi.de Signed-off-by: David S. Miller <davem@davemloft.net>
2019-02-03sockopt: Rename SO_TIMESTAMP* to SO_TIMESTAMP*_OLDDeepa Dinamani2-2/+2
SO_TIMESTAMP, SO_TIMESTAMPNS and SO_TIMESTAMPING options, the way they are currently defined, are not y2038 safe. Subsequent patches in the series add new y2038 safe versions of these options which provide 64 bit timestamps on all architectures uniformly. Hence, rename existing options with OLD tag suffixes. Also note that kernel will not use the untagged SO_TIMESTAMP* and SCM_TIMESTAMP* options internally anymore. Signed-off-by: Deepa Dinamani <deepa.kernel@gmail.com> Acked-by: Willem de Bruijn <willemb@google.com> Cc: deller@gmx.de Cc: dhowells@redhat.com Cc: jejb@parisc-linux.org Cc: ralf@linux-mips.org Cc: rth@twiddle.net Cc: linux-afs@lists.infradead.org Cc: linux-alpha@vger.kernel.org Cc: linux-arch@vger.kernel.org Cc: linux-mips@linux-mips.org Cc: linux-parisc@vger.kernel.org Cc: linux-rdma@vger.kernel.org Cc: sparclinux@vger.kernel.org Signed-off-by: David S. Miller <davem@davemloft.net>
2019-01-31rds: fix refcount bug in rds_sock_addrefEric Dumazet1-3/+3
syzbot was able to catch a bug in rds [1] The issue here is that the socket might be found in a hash table but that its refcount has already be set to 0 by another cpu. We need to use refcount_inc_not_zero() to be safe here. [1] refcount_t: increment on 0; use-after-free. WARNING: CPU: 1 PID: 23129 at lib/refcount.c:153 refcount_inc_checked lib/refcount.c:153 [inline] WARNING: CPU: 1 PID: 23129 at lib/refcount.c:153 refcount_inc_checked+0x61/0x70 lib/refcount.c:151 Kernel panic - not syncing: panic_on_warn set ... CPU: 1 PID: 23129 Comm: syz-executor3 Not tainted 5.0.0-rc4+ #53 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 Call Trace: __dump_stack lib/dump_stack.c:77 [inline] dump_stack+0x1db/0x2d0 lib/dump_stack.c:113 panic+0x2cb/0x65c kernel/panic.c:214 __warn.cold+0x20/0x48 kernel/panic.c:571 report_bug+0x263/0x2b0 lib/bug.c:186 fixup_bug arch/x86/kernel/traps.c:178 [inline] fixup_bug arch/x86/kernel/traps.c:173 [inline] do_error_trap+0x11b/0x200 arch/x86/kernel/traps.c:271 do_invalid_op+0x37/0x50 arch/x86/kernel/traps.c:290 invalid_op+0x14/0x20 arch/x86/entry/entry_64.S:973 RIP: 0010:refcount_inc_checked lib/refcount.c:153 [inline] RIP: 0010:refcount_inc_checked+0x61/0x70 lib/refcount.c:151 Code: 1d 51 63 c8 06 31 ff 89 de e8 eb 1b f2 fd 84 db 75 dd e8 a2 1a f2 fd 48 c7 c7 60 9f 81 88 c6 05 31 63 c8 06 01 e8 af 65 bb fd <0f> 0b eb c1 90 66 2e 0f 1f 84 00 00 00 00 00 55 48 89 e5 41 54 49 RSP: 0018:ffff8880a0cbf1e8 EFLAGS: 00010282 RAX: 0000000000000000 RBX: 0000000000000000 RCX: ffffc90006113000 RDX: 000000000001047d RSI: ffffffff81685776 RDI: 0000000000000005 RBP: ffff8880a0cbf1f8 R08: ffff888097c9e100 R09: ffffed1015ce5021 R10: ffffed1015ce5020 R11: ffff8880ae728107 R12: ffff8880723c20c0 R13: ffff8880723c24b0 R14: dffffc0000000000 R15: ffffed1014197e64 sock_hold include/net/sock.h:647 [inline] rds_sock_addref+0x19/0x20 net/rds/af_rds.c:675 rds_find_bound+0x97c/0x1080 net/rds/bind.c:82 rds_recv_incoming+0x3be/0x1430 net/rds/recv.c:362 rds_loop_xmit+0xf3/0x2a0 net/rds/loop.c:96 rds_send_xmit+0x1355/0x2a10 net/rds/send.c:355 rds_sendmsg+0x323c/0x44e0 net/rds/send.c:1368 sock_sendmsg_nosec net/socket.c:621 [inline] sock_sendmsg+0xdd/0x130 net/socket.c:631 __sys_sendto+0x387/0x5f0 net/socket.c:1788 __do_sys_sendto net/socket.c:1800 [inline] __se_sys_sendto net/socket.c:1796 [inline] __x64_sys_sendto+0xe1/0x1a0 net/socket.c:1796 do_syscall_64+0x1a3/0x800 arch/x86/entry/common.c:290 entry_SYSCALL_64_after_hwframe+0x49/0xbe RIP: 0033:0x458089 Code: 6d b7 fb ff c3 66 2e 0f 1f 84 00 00 00 00 00 66 90 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 0f 83 3b b7 fb ff c3 66 2e 0f 1f 84 00 00 00 00 RSP: 002b:00007fc266df8c78 EFLAGS: 00000246 ORIG_RAX: 000000000000002c RAX: ffffffffffffffda RBX: 0000000000000006 RCX: 0000000000458089 RDX: 0000000000000000 RSI: 00000000204b3fff RDI: 0000000000000005 RBP: 000000000073bf00 R08: 00000000202b4000 R09: 0000000000000010 R10: 0000000000000000 R11: 0000000000000246 R12: 00007fc266df96d4 R13: 00000000004c56e4 R14: 00000000004d94a8 R15: 00000000ffffffff Fixes: cc4dfb7f70a3 ("rds: fix two RCU related problems") Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: syzbot <syzkaller@googlegroups.com> Cc: Sowmini Varadhan <sowmini.varadhan@oracle.com> Cc: Santosh Shilimkar <santosh.shilimkar@oracle.com> Cc: rds-devel@oss.oracle.com Cc: Cong Wang <xiyou.wangcong@gmail.com> Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-01-07rds: use DIV_ROUND_UP instead of ceilJacob Wen4-9/+5
Yes indeed, DIV_ROUND_UP is in kernel.h. Signed-off-by: Jacob Wen <jian.w.wen@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-01-03Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netLinus Torvalds1-1/+1
Pull networking fixes from David Miller: "Several fixes here. Basically split down the line between newly introduced regressions and long existing problems: 1) Double free in tipc_enable_bearer(), from Cong Wang. 2) Many fixes to nf_conncount, from Florian Westphal. 3) op->get_regs_len() can throw an error, check it, from Yunsheng Lin. 4) Need to use GFP_ATOMIC in *_add_hash_mac_address() of fsl/fman driver, from Scott Wood. 5) Inifnite loop in fib_empty_table(), from Yue Haibing. 6) Use after free in ax25_fillin_cb(), from Cong Wang. 7) Fix socket locking in nr_find_socket(), also from Cong Wang. 8) Fix WoL wakeup enable in r8169, from Heiner Kallweit. 9) On 32-bit sock->sk_stamp is not thread-safe, from Deepa Dinamani. 10) Fix ptr_ring wrap during queue swap, from Cong Wang. 11) Missing shutdown callback in hinic driver, from Xue Chaojing. 12) Need to return NULL on error from ip6_neigh_lookup(), from Stefano Brivio. 13) BPF out of bounds speculation fixes from Daniel Borkmann" * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (57 commits) ipv6: Consider sk_bound_dev_if when binding a socket to an address ipv6: Fix dump of specific table with strict checking bpf: add various test cases to selftests bpf: prevent out of bounds speculation on pointer arithmetic bpf: fix check_map_access smin_value test when pointer contains offset bpf: restrict unknown scalars of mixed signed bounds for unprivileged bpf: restrict stack pointer arithmetic for unprivileged bpf: restrict map value pointer arithmetic for unprivileged bpf: enable access to ax register also from verifier rewrite bpf: move tmp variable into ax register in interpreter bpf: move {prev_,}insn_idx into verifier env isdn: fix kernel-infoleak in capi_unlocked_ioctl ipv6: route: Fix return value of ip6_neigh_lookup() on neigh_create() error net/hamradio/6pack: use mod_timer() to rearm timers net-next/hinic:add shutdown callback net: hns3: call hns3_nic_net_open() while doing HNAE3_UP_CLIENT ip: validate header length on virtual device xmit tap: call skb_probe_transport_header after setting skb->dev ptr_ring: wrap back ->producer in __ptr_ring_swap_queue() net: rds: remove unnecessary NULL check ...
2019-01-01net: rds: remove unnecessary NULL checkZhu Yanjun1-1/+1
In kfree, the NULL check is done. Signed-off-by: Zhu Yanjun <yanjun.zhu@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-12-28Merge tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdmaLinus Torvalds1-2/+2
Pull rdma updates from Jason Gunthorpe: "This has been a fairly typical cycle, with the usual sorts of driver updates. Several series continue to come through which improve and modernize various parts of the core code, and we finally are starting to get the uAPI command interface cleaned up. - Various driver fixes for bnxt_re, cxgb3/4, hfi1, hns, i40iw, mlx4, mlx5, qib, rxe, usnic - Rework the entire syscall flow for uverbs to be able to run over ioctl(). Finally getting past the historic bad choice to use write() for command execution - More functional coverage with the mlx5 'devx' user API - Start of the HFI1 series for 'TID RDMA' - SRQ support in the hns driver - Support for new IBTA defined 2x lane widths - A big series to consolidate all the driver function pointers into a big struct and have drivers provide a 'static const' version of the struct instead of open coding initialization - New 'advise_mr' uAPI to control device caching/loading of page tables - Support for inline data in SRPT - Modernize how umad uses the driver core and creates cdev's and sysfs files - First steps toward removing 'uobject' from the view of the drivers" * tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma: (193 commits) RDMA/srpt: Use kmem_cache_free() instead of kfree() RDMA/mlx5: Signedness bug in UVERBS_HANDLER() IB/uverbs: Signedness bug in UVERBS_HANDLER() IB/mlx5: Allocate the per-port Q counter shared when DEVX is supported IB/umad: Start using dev_groups of class IB/umad: Use class_groups and let core create class file IB/umad: Refactor code to use cdev_device_add() IB/umad: Avoid destroying device while it is accessed IB/umad: Simplify and avoid dynamic allocation of class IB/mlx5: Fix wrong error unwind IB/mlx4: Remove set but not used variable 'pd' RDMA/iwcm: Don't copy past the end of dev_name() string IB/mlx5: Fix long EEH recover time with NVMe offloads IB/mlx5: Simplify netdev unbinding IB/core: Move query port to ioctl RDMA/nldev: Expose port_cap_flags2 IB/core: uverbs copy to struct or zero helper IB/rxe: Reuse code which sets port state IB/rxe: Make counters thread safe IB/mlx5: Use the correct commands for UMEM and UCTX allocation ...
2018-12-19rds: Fix warning.David S. Miller1-1/+3
>> net/rds/send.c:1109:42: warning: Using plain integer as NULL pointer Fixes: ea010070d0a7 ("net/rds: fix warn in rds_message_alloc_sgs") Reported-by: kbuild test robot <lkp@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-12-19net/rds: remove user triggered WARN_ON in rds_sendmsgshamir rabinovitch4-19/+29
per comment from Leon in rdma mailing list https://lkml.org/lkml/2018/10/31/312 : Please don't forget to remove user triggered WARN_ON. https://lwn.net/Articles/769365/ "Greg Kroah-Hartman raised the problem of core kernel API code that will use WARN_ON_ONCE() to complain about bad usage; that will not generate the desired result if WARN_ON_ONCE() is configured to crash the machine. He was told that the code should just call pr_warn() instead, and that the called function should return an error in such situations. It was generally agreed that any WARN_ON() or WARN_ON_ONCE() calls that can be triggered from user space need to be fixed." in addition harden rds_sendmsg to detect and overcome issues with invalid sg count and fail the sendmsg. Suggested-by: Leon Romanovsky <leon@kernel.org> Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com> Signed-off-by: shamir rabinovitch <shamir.rabinovitch@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-12-19net/rds: fix warn in rds_message_alloc_sgsshamir rabinovitch3-42/+91
redundant copy_from_user in rds_sendmsg system call expose rds to issue where rds_rdma_extra_size walk the rds iovec and and calculate the number pf pages (sgs) it need to add to the tail of rds message and later rds_cmsg_rdma_args copy the rds iovec again and re calculate the same number and get different result causing WARN_ON in rds_message_alloc_sgs. fix this by doing the copy_from_user only once per rds_sendmsg system call. When issue occur the below dump is seen: WARNING: CPU: 0 PID: 19789 at net/rds/message.c:316 rds_message_alloc_sgs+0x10c/0x160 net/rds/message.c:316 Kernel panic - not syncing: panic_on_warn set ... CPU: 0 PID: 19789 Comm: syz-executor827 Not tainted 4.19.0-next-20181030+ #101 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 Call Trace: __dump_stack lib/dump_stack.c:77 [inline] dump_stack+0x244/0x39d lib/dump_stack.c:113 panic+0x2ad/0x55c kernel/panic.c:188 __warn.cold.8+0x20/0x45 kernel/panic.c:540 report_bug+0x254/0x2d0 lib/bug.c:186 fixup_bug arch/x86/kernel/traps.c:178 [inline] do_error_trap+0x11b/0x200 arch/x86/kernel/traps.c:271 do_invalid_op+0x36/0x40 arch/x86/kernel/traps.c:290 invalid_op+0x14/0x20 arch/x86/entry/entry_64.S:969 RIP: 0010:rds_message_alloc_sgs+0x10c/0x160 net/rds/message.c:316 Code: c0 74 04 3c 03 7e 6c 44 01 ab 78 01 00 00 e8 2b 9e 35 fa 4c 89 e0 48 83 c4 08 5b 41 5c 41 5d 41 5e 41 5f 5d c3 e8 14 9e 35 fa <0f> 0b 31 ff 44 89 ee e8 18 9f 35 fa 45 85 ed 75 1b e8 fe 9d 35 fa RSP: 0018:ffff8801c51b7460 EFLAGS: 00010293 RAX: ffff8801bc412080 RBX: ffff8801d7bf4040 RCX: ffffffff8749c9e6 RDX: 0000000000000000 RSI: ffffffff8749ca5c RDI: 0000000000000004 RBP: ffff8801c51b7490 R08: ffff8801bc412080 R09: ffffed003b5c5b67 R10: ffffed003b5c5b67 R11: ffff8801dae2db3b R12: 0000000000000000 R13: 000000000007165c R14: 000000000007165c R15: 0000000000000005 rds_cmsg_rdma_args+0x82d/0x1510 net/rds/rdma.c:623 rds_cmsg_send net/rds/send.c:971 [inline] rds_sendmsg+0x19a2/0x3180 net/rds/send.c:1273 sock_sendmsg_nosec net/socket.c:622 [inline] sock_sendmsg+0xd5/0x120 net/socket.c:632 ___sys_sendmsg+0x7fd/0x930 net/socket.c:2117 __sys_sendmsg+0x11d/0x280 net/socket.c:2155 __do_sys_sendmsg net/socket.c:2164 [inline] __se_sys_sendmsg net/socket.c:2162 [inline] __x64_sys_sendmsg+0x78/0xb0 net/socket.c:2162 do_syscall_64+0x1b9/0x820 arch/x86/entry/common.c:290 entry_SYSCALL_64_after_hwframe+0x49/0xbe RIP: 0033:0x44a859 Code: e8 dc e6 ff ff 48 83 c4 18 c3 0f 1f 80 00 00 00 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 0f 83 6b cb fb ff c3 66 2e 0f 1f 84 00 00 00 00 RSP: 002b:00007f1d4710ada8 EFLAGS: 00000297 ORIG_RAX: 000000000000002e RAX: ffffffffffffffda RBX: 00000000006dcc28 RCX: 000000000044a859 RDX: 0000000000000000 RSI: 0000000020001600 RDI: 0000000000000003 RBP: 00000000006dcc20 R08: 0000000000000000 R09: 0000000000000000 R10: 0000000000000000 R11: 0000000000000297 R12: 00000000006dcc2c R13: 646e732f7665642f R14: 00007f1d4710b9c0 R15: 00000000006dcd2c Kernel Offset: disabled Rebooting in 86400 seconds.. Reported-by: syzbot+26de17458aeda9d305d8@syzkaller.appspotmail.com Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com> Signed-off-by: shamir rabinovitch <shamir.rabinovitch@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-12-12RDMA: Start use ib_device_opsKamal Heib1-2/+2
Make all the required change to start use the ib_device_ops structure. Signed-off-by: Kamal Heib <kamalheib1@gmail.com> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-10-12Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller1-3/+10
Conflicts were easy to resolve using immediate context mostly, except the cls_u32.c one where I simply too the entire HEAD chunk. Signed-off-by: David S. Miller <davem@davemloft.net>
2018-10-10rds: RDS (tcp) hangs on sendto() to unresponding addressKa-Cheong Poon1-3/+10
In rds_send_mprds_hash(), if the calculated hash value is non-zero and the MPRDS connections are not yet up, it will wait. But it should not wait if the send is non-blocking. In this case, it should just use the base c_path for sending the message. Signed-off-by: Ka-Cheong Poon <ka-cheong.poon@oracle.com> Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-09-25Merge ra.kernel.org:/pub/scm/linux/kernel/git/davem/netDavid S. Miller1-1/+1
Version bump conflict in batman-adv, take what's in net-next. iavf conflict, adjustment of netdev_ops in net-next conflicting with poll controller method removal in net. Signed-off-by: David S. Miller <davem@davemloft.net>
2018-09-23rds: Fix build regression.David S. Miller1-1/+1
Use DECLARE_* not DEFINE_* Fixes: 8360ed6745df ("RDS: IB: Use DEFINE_PER_CPU_SHARED_ALIGNED for rds_ib_stats") Signed-off-by: David S. Miller <davem@davemloft.net>
2018-09-21RDS: IB: Use DEFINE_PER_CPU_SHARED_ALIGNED for rds_ib_statsNathan Chancellor1-1/+1
Clang warns when two declarations' section attributes don't match. net/rds/ib_stats.c:40:1: warning: section does not match previous declaration [-Wsection] DEFINE_PER_CPU_SHARED_ALIGNED(struct rds_ib_statistics, rds_ib_stats); ^ ./include/linux/percpu-defs.h:142:2: note: expanded from macro 'DEFINE_PER_CPU_SHARED_ALIGNED' DEFINE_PER_CPU_SECTION(type, name, PER_CPU_SHARED_ALIGNED_SECTION) \ ^ ./include/linux/percpu-defs.h:93:9: note: expanded from macro 'DEFINE_PER_CPU_SECTION' extern __PCPU_ATTRS(sec) __typeof__(type) name; \ ^ ./include/linux/percpu-defs.h:49:26: note: expanded from macro '__PCPU_ATTRS' __percpu __attribute__((section(PER_CPU_BASE_SECTION sec))) \ ^ net/rds/ib.h:446:1: note: previous attribute is here DECLARE_PER_CPU(struct rds_ib_statistics, rds_ib_stats); ^ ./include/linux/percpu-defs.h:111:2: note: expanded from macro 'DECLARE_PER_CPU' DECLARE_PER_CPU_SECTION(type, name, "") ^ ./include/linux/percpu-defs.h:87:9: note: expanded from macro 'DECLARE_PER_CPU_SECTION' extern __PCPU_ATTRS(sec) __typeof__(type) name ^ ./include/linux/percpu-defs.h:49:26: note: expanded from macro '__PCPU_ATTRS' __percpu __attribute__((section(PER_CPU_BASE_SECTION sec))) \ ^ 1 warning generated. The initial definition was added in commit ec16227e1414 ("RDS/IB: Infiniband transport") and the cache aligned definition was added in commit e6babe4cc4ce ("RDS/IB: Stats and sysctls") right after. The definition probably should have been updated in net/rds/ib.h, which is what this patch does. Link: https://github.com/ClangBuiltLinux/linux/issues/114 Signed-off-by: Nathan Chancellor <natechancellor@gmail.com> Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-09-17net: rds: use memset to optimize the recvZhu Yanjun1-4/+1
The function rds_inc_init is in recv process. To use memset can optimize the function rds_inc_init. The test result: Before: 1) + 24.950 us | rds_inc_init [rds](); After: 1) + 10.990 us | rds_inc_init [rds](); Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com> Signed-off-by: Zhu Yanjun <yanjun.zhu@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-09-12Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller1-1/+4
2018-09-12rds: fix two RCU related problemsCong Wang1-1/+4
When a rds sock is bound, it is inserted into the bind_hash_table which is protected by RCU. But when releasing rds sock, after it is removed from this hash table, it is freed immediately without respecting RCU grace period. This could cause some use-after-free as reported by syzbot. Mark the rds sock with SOCK_RCU_FREE before inserting it into the bind_hash_table, so that it would be always freed after a RCU grace period. The other problem is in rds_find_bound(), the rds sock could be freed in between rhashtable_lookup_fast() and rds_sock_addref(), so we need to extend RCU read lock protection in rds_find_bound() to close this race condition. Reported-and-tested-by: syzbot+8967084bcac563795dc6@syzkaller.appspotmail.com Reported-by: syzbot+93a5839deb355537440f@syzkaller.appspotmail.com Cc: Sowmini Varadhan <sowmini.varadhan@oracle.com> Cc: Santosh Shilimkar <santosh.shilimkar@oracle.com> Cc: rds-devel@oss.oracle.com Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com> Acked-by: Santosh Shilimkar <santosh.shilimkar@oarcle.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-09-04Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller2-8/+3
2018-09-01rds: store socket timestamps as ktime_tArnd Bergmann2-9/+7
rds is the last in-kernel user of the old do_gettimeofday() function. Convert it over to ktime_get_real() to make it work more like the generic socket timestamps, and to let us kill off do_gettimeofday(). A follow-up patch will have to change the user space interface to deal better with 32-bit tasks, which may use an incompatible layout for 'struct timespec'. Signed-off-by: Arnd Bergmann <arnd@arndb.de> Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-08-31net/rds: RDS is not Radio Data SystemPavel Machek1-1/+1
Getting prompt "The RDS Protocol" (RDS) is not too helpful, and it is easily confused with Radio Data System (which we may want to support in kernel, too). Signed-off-by: Pavel Machek <pavel@ucw.cz> Acked-by: Sowmini Varadhan <sowmini.varadhan@oracle.com> Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com> Acked-by: Sowmini Varadhan <sowmini.varadhan@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-08-27net/rds: Use rdma_read_gids to get connection SGID/DGID in IPv6Zhu Yanjun1-7/+2
In IPv4, the newly introduced rdma_read_gids is used to read the SGID/DGID for the connection which returns GID correctly for RoCE transport as well. In IPv6, rdma_read_gids is also used. The following are why rdma_read_gids is introduced. rdma_addr_get_dgid() for RoCE for client side connections returns MAC address, instead of DGID. rdma_addr_get_sgid() for RoCE doesn't return correct SGID for IPv6 and when more than one IP address is assigned to the netdevice. So the transport agnostic rdma_read_gids() API is provided by rdma_cm module. Signed-off-by: Zhu Yanjun <yanjun.zhu@oracle.com> Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-08-21rds: tcp: remove duplicated include from tcp.cYue Haibing1-1/+0
Remove duplicated include. Signed-off-by: Yue Haibing <yuehaibing@huawei.com> Acked-by: Sowmini Varadhan <sowmini.varadhan@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-08-16Merge branch 'linus/master' into rdma.git for-nextJason Gunthorpe30-441/+1600
rdma.git merge resolution for the 4.19 merge window Conflicts: drivers/infiniband/core/rdma_core.c - Use the rdma code and revise with the new spelling for atomic_fetch_add_unless drivers/nvme/host/rdma.c - Replace max_sge with max_send_sge in new blk code drivers/nvme/target/rdma.c - Use the blk code and revise to use NULL for ib_post_recv when appropriate - Replace max_sge with max_recv_sge in new blk code net/rds/ib_send.c - Use the net code and revise to use NULL for ib_post_recv when appropriate Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-08-16Merge tag 'v4.18' into rdma.git for-nextJason Gunthorpe9-21/+107
Resolve merge conflicts from the -rc cycle against the rdma.git tree: Conflicts: drivers/infiniband/core/uverbs_cmd.c - New ifs added to ib_uverbs_ex_create_flow in -rc and for-next - Merge removal of file->ucontext in for-next with new code in -rc drivers/infiniband/core/uverbs_main.c - for-next removed code from ib_uverbs_write() that was modified in for-rc Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-08-14rds: fix building with IPV6=mArnd Bergmann1-0/+1
When CONFIG_RDS_TCP is built-in and CONFIG_IPV6 is a loadable module, we get a link error agains the modular ipv6_chk_addr() function: net/rds/tcp.o: In function `rds_tcp_laddr_check': tcp.c:(.text+0x3b2): undefined reference to `ipv6_chk_addr' This adds back a dependency that forces RDS_TCP to also be a loadable module when IPV6 is one. Fixes: e65d4d96334e ("rds: Remove IPv6 dependency") Signed-off-by: Arnd Bergmann <arnd@arndb.de> Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-08-07RDS: IB: fix 'passing zero to ERR_PTR()' warningYueHaibing1-0/+1
Fix a static code checker warning: net/rds/ib_frmr.c:82 rds_ib_alloc_frmr() warn: passing zero to 'ERR_PTR' The error path for ib_alloc_mr failure should set err to PTR_ERR. Fixes: 1659185fb4d0 ("RDS: IB: Support Fastreg MR (FRMR) memory registration mode") Signed-off-by: YueHaibing <yuehaibing@huawei.com> Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-08-02Merge ra.kernel.org:/pub/scm/linux/kernel/git/davem/netDavid S. Miller6-20/+39
The BTF conflicts were simple overlapping changes. The virtio_net conflict was an overlap of a fix of statistics counter, happening alongisde a move over to a bonafide statistics structure rather than counting value on the stack. Signed-off-by: David S. Miller <davem@davemloft.net>
2018-08-01rds: remove redundant variable 'rds_ibdev'YueHaibing1-3/+0
Variable 'rds_ibdev' is being assigned but never used, so can be removed. fix this clang warning: net/rds/ib_send.c:762:24: warning: variable ‘rds_ibdev’ set but not used [-Wunused-but-set-variable] Signed-off-by: YueHaibing <yuehaibing@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-08-01rds: Remove IPv6 dependencyKa-Cheong Poon12-26/+140
This patch removes the IPv6 dependency from RDS. Signed-off-by: Ka-Cheong Poon <ka-cheong.poon@oracle.com> Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-08-01rds: rds_ib_recv_alloc_cache() should call alloc_percpu_gfp() insteadKa-Cheong Poon3-7/+7
Currently, rds_ib_conn_alloc() calls rds_ib_recv_alloc_caches() without passing along the gfp_t flag. But rds_ib_recv_alloc_caches() and rds_ib_recv_alloc_cache() should take a gfp_t parameter so that rds_ib_recv_alloc_cache() can call alloc_percpu_gfp() using the correct flag instead of calling alloc_percpu(). Signed-off-by: Ka-Cheong Poon <ka-cheong.poon@oracle.com> Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-30RDMA, core and ULPs: Declare ib_post_send() and ib_post_recv() arguments constBart Van Assche1-3/+3
Since neither ib_post_send() nor ib_post_recv() modify the data structure their second argument points at, declare that argument const. This change makes it necessary to declare the 'bad_wr' argument const too and also to modify all ULPs that call ib_post_send(), ib_post_recv() or ib_post_srq_recv(). This patch does not change any functionality but makes it possible for the compiler to verify whether the ib_post_(send|recv|srq_recv) really do not modify the posted work request. To make this possible, only one cast had to be introduce that casts away constness, namely in rpcrdma_post_recvs(). The only way I can think of to avoid that cast is to introduce an additional loop in that function or to change the data type of bad_wr from struct ib_recv_wr ** into int (an index that refers to an element in the work request list). However, both approaches would require even more extensive changes than this patch. Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com> Reviewed-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-07-27net/rds/Kconfig: Correct the RDS dependsAnders Roxell1-1/+1
Remove prefix 'CONFIG_' from CONFIG_IPV6 Fixes: ba7d7e2677c0 ("net/rds/Kconfig: RDS should depend on IPV6") Reported-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: Anders Roxell <anders.roxell@linaro.org> Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>