`sphinx.addnodesdocument)}( rawsourcechildren]( translations LanguagesNode)}(hhh](h pending_xref)}(hhh]docutils.nodesTextChinese (Simplified)}parenthsba attributes}(ids]classes]names]dupnames]backrefs] refdomainstdreftypedoc reftarget"/translations/zh_CN/networking/rdsmodnameN classnameN refexplicitutagnamehhh ubh)}(hhh]hChinese (Traditional)}hh2sbah}(h]h ]h"]h$]h&] refdomainh)reftypeh+ reftarget"/translations/zh_TW/networking/rdsmodnameN classnameN refexplicituh1hhh ubh)}(hhh]hItalian}hhFsbah}(h]h ]h"]h$]h&] refdomainh)reftypeh+ reftarget"/translations/it_IT/networking/rdsmodnameN classnameN refexplicituh1hhh ubh)}(hhh]hJapanese}hhZsbah}(h]h ]h"]h$]h&] refdomainh)reftypeh+ reftarget"/translations/ja_JP/networking/rdsmodnameN classnameN refexplicituh1hhh ubh)}(hhh]hKorean}hhnsbah}(h]h ]h"]h$]h&] refdomainh)reftypeh+ reftarget"/translations/ko_KR/networking/rdsmodnameN classnameN refexplicituh1hhh ubh)}(hhh]hSpanish}hhsbah}(h]h ]h"]h$]h&] refdomainh)reftypeh+ reftarget"/translations/sp_SP/networking/rdsmodnameN classnameN refexplicituh1hhh ubeh}(h]h ]h"]h$]h&]current_languageEnglishuh1h hh _documenthsourceNlineNubhcomment)}(h SPDX-License-Identifier: GPL-2.0h]h SPDX-License-Identifier: GPL-2.0}hhsbah}(h]h ]h"]h$]h&] xml:spacepreserveuh1hhhhhhhhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhKhjhhubh block_quote)}(hX* Addressing RDS uses IPv4 addresses and 16bit port numbers to identify the end point of a connection. All socket operations that involve passing addresses between kernel and user space generally use a struct sockaddr_in. The fact that IPv4 addresses are used does not mean the underlying transport has to be IP-based. In fact, RDS over IB uses a reliable IB connection; the IP address is used exclusively to locate the remote node's GID (by ARPing for the given IP). The port space is entirely independent of UDP, TCP or any other protocol. * Socket interface RDS sockets work *mostly* as you would expect from a BSD socket. The next section will cover the details. At any rate, all I/O is performed through the standard BSD socket API. Some additions like zerocopy support are implemented through control messages, while other extensions use the getsockopt/ setsockopt calls. Sockets must be bound before you can send or receive data. This is needed because binding also selects a transport and attaches it to the socket. Once bound, the transport assignment does not change. RDS will tolerate IPs moving around (eg in a active-active HA scenario), but only as long as the address doesn't move to a different transport. * sysctls RDS supports a number of sysctls in /proc/sys/net/rds h]h bullet_list)}(hhh](h list_item)}(hXAddressing RDS uses IPv4 addresses and 16bit port numbers to identify the end point of a connection. All socket operations that involve passing addresses between kernel and user space generally use a struct sockaddr_in. The fact that IPv4 addresses are used does not mean the underlying transport has to be IP-based. In fact, RDS over IB uses a reliable IB connection; the IP address is used exclusively to locate the remote node's GID (by ARPing for the given IP). The port space is entirely independent of UDP, TCP or any other protocol. h](h)}(h Addressingh]h Addressing}(hj]hhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhKhjYubh)}(hRDS uses IPv4 addresses and 16bit port numbers to identify the end point of a connection. All socket operations that involve passing addresses between kernel and user space generally use a struct sockaddr_in.h]hRDS uses IPv4 addresses and 16bit port numbers to identify the end point of a connection. All socket operations that involve passing addresses between kernel and user space generally use a struct sockaddr_in.}(hjkhhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhK!hjYubh)}(hThe fact that IPv4 addresses are used does not mean the underlying transport has to be IP-based. In fact, RDS over IB uses a reliable IB connection; the IP address is used exclusively to locate the remote node's GID (by ARPing for the given IP).h]hThe fact that IPv4 addresses are used does not mean the underlying transport has to be IP-based. In fact, RDS over IB uses a reliable IB connection; the IP address is used exclusively to locate the remote node’s GID (by ARPing for the given IP).}(hjyhhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhK&hjYubh)}(hIThe port space is entirely independent of UDP, TCP or any other protocol.h]hIThe port space is entirely independent of UDP, TCP or any other protocol.}(hjhhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhK+hjYubeh}(h]h ]h"]h$]h&]uh1jWhjTubjX)}(hXSocket interface RDS sockets work *mostly* as you would expect from a BSD socket. The next section will cover the details. At any rate, all I/O is performed through the standard BSD socket API. Some additions like zerocopy support are implemented through control messages, while other extensions use the getsockopt/ setsockopt calls. Sockets must be bound before you can send or receive data. This is needed because binding also selects a transport and attaches it to the socket. Once bound, the transport assignment does not change. RDS will tolerate IPs moving around (eg in a active-active HA scenario), but only as long as the address doesn't move to a different transport. h](h)}(hSocket interfaceh]hSocket interface}(hjhhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhK.hjubh)}(hX<RDS sockets work *mostly* as you would expect from a BSD socket. The next section will cover the details. At any rate, all I/O is performed through the standard BSD socket API. Some additions like zerocopy support are implemented through control messages, while other extensions use the getsockopt/ setsockopt calls.h](hRDS sockets work }(hjhhhNhNubhemphasis)}(h*mostly*h]hmostly}(hjhhhNhNubah}(h]h ]h"]h$]h&]uh1jhjubhX# as you would expect from a BSD socket. The next section will cover the details. At any rate, all I/O is performed through the standard BSD socket API. Some additions like zerocopy support are implemented through control messages, while other extensions use the getsockopt/ setsockopt calls.}(hjhhhNhNubeh}(h]h ]h"]h$]h&]uh1hhhhK0hjubh)}(hXWSockets must be bound before you can send or receive data. This is needed because binding also selects a transport and attaches it to the socket. Once bound, the transport assignment does not change. RDS will tolerate IPs moving around (eg in a active-active HA scenario), but only as long as the address doesn't move to a different transport.h]hXYSockets must be bound before you can send or receive data. This is needed because binding also selects a transport and attaches it to the socket. Once bound, the transport assignment does not change. RDS will tolerate IPs moving around (eg in a active-active HA scenario), but only as long as the address doesn’t move to a different transport.}(hjhhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhK7hjubeh}(h]h ]h"]h$]h&]uh1jWhjTubjX)}(h@sysctls RDS supports a number of sysctls in /proc/sys/net/rds h](h)}(hsysctlsh]hsysctls}(hjhhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhK>hjubh)}(h5RDS supports a number of sysctls in /proc/sys/net/rdsh]h5RDS supports a number of sysctls in /proc/sys/net/rds}(hjhhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhK@hjubeh}(h]h ]h"]h$]h&]uh1jWhjTubeh}(h]h ]h"]h$]h&]bullet*uh1jRhhhKhjNubah}(h]h ]h"]h$]h&]uh1jLhhhKhjhhubeh}(h]rds-architectureah ]h"]rds architectureah$]h&]uh1hhhhhhhhKubh)}(hhh](h)}(hSocket Interfaceh]hSocket Interface}(hj"hhhNhNubah}(h]h ]h"]h$]h&]uh1hhjhhhhhKDubjM)}(hXAF_RDS, PF_RDS, SOL_RDS AF_RDS and PF_RDS are the domain type to be used with socket(2) to create RDS sockets. SOL_RDS is the socket-level to be used with setsockopt(2) and getsockopt(2) for RDS specific socket options. fd = socket(PF_RDS, SOCK_SEQPACKET, 0); This creates a new, unbound RDS socket. setsockopt(SOL_SOCKET): send and receive buffer size RDS honors the send and receive buffer size socket options. You are not allowed to queue more than SO_SNDSIZE bytes to a socket. A message is queued when sendmsg is called, and it leaves the queue when the remote system acknowledges its arrival. The SO_RCVSIZE option controls the maximum receive queue length. This is a soft limit rather than a hard limit - RDS will continue to accept and queue incoming messages, even if that takes the queue length over the limit. However, it will also mark the port as "congested" and send a congestion update to the source node. The source node is supposed to throttle any processes sending to this congested port. bind(fd, &sockaddr_in, ...) This binds the socket to a local IP address and port, and a transport, if one has not already been selected via the SO_RDS_TRANSPORT socket option sendmsg(fd, ...) Sends a message to the indicated recipient. The kernel will transparently establish the underlying reliable connection if it isn't up yet. An attempt to send a message that exceeds SO_SNDSIZE will return with -EMSGSIZE An attempt to send a message that would take the total number of queued bytes over the SO_SNDSIZE threshold will return EAGAIN. An attempt to send a message to a destination that is marked as "congested" will return ENOBUFS. recvmsg(fd, ...) Receives a message that was queued to this socket. The sockets recv queue accounting is adjusted, and if the queue length drops below SO_SNDSIZE, the port is marked uncongested, and a congestion update is sent to all peers. Applications can ask the RDS kernel module to receive notifications via control messages (for instance, there is a notification when a congestion update arrived, or when a RDMA operation completes). These notifications are received through the msg.msg_control buffer of struct msghdr. The format of the messages is described in manpages. poll(fd) RDS supports the poll interface to allow the application to implement async I/O. POLLIN handling is pretty straightforward. When there's an incoming message queued to the socket, or a pending notification, we signal POLLIN. POLLOUT is a little harder. Since you can essentially send to any destination, RDS will always signal POLLOUT as long as there's room on the send queue (ie the number of bytes queued is less than the sendbuf size). However, the kernel will refuse to accept messages to a destination marked congested - in this case you will loop forever if you rely on poll to tell you what to do. This isn't a trivial problem, but applications can deal with this - by using congestion notifications, and by checking for ENOBUFS errors returned by sendmsg. setsockopt(SOL_RDS, RDS_CANCEL_SENT_TO, &sockaddr_in) This allows the application to discard all messages queued to a specific destination on this particular socket. This allows the application to cancel outstanding messages if it detects a timeout. For instance, if it tried to send a message, and the remote host is unreachable, RDS will keep trying forever. The application may decide it's not worth it, and cancel the operation. In this case, it would use RDS_CANCEL_SENT_TO to nuke any pending messages. ``setsockopt(fd, SOL_RDS, SO_RDS_TRANSPORT, (int *)&transport ..), getsockopt(fd, SOL_RDS, SO_RDS_TRANSPORT, (int *)&transport ..)`` Set or read an integer defining the underlying encapsulating transport to be used for RDS packets on the socket. When setting the option, integer argument may be one of RDS_TRANS_TCP or RDS_TRANS_IB. When retrieving the value, RDS_TRANS_NONE will be returned on an unbound socket. This socket option may only be set exactly once on the socket, prior to binding it via the bind(2) system call. Attempts to set SO_RDS_TRANSPORT on a socket for which the transport has been previously attached explicitly (by SO_RDS_TRANSPORT) or implicitly (via bind(2)) will return an error of EOPNOTSUPP. An attempt to set SO_RDS_TRANSPORT to RDS_TRANS_NONE will always return EINVAL. h]hdefinition_list)}(hhh](hdefinition_list_item)}(hAF_RDS, PF_RDS, SOL_RDS AF_RDS and PF_RDS are the domain type to be used with socket(2) to create RDS sockets. SOL_RDS is the socket-level to be used with setsockopt(2) and getsockopt(2) for RDS specific socket options. h](hterm)}(hAF_RDS, PF_RDS, SOL_RDSh]hAF_RDS, PF_RDS, SOL_RDS}(hjAhhhNhNubah}(h]h ]h"]h$]h&]uh1j?hhhKJhj;ubh definition)}(hhh]h)}(hAF_RDS and PF_RDS are the domain type to be used with socket(2) to create RDS sockets. SOL_RDS is the socket-level to be used with setsockopt(2) and getsockopt(2) for RDS specific socket options.h]hAF_RDS and PF_RDS are the domain type to be used with socket(2) to create RDS sockets. SOL_RDS is the socket-level to be used with setsockopt(2) and getsockopt(2) for RDS specific socket options.}(hjThhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhKGhjQubah}(h]h ]h"]h$]h&]uh1jOhj;ubeh}(h]h ]h"]h$]h&]uh1j9hhhKJhj6ubj:)}(hPfd = socket(PF_RDS, SOCK_SEQPACKET, 0); This creates a new, unbound RDS socket. h](j@)}(h'fd = socket(PF_RDS, SOCK_SEQPACKET, 0);h]h'fd = socket(PF_RDS, SOCK_SEQPACKET, 0);}(hjrhhhNhNubah}(h]h ]h"]h$]h&]uh1j?hhhKMhjnubjP)}(hhh]h)}(h'This creates a new, unbound RDS socket.h]h'This creates a new, unbound RDS socket.}(hjhhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhKMhjubah}(h]h ]h"]h$]h&]uh1jOhjnubeh}(h]h ]h"]h$]h&]uh1j9hhhKMhj6ubj:)}(hXsetsockopt(SOL_SOCKET): send and receive buffer size RDS honors the send and receive buffer size socket options. You are not allowed to queue more than SO_SNDSIZE bytes to a socket. A message is queued when sendmsg is called, and it leaves the queue when the remote system acknowledges its arrival. The SO_RCVSIZE option controls the maximum receive queue length. This is a soft limit rather than a hard limit - RDS will continue to accept and queue incoming messages, even if that takes the queue length over the limit. However, it will also mark the port as "congested" and send a congestion update to the source node. The source node is supposed to throttle any processes sending to this congested port. h](j@)}(h4setsockopt(SOL_SOCKET): send and receive buffer sizeh]h4setsockopt(SOL_SOCKET): send and receive buffer size}(hjhhhNhNubah}(h]h ]h"]h$]h&]uh1j?hhhK\hjubjP)}(hhh](h)}(hRDS honors the send and receive buffer size socket options. You are not allowed to queue more than SO_SNDSIZE bytes to a socket. A message is queued when sendmsg is called, and it leaves the queue when the remote system acknowledges its arrival.h]hRDS honors the send and receive buffer size socket options. You are not allowed to queue more than SO_SNDSIZE bytes to a socket. A message is queued when sendmsg is called, and it leaves the queue when the remote system acknowledges its arrival.}(hjhhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhKPhjubh)}(hXThe SO_RCVSIZE option controls the maximum receive queue length. This is a soft limit rather than a hard limit - RDS will continue to accept and queue incoming messages, even if that takes the queue length over the limit. However, it will also mark the port as "congested" and send a congestion update to the source node. The source node is supposed to throttle any processes sending to this congested port.h]hXThe SO_RCVSIZE option controls the maximum receive queue length. This is a soft limit rather than a hard limit - RDS will continue to accept and queue incoming messages, even if that takes the queue length over the limit. However, it will also mark the port as “congested” and send a congestion update to the source node. The source node is supposed to throttle any processes sending to this congested port.}(hjhhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhKVhjubeh}(h]h ]h"]h$]h&]uh1jOhjubeh}(h]h ]h"]h$]h&]uh1j9hhhK\hj6ubj:)}(hbind(fd, &sockaddr_in, ...) This binds the socket to a local IP address and port, and a transport, if one has not already been selected via the SO_RDS_TRANSPORT socket option h](j@)}(hbind(fd, &sockaddr_in, ...)h]hbind(fd, &sockaddr_in, ...)}(hjhhhNhNubah}(h]h ]h"]h$]h&]uh1j?hhhKahjubjP)}(hhh]h)}(hThis binds the socket to a local IP address and port, and a transport, if one has not already been selected via the SO_RDS_TRANSPORT socket optionh]hThis binds the socket to a local IP address and port, and a transport, if one has not already been selected via the SO_RDS_TRANSPORT socket option}(hjhhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhK_hjubah}(h]h ]h"]h$]h&]uh1jOhjubeh}(h]h ]h"]h$]h&]uh1j9hhhKahj6ubj:)}(hXsendmsg(fd, ...) Sends a message to the indicated recipient. The kernel will transparently establish the underlying reliable connection if it isn't up yet. An attempt to send a message that exceeds SO_SNDSIZE will return with -EMSGSIZE An attempt to send a message that would take the total number of queued bytes over the SO_SNDSIZE threshold will return EAGAIN. An attempt to send a message to a destination that is marked as "congested" will return ENOBUFS. h](j@)}(hsendmsg(fd, ...)h]hsendmsg(fd, ...)}(hj hhhNhNubah}(h]h ]h"]h$]h&]uh1j?hhhKphj ubjP)}(hhh](h)}(hSends a message to the indicated recipient. The kernel will transparently establish the underlying reliable connection if it isn't up yet.h]hSends a message to the indicated recipient. The kernel will transparently establish the underlying reliable connection if it isn’t up yet.}(hjhhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhKdhjubh)}(hOAn attempt to send a message that exceeds SO_SNDSIZE will return with -EMSGSIZEh]hOAn attempt to send a message that exceeds SO_SNDSIZE will return with -EMSGSIZE}(hj,hhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhKhhjubh)}(hAn attempt to send a message that would take the total number of queued bytes over the SO_SNDSIZE threshold will return EAGAIN.h]hAn attempt to send a message that would take the total number of queued bytes over the SO_SNDSIZE threshold will return EAGAIN.}(hj:hhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhKkhjubh)}(h`An attempt to send a message to a destination that is marked as "congested" will return ENOBUFS.h]hdAn attempt to send a message to a destination that is marked as “congested” will return ENOBUFS.}(hjHhhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhKohjubeh}(h]h ]h"]h$]h&]uh1jOhj ubeh}(h]h ]h"]h$]h&]uh1j9hhhKphj6ubj:)}(hXDrecvmsg(fd, ...) Receives a message that was queued to this socket. The sockets recv queue accounting is adjusted, and if the queue length drops below SO_SNDSIZE, the port is marked uncongested, and a congestion update is sent to all peers. Applications can ask the RDS kernel module to receive notifications via control messages (for instance, there is a notification when a congestion update arrived, or when a RDMA operation completes). These notifications are received through the msg.msg_control buffer of struct msghdr. The format of the messages is described in manpages. h](j@)}(hrecvmsg(fd, ...)h]hrecvmsg(fd, ...)}(hjfhhhNhNubah}(h]h ]h"]h$]h&]uh1j?hhhK}hjbubjP)}(hhh](h)}(hReceives a message that was queued to this socket. The sockets recv queue accounting is adjusted, and if the queue length drops below SO_SNDSIZE, the port is marked uncongested, and a congestion update is sent to all peers.h]hReceives a message that was queued to this socket. The sockets recv queue accounting is adjusted, and if the queue length drops below SO_SNDSIZE, the port is marked uncongested, and a congestion update is sent to all peers.}(hjwhhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhKshjtubh)}(hXQApplications can ask the RDS kernel module to receive notifications via control messages (for instance, there is a notification when a congestion update arrived, or when a RDMA operation completes). These notifications are received through the msg.msg_control buffer of struct msghdr. The format of the messages is described in manpages.h]hXQApplications can ask the RDS kernel module to receive notifications via control messages (for instance, there is a notification when a congestion update arrived, or when a RDMA operation completes). These notifications are received through the msg.msg_control buffer of struct msghdr. The format of the messages is described in manpages.}(hjhhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhKxhjtubeh}(h]h ]h"]h$]h&]uh1jOhjbubeh}(h]h ]h"]h$]h&]uh1j9hhhK}hj6ubj:)}(hXpoll(fd) RDS supports the poll interface to allow the application to implement async I/O. POLLIN handling is pretty straightforward. When there's an incoming message queued to the socket, or a pending notification, we signal POLLIN. POLLOUT is a little harder. Since you can essentially send to any destination, RDS will always signal POLLOUT as long as there's room on the send queue (ie the number of bytes queued is less than the sendbuf size). However, the kernel will refuse to accept messages to a destination marked congested - in this case you will loop forever if you rely on poll to tell you what to do. This isn't a trivial problem, but applications can deal with this - by using congestion notifications, and by checking for ENOBUFS errors returned by sendmsg. h](j@)}(hpoll(fd)h]hpoll(fd)}(hjhhhNhNubah}(h]h ]h"]h$]h&]uh1j?hhhKhjubjP)}(hhh](h)}(hPRDS supports the poll interface to allow the application to implement async I/O.h]hPRDS supports the poll interface to allow the application to implement async I/O.}(hjhhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhKhjubh)}(hPOLLIN handling is pretty straightforward. When there's an incoming message queued to the socket, or a pending notification, we signal POLLIN.h]hPOLLIN handling is pretty straightforward. When there’s an incoming message queued to the socket, or a pending notification, we signal POLLIN.}(hjhhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhKhjubh)}(hPOLLOUT is a little harder. Since you can essentially send to any destination, RDS will always signal POLLOUT as long as there's room on the send queue (ie the number of bytes queued is less than the sendbuf size).h]hPOLLOUT is a little harder. Since you can essentially send to any destination, RDS will always signal POLLOUT as long as there’s room on the send queue (ie the number of bytes queued is less than the sendbuf size).}(hjhhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhKhjubh)}(hXDHowever, the kernel will refuse to accept messages to a destination marked congested - in this case you will loop forever if you rely on poll to tell you what to do. This isn't a trivial problem, but applications can deal with this - by using congestion notifications, and by checking for ENOBUFS errors returned by sendmsg.h]hXFHowever, the kernel will refuse to accept messages to a destination marked congested - in this case you will loop forever if you rely on poll to tell you what to do. This isn’t a trivial problem, but applications can deal with this - by using congestion notifications, and by checking for ENOBUFS errors returned by sendmsg.}(hjhhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhKhjubeh}(h]h ]h"]h$]h&]uh1jOhjubeh}(h]h ]h"]h$]h&]uh1j9hhhKhj6ubj:)}(hXsetsockopt(SOL_RDS, RDS_CANCEL_SENT_TO, &sockaddr_in) This allows the application to discard all messages queued to a specific destination on this particular socket. This allows the application to cancel outstanding messages if it detects a timeout. For instance, if it tried to send a message, and the remote host is unreachable, RDS will keep trying forever. The application may decide it's not worth it, and cancel the operation. In this case, it would use RDS_CANCEL_SENT_TO to nuke any pending messages. h](j@)}(h5setsockopt(SOL_RDS, RDS_CANCEL_SENT_TO, &sockaddr_in)h]h5setsockopt(SOL_RDS, RDS_CANCEL_SENT_TO, &sockaddr_in)}(hjhhhNhNubah}(h]h ]h"]h$]h&]uh1j?hhhKhjubjP)}(hhh](h)}(hoThis allows the application to discard all messages queued to a specific destination on this particular socket.h]hoThis allows the application to discard all messages queued to a specific destination on this particular socket.}(hj hhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhKhj ubh)}(hXVThis allows the application to cancel outstanding messages if it detects a timeout. For instance, if it tried to send a message, and the remote host is unreachable, RDS will keep trying forever. The application may decide it's not worth it, and cancel the operation. In this case, it would use RDS_CANCEL_SENT_TO to nuke any pending messages.h]hXXThis allows the application to cancel outstanding messages if it detects a timeout. For instance, if it tried to send a message, and the remote host is unreachable, RDS will keep trying forever. The application may decide it’s not worth it, and cancel the operation. In this case, it would use RDS_CANCEL_SENT_TO to nuke any pending messages.}(hjhhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhKhj ubeh}(h]h ]h"]h$]h&]uh1jOhjubeh}(h]h ]h"]h$]h&]uh1j9hhhKhj6ubj:)}(hX"``setsockopt(fd, SOL_RDS, SO_RDS_TRANSPORT, (int *)&transport ..), getsockopt(fd, SOL_RDS, SO_RDS_TRANSPORT, (int *)&transport ..)`` Set or read an integer defining the underlying encapsulating transport to be used for RDS packets on the socket. When setting the option, integer argument may be one of RDS_TRANS_TCP or RDS_TRANS_IB. When retrieving the value, RDS_TRANS_NONE will be returned on an unbound socket. This socket option may only be set exactly once on the socket, prior to binding it via the bind(2) system call. Attempts to set SO_RDS_TRANSPORT on a socket for which the transport has been previously attached explicitly (by SO_RDS_TRANSPORT) or implicitly (via bind(2)) will return an error of EOPNOTSUPP. An attempt to set SO_RDS_TRANSPORT to RDS_TRANS_NONE will always return EINVAL. h](j@)}(h``setsockopt(fd, SOL_RDS, SO_RDS_TRANSPORT, (int *)&transport ..), getsockopt(fd, SOL_RDS, SO_RDS_TRANSPORT, (int *)&transport ..)``h]hliteral)}(hj;h]hsetsockopt(fd, SOL_RDS, SO_RDS_TRANSPORT, (int *)&transport ..), getsockopt(fd, SOL_RDS, SO_RDS_TRANSPORT, (int *)&transport ..)}(hj?hhhNhNubah}(h]h ]h"]h$]h&]uh1j=hj9ubah}(h]h ]h"]h$]h&]uh1j?hhhKhj5ubjP)}(hhh]h)}(hXSet or read an integer defining the underlying encapsulating transport to be used for RDS packets on the socket. When setting the option, integer argument may be one of RDS_TRANS_TCP or RDS_TRANS_IB. When retrieving the value, RDS_TRANS_NONE will be returned on an unbound socket. This socket option may only be set exactly once on the socket, prior to binding it via the bind(2) system call. Attempts to set SO_RDS_TRANSPORT on a socket for which the transport has been previously attached explicitly (by SO_RDS_TRANSPORT) or implicitly (via bind(2)) will return an error of EOPNOTSUPP. An attempt to set SO_RDS_TRANSPORT to RDS_TRANS_NONE will always return EINVAL.h]hXSet or read an integer defining the underlying encapsulating transport to be used for RDS packets on the socket. When setting the option, integer argument may be one of RDS_TRANS_TCP or RDS_TRANS_IB. When retrieving the value, RDS_TRANS_NONE will be returned on an unbound socket. This socket option may only be set exactly once on the socket, prior to binding it via the bind(2) system call. Attempts to set SO_RDS_TRANSPORT on a socket for which the transport has been previously attached explicitly (by SO_RDS_TRANSPORT) or implicitly (via bind(2)) will return an error of EOPNOTSUPP. An attempt to set SO_RDS_TRANSPORT to RDS_TRANS_NONE will always return EINVAL.}(hjUhhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhKhjRubah}(h]h ]h"]h$]h&]uh1jOhj5ubeh}(h]h ]h"]h$]h&]uh1j9hhhKhj6ubeh}(h]h ]h"]h$]h&]uh1j4hj0ubah}(h]h ]h"]h$]h&]uh1jLhhhKFhjhhubeh}(h]socket-interfaceah ]h"]socket interfaceah$]h&]uh1hhhhhhhhKDubh)}(hhh](h)}(h RDMA for RDSh]h RDMA for RDS}(hjhhhNhNubah}(h]h ]h"]h$]h&]uh1hhjhhhhhKubjM)}(h2see rds-rdma(7) manpage (available in rds-tools) h]h)}(h0see rds-rdma(7) manpage (available in rds-tools)h]h0see rds-rdma(7) manpage (available in rds-tools)}(hjhhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhKhjubah}(h]h ]h"]h$]h&]uh1jLhhhKhjhhubeh}(h] rdma-for-rdsah ]h"] rdma for rdsah$]h&]uh1hhhhhhhhKubh)}(hhh](h)}(hCongestion Notificationsh]hCongestion Notifications}(hjhhhNhNubah}(h]h ]h"]h$]h&]uh1hhjhhhhhKubjM)}(hsee rds(7) manpage h]h)}(hsee rds(7) manpageh]hsee rds(7) manpage}(hjhhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhKhjubah}(h]h ]h"]h$]h&]uh1jLhhhKhjhhubeh}(h]congestion-notificationsah ]h"]congestion notificationsah$]h&]uh1hhhhhhhhKubh)}(hhh](h)}(h RDS Protocolh]h RDS Protocol}(hjhhhNhNubah}(h]h ]h"]h$]h&]uh1hhjhhhhhKubjM)}(hX} Message header The message header is a 'struct rds_header' (see rds.h): Fields: h_sequence: per-packet sequence number h_ack: piggybacked acknowledgment of last packet received h_len: length of data, not including header h_sport: source port h_dport: destination port h_flags: Can be: ============= ================================== CONG_BITMAP this is a congestion update bitmap ACK_REQUIRED receiver must ack this packet RETRANSMITTED packet has previously been sent ============= ================================== h_credit: indicate to other end of connection that it has more credits available (i.e. there is more send room) h_padding[4]: unused, for future use h_csum: header checksum h_exthdr: optional data can be passed here. This is currently used for passing RDMA-related information. ACK and retransmit handling One might think that with reliable IB connections you wouldn't need to ack messages that have been received. The problem is that IB hardware generates an ack message before it has DMAed the message into memory. This creates a potential message loss if the HCA is disabled for any reason between when it sends the ack and before the message is DMAed and processed. This is only a potential issue if another HCA is available for fail-over. Sending an ack immediately would allow the sender to free the sent message from their send queue quickly, but could cause excessive traffic to be used for acks. RDS piggybacks acks on sent data packets. Ack-only packets are reduced by only allowing one to be in flight at a time, and by the sender only asking for acks when its send buffers start to fill up. All retransmissions are also acked. Flow Control RDS's IB transport uses a credit-based mechanism to verify that there is space in the peer's receive buffers for more data. This eliminates the need for hardware retries on the connection. Congestion Messages waiting in the receive queue on the receiving socket are accounted against the sockets SO_RCVBUF option value. Only the payload bytes in the message are accounted for. If the number of bytes queued equals or exceeds rcvbuf then the socket is congested. All sends attempted to this socket's address should return block or return -EWOULDBLOCK. Applications are expected to be reasonably tuned such that this situation very rarely occurs. An application encountering this "back-pressure" is considered a bug. This is implemented by having each node maintain bitmaps which indicate which ports on bound addresses are congested. As the bitmap changes it is sent through all the connections which terminate in the local address of the bitmap which changed. The bitmaps are allocated as connections are brought up. This avoids allocation in the interrupt handling path which queues sages on sockets. The dense bitmaps let transports send the entire bitmap on any bitmap change reasonably efficiently. This is much easier to implement than some finer-grained communication of per-port congestion. The sender does a very inexpensive bit test to test if the port it's about to send to is congested or not. h](h)}(hMessage headerh]hMessage header}(hjhhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhKhjubjM)}(hXThe message header is a 'struct rds_header' (see rds.h): Fields: h_sequence: per-packet sequence number h_ack: piggybacked acknowledgment of last packet received h_len: length of data, not including header h_sport: source port h_dport: destination port h_flags: Can be: ============= ================================== CONG_BITMAP this is a congestion update bitmap ACK_REQUIRED receiver must ack this packet RETRANSMITTED packet has previously been sent ============= ================================== h_credit: indicate to other end of connection that it has more credits available (i.e. there is more send room) h_padding[4]: unused, for future use h_csum: header checksum h_exthdr: optional data can be passed here. This is currently used for passing RDMA-related information. h](h)}(h8The message header is a 'struct rds_header' (see rds.h):h]hhhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhKhj:ubah}(h]h ]h"]h$]h&]uh1jLhhhKhjubh)}(h Congestionh]h Congestion}(hjRhhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhKhjubjM)}(hXMessages waiting in the receive queue on the receiving socket are accounted against the sockets SO_RCVBUF option value. Only the payload bytes in the message are accounted for. If the number of bytes queued equals or exceeds rcvbuf then the socket is congested. All sends attempted to this socket's address should return block or return -EWOULDBLOCK. Applications are expected to be reasonably tuned such that this situation very rarely occurs. An application encountering this "back-pressure" is considered a bug. This is implemented by having each node maintain bitmaps which indicate which ports on bound addresses are congested. As the bitmap changes it is sent through all the connections which terminate in the local address of the bitmap which changed. The bitmaps are allocated as connections are brought up. This avoids allocation in the interrupt handling path which queues sages on sockets. The dense bitmaps let transports send the entire bitmap on any bitmap change reasonably efficiently. This is much easier to implement than some finer-grained communication of per-port congestion. The sender does a very inexpensive bit test to test if the port it's about to send to is congested or not. h](h)}(hXaMessages waiting in the receive queue on the receiving socket are accounted against the sockets SO_RCVBUF option value. Only the payload bytes in the message are accounted for. If the number of bytes queued equals or exceeds rcvbuf then the socket is congested. All sends attempted to this socket's address should return block or return -EWOULDBLOCK.h]hXcMessages waiting in the receive queue on the receiving socket are accounted against the sockets SO_RCVBUF option value. Only the payload bytes in the message are accounted for. If the number of bytes queued equals or exceeds rcvbuf then the socket is congested. All sends attempted to this socket’s address should return block or return -EWOULDBLOCK.}(hjdhhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhKhj`ubh)}(hApplications are expected to be reasonably tuned such that this situation very rarely occurs. An application encountering this "back-pressure" is considered a bug.h]hApplications are expected to be reasonably tuned such that this situation very rarely occurs. An application encountering this “back-pressure” is considered a bug.}(hjrhhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhMhj`ubh)}(hThis is implemented by having each node maintain bitmaps which indicate which ports on bound addresses are congested. As the bitmap changes it is sent through all the connections which terminate in the local address of the bitmap which changed.h]hThis is implemented by having each node maintain bitmaps which indicate which ports on bound addresses are congested. As the bitmap changes it is sent through all the connections which terminate in the local address of the bitmap which changed.}(hjhhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhMhj`ubh)}(hXThe bitmaps are allocated as connections are brought up. This avoids allocation in the interrupt handling path which queues sages on sockets. The dense bitmaps let transports send the entire bitmap on any bitmap change reasonably efficiently. This is much easier to implement than some finer-grained communication of per-port congestion. The sender does a very inexpensive bit test to test if the port it's about to send to is congested or not.h]hXThe bitmaps are allocated as connections are brought up. This avoids allocation in the interrupt handling path which queues sages on sockets. The dense bitmaps let transports send the entire bitmap on any bitmap change reasonably efficiently. This is much easier to implement than some finer-grained communication of per-port congestion. The sender does a very inexpensive bit test to test if the port it’s about to send to is congested or not.}(hjhhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhM hj`ubeh}(h]h ]h"]h$]h&]uh1jLhhhKhjubeh}(h]h ]h"]h$]h&]uh1jLhhhKhjhhubeh}(h] rds-protocolah ]h"] rds protocolah$]h&]uh1hhhhhhhhKubh)}(hhh](h)}(hRDS Transport Layerh]hRDS Transport Layer}(hjhhhNhNubah}(h]h ]h"]h$]h&]uh1hhjhhhhhMubjM)}(hXAs mentioned above, RDS is not IB-specific. Its code is divided into a general RDS layer and a transport layer. The general layer handles the socket API, congestion handling, loopback, stats, usermem pinning, and the connection state machine. The transport layer handles the details of the transport. The IB transport, for example, handles all the queue pairs, work requests, CM event handlers, and other Infiniband details. h](h)}(hoAs mentioned above, RDS is not IB-specific. Its code is divided into a general RDS layer and a transport layer.h]hoAs mentioned above, RDS is not IB-specific. Its code is divided into a general RDS layer and a transport layer.}(hjhhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhMhjubh)}(hThe general layer handles the socket API, congestion handling, loopback, stats, usermem pinning, and the connection state machine.h]hThe general layer handles the socket API, congestion handling, loopback, stats, usermem pinning, and the connection state machine.}(hjhhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhMhjubh)}(hThe transport layer handles the details of the transport. The IB transport, for example, handles all the queue pairs, work requests, CM event handlers, and other Infiniband details.h]hThe transport layer handles the details of the transport. The IB transport, for example, handles all the queue pairs, work requests, CM event handlers, and other Infiniband details.}(hjhhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhMhjubeh}(h]h ]h"]h$]h&]uh1jLhhhMhjhhubeh}(h]rds-transport-layerah ]h"]rds transport layerah$]h&]uh1hhhhhhhhMubh)}(hhh](h)}(hRDS Kernel Structuresh]hRDS Kernel Structures}(hj hhhNhNubah}(h]h ]h"]h$]h&]uh1hhjhhhhhM#ubjM)}(hX9struct rds_message aka possibly "rds_outgoing", the generic RDS layer copies data to be sent and sets header fields as needed, based on the socket API. This is then queued for the individual connection and sent by the connection's transport. struct rds_incoming a generic struct referring to incoming data that can be handed from the transport to the general code and queued by the general code while the socket is awoken. It is then passed back to the transport code to handle the actual copy-to-user. struct rds_socket per-socket information struct rds_connection per-connection information struct rds_transport pointers to transport-specific functions struct rds_statistics non-transport-specific statistics struct rds_cong_map wraps the raw congestion bitmap, contains rbnode, waitq, etc. h]j5)}(hhh](j:)}(hstruct rds_message aka possibly "rds_outgoing", the generic RDS layer copies data to be sent and sets header fields as needed, based on the socket API. This is then queued for the individual connection and sent by the connection's transport. h](j@)}(hstruct rds_messageh]hstruct rds_message}(hj hhhNhNubah}(h]h ]h"]h$]h&]uh1j?hhhM)hj ubjP)}(hhh]h)}(haka possibly "rds_outgoing", the generic RDS layer copies data to be sent and sets header fields as needed, based on the socket API. This is then queued for the individual connection and sent by the connection's transport.h]haka possibly “rds_outgoing”, the generic RDS layer copies data to be sent and sets header fields as needed, based on the socket API. This is then queued for the individual connection and sent by the connection’s transport.}(hj* hhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhM&hj' ubah}(h]h ]h"]h$]h&]uh1jOhj ubeh}(h]h ]h"]h$]h&]uh1j9hhhM)hj ubj:)}(hXstruct rds_incoming a generic struct referring to incoming data that can be handed from the transport to the general code and queued by the general code while the socket is awoken. It is then passed back to the transport code to handle the actual copy-to-user. h](j@)}(hstruct rds_incomingh]hstruct rds_incoming}(hjH hhhNhNubah}(h]h ]h"]h$]h&]uh1j?hhhM/hjD ubjP)}(hhh]h)}(ha generic struct referring to incoming data that can be handed from the transport to the general code and queued by the general code while the socket is awoken. It is then passed back to the transport code to handle the actual copy-to-user.h]ha generic struct referring to incoming data that can be handed from the transport to the general code and queued by the general code while the socket is awoken. It is then passed back to the transport code to handle the actual copy-to-user.}(hjY hhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhM,hjV ubah}(h]h ]h"]h$]h&]uh1jOhjD ubeh}(h]h ]h"]h$]h&]uh1j9hhhM/hj ubj:)}(h)struct rds_socket per-socket information h](j@)}(hstruct rds_socketh]hstruct rds_socket}(hjw hhhNhNubah}(h]h ]h"]h$]h&]uh1j?hhhM2hjs ubjP)}(hhh]h)}(hper-socket informationh]hper-socket information}(hj hhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhM2hj ubah}(h]h ]h"]h$]h&]uh1jOhjs ubeh}(h]h ]h"]h$]h&]uh1j9hhhM2hj ubj:)}(h1struct rds_connection per-connection information h](j@)}(hstruct rds_connectionh]hstruct rds_connection}(hj hhhNhNubah}(h]h ]h"]h$]h&]uh1j?hhhM5hj ubjP)}(hhh]h)}(hper-connection informationh]hper-connection information}(hj hhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhM5hj ubah}(h]h ]h"]h$]h&]uh1jOhj ubeh}(h]h ]h"]h$]h&]uh1j9hhhM5hj ubj:)}(h>struct rds_transport pointers to transport-specific functions h](j@)}(hstruct rds_transporth]hstruct rds_transport}(hj hhhNhNubah}(h]h ]h"]h$]h&]uh1j?hhhM8hj ubjP)}(hhh]h)}(h(pointers to transport-specific functionsh]h(pointers to transport-specific functions}(hj hhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhM8hj ubah}(h]h ]h"]h$]h&]uh1jOhj ubeh}(h]h ]h"]h$]h&]uh1j9hhhM8hj ubj:)}(h8struct rds_statistics non-transport-specific statistics h](j@)}(hstruct rds_statisticsh]hstruct rds_statistics}(hj hhhNhNubah}(h]h ]h"]h$]h&]uh1j?hhhM;hj ubjP)}(hhh]h)}(h!non-transport-specific statisticsh]h!non-transport-specific statistics}(hj hhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhM;hj ubah}(h]h ]h"]h$]h&]uh1jOhj ubeh}(h]h ]h"]h$]h&]uh1j9hhhM;hj ubj:)}(hRstruct rds_cong_map wraps the raw congestion bitmap, contains rbnode, waitq, etc. h](j@)}(hstruct rds_cong_maph]hstruct rds_cong_map}(hj3 hhhNhNubah}(h]h ]h"]h$]h&]uh1j?hhhM>hj/ ubjP)}(hhh]h)}(h=wraps the raw congestion bitmap, contains rbnode, waitq, etc.h]h=wraps the raw congestion bitmap, contains rbnode, waitq, etc.}(hjD hhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhM>hjA ubah}(h]h ]h"]h$]h&]uh1jOhj/ ubeh}(h]h ]h"]h$]h&]uh1j9hhhM>hj ubeh}(h]h ]h"]h$]h&]uh1j4hj ubah}(h]h ]h"]h$]h&]uh1jLhhhM%hjhhubeh}(h]rds-kernel-structuresah ]h"]rds kernel structuresah$]h&]uh1hhhhhhhhM#ubh)}(hhh](h)}(hConnection managementh]hConnection management}(hju hhhNhNubah}(h]h ]h"]h$]h&]uh1hhjr hhhhhMAubjM)}(hXConnections may be in UP, DOWN, CONNECTING, DISCONNECTING, and ERROR states. The first time an attempt is made by an RDS socket to send data to a node, a connection is allocated and connected. That connection is then maintained forever -- if there are transport errors, the connection will be dropped and re-established. Dropping a connection while packets are queued will cause queued or partially-sent datagrams to be retransmitted when the connection is re-established. h](h)}(hLConnections may be in UP, DOWN, CONNECTING, DISCONNECTING, and ERROR states.h]hLConnections may be in UP, DOWN, CONNECTING, DISCONNECTING, and ERROR states.}(hj hhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhMChj ubh)}(hThe first time an attempt is made by an RDS socket to send data to a node, a connection is allocated and connected. That connection is then maintained forever -- if there are transport errors, the connection will be dropped and re-established.h]hThe first time an attempt is made by an RDS socket to send data to a node, a connection is allocated and connected. That connection is then maintained forever -- if there are transport errors, the connection will be dropped and re-established.}(hj hhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhMFhj ubh)}(hDropping a connection while packets are queued will cause queued or partially-sent datagrams to be retransmitted when the connection is re-established.h]hDropping a connection while packets are queued will cause queued or partially-sent datagrams to be retransmitted when the connection is re-established.}(hj hhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhMKhj ubeh}(h]h ]h"]h$]h&]uh1jLhhhMChjr hhubeh}(h]connection-managementah ]h"]connection managementah$]h&]uh1hhhhhhhhMAubh)}(hhh](h)}(h The send pathh]h The send path}(hj hhhNhNubah}(h]h ]h"]h$]h&]uh1hhj hhhhhMQubjM)}(hXrds_sendmsg() - struct rds_message built from incoming data - CMSGs parsed (e.g. RDMA ops) - transport connection alloced and connected if not already - rds_message placed on send queue - send worker awoken rds_send_worker() - calls rds_send_xmit() until queue is empty rds_send_xmit() - transmits congestion map if one is pending - may set ACK_REQUIRED - calls transport to send either non-RDMA or RDMA message (RDMA ops never retransmitted) rds_ib_xmit() - allocs work requests from send ring - adds any new send credits available to peer (h_credits) - maps the rds_message's sg list - piggybacks ack - populates work requests - post send to connection's queue pair gh]j5)}(hhh](j:)}(hrds_sendmsg() - struct rds_message built from incoming data - CMSGs parsed (e.g. RDMA ops) - transport connection alloced and connected if not already - rds_message placed on send queue - send worker awoken h](j@)}(h rds_sendmsg()h]h rds_sendmsg()}(hj hhhNhNubah}(h]h ]h"]h$]h&]uh1j?hhhMXhj ubjP)}(hhh]jS)}(hhh](jX)}(h+struct rds_message built from incoming datah]h)}(hj h]h+struct rds_message built from incoming data}(hj hhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhMThj ubah}(h]h ]h"]h$]h&]uh1jWhj ubjX)}(hCMSGs parsed (e.g. RDMA ops)h]h)}(hj h]hCMSGs parsed (e.g. RDMA ops)}(hj hhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhMUhj ubah}(h]h ]h"]h$]h&]uh1jWhj ubjX)}(h9transport connection alloced and connected if not alreadyh]h)}(hj h]h9transport connection alloced and connected if not already}(hj! hhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhMVhj ubah}(h]h ]h"]h$]h&]uh1jWhj ubjX)}(h rds_message placed on send queueh]h)}(hj6 h]h rds_message placed on send queue}(hj8 hhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhMWhj4 ubah}(h]h ]h"]h$]h&]uh1jWhj ubjX)}(hsend worker awoken h]h)}(hsend worker awokenh]hsend worker awoken}(hjO hhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhMXhjK ubah}(h]h ]h"]h$]h&]uh1jWhj ubeh}(h]h ]h"]h$]h&]j-uh1jRhhhMThj ubah}(h]h ]h"]h$]h&]uh1jOhj ubeh}(h]h ]h"]h$]h&]uh1j9hhhMXhj ubj:)}(h?rds_send_worker() - calls rds_send_xmit() until queue is empty h](j@)}(hrds_send_worker()h]hrds_send_worker()}(hjz hhhNhNubah}(h]h ]h"]h$]h&]uh1j?hhhM[hjv ubjP)}(hhh]jS)}(hhh]jX)}(h+calls rds_send_xmit() until queue is empty h]h)}(h*calls rds_send_xmit() until queue is emptyh]h*calls rds_send_xmit() until queue is empty}(hj hhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhM[hj ubah}(h]h ]h"]h$]h&]uh1jWhj ubah}(h]h ]h"]h$]h&]jji uh1jRhhhM[hj ubah}(h]h ]h"]h$]h&]uh1jOhjv ubeh}(h]h ]h"]h$]h&]uh1j9hhhM[hj ubj:)}(hrds_send_xmit() - transmits congestion map if one is pending - may set ACK_REQUIRED - calls transport to send either non-RDMA or RDMA message (RDMA ops never retransmitted) h](j@)}(hrds_send_xmit()h]hrds_send_xmit()}(hj hhhNhNubah}(h]h ]h"]h$]h&]uh1j?hhhMahj ubjP)}(hhh]jS)}(hhh](jX)}(h*transmits congestion map if one is pendingh]h)}(hj h]h*transmits congestion map if one is pending}(hj hhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhM^hj ubah}(h]h ]h"]h$]h&]uh1jWhj ubjX)}(hmay set ACK_REQUIREDh]h)}(hj h]hmay set ACK_REQUIRED}(hj hhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhM_hj ubah}(h]h ]h"]h$]h&]uh1jWhj ubjX)}(hWcalls transport to send either non-RDMA or RDMA message (RDMA ops never retransmitted) h]h)}(hVcalls transport to send either non-RDMA or RDMA message (RDMA ops never retransmitted)h]hVcalls transport to send either non-RDMA or RDMA message (RDMA ops never retransmitted)}(hj hhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhM`hj ubah}(h]h ]h"]h$]h&]uh1jWhj ubeh}(h]h ]h"]h$]h&]jji uh1jRhhhM^hj ubah}(h]h ]h"]h$]h&]uh1jOhj ubeh}(h]h ]h"]h$]h&]uh1j9hhhMahj ubj:)}(hrds_ib_xmit() - allocs work requests from send ring - adds any new send credits available to peer (h_credits) - maps the rds_message's sg list - piggybacks ack - populates work requests - post send to connection's queue pair h](j@)}(h rds_ib_xmit()h]h rds_ib_xmit()}(hj, hhhNhNubah}(h]h ]h"]h$]h&]uh1j?hhhMihj( ubjP)}(hhh]jS)}(hhh](jX)}(h#allocs work requests from send ringh]h)}(hjB h]h#allocs work requests from send ring}(hjD hhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhMdhj@ ubah}(h]h ]h"]h$]h&]uh1jWhj= ubjX)}(h7adds any new send credits available to peer (h_credits)h]h)}(hjY h]h7adds any new send credits available to peer (h_credits)}(hj[ hhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhMehjW ubah}(h]h ]h"]h$]h&]uh1jWhj= ubjX)}(hmaps the rds_message's sg listh]h)}(hjp h]h maps the rds_message’s sg list}(hjr hhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhMfhjn ubah}(h]h ]h"]h$]h&]uh1jWhj= ubjX)}(hpiggybacks ackh]h)}(hj h]hpiggybacks ack}(hj hhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhMghj ubah}(h]h ]h"]h$]h&]uh1jWhj= ubjX)}(hpopulates work requestsh]h)}(hj h]hpopulates work requests}(hj hhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhMhhj ubah}(h]h ]h"]h$]h&]uh1jWhj= ubjX)}(h%post send to connection's queue pair h]h)}(h$post send to connection's queue pairh]h&post send to connection’s queue pair}(hj hhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhMihj ubah}(h]h ]h"]h$]h&]uh1jWhj= ubeh}(h]h ]h"]h$]h&]jji uh1jRhhhMdhj: ubah}(h]h ]h"]h$]h&]uh1jOhj( ubeh}(h]h ]h"]h$]h&]uh1j9hhhMihj ubeh}(h]h ]h"]h$]h&]uh1j4hj ubah}(h]h ]h"]h$]h&]uh1jLhhhMShj hhubeh}(h] the-send-pathah ]h"] the send pathah$]h&]uh1hhhhhhhhMQubh)}(hhh](h)}(h The recv pathh]h The recv path}(hj hhhNhNubah}(h]h ]h"]h$]h&]uh1hhj hhhhhMlubjM)}(hXrds_ib_recv_cq_comp_handler() - looks at write completions - unmaps recv buffer from device - no errors, call rds_ib_process_recv() - refill recv ring rds_ib_process_recv() - validate header checksum - copy header to rds_ib_incoming struct if start of a new datagram - add to ibinc's fraglist - if competed datagram: - update cong map if datagram was cong update - call rds_recv_incoming() otherwise - note if ack is required rds_recv_incoming() - drop duplicate packets - respond to pings - find the sock associated with this datagram - add to sock queue - wake up sock - do some congestion calculations rds_recvmsg - copy data into user iovec - handle CMSGs - return to application h]j5)}(hhh](j:)}(hrds_ib_recv_cq_comp_handler() - looks at write completions - unmaps recv buffer from device - no errors, call rds_ib_process_recv() - refill recv ring h](j@)}(hrds_ib_recv_cq_comp_handler()h]hrds_ib_recv_cq_comp_handler()}(hj hhhNhNubah}(h]h ]h"]h$]h&]uh1j?hhhMrhj ubjP)}(hhh]jS)}(hhh](jX)}(hlooks at write completionsh]h)}(hj# h]hlooks at write completions}(hj% hhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhMohj! ubah}(h]h ]h"]h$]h&]uh1jWhj ubjX)}(hunmaps recv buffer from deviceh]h)}(hj: h]hunmaps recv buffer from device}(hj< hhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhMphj8 ubah}(h]h ]h"]h$]h&]uh1jWhj ubjX)}(h%no errors, call rds_ib_process_recv()h]h)}(hjQ h]h%no errors, call rds_ib_process_recv()}(hjS hhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhMqhjO ubah}(h]h ]h"]h$]h&]uh1jWhj ubjX)}(hrefill recv ring h]h)}(hrefill recv ringh]hrefill recv ring}(hjj hhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhMrhjf ubah}(h]h ]h"]h$]h&]uh1jWhj ubeh}(h]h ]h"]h$]h&]jji uh1jRhhhMohj ubah}(h]h ]h"]h$]h&]uh1jOhj ubeh}(h]h ]h"]h$]h&]uh1j9hhhMrhj ubj:)}(hX"rds_ib_process_recv() - validate header checksum - copy header to rds_ib_incoming struct if start of a new datagram - add to ibinc's fraglist - if competed datagram: - update cong map if datagram was cong update - call rds_recv_incoming() otherwise - note if ack is required h](j@)}(hrds_ib_process_recv()h]hrds_ib_process_recv()}(hj hhhNhNubah}(h]h ]h"]h$]h&]uh1j?hhhM{hj ubjP)}(hhh]jS)}(hhh](jX)}(hvalidate header checksumh]h)}(hj h]hvalidate header checksum}(hj hhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhMuhj ubah}(h]h ]h"]h$]h&]uh1jWhj ubjX)}(h@copy header to rds_ib_incoming struct if start of a new datagramh]h)}(hj h]h@copy header to rds_ib_incoming struct if start of a new datagram}(hj hhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhMvhj ubah}(h]h ]h"]h$]h&]uh1jWhj ubjX)}(hadd to ibinc's fraglisth]h)}(hj h]hadd to ibinc’s fraglist}(hj hhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhMwhj ubah}(h]h ]h"]h$]h&]uh1jWhj ubjX)}(hif competed datagram: - update cong map if datagram was cong update - call rds_recv_incoming() otherwise - note if ack is required h]j5)}(hhh]j:)}(hif competed datagram: - update cong map if datagram was cong update - call rds_recv_incoming() otherwise - note if ack is required h](j@)}(hif competed datagram:h]hif competed datagram:}(hj hhhNhNubah}(h]h ]h"]h$]h&]uh1j?hhhM{hj ubjP)}(hhh]jS)}(hhh](jX)}(h+update cong map if datagram was cong updateh]h)}(hjh]h+update cong map if datagram was cong update}(hjhhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhMyhj ubah}(h]h ]h"]h$]h&]uh1jWhj ubjX)}(h"call rds_recv_incoming() otherwiseh]h)}(hj%h]h"call rds_recv_incoming() otherwise}(hj'hhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhMzhj#ubah}(h]h ]h"]h$]h&]uh1jWhj ubjX)}(hnote if ack is required h]h)}(hnote if ack is requiredh]hnote if ack is required}(hj>hhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhM{hj:ubah}(h]h ]h"]h$]h&]uh1jWhj ubeh}(h]h ]h"]h$]h&]jji uh1jRhhhMyhjubah}(h]h ]h"]h$]h&]uh1jOhj ubeh}(h]h ]h"]h$]h&]uh1j9hhhM{hj ubah}(h]h ]h"]h$]h&]uh1j4hj ubah}(h]h ]h"]h$]h&]uh1jWhj ubeh}(h]h ]h"]h$]h&]jji uh1jRhhhMuhj ubah}(h]h ]h"]h$]h&]uh1jOhj ubeh}(h]h ]h"]h$]h&]uh1j9hhhM{hj ubj:)}(hrds_recv_incoming() - drop duplicate packets - respond to pings - find the sock associated with this datagram - add to sock queue - wake up sock - do some congestion calculationsh](j@)}(hrds_recv_incoming()h]hrds_recv_incoming()}(hjhhhNhNubah}(h]h ]h"]h$]h&]uh1j?hhhMhjubjP)}(hhh]jS)}(hhh](jX)}(hdrop duplicate packetsh]h)}(hjh]hdrop duplicate packets}(hjhhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhM~hjubah}(h]h ]h"]h$]h&]uh1jWhjubjX)}(hrespond to pingsh]h)}(hjh]hrespond to pings}(hjhhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhMhjubah}(h]h ]h"]h$]h&]uh1jWhjubjX)}(h+find the sock associated with this datagramh]h)}(hjh]h+find the sock associated with this datagram}(hjhhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhMhjubah}(h]h ]h"]h$]h&]uh1jWhjubjX)}(hadd to sock queueh]h)}(hjh]hadd to sock queue}(hjhhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhMhjubah}(h]h ]h"]h$]h&]uh1jWhjubjX)}(h wake up sockh]h)}(hjh]h wake up sock}(hjhhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhMhjubah}(h]h ]h"]h$]h&]uh1jWhjubjX)}(hdo some congestion calculationsh]h)}(hjh]hdo some congestion calculations}(hjhhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhMhj ubah}(h]h ]h"]h$]h&]uh1jWhjubeh}(h]h ]h"]h$]h&]jji uh1jRhhhM~hjubah}(h]h ]h"]h$]h&]uh1jOhjubeh}(h]h ]h"]h$]h&]uh1j9hhhMhj ubj:)}(hOrds_recvmsg - copy data into user iovec - handle CMSGs - return to application h](j@)}(h rds_recvmsgh]h rds_recvmsg}(hj:hhhNhNubah}(h]h ]h"]h$]h&]uh1j?hhhMhj6ubjP)}(hhh]jS)}(hhh](jX)}(hcopy data into user iovech]h)}(hjPh]hcopy data into user iovec}(hjRhhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhMhjNubah}(h]h ]h"]h$]h&]uh1jWhjKubjX)}(h handle CMSGsh]h)}(hjgh]h handle CMSGs}(hjihhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhMhjeubah}(h]h ]h"]h$]h&]uh1jWhjKubjX)}(hreturn to application h]h)}(hreturn to applicationh]hreturn to application}(hjhhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhMhj|ubah}(h]h ]h"]h$]h&]uh1jWhjKubeh}(h]h ]h"]h$]h&]jji uh1jRhhhMhjHubah}(h]h ]h"]h$]h&]uh1jOhj6ubeh}(h]h ]h"]h$]h&]uh1j9hhhMhj ubeh}(h]h ]h"]h$]h&]uh1j4hj ubah}(h]h ]h"]h$]h&]uh1jLhhhMnhj hhubeh}(h] the-recv-pathah ]h"] the recv pathah$]h&]uh1hhhhhhhhMlubh)}(hhh](h)}(hMultipath RDS (mprds)h]hMultipath RDS (mprds)}(hjhhhNhNubah}(h]h ]h"]h$]h&]uh1hhjhhhhhMubjM)}(hX Mprds is multipathed-RDS, primarily intended for RDS-over-TCP (though the concept can be extended to other transports). The classical implementation of RDS-over-TCP is implemented by demultiplexing multiple PF_RDS sockets between any 2 endpoints (where endpoint == [IP address, port]) over a single TCP socket between the 2 IP addresses involved. This has the limitation that it ends up funneling multiple RDS flows over a single TCP flow, thus it is (a) upper-bounded to the single-flow bandwidth, (b) suffers from head-of-line blocking for all the RDS sockets. Better throughput (for a fixed small packet size, MTU) can be achieved by having multiple TCP/IP flows per rds/tcp connection, i.e., multipathed RDS (mprds). Each such TCP/IP flow constitutes a path for the rds/tcp connection. RDS sockets will be attached to a path based on some hash (e.g., of local address and RDS port number) and packets for that RDS socket will be sent over the attached path using TCP to segment/reassemble RDS datagrams on that path. Multipathed RDS is implemented by splitting the struct rds_connection into a common (to all paths) part, and a per-path struct rds_conn_path. All I/O workqs and reconnect threads are driven from the rds_conn_path. Transports such as TCP that are multipath capable may then set up a TCP socket per rds_conn_path, and this is managed by the transport via the transport privatee cp_transport_data pointer. Transports announce themselves as multipath capable by setting the t_mp_capable bit during registration with the rds core module. When the transport is multipath-capable, rds_sendmsg() hashes outgoing traffic across multiple paths. The outgoing hash is computed based on the local address and port that the PF_RDS socket is bound to. Additionally, even if the transport is MP capable, we may be peering with some node that does not support mprds, or supports a different number of paths. As a result, the peering nodes need to agree on the number of paths to be used for the connection. This is done by sending out a control packet exchange before the first data packet. The control packet exchange must have completed prior to outgoing hash completion in rds_sendmsg() when the transport is mutlipath capable. The control packet is an RDS ping packet (i.e., packet to rds dest port 0) with the ping packet having a rds extension header option of type RDS_EXTHDR_NPATHS, length 2 bytes, and the value is the number of paths supported by the sender. The "probe" ping packet will get sent from some reserved port, RDS_FLAG_PROBE_PORT (in ) The receiver of a ping from RDS_FLAG_PROBE_PORT will thus immediately be able to compute the min(sender_paths, rcvr_paths). The pong sent in response to a probe-ping should contain the rcvr's npaths when the rcvr is mprds-capable. If the rcvr is not mprds-capable, the exthdr in the ping will be ignored. In this case the pong will not have any exthdrs, so the sender of the probe-ping can default to single-path mprds. h](h)}(hX2Mprds is multipathed-RDS, primarily intended for RDS-over-TCP (though the concept can be extended to other transports). The classical implementation of RDS-over-TCP is implemented by demultiplexing multiple PF_RDS sockets between any 2 endpoints (where endpoint == [IP address, port]) over a single TCP socket between the 2 IP addresses involved. This has the limitation that it ends up funneling multiple RDS flows over a single TCP flow, thus it is (a) upper-bounded to the single-flow bandwidth, (b) suffers from head-of-line blocking for all the RDS sockets.h]hX2Mprds is multipathed-RDS, primarily intended for RDS-over-TCP (though the concept can be extended to other transports). The classical implementation of RDS-over-TCP is implemented by demultiplexing multiple PF_RDS sockets between any 2 endpoints (where endpoint == [IP address, port]) over a single TCP socket between the 2 IP addresses involved. This has the limitation that it ends up funneling multiple RDS flows over a single TCP flow, thus it is (a) upper-bounded to the single-flow bandwidth, (b) suffers from head-of-line blocking for all the RDS sockets.}(hjhhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhMhjubh)}(hXBetter throughput (for a fixed small packet size, MTU) can be achieved by having multiple TCP/IP flows per rds/tcp connection, i.e., multipathed RDS (mprds). Each such TCP/IP flow constitutes a path for the rds/tcp connection. RDS sockets will be attached to a path based on some hash (e.g., of local address and RDS port number) and packets for that RDS socket will be sent over the attached path using TCP to segment/reassemble RDS datagrams on that path.h]hXBetter throughput (for a fixed small packet size, MTU) can be achieved by having multiple TCP/IP flows per rds/tcp connection, i.e., multipathed RDS (mprds). Each such TCP/IP flow constitutes a path for the rds/tcp connection. RDS sockets will be attached to a path based on some hash (e.g., of local address and RDS port number) and packets for that RDS socket will be sent over the attached path using TCP to segment/reassemble RDS datagrams on that path.}(hjhhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhMhjubh)}(hXMultipathed RDS is implemented by splitting the struct rds_connection into a common (to all paths) part, and a per-path struct rds_conn_path. All I/O workqs and reconnect threads are driven from the rds_conn_path. Transports such as TCP that are multipath capable may then set up a TCP socket per rds_conn_path, and this is managed by the transport via the transport privatee cp_transport_data pointer.h]hXMultipathed RDS is implemented by splitting the struct rds_connection into a common (to all paths) part, and a per-path struct rds_conn_path. All I/O workqs and reconnect threads are driven from the rds_conn_path. Transports such as TCP that are multipath capable may then set up a TCP socket per rds_conn_path, and this is managed by the transport via the transport privatee cp_transport_data pointer.}(hjhhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhMhjubh)}(hXMTransports announce themselves as multipath capable by setting the t_mp_capable bit during registration with the rds core module. When the transport is multipath-capable, rds_sendmsg() hashes outgoing traffic across multiple paths. The outgoing hash is computed based on the local address and port that the PF_RDS socket is bound to.h]hXMTransports announce themselves as multipath capable by setting the t_mp_capable bit during registration with the rds core module. When the transport is multipath-capable, rds_sendmsg() hashes outgoing traffic across multiple paths. The outgoing hash is computed based on the local address and port that the PF_RDS socket is bound to.}(hjhhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhMhjubh)}(hXAdditionally, even if the transport is MP capable, we may be peering with some node that does not support mprds, or supports a different number of paths. As a result, the peering nodes need to agree on the number of paths to be used for the connection. This is done by sending out a control packet exchange before the first data packet. The control packet exchange must have completed prior to outgoing hash completion in rds_sendmsg() when the transport is mutlipath capable.h]hXAdditionally, even if the transport is MP capable, we may be peering with some node that does not support mprds, or supports a different number of paths. As a result, the peering nodes need to agree on the number of paths to be used for the connection. This is done by sending out a control packet exchange before the first data packet. The control packet exchange must have completed prior to outgoing hash completion in rds_sendmsg() when the transport is mutlipath capable.}(hjhhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhMhjubh)}(hX;The control packet is an RDS ping packet (i.e., packet to rds dest port 0) with the ping packet having a rds extension header option of type RDS_EXTHDR_NPATHS, length 2 bytes, and the value is the number of paths supported by the sender. The "probe" ping packet will get sent from some reserved port, RDS_FLAG_PROBE_PORT (in ) The receiver of a ping from RDS_FLAG_PROBE_PORT will thus immediately be able to compute the min(sender_paths, rcvr_paths). The pong sent in response to a probe-ping should contain the rcvr's npaths when the rcvr is mprds-capable.h]hXAThe control packet is an RDS ping packet (i.e., packet to rds dest port 0) with the ping packet having a rds extension header option of type RDS_EXTHDR_NPATHS, length 2 bytes, and the value is the number of paths supported by the sender. The “probe” ping packet will get sent from some reserved port, RDS_FLAG_PROBE_PORT (in ) The receiver of a ping from RDS_FLAG_PROBE_PORT will thus immediately be able to compute the min(sender_paths, rcvr_paths). The pong sent in response to a probe-ping should contain the rcvr’s npaths when the rcvr is mprds-capable.}(hjhhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhMhjubh)}(hIf the rcvr is not mprds-capable, the exthdr in the ping will be ignored. In this case the pong will not have any exthdrs, so the sender of the probe-ping can default to single-path mprds.h]hIf the rcvr is not mprds-capable, the exthdr in the ping will be ignored. In this case the pong will not have any exthdrs, so the sender of the probe-ping can default to single-path mprds.}(hj#hhhNhNubah}(h]h ]h"]h$]h&]uh1hhhhMhjubeh}(h]h ]h"]h$]h&]uh1jLhhhMhjhhubeh}(h]multipath-rds-mprdsah ]h"]multipath rds (mprds)ah$]h&]uh1hhhhhhhhMubeh}(h]rdsah ]h"]rdsah$]h&]uh1hhhhhhhhKubeh}(h]h ]h"]h$]h&]sourcehuh1hcurrent_sourceN current_lineNsettingsdocutils.frontendValues)}(hN generatorN datestampN source_linkN source_urlN toc_backlinksjgfootnote_backlinksK sectnum_xformKstrip_commentsNstrip_elements_with_classesN strip_classesN report_levelK halt_levelKexit_status_levelKdebugNwarning_streamN tracebackinput_encoding utf-8-siginput_encoding_error_handlerstrictoutput_encodingutf-8output_encoding_error_handlerjierror_encodingutf-8error_encoding_error_handlerbackslashreplace language_codeenrecord_dependenciesNconfigN id_prefixhauto_id_prefixid dump_settingsNdump_internalsNdump_transformsNdump_pseudo_xmlNexpose_internalsNstrict_visitorN_disable_configN_sourceh _destinationN _config_files]7/var/lib/git/docbuild/linux/Documentation/docutils.confafile_insertion_enabled raw_enabledKline_length_limitM'pep_referencesN pep_base_urlhttps://peps.python.org/pep_file_url_templatepep-%04drfc_referencesN rfc_base_url&https://datatracker.ietf.org/doc/html/ tab_widthKtrim_footnote_reference_spacesyntax_highlightlong smart_quotessmartquotes_locales]character_level_inline_markupdoctitle_xform docinfo_xformKsectsubtitle_xform image_loadinglinkembed_stylesheetcloak_email_addressessection_self_linkenvNubreporterNindirect_targets]substitution_defs}substitution_names}refnames}refids}nameids}(jDjAjj jjjj}jjjjjjjjjo jl j j j j jjj<j9u nametypes}(jDjjjjjjjjo j j jj<uh}(jAhj hjjj}jjjjjjjjjjl jj jr j j jj j9ju footnote_refs} citation_refs} autofootnotes]autofootnote_refs]symbol_footnotes]symbol_footnote_refs] footnotes] citations]autofootnote_startKsymbol_footnote_startK id_counter collectionsCounter}Rparse_messages]transform_messages] transformerN include_log] decorationNhhub.