Merge tag 'mlx5-updates-2020-04-30' of git://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux
Saeed Mahameed says:
====================
mlx5-updates-2020-04-30
1) Add release all pages support, From Eran.
to release all FW pages at once on driver unload, when supported by FW.
2) From Maxim and Tariq, Trivial Data path cleanup and code improvements
in preparation for their next features, TLS offload and TX performance
improvements
3) Multiple cleanups.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
index e43f2e1..398c348 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -638,7 +638,7 @@
See Documentation/admin-guide/serial-console.rst for more
information. See
- Documentation/networking/netconsole.txt for an
+ Documentation/networking/netconsole.rst for an
alternative.
uart[8250],io,<addr>[,options]
diff --git a/Documentation/admin-guide/serial-console.rst b/Documentation/admin-guide/serial-console.rst
index a8d1e36..58b3283 100644
--- a/Documentation/admin-guide/serial-console.rst
+++ b/Documentation/admin-guide/serial-console.rst
@@ -54,7 +54,7 @@
``/dev/console`` is now character device 5,1.
(You can also use a network device as a console. See
-``Documentation/networking/netconsole.txt`` for information on that.)
+``Documentation/networking/netconsole.rst`` for information on that.)
Here's an example that will use ``/dev/ttyS1`` (COM2) as the console.
Replace the sample values as needed.
diff --git a/Documentation/devicetree/bindings/net/qcom,ipq4019-mdio.yaml b/Documentation/devicetree/bindings/net/qcom,ipq4019-mdio.yaml
new file mode 100644
index 0000000..13555a8
--- /dev/null
+++ b/Documentation/devicetree/bindings/net/qcom,ipq4019-mdio.yaml
@@ -0,0 +1,61 @@
+# SPDX-License-Identifier: GPL-2.0-only OR BSD-2-Clause
+%YAML 1.2
+---
+$id: http://devicetree.org/schemas/net/qcom,ipq4019-mdio.yaml#
+$schema: http://devicetree.org/meta-schemas/core.yaml#
+
+title: Qualcomm IPQ40xx MDIO Controller Device Tree Bindings
+
+maintainers:
+ - Robert Marko <robert.marko@sartura.hr>
+
+allOf:
+ - $ref: "mdio.yaml#"
+
+properties:
+ compatible:
+ const: qcom,ipq4019-mdio
+
+ "#address-cells":
+ const: 1
+
+ "#size-cells":
+ const: 0
+
+ reg:
+ maxItems: 1
+
+required:
+ - compatible
+ - reg
+ - "#address-cells"
+ - "#size-cells"
+
+examples:
+ - |
+ mdio@90000 {
+ #address-cells = <1>;
+ #size-cells = <0>;
+ compatible = "qcom,ipq4019-mdio";
+ reg = <0x90000 0x64>;
+
+ ethphy0: ethernet-phy@0 {
+ reg = <0>;
+ };
+
+ ethphy1: ethernet-phy@1 {
+ reg = <1>;
+ };
+
+ ethphy2: ethernet-phy@2 {
+ reg = <2>;
+ };
+
+ ethphy3: ethernet-phy@3 {
+ reg = <3>;
+ };
+
+ ethphy4: ethernet-phy@4 {
+ reg = <4>;
+ };
+ };
diff --git a/Documentation/filesystems/afs.rst b/Documentation/filesystems/afs.rst
index c4ec39a..cada946 100644
--- a/Documentation/filesystems/afs.rst
+++ b/Documentation/filesystems/afs.rst
@@ -70,7 +70,7 @@
The first module is the AF_RXRPC network protocol driver. This provides the
RxRPC remote operation protocol and may also be accessed from userspace. See:
- Documentation/networking/rxrpc.txt
+ Documentation/networking/rxrpc.rst
The second module is the kerberos RxRPC security driver, and the third module
is the actual filesystem driver for the AFS filesystem.
diff --git a/Documentation/networking/bonding.rst b/Documentation/networking/bonding.rst
index dd49f95..24168b0 100644
--- a/Documentation/networking/bonding.rst
+++ b/Documentation/networking/bonding.rst
@@ -1639,7 +1639,7 @@
using the traffic control utilities inherent in linux.
By default the bonding driver is multiqueue aware and 16 queues are created
-when the driver initializes (see Documentation/networking/multiqueue.txt
+when the driver initializes (see Documentation/networking/multiqueue.rst
for details). If more or less queues are desired the module parameter
tx_queues can be used to change this value. There is no sysfs parameter
available as the allocation is done at module init time.
diff --git a/Documentation/networking/can.rst b/Documentation/networking/can.rst
index 2fd0b51..ff05cbd 100644
--- a/Documentation/networking/can.rst
+++ b/Documentation/networking/can.rst
@@ -1058,7 +1058,7 @@
- TX: Put the CAN frame from the socket buffer to the CAN controller.
- RX: Put the CAN frame from the CAN controller to the socket buffer.
-See e.g. at Documentation/networking/netdevices.txt . The differences
+See e.g. at Documentation/networking/netdevices.rst . The differences
for writing CAN network device driver are described below:
diff --git a/Documentation/networking/checksum-offloads.rst b/Documentation/networking/checksum-offloads.rst
index 905c8a8..69b23cf 100644
--- a/Documentation/networking/checksum-offloads.rst
+++ b/Documentation/networking/checksum-offloads.rst
@@ -59,7 +59,7 @@
for more details.
A driver declares its offload capabilities in netdev->hw_features; see
-Documentation/networking/netdev-features.txt for more. Note that a device
+Documentation/networking/netdev-features.rst for more. Note that a device
which only advertises NETIF_F_IP[V6]_CSUM must still obey the csum_start and
csum_offset given in the SKB; if it tries to deduce these itself in hardware
(as some NICs do) the driver should check that the values in the SKB match
diff --git a/Documentation/networking/index.rst b/Documentation/networking/index.rst
index e1ff08b..b423b2d 100644
--- a/Documentation/networking/index.rst
+++ b/Documentation/networking/index.rst
@@ -74,6 +74,43 @@
ipvlan
ipvs-sysctl
kcm
+ l2tp
+ lapb-module
+ ltpc
+ mac80211-injection
+ mpls-sysctl
+ multiqueue
+ netconsole
+ netdev-features
+ netdevices
+ netfilter-sysctl
+ netif-msg
+ nf_conntrack-sysctl
+ nf_flowtable
+ openvswitch
+ operstates
+ packet_mmap
+ phonet
+ pktgen
+ plip
+ ppp_generic
+ proc_net_tcp
+ radiotap-headers
+ ray_cs
+ rds
+ regulatory
+ rxrpc
+ sctp
+ secid
+ seg6-sysctl
+ skfp
+ strparser
+ switchdev
+ tc-actions-env-rules
+ tcp-thin
+ team
+ timestamping
+ tproxy
.. only:: subproject and html
diff --git a/Documentation/networking/ip-sysctl.rst b/Documentation/networking/ip-sysctl.rst
index 38f811d..3266aee 100644
--- a/Documentation/networking/ip-sysctl.rst
+++ b/Documentation/networking/ip-sysctl.rst
@@ -886,7 +886,7 @@
initiated. This improves retransmission latency for
non-aggressive thin streams, often found to be time-dependent.
For more information on thin streams, see
- Documentation/networking/tcp-thin.txt
+ Documentation/networking/tcp-thin.rst
Default: 0
diff --git a/Documentation/networking/l2tp.txt b/Documentation/networking/l2tp.rst
similarity index 78%
rename from Documentation/networking/l2tp.txt
rename to Documentation/networking/l2tp.rst
index 9bc271c..a48238a 100644
--- a/Documentation/networking/l2tp.txt
+++ b/Documentation/networking/l2tp.rst
@@ -1,3 +1,9 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+====
+L2TP
+====
+
This document describes how to use the kernel's L2TP drivers to
provide L2TP functionality. L2TP is a protocol that tunnels one or
more sessions over an IP tunnel. It is commonly used for VPNs
@@ -121,14 +127,16 @@
setsockopt and ioctl on the PPPoX socket. The following socket
options are supported:-
-DEBUG - bitmask of debug message categories. See below.
-SENDSEQ - 0 => don't send packets with sequence numbers
- 1 => send packets with sequence numbers
-RECVSEQ - 0 => receive packet sequence numbers are optional
- 1 => drop receive packets without sequence numbers
-LNSMODE - 0 => act as LAC.
- 1 => act as LNS.
-REORDERTO - reorder timeout (in millisecs). If 0, don't try to reorder.
+========= ===========================================================
+DEBUG bitmask of debug message categories. See below.
+SENDSEQ - 0 => don't send packets with sequence numbers
+ - 1 => send packets with sequence numbers
+RECVSEQ - 0 => receive packet sequence numbers are optional
+ - 1 => drop receive packets without sequence numbers
+LNSMODE - 0 => act as LAC.
+ - 1 => act as LNS.
+REORDERTO reorder timeout (in millisecs). If 0, don't try to reorder.
+========= ===========================================================
Only the DEBUG option is supported by the special tunnel management
PPPoX socket.
@@ -177,20 +185,22 @@
The following debug mask bits are available:
+================ ==============================
L2TP_MSG_DEBUG verbose debug (if compiled in)
L2TP_MSG_CONTROL userspace - kernel interface
L2TP_MSG_SEQ sequence numbers handling
L2TP_MSG_DATA data packets
+================ ==============================
If enabled, files under a l2tp debugfs directory can be used to dump
kernel state about L2TP tunnels and sessions. To access it, the
-debugfs filesystem must first be mounted.
+debugfs filesystem must first be mounted::
-# mount -t debugfs debugfs /debug
+ # mount -t debugfs debugfs /debug
-Files under the l2tp directory can then be accessed.
+Files under the l2tp directory can then be accessed::
-# cat /debug/l2tp/tunnels
+ # cat /debug/l2tp/tunnels
The debugfs files should not be used by applications to obtain L2TP
state information because the file format is subject to change. It is
@@ -211,14 +221,14 @@
To create an L2TPv3 ethernet pseudowire between local host 192.168.1.1
and peer 192.168.1.2, using IP addresses 10.5.1.1 and 10.5.1.2 for the
-tunnel endpoints:-
+tunnel endpoints::
-# ip l2tp add tunnel tunnel_id 1 peer_tunnel_id 1 udp_sport 5000 \
- udp_dport 5000 encap udp local 192.168.1.1 remote 192.168.1.2
-# ip l2tp add session tunnel_id 1 session_id 1 peer_session_id 1
-# ip -s -d show dev l2tpeth0
-# ip addr add 10.5.1.2/32 peer 10.5.1.1/32 dev l2tpeth0
-# ip li set dev l2tpeth0 up
+ # ip l2tp add tunnel tunnel_id 1 peer_tunnel_id 1 udp_sport 5000 \
+ udp_dport 5000 encap udp local 192.168.1.1 remote 192.168.1.2
+ # ip l2tp add session tunnel_id 1 session_id 1 peer_session_id 1
+ # ip -s -d show dev l2tpeth0
+ # ip addr add 10.5.1.2/32 peer 10.5.1.1/32 dev l2tpeth0
+ # ip li set dev l2tpeth0 up
Choose IP addresses to be the address of a local IP interface and that
of the remote system. The IP addresses of the l2tpeth0 interface can be
@@ -228,75 +238,78 @@
addresses reversed. The tunnel and session IDs can be any non-zero
32-bit number, but the values must be reversed at the peer.
+======================== ===================
Host 1 Host2
+======================== ===================
udp_sport=5000 udp_sport=5001
udp_dport=5001 udp_dport=5000
tunnel_id=42 tunnel_id=45
peer_tunnel_id=45 peer_tunnel_id=42
session_id=128 session_id=5196755
peer_session_id=5196755 peer_session_id=128
+======================== ===================
When done at both ends of the tunnel, it should be possible to send
-data over the network. e.g.
+data over the network. e.g.::
-# ping 10.5.1.1
+ # ping 10.5.1.1
Sample Userspace Code
=====================
-1. Create tunnel management PPPoX socket
+1. Create tunnel management PPPoX socket::
- kernel_fd = socket(AF_PPPOX, SOCK_DGRAM, PX_PROTO_OL2TP);
- if (kernel_fd >= 0) {
- struct sockaddr_pppol2tp sax;
- struct sockaddr_in const *peer_addr;
+ kernel_fd = socket(AF_PPPOX, SOCK_DGRAM, PX_PROTO_OL2TP);
+ if (kernel_fd >= 0) {
+ struct sockaddr_pppol2tp sax;
+ struct sockaddr_in const *peer_addr;
- peer_addr = l2tp_tunnel_get_peer_addr(tunnel);
- memset(&sax, 0, sizeof(sax));
- sax.sa_family = AF_PPPOX;
- sax.sa_protocol = PX_PROTO_OL2TP;
- sax.pppol2tp.fd = udp_fd; /* fd of tunnel UDP socket */
- sax.pppol2tp.addr.sin_addr.s_addr = peer_addr->sin_addr.s_addr;
- sax.pppol2tp.addr.sin_port = peer_addr->sin_port;
- sax.pppol2tp.addr.sin_family = AF_INET;
- sax.pppol2tp.s_tunnel = tunnel_id;
- sax.pppol2tp.s_session = 0; /* special case: mgmt socket */
- sax.pppol2tp.d_tunnel = 0;
- sax.pppol2tp.d_session = 0; /* special case: mgmt socket */
+ peer_addr = l2tp_tunnel_get_peer_addr(tunnel);
+ memset(&sax, 0, sizeof(sax));
+ sax.sa_family = AF_PPPOX;
+ sax.sa_protocol = PX_PROTO_OL2TP;
+ sax.pppol2tp.fd = udp_fd; /* fd of tunnel UDP socket */
+ sax.pppol2tp.addr.sin_addr.s_addr = peer_addr->sin_addr.s_addr;
+ sax.pppol2tp.addr.sin_port = peer_addr->sin_port;
+ sax.pppol2tp.addr.sin_family = AF_INET;
+ sax.pppol2tp.s_tunnel = tunnel_id;
+ sax.pppol2tp.s_session = 0; /* special case: mgmt socket */
+ sax.pppol2tp.d_tunnel = 0;
+ sax.pppol2tp.d_session = 0; /* special case: mgmt socket */
- if(connect(kernel_fd, (struct sockaddr *)&sax, sizeof(sax) ) < 0 ) {
- perror("connect failed");
- result = -errno;
- goto err;
- }
- }
+ if(connect(kernel_fd, (struct sockaddr *)&sax, sizeof(sax) ) < 0 ) {
+ perror("connect failed");
+ result = -errno;
+ goto err;
+ }
+ }
-2. Create session PPPoX data socket
+2. Create session PPPoX data socket::
- struct sockaddr_pppol2tp sax;
- int fd;
+ struct sockaddr_pppol2tp sax;
+ int fd;
- /* Note, the target socket must be bound already, else it will not be ready */
- sax.sa_family = AF_PPPOX;
- sax.sa_protocol = PX_PROTO_OL2TP;
- sax.pppol2tp.fd = tunnel_fd;
- sax.pppol2tp.addr.sin_addr.s_addr = addr->sin_addr.s_addr;
- sax.pppol2tp.addr.sin_port = addr->sin_port;
- sax.pppol2tp.addr.sin_family = AF_INET;
- sax.pppol2tp.s_tunnel = tunnel_id;
- sax.pppol2tp.s_session = session_id;
- sax.pppol2tp.d_tunnel = peer_tunnel_id;
- sax.pppol2tp.d_session = peer_session_id;
+ /* Note, the target socket must be bound already, else it will not be ready */
+ sax.sa_family = AF_PPPOX;
+ sax.sa_protocol = PX_PROTO_OL2TP;
+ sax.pppol2tp.fd = tunnel_fd;
+ sax.pppol2tp.addr.sin_addr.s_addr = addr->sin_addr.s_addr;
+ sax.pppol2tp.addr.sin_port = addr->sin_port;
+ sax.pppol2tp.addr.sin_family = AF_INET;
+ sax.pppol2tp.s_tunnel = tunnel_id;
+ sax.pppol2tp.s_session = session_id;
+ sax.pppol2tp.d_tunnel = peer_tunnel_id;
+ sax.pppol2tp.d_session = peer_session_id;
- /* session_fd is the fd of the session's PPPoL2TP socket.
- * tunnel_fd is the fd of the tunnel UDP socket.
- */
- fd = connect(session_fd, (struct sockaddr *)&sax, sizeof(sax));
- if (fd < 0 ) {
- return -errno;
- }
- return 0;
+ /* session_fd is the fd of the session's PPPoL2TP socket.
+ * tunnel_fd is the fd of the tunnel UDP socket.
+ */
+ fd = connect(session_fd, (struct sockaddr *)&sax, sizeof(sax));
+ if (fd < 0 ) {
+ return -errno;
+ }
+ return 0;
Internal Implementation
=======================
diff --git a/Documentation/networking/lapb-module.txt b/Documentation/networking/lapb-module.rst
similarity index 74%
rename from Documentation/networking/lapb-module.txt
rename to Documentation/networking/lapb-module.rst
index d4fc8f22..ff586bc 100644
--- a/Documentation/networking/lapb-module.txt
+++ b/Documentation/networking/lapb-module.rst
@@ -1,8 +1,14 @@
- The Linux LAPB Module Interface 1.3
+.. SPDX-License-Identifier: GPL-2.0
- Jonathan Naylor 29.12.96
+===============================
+The Linux LAPB Module Interface
+===============================
-Changed (Henner Eisen, 2000-10-29): int return value for data_indication()
+Version 1.3
+
+Jonathan Naylor 29.12.96
+
+Changed (Henner Eisen, 2000-10-29): int return value for data_indication()
The LAPB module will be a separately compiled module for use by any parts of
the Linux operating system that require a LAPB service. This document
@@ -32,16 +38,16 @@
This structure is used only once, in the call to lapb_register (see below).
It contains information about the device driver that requires the services
-of the LAPB module.
+of the LAPB module::
-struct lapb_register_struct {
- void (*connect_confirmation)(int token, int reason);
- void (*connect_indication)(int token, int reason);
- void (*disconnect_confirmation)(int token, int reason);
- void (*disconnect_indication)(int token, int reason);
- int (*data_indication)(int token, struct sk_buff *skb);
- void (*data_transmit)(int token, struct sk_buff *skb);
-};
+ struct lapb_register_struct {
+ void (*connect_confirmation)(int token, int reason);
+ void (*connect_indication)(int token, int reason);
+ void (*disconnect_confirmation)(int token, int reason);
+ void (*disconnect_indication)(int token, int reason);
+ int (*data_indication)(int token, struct sk_buff *skb);
+ void (*data_transmit)(int token, struct sk_buff *skb);
+ };
Each member of this structure corresponds to a function in the device driver
that is called when a particular event in the LAPB module occurs. These will
@@ -54,19 +60,19 @@
This structure is used with the lapb_getparms and lapb_setparms functions
(see below). They are used to allow the device driver to get and set the
-operational parameters of the LAPB implementation for a given connection.
+operational parameters of the LAPB implementation for a given connection::
-struct lapb_parms_struct {
- unsigned int t1;
- unsigned int t1timer;
- unsigned int t2;
- unsigned int t2timer;
- unsigned int n2;
- unsigned int n2count;
- unsigned int window;
- unsigned int state;
- unsigned int mode;
-};
+ struct lapb_parms_struct {
+ unsigned int t1;
+ unsigned int t1timer;
+ unsigned int t2;
+ unsigned int t2timer;
+ unsigned int n2;
+ unsigned int n2count;
+ unsigned int window;
+ unsigned int state;
+ unsigned int mode;
+ };
T1 and T2 are protocol timing parameters and are given in units of 100ms. N2
is the maximum number of tries on the link before it is declared a failure.
@@ -78,11 +84,14 @@
The mode variable is a bit field used for setting (at present) three values.
The bit fields have the following meanings:
+====== =================================================
Bit Meaning
+====== =================================================
0 LAPB operation (0=LAPB_STANDARD 1=LAPB_EXTENDED).
1 [SM]LP operation (0=LAPB_SLP 1=LAPB=MLP).
2 DTE/DCE operation (0=LAPB_DTE 1=LAPB_DCE)
3-31 Reserved, must be 0.
+====== =================================================
Extended LAPB operation indicates the use of extended sequence numbers and
consequently larger window sizes, the default is standard LAPB operation.
@@ -99,8 +108,9 @@
The LAPB module provides a number of function entry points.
+::
-int lapb_register(void *token, struct lapb_register_struct);
+ int lapb_register(void *token, struct lapb_register_struct);
This must be called before the LAPB module may be used. If the call is
successful then LAPB_OK is returned. The token must be a unique identifier
@@ -111,33 +121,42 @@
lapb_register must be made. The format of the lapb_register_struct is given
above. The return values are:
+============= =============================
LAPB_OK LAPB registered successfully.
LAPB_BADTOKEN Token is already registered.
LAPB_NOMEM Out of memory
+============= =============================
+::
-int lapb_unregister(void *token);
+ int lapb_unregister(void *token);
This releases all the resources associated with a LAPB link. Any current
LAPB link will be abandoned without further messages being passed. After
this call, the value of token is no longer valid for any calls to the LAPB
function. The valid return values are:
+============= ===============================
LAPB_OK LAPB unregistered successfully.
LAPB_BADTOKEN Invalid/unknown LAPB token.
+============= ===============================
+::
-int lapb_getparms(void *token, struct lapb_parms_struct *parms);
+ int lapb_getparms(void *token, struct lapb_parms_struct *parms);
This allows the device driver to get the values of the current LAPB
variables, the lapb_parms_struct is described above. The valid return values
are:
+============= =============================
LAPB_OK LAPB getparms was successful.
LAPB_BADTOKEN Invalid/unknown LAPB token.
+============= =============================
+::
-int lapb_setparms(void *token, struct lapb_parms_struct *parms);
+ int lapb_setparms(void *token, struct lapb_parms_struct *parms);
This allows the device driver to set the values of the current LAPB
variables, the lapb_parms_struct is described above. The values of t1timer,
@@ -145,42 +164,54 @@
connected will be ignored. An error implies that none of the values have
been changed. The valid return values are:
+============= =================================================
LAPB_OK LAPB getparms was successful.
LAPB_BADTOKEN Invalid/unknown LAPB token.
LAPB_INVALUE One of the values was out of its allowable range.
+============= =================================================
+::
-int lapb_connect_request(void *token);
+ int lapb_connect_request(void *token);
Initiate a connect using the current parameter settings. The valid return
values are:
+============== =================================
LAPB_OK LAPB is starting to connect.
LAPB_BADTOKEN Invalid/unknown LAPB token.
LAPB_CONNECTED LAPB module is already connected.
+============== =================================
+::
-int lapb_disconnect_request(void *token);
+ int lapb_disconnect_request(void *token);
Initiate a disconnect. The valid return values are:
+================= ===============================
LAPB_OK LAPB is starting to disconnect.
LAPB_BADTOKEN Invalid/unknown LAPB token.
LAPB_NOTCONNECTED LAPB module is not connected.
+================= ===============================
+::
-int lapb_data_request(void *token, struct sk_buff *skb);
+ int lapb_data_request(void *token, struct sk_buff *skb);
Queue data with the LAPB module for transmitting over the link. If the call
is successful then the skbuff is owned by the LAPB module and may not be
used by the device driver again. The valid return values are:
+================= =============================
LAPB_OK LAPB has accepted the data.
LAPB_BADTOKEN Invalid/unknown LAPB token.
LAPB_NOTCONNECTED LAPB module is not connected.
+================= =============================
+::
-int lapb_data_received(void *token, struct sk_buff *skb);
+ int lapb_data_received(void *token, struct sk_buff *skb);
Queue data with the LAPB module which has been received from the device. It
is expected that the data passed to the LAPB module has skb->data pointing
@@ -188,9 +219,10 @@
is owned by the LAPB module and may not be used by the device driver again.
The valid return values are:
+============= ===========================
LAPB_OK LAPB has accepted the data.
LAPB_BADTOKEN Invalid/unknown LAPB token.
-
+============= ===========================
Callbacks
---------
@@ -200,49 +232,58 @@
module with lapb_register (see above) in the structure lapb_register_struct
(see above).
+::
-void (*connect_confirmation)(void *token, int reason);
+ void (*connect_confirmation)(void *token, int reason);
This is called by the LAPB module when a connection is established after
being requested by a call to lapb_connect_request (see above). The reason is
always LAPB_OK.
+::
-void (*connect_indication)(void *token, int reason);
+ void (*connect_indication)(void *token, int reason);
This is called by the LAPB module when the link is established by the remote
system. The value of reason is always LAPB_OK.
+::
-void (*disconnect_confirmation)(void *token, int reason);
+ void (*disconnect_confirmation)(void *token, int reason);
This is called by the LAPB module when an event occurs after the device
driver has called lapb_disconnect_request (see above). The reason indicates
what has happened. In all cases the LAPB link can be regarded as being
terminated. The values for reason are:
+================= ====================================================
LAPB_OK The LAPB link was terminated normally.
LAPB_NOTCONNECTED The remote system was not connected.
LAPB_TIMEDOUT No response was received in N2 tries from the remote
system.
+================= ====================================================
+::
-void (*disconnect_indication)(void *token, int reason);
+ void (*disconnect_indication)(void *token, int reason);
This is called by the LAPB module when the link is terminated by the remote
system or another event has occurred to terminate the link. This may be
returned in response to a lapb_connect_request (see above) if the remote
system refused the request. The values for reason are:
+================= ====================================================
LAPB_OK The LAPB link was terminated normally by the remote
system.
LAPB_REFUSED The remote system refused the connect request.
LAPB_NOTCONNECTED The remote system was not connected.
LAPB_TIMEDOUT No response was received in N2 tries from the remote
system.
+================= ====================================================
+::
-int (*data_indication)(void *token, struct sk_buff *skb);
+ int (*data_indication)(void *token, struct sk_buff *skb);
This is called by the LAPB module when data has been received from the
remote system that should be passed onto the next layer in the protocol
@@ -254,8 +295,9 @@
file include/linux/netdevice.h) if and only if the frame was dropped
before it could be delivered to the upper layer.
+::
-void (*data_transmit)(void *token, struct sk_buff *skb);
+ void (*data_transmit)(void *token, struct sk_buff *skb);
This is called by the LAPB module when data is to be transmitted to the
remote system by the device driver. The skbuff becomes the property of the
diff --git a/Documentation/networking/ltpc.txt b/Documentation/networking/ltpc.rst
similarity index 86%
rename from Documentation/networking/ltpc.txt
rename to Documentation/networking/ltpc.rst
index a005a73..0ad197f 100644
--- a/Documentation/networking/ltpc.txt
+++ b/Documentation/networking/ltpc.rst
@@ -1,3 +1,9 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+===========
+LTPC Driver
+===========
+
This is the ALPHA version of the ltpc driver.
In order to use it, you will need at least version 1.3.3 of the
@@ -15,7 +21,7 @@
change the settings on your card)
When the driver is compiled into the kernel, you can add a line such
-as the following to your /etc/lilo.conf:
+as the following to your /etc/lilo.conf::
append="ltpc=0x240,9,1"
@@ -25,13 +31,13 @@
If you load the driver as a module, you can pass the parameters "io=",
"irq=", and "dma=" on the command line with insmod or modprobe, or add
-them as options in a configuration file in /etc/modprobe.d/ directory:
+them as options in a configuration file in /etc/modprobe.d/ directory::
alias lt0 ltpc # autoload the module when the interface is configured
options ltpc io=0x240 irq=9 dma=1
Before starting up the netatalk demons (perhaps in rc.local), you
-need to add a line such as:
+need to add a line such as::
/sbin/ifconfig lt0 127.0.0.42
@@ -42,7 +48,7 @@
attached to a network that includes AppleTalk routers or not. If,
like me, you are simply connecting to your home Macintoshes and
printers, you need to set up netatalk to "seed". The way I do this
-is to have the lines
+is to have the lines::
dummy -seed -phase 2 -net 2000 -addr 2000.26 -zone "1033"
lt0 -seed -phase 1 -net 1033 -addr 1033.27 -zone "1033"
@@ -57,13 +63,13 @@
If you are attached to an extended AppleTalk network, with routers on
it, then you don't need to fool around with this -- the appropriate
-line in atalkd.conf is
+line in atalkd.conf is::
lt0 -phase 1
---------------------------------------
-Card Configuration:
+Card Configuration
+==================
The interrupts and so forth are configured via the dipswitch on the
board. Set the switches so as not to conflict with other hardware.
@@ -73,26 +79,32 @@
original documentation refers to IRQ2. Since you'll be running
this on an AT (or later) class machine, that really means IRQ9.
+ === ===========================================================
SW1 IRQ 4
SW2 IRQ 3
SW3 IRQ 9 (2 in original card documentation only applies to XT)
+ === ===========================================================
DMA -- choose DMA 1 or 3, and set both corresponding switches.
+ === =====
SW4 DMA 3
SW5 DMA 1
SW6 DMA 3
SW7 DMA 1
+ === =====
I/O address -- choose one.
+ === =========
SW8 220 / 240
+ === =========
---------------------------------------
-IP:
+IP
+==
Yes, it is possible to do IP over LocalTalk. However, you can't just
treat the LocalTalk device like an ordinary Ethernet device, even if
@@ -102,9 +114,9 @@
See Documentation/networking/ipddp.rst for more information about the
kernel driver and userspace tools needed.
---------------------------------------
-BUGS:
+Bugs
+====
IRQ autoprobing often doesn't work on a cold boot. To get around
this, either compile the driver as a module, or pass the parameters
@@ -120,12 +132,13 @@
machine, but this is unsupported, so if you really want to do this,
you'll probably have to hack the initialization code a bit.
-______________________________________
-THANKS:
- Thanks to Alan Cox for helpful discussions early on in this
+Thanks
+======
+
+Thanks to Alan Cox for helpful discussions early on in this
work, and to Denis Hainsworth for doing the bleeding-edge testing.
--- Bradford Johnson <bradford@math.umn.edu>
+Bradford Johnson <bradford@math.umn.edu>
--- Updated 11/09/1998 by David Huggins-Daines <dhd@debian.org>
+Updated 11/09/1998 by David Huggins-Daines <dhd@debian.org>
diff --git a/Documentation/networking/mac80211-injection.txt b/Documentation/networking/mac80211-injection.rst
similarity index 67%
rename from Documentation/networking/mac80211-injection.txt
rename to Documentation/networking/mac80211-injection.rst
index d58d78d..be65f88 100644
--- a/Documentation/networking/mac80211-injection.txt
+++ b/Documentation/networking/mac80211-injection.rst
@@ -1,16 +1,19 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+=========================================
How to use packet injection with mac80211
=========================================
mac80211 now allows arbitrary packets to be injected down any Monitor Mode
interface from userland. The packet you inject needs to be composed in the
-following format:
+following format::
[ radiotap header ]
[ ieee80211 header ]
[ payload ]
The radiotap format is discussed in
-./Documentation/networking/radiotap-headers.txt.
+./Documentation/networking/radiotap-headers.rst.
Despite many radiotap parameters being currently defined, most only make sense
to appear on received packets. The following information is parsed from the
@@ -18,15 +21,19 @@
* IEEE80211_RADIOTAP_FLAGS
- IEEE80211_RADIOTAP_F_FCS: FCS will be removed and recalculated
- IEEE80211_RADIOTAP_F_WEP: frame will be encrypted if key available
- IEEE80211_RADIOTAP_F_FRAG: frame will be fragmented if longer than the
+ ========================= ===========================================
+ IEEE80211_RADIOTAP_F_FCS FCS will be removed and recalculated
+ IEEE80211_RADIOTAP_F_WEP frame will be encrypted if key available
+ IEEE80211_RADIOTAP_F_FRAG frame will be fragmented if longer than the
current fragmentation threshold.
+ ========================= ===========================================
* IEEE80211_RADIOTAP_TX_FLAGS
- IEEE80211_RADIOTAP_F_TX_NOACK: frame should be sent without waiting for
+ ============================= ========================================
+ IEEE80211_RADIOTAP_F_TX_NOACK frame should be sent without waiting for
an ACK even if it is a unicast frame
+ ============================= ========================================
* IEEE80211_RADIOTAP_RATE
@@ -37,8 +44,10 @@
HT rate for the transmission (only for devices without own rate control).
Also some flags are parsed
- IEEE80211_RADIOTAP_MCS_SGI: use short guard interval
- IEEE80211_RADIOTAP_MCS_BW_40: send in HT40 mode
+ ============================ ========================
+ IEEE80211_RADIOTAP_MCS_SGI use short guard interval
+ IEEE80211_RADIOTAP_MCS_BW_40 send in HT40 mode
+ ============================ ========================
* IEEE80211_RADIOTAP_DATA_RETRIES
@@ -51,17 +60,17 @@
without own rate control). Also other fields are parsed
flags field
- IEEE80211_RADIOTAP_VHT_FLAG_SGI: use short guard interval
+ IEEE80211_RADIOTAP_VHT_FLAG_SGI: use short guard interval
bandwidth field
- 1: send using 40MHz channel width
- 4: send using 80MHz channel width
- 11: send using 160MHz channel width
+ * 1: send using 40MHz channel width
+ * 4: send using 80MHz channel width
+ * 11: send using 160MHz channel width
The injection code can also skip all other currently defined radiotap fields
facilitating replay of captured radiotap headers directly.
-Here is an example valid radiotap header defining some parameters
+Here is an example valid radiotap header defining some parameters::
0x00, 0x00, // <-- radiotap version
0x0b, 0x00, // <- radiotap header length
@@ -71,7 +80,7 @@
0x01 //<-- antenna
The ieee80211 header follows immediately afterwards, looking for example like
-this:
+this::
0x08, 0x01, 0x00, 0x00,
0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF,
@@ -84,10 +93,10 @@
After composing the packet contents, it is sent by send()-ing it to a logical
mac80211 interface that is in Monitor mode. Libpcap can also be used,
(which is easier than doing the work to bind the socket to the right
-interface), along the following lines:
+interface), along the following lines:::
ppcap = pcap_open_live(szInterfaceName, 800, 1, 20, szErrbuf);
-...
+ ...
r = pcap_inject(ppcap, u8aSendBuffer, nLength);
You can also find a link to a complete inject application here:
diff --git a/Documentation/networking/mpls-sysctl.txt b/Documentation/networking/mpls-sysctl.rst
similarity index 82%
rename from Documentation/networking/mpls-sysctl.txt
rename to Documentation/networking/mpls-sysctl.rst
index 025cc9b..0a2ac88 100644
--- a/Documentation/networking/mpls-sysctl.txt
+++ b/Documentation/networking/mpls-sysctl.rst
@@ -1,4 +1,11 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+====================
+MPLS Sysfs variables
+====================
+
/proc/sys/net/mpls/* Variables:
+===============================
platform_labels - INTEGER
Number of entries in the platform label table. It is not
@@ -17,6 +24,7 @@
no longer fit in the table.
Possible values: 0 - 1048575
+
Default: 0
ip_ttl_propagate - BOOL
@@ -27,8 +35,8 @@
If disabled, the MPLS transport network will appear as a
single hop to transit traffic.
- 0 - disabled / RFC 3443 [Short] Pipe Model
- 1 - enabled / RFC 3443 Uniform Model (default)
+ * 0 - disabled / RFC 3443 [Short] Pipe Model
+ * 1 - enabled / RFC 3443 Uniform Model (default)
default_ttl - INTEGER
Default TTL value to use for MPLS packets where it cannot be
@@ -36,6 +44,7 @@
or ip_ttl_propagate has been disabled.
Possible values: 1 - 255
+
Default: 255
conf/<interface>/input - BOOL
@@ -44,5 +53,5 @@
If disabled, packets will be discarded without further
processing.
- 0 - disabled (default)
- not 0 - enabled
+ * 0 - disabled (default)
+ * not 0 - enabled
diff --git a/Documentation/networking/multiqueue.txt b/Documentation/networking/multiqueue.rst
similarity index 74%
rename from Documentation/networking/multiqueue.txt
rename to Documentation/networking/multiqueue.rst
index 4caa0e3..0a57616 100644
--- a/Documentation/networking/multiqueue.txt
+++ b/Documentation/networking/multiqueue.rst
@@ -1,17 +1,17 @@
+.. SPDX-License-Identifier: GPL-2.0
- HOWTO for multiqueue network device support
- ===========================================
+===========================================
+HOWTO for multiqueue network device support
+===========================================
Section 1: Base driver requirements for implementing multiqueue support
+=======================================================================
Intro: Kernel support for multiqueue devices
---------------------------------------------------------
Kernel support for multiqueue devices is always present.
-Section 1: Base driver requirements for implementing multiqueue support
------------------------------------------------------------------------
-
Base drivers are required to use the new alloc_etherdev_mq() or
alloc_netdev_mq() functions to allocate the subqueues for the device. The
underlying kernel API will take care of the allocation and deallocation of
@@ -26,8 +26,7 @@
Section 2: Qdisc support for multiqueue devices
-
------------------------------------------------
+===============================================
Currently two qdiscs are optimized for multiqueue devices. The first is the
default pfifo_fast qdisc. This qdisc supports one qdisc per hardware queue.
@@ -46,22 +45,22 @@
Section 3: Brief howto using MULTIQ for multiqueue devices
----------------------------------------------------------------
+==========================================================
The userspace command 'tc,' part of the iproute2 package, is used to configure
qdiscs. To add the MULTIQ qdisc to your network device, assuming the device
-is called eth0, run the following command:
+is called eth0, run the following command::
-# tc qdisc add dev eth0 root handle 1: multiq
+ # tc qdisc add dev eth0 root handle 1: multiq
The qdisc will allocate the number of bands to equal the number of queues that
the device reports, and bring the qdisc online. Assuming eth0 has 4 Tx
-queues, the band mapping would look like:
+queues, the band mapping would look like::
-band 0 => queue 0
-band 1 => queue 1
-band 2 => queue 2
-band 3 => queue 3
+ band 0 => queue 0
+ band 1 => queue 1
+ band 2 => queue 2
+ band 3 => queue 3
Traffic will begin flowing through each queue based on either the simple_tx_hash
function or based on netdev->select_queue() if you have it defined.
@@ -69,11 +68,11 @@
The behavior of tc filters remains the same. However a new tc action,
skbedit, has been added. Assuming you wanted to route all traffic to a
specific host, for example 192.168.0.3, through a specific queue you could use
-this action and establish a filter such as:
+this action and establish a filter such as::
-tc filter add dev eth0 parent 1: protocol ip prio 1 u32 \
- match ip dst 192.168.0.3 \
- action skbedit queue_mapping 3
+ tc filter add dev eth0 parent 1: protocol ip prio 1 u32 \
+ match ip dst 192.168.0.3 \
+ action skbedit queue_mapping 3
-Author: Alexander Duyck <alexander.h.duyck@intel.com>
-Original Author: Peter P. Waskiewicz Jr. <peter.p.waskiewicz.jr@intel.com>
+:Author: Alexander Duyck <alexander.h.duyck@intel.com>
+:Original Author: Peter P. Waskiewicz Jr. <peter.p.waskiewicz.jr@intel.com>
diff --git a/Documentation/networking/netconsole.txt b/Documentation/networking/netconsole.rst
similarity index 65%
rename from Documentation/networking/netconsole.txt
rename to Documentation/networking/netconsole.rst
index 296ea00..1f5c4a0 100644
--- a/Documentation/networking/netconsole.txt
+++ b/Documentation/networking/netconsole.rst
@@ -1,7 +1,16 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+==========
+Netconsole
+==========
+
started by Ingo Molnar <mingo@redhat.com>, 2001.09.17
+
2.6 port and netpoll api by Matt Mackall <mpm@selenic.com>, Sep 9 2003
+
IPv6 support by Cong Wang <xiyou.wangcong@gmail.com>, Jan 1 2013
+
Extended console support by Tejun Heo <tj@kernel.org>, May 1 2015
Please send bug reports to Matt Mackall <mpm@selenic.com>
@@ -23,34 +32,34 @@
==================================
It takes a string configuration parameter "netconsole" in the
-following format:
+following format::
netconsole=[+][src-port]@[src-ip]/[<dev>],[tgt-port]@<tgt-ip>/[tgt-macaddr]
where
- + if present, enable extended console support
- src-port source for UDP packets (defaults to 6665)
- src-ip source IP to use (interface address)
- dev network interface (eth0)
- tgt-port port for logging agent (6666)
- tgt-ip IP address for logging agent
- tgt-macaddr ethernet MAC address for logging agent (broadcast)
+ + if present, enable extended console support
+ src-port source for UDP packets (defaults to 6665)
+ src-ip source IP to use (interface address)
+ dev network interface (eth0)
+ tgt-port port for logging agent (6666)
+ tgt-ip IP address for logging agent
+ tgt-macaddr ethernet MAC address for logging agent (broadcast)
-Examples:
+Examples::
linux netconsole=4444@10.0.0.1/eth1,9353@10.0.0.2/12:34:56:78:9a:bc
- or
+or::
insmod netconsole netconsole=@/,@10.0.0.2/
- or using IPv6
+or using IPv6::
insmod netconsole netconsole=@/,@fd00:1:2:3::1/
It also supports logging to multiple remote agents by specifying
parameters for the multiple agents separated by semicolons and the
-complete string enclosed in "quotes", thusly:
+complete string enclosed in "quotes", thusly::
modprobe netconsole netconsole="@/,@10.0.0.2/;@/eth1,6892@10.0.0.3/"
@@ -67,14 +76,19 @@
On distributions using a BSD-based netcat version (e.g. Fedora,
openSUSE and Ubuntu) the listening port must be specified without
- the -p switch:
+ the -p switch::
- 'nc -u -l -p <port>' / 'nc -u -l <port>' or
- 'netcat -u -l -p <port>' / 'netcat -u -l <port>'
+ nc -u -l -p <port>' / 'nc -u -l <port>
+
+ or::
+
+ netcat -u -l -p <port>' / 'netcat -u -l <port>
3) socat
- 'socat udp-recv:<port> -'
+::
+
+ socat udp-recv:<port> -
Dynamic reconfiguration:
========================
@@ -92,7 +106,7 @@
Some examples follow (where configfs is mounted at the /sys/kernel/config
mountpoint).
-To add a remote logging target (target names can be arbitrary):
+To add a remote logging target (target names can be arbitrary)::
cd /sys/kernel/config/netconsole/
mkdir target1
@@ -102,12 +116,13 @@
"1" to the "enabled" attribute (usually after setting parameters accordingly)
as described below.
-To remove a target:
+To remove a target::
rmdir /sys/kernel/config/netconsole/othertarget/
The interface exposes these parameters of a netconsole target to userspace:
+ ============== ================================= ============
enabled Is this target currently enabled? (read-write)
extended Extended mode enabled (read-write)
dev_name Local network interface name (read-write)
@@ -117,12 +132,13 @@
remote_ip Remote agent's IP address (read-write)
local_mac Local interface's MAC address (read-only)
remote_mac Remote agent's MAC address (read-write)
+ ============== ================================= ============
The "enabled" attribute is also used to control whether the parameters of
a target can be updated or not -- you can modify the parameters of only
disabled targets (i.e. if "enabled" is 0).
-To update a target's parameters:
+To update a target's parameters::
cat enabled # check if enabled is 1
echo 0 > enabled # disable the target (if required)
@@ -140,12 +156,12 @@
If '+' is prefixed to the configuration line or "extended" config file
is set to 1, extended console support is enabled. An example boot
-param follows.
+param follows::
linux netconsole=+4444@10.0.0.1/eth1,9353@10.0.0.2/12:34:56:78:9a:bc
Log messages are transmitted with extended metadata header in the
-following format which is the same as /dev/kmsg.
+following format which is the same as /dev/kmsg::
<level>,<sequnum>,<timestamp>,<contflag>;<message text>
@@ -155,12 +171,12 @@
If a message doesn't fit in certain number of bytes (currently 1000),
the message is split into multiple fragments by netconsole. These
-fragments are transmitted with "ncfrag" header field added.
+fragments are transmitted with "ncfrag" header field added::
ncfrag=<byte-offset>/<total-bytes>
For example, assuming a lot smaller chunk size, a message "the first
-chunk, the 2nd chunk." may be split as follows.
+chunk, the 2nd chunk." may be split as follows::
6,416,1758426,-,ncfrag=0/31;the first chunk,
6,416,1758426,-,ncfrag=16/31; the 2nd chunk.
@@ -168,39 +184,52 @@
Miscellaneous notes:
====================
-WARNING: the default target ethernet setting uses the broadcast
-ethernet address to send packets, which can cause increased load on
-other systems on the same ethernet segment.
+.. Warning::
-TIP: some LAN switches may be configured to suppress ethernet broadcasts
-so it is advised to explicitly specify the remote agents' MAC addresses
-from the config parameters passed to netconsole.
+ the default target ethernet setting uses the broadcast
+ ethernet address to send packets, which can cause increased load on
+ other systems on the same ethernet segment.
-TIP: to find out the MAC address of, say, 10.0.0.2, you may try using:
+.. Tip::
- ping -c 1 10.0.0.2 ; /sbin/arp -n | grep 10.0.0.2
+ some LAN switches may be configured to suppress ethernet broadcasts
+ so it is advised to explicitly specify the remote agents' MAC addresses
+ from the config parameters passed to netconsole.
-TIP: in case the remote logging agent is on a separate LAN subnet than
-the sender, it is suggested to try specifying the MAC address of the
-default gateway (you may use /sbin/route -n to find it out) as the
-remote MAC address instead.
+.. Tip::
-NOTE: the network device (eth1 in the above case) can run any kind
-of other network traffic, netconsole is not intrusive. Netconsole
-might cause slight delays in other traffic if the volume of kernel
-messages is high, but should have no other impact.
+ to find out the MAC address of, say, 10.0.0.2, you may try using::
-NOTE: if you find that the remote logging agent is not receiving or
-printing all messages from the sender, it is likely that you have set
-the "console_loglevel" parameter (on the sender) to only send high
-priority messages to the console. You can change this at runtime using:
+ ping -c 1 10.0.0.2 ; /sbin/arp -n | grep 10.0.0.2
- dmesg -n 8
+.. Tip::
-or by specifying "debug" on the kernel command line at boot, to send
-all kernel messages to the console. A specific value for this parameter
-can also be set using the "loglevel" kernel boot option. See the
-dmesg(8) man page and Documentation/admin-guide/kernel-parameters.rst for details.
+ in case the remote logging agent is on a separate LAN subnet than
+ the sender, it is suggested to try specifying the MAC address of the
+ default gateway (you may use /sbin/route -n to find it out) as the
+ remote MAC address instead.
+
+.. note::
+
+ the network device (eth1 in the above case) can run any kind
+ of other network traffic, netconsole is not intrusive. Netconsole
+ might cause slight delays in other traffic if the volume of kernel
+ messages is high, but should have no other impact.
+
+.. note::
+
+ if you find that the remote logging agent is not receiving or
+ printing all messages from the sender, it is likely that you have set
+ the "console_loglevel" parameter (on the sender) to only send high
+ priority messages to the console. You can change this at runtime using::
+
+ dmesg -n 8
+
+ or by specifying "debug" on the kernel command line at boot, to send
+ all kernel messages to the console. A specific value for this parameter
+ can also be set using the "loglevel" kernel boot option. See the
+ dmesg(8) man page and Documentation/admin-guide/kernel-parameters.rst
+ for details.
Netconsole was designed to be as instantaneous as possible, to
enable the logging of even the most critical kernel bugs. It works
diff --git a/Documentation/networking/netdev-features.txt b/Documentation/networking/netdev-features.rst
similarity index 95%
rename from Documentation/networking/netdev-features.txt
rename to Documentation/networking/netdev-features.rst
index 58dd1c1..a2d7d71 100644
--- a/Documentation/networking/netdev-features.txt
+++ b/Documentation/networking/netdev-features.rst
@@ -1,3 +1,6 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+=====================================================
Netdev features mess and how to get out from it alive
=====================================================
@@ -6,8 +9,8 @@
- Part I: Feature sets
-======================
+Part I: Feature sets
+====================
Long gone are the days when a network card would just take and give packets
verbatim. Today's devices add multiple features and bugs (read: offloads)
@@ -39,8 +42,8 @@
- Part II: Controlling enabled features
-=======================================
+Part II: Controlling enabled features
+=====================================
When current feature set (netdev->features) is to be changed, new set
is calculated and filtered by calling ndo_fix_features callback
@@ -65,8 +68,8 @@
- Part III: Implementation hints
-================================
+Part III: Implementation hints
+==============================
* ndo_fix_features:
@@ -94,8 +97,8 @@
- Part IV: Features
-===================
+Part IV: Features
+=================
For current list of features, see include/linux/netdev_features.h.
This section describes semantics of some of them.
diff --git a/Documentation/networking/netdevices.txt b/Documentation/networking/netdevices.rst
similarity index 89%
rename from Documentation/networking/netdevices.txt
rename to Documentation/networking/netdevices.rst
index 7fec206..5a85fcc 100644
--- a/Documentation/networking/netdevices.txt
+++ b/Documentation/networking/netdevices.rst
@@ -1,5 +1,8 @@
+.. SPDX-License-Identifier: GPL-2.0
+=====================================
Network Devices, the Kernel, and You!
+=====================================
Introduction
@@ -75,11 +78,12 @@
Don't use it for new drivers.
Context: Process with BHs disabled or BH (timer),
- will be called with interrupts disabled by netconsole.
+ will be called with interrupts disabled by netconsole.
- Return codes:
- o NETDEV_TX_OK everything ok.
- o NETDEV_TX_BUSY Cannot transmit packet, try later
+ Return codes:
+
+ * NETDEV_TX_OK everything ok.
+ * NETDEV_TX_BUSY Cannot transmit packet, try later
Usually a bug, means queue start/stop flow control is broken in
the driver. Note: the driver must NOT put the skb in its DMA ring.
@@ -95,10 +99,13 @@
struct napi_struct synchronization rules
========================================
napi->poll:
- Synchronization: NAPI_STATE_SCHED bit in napi->state. Device
+ Synchronization:
+ NAPI_STATE_SCHED bit in napi->state. Device
driver's ndo_stop method will invoke napi_disable() on
all NAPI instances which will do a sleeping poll on the
NAPI_STATE_SCHED napi->state bit, waiting for all pending
NAPI activity to cease.
- Context: softirq
- will be called with interrupts disabled by netconsole.
+
+ Context:
+ softirq
+ will be called with interrupts disabled by netconsole.
diff --git a/Documentation/networking/netfilter-sysctl.txt b/Documentation/networking/netfilter-sysctl.rst
similarity index 62%
rename from Documentation/networking/netfilter-sysctl.txt
rename to Documentation/networking/netfilter-sysctl.rst
index 55791e5..beb6d7b 100644
--- a/Documentation/networking/netfilter-sysctl.txt
+++ b/Documentation/networking/netfilter-sysctl.rst
@@ -1,8 +1,15 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+=========================
+Netfilter Sysfs variables
+=========================
+
/proc/sys/net/netfilter/* Variables:
+====================================
nf_log_all_netns - BOOLEAN
- 0 - disabled (default)
- not 0 - enabled
+ - 0 - disabled (default)
+ - not 0 - enabled
By default, only init_net namespace can log packets into kernel log
with LOG target; this aims to prevent containers from flooding host
diff --git a/Documentation/networking/netif-msg.rst b/Documentation/networking/netif-msg.rst
new file mode 100644
index 0000000..b20d265
--- /dev/null
+++ b/Documentation/networking/netif-msg.rst
@@ -0,0 +1,95 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+===============
+NETIF Msg Level
+===============
+
+The design of the network interface message level setting.
+
+History
+-------
+
+ The design of the debugging message interface was guided and
+ constrained by backwards compatibility previous practice. It is useful
+ to understand the history and evolution in order to understand current
+ practice and relate it to older driver source code.
+
+ From the beginning of Linux, each network device driver has had a local
+ integer variable that controls the debug message level. The message
+ level ranged from 0 to 7, and monotonically increased in verbosity.
+
+ The message level was not precisely defined past level 3, but were
+ always implemented within +-1 of the specified level. Drivers tended
+ to shed the more verbose level messages as they matured.
+
+ - 0 Minimal messages, only essential information on fatal errors.
+ - 1 Standard messages, initialization status. No run-time messages
+ - 2 Special media selection messages, generally timer-driver.
+ - 3 Interface starts and stops, including normal status messages
+ - 4 Tx and Rx frame error messages, and abnormal driver operation
+ - 5 Tx packet queue information, interrupt events.
+ - 6 Status on each completed Tx packet and received Rx packets
+ - 7 Initial contents of Tx and Rx packets
+
+ Initially this message level variable was uniquely named in each driver
+ e.g. "lance_debug", so that a kernel symbolic debugger could locate and
+ modify the setting. When kernel modules became common, the variables
+ were consistently renamed to "debug" and allowed to be set as a module
+ parameter.
+
+ This approach worked well. However there is always a demand for
+ additional features. Over the years the following emerged as
+ reasonable and easily implemented enhancements
+
+ - Using an ioctl() call to modify the level.
+ - Per-interface rather than per-driver message level setting.
+ - More selective control over the type of messages emitted.
+
+ The netif_msg recommendation adds these features with only a minor
+ complexity and code size increase.
+
+ The recommendation is the following points
+
+ - Retaining the per-driver integer variable "debug" as a module
+ parameter with a default level of '1'.
+
+ - Adding a per-interface private variable named "msg_enable". The
+ variable is a bit map rather than a level, and is initialized as::
+
+ 1 << debug
+
+ Or more precisely::
+
+ debug < 0 ? 0 : 1 << min(sizeof(int)-1, debug)
+
+ Messages should changes from::
+
+ if (debug > 1)
+ printk(MSG_DEBUG "%s: ...
+
+ to::
+
+ if (np->msg_enable & NETIF_MSG_LINK)
+ printk(MSG_DEBUG "%s: ...
+
+
+The set of message levels is named
+
+
+ ========= =================== ============
+ Old level Name Bit position
+ ========= =================== ============
+ 0 NETIF_MSG_DRV 0x0001
+ 1 NETIF_MSG_PROBE 0x0002
+ 2 NETIF_MSG_LINK 0x0004
+ 2 NETIF_MSG_TIMER 0x0004
+ 3 NETIF_MSG_IFDOWN 0x0008
+ 3 NETIF_MSG_IFUP 0x0008
+ 4 NETIF_MSG_RX_ERR 0x0010
+ 4 NETIF_MSG_TX_ERR 0x0010
+ 5 NETIF_MSG_TX_QUEUED 0x0020
+ 5 NETIF_MSG_INTR 0x0020
+ 6 NETIF_MSG_TX_DONE 0x0040
+ 6 NETIF_MSG_RX_STATUS 0x0040
+ 7 NETIF_MSG_PKTDATA 0x0080
+ ========= =================== ============
diff --git a/Documentation/networking/netif-msg.txt b/Documentation/networking/netif-msg.txt
deleted file mode 100644
index c967ddb..0000000
--- a/Documentation/networking/netif-msg.txt
+++ /dev/null
@@ -1,79 +0,0 @@
-
-________________
-NETIF Msg Level
-
-The design of the network interface message level setting.
-
-History
-
- The design of the debugging message interface was guided and
- constrained by backwards compatibility previous practice. It is useful
- to understand the history and evolution in order to understand current
- practice and relate it to older driver source code.
-
- From the beginning of Linux, each network device driver has had a local
- integer variable that controls the debug message level. The message
- level ranged from 0 to 7, and monotonically increased in verbosity.
-
- The message level was not precisely defined past level 3, but were
- always implemented within +-1 of the specified level. Drivers tended
- to shed the more verbose level messages as they matured.
- 0 Minimal messages, only essential information on fatal errors.
- 1 Standard messages, initialization status. No run-time messages
- 2 Special media selection messages, generally timer-driver.
- 3 Interface starts and stops, including normal status messages
- 4 Tx and Rx frame error messages, and abnormal driver operation
- 5 Tx packet queue information, interrupt events.
- 6 Status on each completed Tx packet and received Rx packets
- 7 Initial contents of Tx and Rx packets
-
- Initially this message level variable was uniquely named in each driver
- e.g. "lance_debug", so that a kernel symbolic debugger could locate and
- modify the setting. When kernel modules became common, the variables
- were consistently renamed to "debug" and allowed to be set as a module
- parameter.
-
- This approach worked well. However there is always a demand for
- additional features. Over the years the following emerged as
- reasonable and easily implemented enhancements
- Using an ioctl() call to modify the level.
- Per-interface rather than per-driver message level setting.
- More selective control over the type of messages emitted.
-
- The netif_msg recommendation adds these features with only a minor
- complexity and code size increase.
-
- The recommendation is the following points
- Retaining the per-driver integer variable "debug" as a module
- parameter with a default level of '1'.
-
- Adding a per-interface private variable named "msg_enable". The
- variable is a bit map rather than a level, and is initialized as
- 1 << debug
- Or more precisely
- debug < 0 ? 0 : 1 << min(sizeof(int)-1, debug)
-
- Messages should changes from
- if (debug > 1)
- printk(MSG_DEBUG "%s: ...
- to
- if (np->msg_enable & NETIF_MSG_LINK)
- printk(MSG_DEBUG "%s: ...
-
-
-The set of message levels is named
- Old level Name Bit position
- 0 NETIF_MSG_DRV 0x0001
- 1 NETIF_MSG_PROBE 0x0002
- 2 NETIF_MSG_LINK 0x0004
- 2 NETIF_MSG_TIMER 0x0004
- 3 NETIF_MSG_IFDOWN 0x0008
- 3 NETIF_MSG_IFUP 0x0008
- 4 NETIF_MSG_RX_ERR 0x0010
- 4 NETIF_MSG_TX_ERR 0x0010
- 5 NETIF_MSG_TX_QUEUED 0x0020
- 5 NETIF_MSG_INTR 0x0020
- 6 NETIF_MSG_TX_DONE 0x0040
- 6 NETIF_MSG_RX_STATUS 0x0040
- 7 NETIF_MSG_PKTDATA 0x0080
-
diff --git a/Documentation/networking/nf_conntrack-sysctl.txt b/Documentation/networking/nf_conntrack-sysctl.rst
similarity index 84%
rename from Documentation/networking/nf_conntrack-sysctl.txt
rename to Documentation/networking/nf_conntrack-sysctl.rst
index f75c2ce..11a9b76 100644
--- a/Documentation/networking/nf_conntrack-sysctl.txt
+++ b/Documentation/networking/nf_conntrack-sysctl.rst
@@ -1,8 +1,15 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+===================================
+Netfilter Conntrack Sysfs variables
+===================================
+
/proc/sys/net/netfilter/nf_conntrack_* Variables:
+=================================================
nf_conntrack_acct - BOOLEAN
- 0 - disabled (default)
- not 0 - enabled
+ - 0 - disabled (default)
+ - not 0 - enabled
Enable connection tracking flow accounting. 64-bit byte and packet
counters per flow are added.
@@ -16,8 +23,8 @@
This sysctl is only writeable in the initial net namespace.
nf_conntrack_checksum - BOOLEAN
- 0 - disabled
- not 0 - enabled (default)
+ - 0 - disabled
+ - not 0 - enabled (default)
Verify checksum of incoming packets. Packets with bad checksums are
in INVALID state. If this is enabled, such packets will not be
@@ -27,8 +34,8 @@
Number of currently allocated flow entries.
nf_conntrack_events - BOOLEAN
- 0 - disabled
- not 0 - enabled (default)
+ - 0 - disabled
+ - not 0 - enabled (default)
If this option is enabled, the connection tracking code will
provide userspace with connection tracking events via ctnetlink.
@@ -62,8 +69,8 @@
protocols.
nf_conntrack_helper - BOOLEAN
- 0 - disabled (default)
- not 0 - enabled
+ - 0 - disabled (default)
+ - not 0 - enabled
Enable automatic conntrack helper assignment.
If disabled it is required to set up iptables rules to assign
@@ -81,14 +88,14 @@
Default for ICMP6 timeout.
nf_conntrack_log_invalid - INTEGER
- 0 - disable (default)
- 1 - log ICMP packets
- 6 - log TCP packets
- 17 - log UDP packets
- 33 - log DCCP packets
- 41 - log ICMPv6 packets
- 136 - log UDPLITE packets
- 255 - log packets of any protocol
+ - 0 - disable (default)
+ - 1 - log ICMP packets
+ - 6 - log TCP packets
+ - 17 - log UDP packets
+ - 33 - log DCCP packets
+ - 41 - log ICMPv6 packets
+ - 136 - log UDPLITE packets
+ - 255 - log packets of any protocol
Log invalid packets of a type specified by value.
@@ -97,15 +104,15 @@
nf_conntrack_buckets value * 4.
nf_conntrack_tcp_be_liberal - BOOLEAN
- 0 - disabled (default)
- not 0 - enabled
+ - 0 - disabled (default)
+ - not 0 - enabled
Be conservative in what you do, be liberal in what you accept from others.
If it's non-zero, we mark only out of window RST segments as INVALID.
nf_conntrack_tcp_loose - BOOLEAN
- 0 - disabled
- not 0 - enabled (default)
+ - 0 - disabled
+ - not 0 - enabled (default)
If it is set to zero, we disable picking up already established
connections.
@@ -148,8 +155,8 @@
default 300
nf_conntrack_timestamp - BOOLEAN
- 0 - disabled (default)
- not 0 - enabled
+ - 0 - disabled (default)
+ - not 0 - enabled
Enable connection tracking flow timestamping.
diff --git a/Documentation/networking/nf_flowtable.txt b/Documentation/networking/nf_flowtable.rst
similarity index 76%
rename from Documentation/networking/nf_flowtable.txt
rename to Documentation/networking/nf_flowtable.rst
index 0bf32d1..b6e1fa1 100644
--- a/Documentation/networking/nf_flowtable.txt
+++ b/Documentation/networking/nf_flowtable.rst
@@ -1,3 +1,6 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+====================================
Netfilter's flowtable infrastructure
====================================
@@ -31,15 +34,17 @@
This is represented in Fig.1, which describes the classic forwarding path
including the Netfilter hooks and the flowtable fastpath bypass.
- userspace process
- ^ |
- | |
- _____|____ ____\/___
- / \ / \
- | input | | output |
- \__________/ \_________/
- ^ |
- | |
+::
+
+ userspace process
+ ^ |
+ | |
+ _____|____ ____\/___
+ / \ / \
+ | input | | output |
+ \__________/ \_________/
+ ^ |
+ | |
_________ __________ --------- _____\/_____
/ \ / \ |Routing | / \
--> ingress ---> prerouting ---> |decision| | postrouting |--> neigh_xmit
@@ -59,7 +64,7 @@
\ / |
|__yes_________________fastpath bypass ____________________________|
- Fig.1 Netfilter hooks and flowtable interactions
+ Fig.1 Netfilter hooks and flowtable interactions
The flowtable entry also stores the NAT configuration, so all packets are
mangled according to the NAT policy that matches the initial packets that went
@@ -72,18 +77,18 @@
---------------------
Enabling the flowtable bypass is relatively easy, you only need to create a
-flowtable and add one rule to your forward chain.
+flowtable and add one rule to your forward chain::
- table inet x {
+ table inet x {
flowtable f {
hook ingress priority 0; devices = { eth0, eth1 };
}
- chain y {
- type filter hook forward priority 0; policy accept;
- ip protocol tcp flow offload @f
- counter packets 0 bytes 0
- }
- }
+ chain y {
+ type filter hook forward priority 0; policy accept;
+ ip protocol tcp flow offload @f
+ counter packets 0 bytes 0
+ }
+ }
This example adds the flowtable 'f' to the ingress hook of the eth0 and eth1
netdevices. You can create as many flowtables as you want in case you need to
@@ -101,12 +106,12 @@
More reading
------------
-This documentation is based on the LWN.net articles [1][2]. Rafal Milecki also
-made a very complete and comprehensive summary called "A state of network
+This documentation is based on the LWN.net articles [1]_\ [2]_. Rafal Milecki
+also made a very complete and comprehensive summary called "A state of network
acceleration" that describes how things were before this infrastructure was
-mailined [3] and it also makes a rough summary of this work [4].
+mailined [3]_ and it also makes a rough summary of this work [4]_.
-[1] https://lwn.net/Articles/738214/
-[2] https://lwn.net/Articles/742164/
-[3] http://lists.infradead.org/pipermail/lede-dev/2018-January/010830.html
-[4] http://lists.infradead.org/pipermail/lede-dev/2018-January/010829.html
+.. [1] https://lwn.net/Articles/738214/
+.. [2] https://lwn.net/Articles/742164/
+.. [3] http://lists.infradead.org/pipermail/lede-dev/2018-January/010830.html
+.. [4] http://lists.infradead.org/pipermail/lede-dev/2018-January/010829.html
diff --git a/Documentation/networking/openvswitch.txt b/Documentation/networking/openvswitch.rst
similarity index 95%
rename from Documentation/networking/openvswitch.txt
rename to Documentation/networking/openvswitch.rst
index b3b9ac6..1a8353d 100644
--- a/Documentation/networking/openvswitch.txt
+++ b/Documentation/networking/openvswitch.rst
@@ -1,3 +1,6 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+=============================================
Open vSwitch datapath developer documentation
=============================================
@@ -80,13 +83,13 @@
flow key attributes. For informal explanatory purposes here, we write
them as comma-separated strings, with parentheses indicating arguments
and nesting. For example, the following could represent a flow key
-corresponding to a TCP packet that arrived on vport 1:
+corresponding to a TCP packet that arrived on vport 1::
in_port(1), eth(src=e0:91:f5:21:d0:b2, dst=00:02:e3:0f:80:a4),
eth_type(0x0800), ipv4(src=172.16.0.20, dst=172.18.0.52, proto=17, tos=0,
frag=no), tcp(src=49163, dst=80)
-Often we ellipsize arguments not important to the discussion, e.g.:
+Often we ellipsize arguments not important to the discussion, e.g.::
in_port(1), eth(...), eth_type(0x0800), ipv4(...), tcp(...)
@@ -151,20 +154,20 @@
compatibility for applications that follow the rules listed under
"Flow key compatibility" above.
-The basic rule is obvious:
+The basic rule is obvious::
- ------------------------------------------------------------------
+ ==================================================================
New network protocol support must only supplement existing flow
key attributes. It must not change the meaning of already defined
flow key attributes.
- ------------------------------------------------------------------
+ ==================================================================
This rule does have less-obvious consequences so it is worth working
through a few examples. Suppose, for example, that the kernel module
did not already implement VLAN parsing. Instead, it just interpreted
the 802.1Q TPID (0x8100) as the Ethertype then stopped parsing the
packet. The flow key for any packet with an 802.1Q header would look
-essentially like this, ignoring metadata:
+essentially like this, ignoring metadata::
eth(...), eth_type(0x8100)
@@ -172,7 +175,7 @@
key attribute to contain the VLAN tag, then continue to decode the
encapsulated headers beyond the VLAN tag using the existing field
definitions. With this change, a TCP packet in VLAN 10 would have a
-flow key much like this:
+flow key much like this::
eth(...), vlan(vid=10, pcp=0), eth_type(0x0800), ip(proto=6, ...), tcp(...)
@@ -187,7 +190,7 @@
The solution is to use a set of nested attributes. This is, for
example, why 802.1Q support uses nested attributes. A TCP packet in
-VLAN 10 is actually expressed as:
+VLAN 10 is actually expressed as::
eth(...), eth_type(0x8100), vlan(vid=10, pcp=0), encap(eth_type(0x0800),
ip(proto=6, ...), tcp(...)))
@@ -215,14 +218,14 @@
indicates protocol 6 for TCP, but which is truncated just after the IP
header, so that the TCP header is missing. The flow key for this
packet would include a tcp attribute with all-zero src and dst, like
-this:
+this::
eth(...), eth_type(0x0800), ip(proto=6, ...), tcp(src=0, dst=0)
As another example, consider a packet with an Ethernet type of 0x8100,
indicating that a VLAN TCI should follow, but which is truncated just
after the Ethernet type. The flow key for this packet would include
-an all-zero-bits vlan and an empty encap attribute, like this:
+an all-zero-bits vlan and an empty encap attribute, like this::
eth(...), eth_type(0x8100), vlan(0), encap()
diff --git a/Documentation/networking/operstates.txt b/Documentation/networking/operstates.rst
similarity index 87%
rename from Documentation/networking/operstates.txt
rename to Documentation/networking/operstates.rst
index b203d13..9c918f7 100644
--- a/Documentation/networking/operstates.txt
+++ b/Documentation/networking/operstates.rst
@@ -1,5 +1,12 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+==================
+Operational States
+==================
+
1. Introduction
+===============
Linux distinguishes between administrative and operational state of an
interface. Administrative state is the result of "ip link set dev
@@ -20,6 +27,7 @@
2. Querying from userspace
+==========================
Both admin and operational state can be queried via the netlink
operation RTM_GETLINK. It is also possible to subscribe to RTNLGRP_LINK
@@ -30,16 +38,20 @@
ifinfomsg::if_flags & IFF_UP:
Interface is admin up
+
ifinfomsg::if_flags & IFF_RUNNING:
Interface is in RFC2863 operational state UP or UNKNOWN. This is for
backward compatibility, routing daemons, dhcp clients can use this
flag to determine whether they should use the interface.
+
ifinfomsg::if_flags & IFF_LOWER_UP:
Driver has signaled netif_carrier_on()
+
ifinfomsg::if_flags & IFF_DORMANT:
Driver has signaled netif_dormant_on()
TLV IFLA_OPERSTATE
+------------------
contains RFC2863 state of the interface in numeric representation:
@@ -47,26 +59,33 @@
Interface is in unknown state, neither driver nor userspace has set
operational state. Interface must be considered for user data as
setting operational state has not been implemented in every driver.
+
IF_OPER_NOTPRESENT (1):
Unused in current kernel (notpresent interfaces normally disappear),
just a numerical placeholder.
+
IF_OPER_DOWN (2):
Interface is unable to transfer data on L1, f.e. ethernet is not
plugged or interface is ADMIN down.
+
IF_OPER_LOWERLAYERDOWN (3):
Interfaces stacked on an interface that is IF_OPER_DOWN show this
state (f.e. VLAN).
+
IF_OPER_TESTING (4):
Unused in current kernel.
+
IF_OPER_DORMANT (5):
Interface is L1 up, but waiting for an external event, f.e. for a
protocol to establish. (802.1X)
+
IF_OPER_UP (6):
Interface is operational up and can be used.
This TLV can also be queried via sysfs.
TLV IFLA_LINKMODE
+-----------------
contains link policy. This is needed for userspace interaction
described below.
@@ -75,6 +94,7 @@
3. Kernel driver API
+====================
Kernel drivers have access to two flags that map to IFF_LOWER_UP and
IFF_DORMANT. These flags can be set from everywhere, even from
@@ -126,6 +146,7 @@
4. Setting from userspace
+=========================
Applications have to use the netlink interface to influence the
RFC2863 operational state of an interface. Setting IFLA_LINKMODE to 1
@@ -139,18 +160,18 @@
So basically a 802.1X supplicant interacts with the kernel like this:
--subscribe to RTNLGRP_LINK
--set IFLA_LINKMODE to 1 via RTM_SETLINK
--query RTM_GETLINK once to get initial state
--if initial flags are not (IFF_LOWER_UP && !IFF_DORMANT), wait until
- netlink multicast signals this state
--do 802.1X, eventually abort if flags go down again
--send RTM_SETLINK to set operstate to IF_OPER_UP if authentication
- succeeds, IF_OPER_DORMANT otherwise
--see how operstate and IFF_RUNNING is echoed via netlink multicast
--set interface back to IF_OPER_DORMANT if 802.1X reauthentication
- fails
--restart if kernel changes IFF_LOWER_UP or IFF_DORMANT flag
+- subscribe to RTNLGRP_LINK
+- set IFLA_LINKMODE to 1 via RTM_SETLINK
+- query RTM_GETLINK once to get initial state
+- if initial flags are not (IFF_LOWER_UP && !IFF_DORMANT), wait until
+ netlink multicast signals this state
+- do 802.1X, eventually abort if flags go down again
+- send RTM_SETLINK to set operstate to IF_OPER_UP if authentication
+ succeeds, IF_OPER_DORMANT otherwise
+- see how operstate and IFF_RUNNING is echoed via netlink multicast
+- set interface back to IF_OPER_DORMANT if 802.1X reauthentication
+ fails
+- restart if kernel changes IFF_LOWER_UP or IFF_DORMANT flag
if supplicant goes down, bring back IFLA_LINKMODE to 0 and
IFLA_OPERSTATE to a sane value.
diff --git a/Documentation/networking/packet_mmap.rst b/Documentation/networking/packet_mmap.rst
new file mode 100644
index 0000000..6c009ce
--- /dev/null
+++ b/Documentation/networking/packet_mmap.rst
@@ -0,0 +1,1084 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+===========
+Packet MMAP
+===========
+
+Abstract
+========
+
+This file documents the mmap() facility available with the PACKET
+socket interface on 2.4/2.6/3.x kernels. This type of sockets is used for
+
+i) capture network traffic with utilities like tcpdump,
+ii) transmit network traffic, or any other that needs raw
+ access to network interface.
+
+Howto can be found at:
+
+ https://sites.google.com/site/packetmmap/
+
+Please send your comments to
+ - Ulisses Alonso Camaró <uaca@i.hate.spam.alumni.uv.es>
+ - Johann Baudy
+
+Why use PACKET_MMAP
+===================
+
+In Linux 2.4/2.6/3.x if PACKET_MMAP is not enabled, the capture process is very
+inefficient. It uses very limited buffers and requires one system call to
+capture each packet, it requires two if you want to get packet's timestamp
+(like libpcap always does).
+
+In the other hand PACKET_MMAP is very efficient. PACKET_MMAP provides a size
+configurable circular buffer mapped in user space that can be used to either
+send or receive packets. This way reading packets just needs to wait for them,
+most of the time there is no need to issue a single system call. Concerning
+transmission, multiple packets can be sent through one system call to get the
+highest bandwidth. By using a shared buffer between the kernel and the user
+also has the benefit of minimizing packet copies.
+
+It's fine to use PACKET_MMAP to improve the performance of the capture and
+transmission process, but it isn't everything. At least, if you are capturing
+at high speeds (this is relative to the cpu speed), you should check if the
+device driver of your network interface card supports some sort of interrupt
+load mitigation or (even better) if it supports NAPI, also make sure it is
+enabled. For transmission, check the MTU (Maximum Transmission Unit) used and
+supported by devices of your network. CPU IRQ pinning of your network interface
+card can also be an advantage.
+
+How to use mmap() to improve capture process
+============================================
+
+From the user standpoint, you should use the higher level libpcap library, which
+is a de facto standard, portable across nearly all operating systems
+including Win32.
+
+Packet MMAP support was integrated into libpcap around the time of version 1.3.0;
+TPACKET_V3 support was added in version 1.5.0
+
+How to use mmap() directly to improve capture process
+=====================================================
+
+From the system calls stand point, the use of PACKET_MMAP involves
+the following process::
+
+
+ [setup] socket() -------> creation of the capture socket
+ setsockopt() ---> allocation of the circular buffer (ring)
+ option: PACKET_RX_RING
+ mmap() ---------> mapping of the allocated buffer to the
+ user process
+
+ [capture] poll() ---------> to wait for incoming packets
+
+ [shutdown] close() --------> destruction of the capture socket and
+ deallocation of all associated
+ resources.
+
+
+socket creation and destruction is straight forward, and is done
+the same way with or without PACKET_MMAP::
+
+ int fd = socket(PF_PACKET, mode, htons(ETH_P_ALL));
+
+where mode is SOCK_RAW for the raw interface were link level
+information can be captured or SOCK_DGRAM for the cooked
+interface where link level information capture is not
+supported and a link level pseudo-header is provided
+by the kernel.
+
+The destruction of the socket and all associated resources
+is done by a simple call to close(fd).
+
+Similarly as without PACKET_MMAP, it is possible to use one socket
+for capture and transmission. This can be done by mapping the
+allocated RX and TX buffer ring with a single mmap() call.
+See "Mapping and use of the circular buffer (ring)".
+
+Next I will describe PACKET_MMAP settings and its constraints,
+also the mapping of the circular buffer in the user process and
+the use of this buffer.
+
+How to use mmap() directly to improve transmission process
+==========================================================
+Transmission process is similar to capture as shown below::
+
+ [setup] socket() -------> creation of the transmission socket
+ setsockopt() ---> allocation of the circular buffer (ring)
+ option: PACKET_TX_RING
+ bind() ---------> bind transmission socket with a network interface
+ mmap() ---------> mapping of the allocated buffer to the
+ user process
+
+ [transmission] poll() ---------> wait for free packets (optional)
+ send() ---------> send all packets that are set as ready in
+ the ring
+ The flag MSG_DONTWAIT can be used to return
+ before end of transfer.
+
+ [shutdown] close() --------> destruction of the transmission socket and
+ deallocation of all associated resources.
+
+Socket creation and destruction is also straight forward, and is done
+the same way as in capturing described in the previous paragraph::
+
+ int fd = socket(PF_PACKET, mode, 0);
+
+The protocol can optionally be 0 in case we only want to transmit
+via this socket, which avoids an expensive call to packet_rcv().
+In this case, you also need to bind(2) the TX_RING with sll_protocol = 0
+set. Otherwise, htons(ETH_P_ALL) or any other protocol, for example.
+
+Binding the socket to your network interface is mandatory (with zero copy) to
+know the header size of frames used in the circular buffer.
+
+As capture, each frame contains two parts::
+
+ --------------------
+ | struct tpacket_hdr | Header. It contains the status of
+ | | of this frame
+ |--------------------|
+ | data buffer |
+ . . Data that will be sent over the network interface.
+ . .
+ --------------------
+
+ bind() associates the socket to your network interface thanks to
+ sll_ifindex parameter of struct sockaddr_ll.
+
+ Initialization example::
+
+ struct sockaddr_ll my_addr;
+ struct ifreq s_ifr;
+ ...
+
+ strncpy (s_ifr.ifr_name, "eth0", sizeof(s_ifr.ifr_name));
+
+ /* get interface index of eth0 */
+ ioctl(this->socket, SIOCGIFINDEX, &s_ifr);
+
+ /* fill sockaddr_ll struct to prepare binding */
+ my_addr.sll_family = AF_PACKET;
+ my_addr.sll_protocol = htons(ETH_P_ALL);
+ my_addr.sll_ifindex = s_ifr.ifr_ifindex;
+
+ /* bind socket to eth0 */
+ bind(this->socket, (struct sockaddr *)&my_addr, sizeof(struct sockaddr_ll));
+
+ A complete tutorial is available at: https://sites.google.com/site/packetmmap/
+
+By default, the user should put data at::
+
+ frame base + TPACKET_HDRLEN - sizeof(struct sockaddr_ll)
+
+So, whatever you choose for the socket mode (SOCK_DGRAM or SOCK_RAW),
+the beginning of the user data will be at::
+
+ frame base + TPACKET_ALIGN(sizeof(struct tpacket_hdr))
+
+If you wish to put user data at a custom offset from the beginning of
+the frame (for payload alignment with SOCK_RAW mode for instance) you
+can set tp_net (with SOCK_DGRAM) or tp_mac (with SOCK_RAW). In order
+to make this work it must be enabled previously with setsockopt()
+and the PACKET_TX_HAS_OFF option.
+
+PACKET_MMAP settings
+====================
+
+To setup PACKET_MMAP from user level code is done with a call like
+
+ - Capture process::
+
+ setsockopt(fd, SOL_PACKET, PACKET_RX_RING, (void *) &req, sizeof(req))
+
+ - Transmission process::
+
+ setsockopt(fd, SOL_PACKET, PACKET_TX_RING, (void *) &req, sizeof(req))
+
+The most significant argument in the previous call is the req parameter,
+this parameter must to have the following structure::
+
+ struct tpacket_req
+ {
+ unsigned int tp_block_size; /* Minimal size of contiguous block */
+ unsigned int tp_block_nr; /* Number of blocks */
+ unsigned int tp_frame_size; /* Size of frame */
+ unsigned int tp_frame_nr; /* Total number of frames */
+ };
+
+This structure is defined in /usr/include/linux/if_packet.h and establishes a
+circular buffer (ring) of unswappable memory.
+Being mapped in the capture process allows reading the captured frames and
+related meta-information like timestamps without requiring a system call.
+
+Frames are grouped in blocks. Each block is a physically contiguous
+region of memory and holds tp_block_size/tp_frame_size frames. The total number
+of blocks is tp_block_nr. Note that tp_frame_nr is a redundant parameter because::
+
+ frames_per_block = tp_block_size/tp_frame_size
+
+indeed, packet_set_ring checks that the following condition is true::
+
+ frames_per_block * tp_block_nr == tp_frame_nr
+
+Lets see an example, with the following values::
+
+ tp_block_size= 4096
+ tp_frame_size= 2048
+ tp_block_nr = 4
+ tp_frame_nr = 8
+
+we will get the following buffer structure::
+
+ block #1 block #2
+ +---------+---------+ +---------+---------+
+ | frame 1 | frame 2 | | frame 3 | frame 4 |
+ +---------+---------+ +---------+---------+
+
+ block #3 block #4
+ +---------+---------+ +---------+---------+
+ | frame 5 | frame 6 | | frame 7 | frame 8 |
+ +---------+---------+ +---------+---------+
+
+A frame can be of any size with the only condition it can fit in a block. A block
+can only hold an integer number of frames, or in other words, a frame cannot
+be spawned across two blocks, so there are some details you have to take into
+account when choosing the frame_size. See "Mapping and use of the circular
+buffer (ring)".
+
+PACKET_MMAP setting constraints
+===============================
+
+In kernel versions prior to 2.4.26 (for the 2.4 branch) and 2.6.5 (2.6 branch),
+the PACKET_MMAP buffer could hold only 32768 frames in a 32 bit architecture or
+16384 in a 64 bit architecture. For information on these kernel versions
+see http://pusa.uv.es/~ulisses/packet_mmap/packet_mmap.pre-2.4.26_2.6.5.txt
+
+Block size limit
+----------------
+
+As stated earlier, each block is a contiguous physical region of memory. These
+memory regions are allocated with calls to the __get_free_pages() function. As
+the name indicates, this function allocates pages of memory, and the second
+argument is "order" or a power of two number of pages, that is
+(for PAGE_SIZE == 4096) order=0 ==> 4096 bytes, order=1 ==> 8192 bytes,
+order=2 ==> 16384 bytes, etc. The maximum size of a
+region allocated by __get_free_pages is determined by the MAX_ORDER macro. More
+precisely the limit can be calculated as::
+
+ PAGE_SIZE << MAX_ORDER
+
+ In a i386 architecture PAGE_SIZE is 4096 bytes
+ In a 2.4/i386 kernel MAX_ORDER is 10
+ In a 2.6/i386 kernel MAX_ORDER is 11
+
+So get_free_pages can allocate as much as 4MB or 8MB in a 2.4/2.6 kernel
+respectively, with an i386 architecture.
+
+User space programs can include /usr/include/sys/user.h and
+/usr/include/linux/mmzone.h to get PAGE_SIZE MAX_ORDER declarations.
+
+The pagesize can also be determined dynamically with the getpagesize (2)
+system call.
+
+Block number limit
+------------------
+
+To understand the constraints of PACKET_MMAP, we have to see the structure
+used to hold the pointers to each block.
+
+Currently, this structure is a dynamically allocated vector with kmalloc
+called pg_vec, its size limits the number of blocks that can be allocated::
+
+ +---+---+---+---+
+ | x | x | x | x |
+ +---+---+---+---+
+ | | | |
+ | | | v
+ | | v block #4
+ | v block #3
+ v block #2
+ block #1
+
+kmalloc allocates any number of bytes of physically contiguous memory from
+a pool of pre-determined sizes. This pool of memory is maintained by the slab
+allocator which is at the end the responsible for doing the allocation and
+hence which imposes the maximum memory that kmalloc can allocate.
+
+In a 2.4/2.6 kernel and the i386 architecture, the limit is 131072 bytes. The
+predetermined sizes that kmalloc uses can be checked in the "size-<bytes>"
+entries of /proc/slabinfo
+
+In a 32 bit architecture, pointers are 4 bytes long, so the total number of
+pointers to blocks is::
+
+ 131072/4 = 32768 blocks
+
+PACKET_MMAP buffer size calculator
+==================================
+
+Definitions:
+
+============== ================================================================
+<size-max> is the maximum size of allocable with kmalloc
+ (see /proc/slabinfo)
+<pointer size> depends on the architecture -- ``sizeof(void *)``
+<page size> depends on the architecture -- PAGE_SIZE or getpagesize (2)
+<max-order> is the value defined with MAX_ORDER
+<frame size> it's an upper bound of frame's capture size (more on this later)
+============== ================================================================
+
+from these definitions we will derive::
+
+ <block number> = <size-max>/<pointer size>
+ <block size> = <pagesize> << <max-order>
+
+so, the max buffer size is::
+
+ <block number> * <block size>
+
+and, the number of frames be::
+
+ <block number> * <block size> / <frame size>
+
+Suppose the following parameters, which apply for 2.6 kernel and an
+i386 architecture::
+
+ <size-max> = 131072 bytes
+ <pointer size> = 4 bytes
+ <pagesize> = 4096 bytes
+ <max-order> = 11
+
+and a value for <frame size> of 2048 bytes. These parameters will yield::
+
+ <block number> = 131072/4 = 32768 blocks
+ <block size> = 4096 << 11 = 8 MiB.
+
+and hence the buffer will have a 262144 MiB size. So it can hold
+262144 MiB / 2048 bytes = 134217728 frames
+
+Actually, this buffer size is not possible with an i386 architecture.
+Remember that the memory is allocated in kernel space, in the case of
+an i386 kernel's memory size is limited to 1GiB.
+
+All memory allocations are not freed until the socket is closed. The memory
+allocations are done with GFP_KERNEL priority, this basically means that
+the allocation can wait and swap other process' memory in order to allocate
+the necessary memory, so normally limits can be reached.
+
+Other constraints
+-----------------
+
+If you check the source code you will see that what I draw here as a frame
+is not only the link level frame. At the beginning of each frame there is a
+header called struct tpacket_hdr used in PACKET_MMAP to hold link level's frame
+meta information like timestamp. So what we draw here a frame it's really
+the following (from include/linux/if_packet.h)::
+
+ /*
+ Frame structure:
+
+ - Start. Frame must be aligned to TPACKET_ALIGNMENT=16
+ - struct tpacket_hdr
+ - pad to TPACKET_ALIGNMENT=16
+ - struct sockaddr_ll
+ - Gap, chosen so that packet data (Start+tp_net) aligns to
+ TPACKET_ALIGNMENT=16
+ - Start+tp_mac: [ Optional MAC header ]
+ - Start+tp_net: Packet data, aligned to TPACKET_ALIGNMENT=16.
+ - Pad to align to TPACKET_ALIGNMENT=16
+ */
+
+The following are conditions that are checked in packet_set_ring
+
+ - tp_block_size must be a multiple of PAGE_SIZE (1)
+ - tp_frame_size must be greater than TPACKET_HDRLEN (obvious)
+ - tp_frame_size must be a multiple of TPACKET_ALIGNMENT
+ - tp_frame_nr must be exactly frames_per_block*tp_block_nr
+
+Note that tp_block_size should be chosen to be a power of two or there will
+be a waste of memory.
+
+Mapping and use of the circular buffer (ring)
+---------------------------------------------
+
+The mapping of the buffer in the user process is done with the conventional
+mmap function. Even the circular buffer is compound of several physically
+discontiguous blocks of memory, they are contiguous to the user space, hence
+just one call to mmap is needed::
+
+ mmap(0, size, PROT_READ|PROT_WRITE, MAP_SHARED, fd, 0);
+
+If tp_frame_size is a divisor of tp_block_size frames will be
+contiguously spaced by tp_frame_size bytes. If not, each
+tp_block_size/tp_frame_size frames there will be a gap between
+the frames. This is because a frame cannot be spawn across two
+blocks.
+
+To use one socket for capture and transmission, the mapping of both the
+RX and TX buffer ring has to be done with one call to mmap::
+
+ ...
+ setsockopt(fd, SOL_PACKET, PACKET_RX_RING, &foo, sizeof(foo));
+ setsockopt(fd, SOL_PACKET, PACKET_TX_RING, &bar, sizeof(bar));
+ ...
+ rx_ring = mmap(0, size * 2, PROT_READ|PROT_WRITE, MAP_SHARED, fd, 0);
+ tx_ring = rx_ring + size;
+
+RX must be the first as the kernel maps the TX ring memory right
+after the RX one.
+
+At the beginning of each frame there is an status field (see
+struct tpacket_hdr). If this field is 0 means that the frame is ready
+to be used for the kernel, If not, there is a frame the user can read
+and the following flags apply:
+
+Capture process
+^^^^^^^^^^^^^^^
+
+ from include/linux/if_packet.h
+
+ #define TP_STATUS_COPY (1 << 1)
+ #define TP_STATUS_LOSING (1 << 2)
+ #define TP_STATUS_CSUMNOTREADY (1 << 3)
+ #define TP_STATUS_CSUM_VALID (1 << 7)
+
+====================== =======================================================
+TP_STATUS_COPY This flag indicates that the frame (and associated
+ meta information) has been truncated because it's
+ larger than tp_frame_size. This packet can be
+ read entirely with recvfrom().
+
+ In order to make this work it must to be
+ enabled previously with setsockopt() and
+ the PACKET_COPY_THRESH option.
+
+ The number of frames that can be buffered to
+ be read with recvfrom is limited like a normal socket.
+ See the SO_RCVBUF option in the socket (7) man page.
+
+TP_STATUS_LOSING indicates there were packet drops from last time
+ statistics where checked with getsockopt() and
+ the PACKET_STATISTICS option.
+
+TP_STATUS_CSUMNOTREADY currently it's used for outgoing IP packets which
+ its checksum will be done in hardware. So while
+ reading the packet we should not try to check the
+ checksum.
+
+TP_STATUS_CSUM_VALID This flag indicates that at least the transport
+ header checksum of the packet has been already
+ validated on the kernel side. If the flag is not set
+ then we are free to check the checksum by ourselves
+ provided that TP_STATUS_CSUMNOTREADY is also not set.
+====================== =======================================================
+
+for convenience there are also the following defines::
+
+ #define TP_STATUS_KERNEL 0
+ #define TP_STATUS_USER 1
+
+The kernel initializes all frames to TP_STATUS_KERNEL, when the kernel
+receives a packet it puts in the buffer and updates the status with
+at least the TP_STATUS_USER flag. Then the user can read the packet,
+once the packet is read the user must zero the status field, so the kernel
+can use again that frame buffer.
+
+The user can use poll (any other variant should apply too) to check if new
+packets are in the ring::
+
+ struct pollfd pfd;
+
+ pfd.fd = fd;
+ pfd.revents = 0;
+ pfd.events = POLLIN|POLLRDNORM|POLLERR;
+
+ if (status == TP_STATUS_KERNEL)
+ retval = poll(&pfd, 1, timeout);
+
+It doesn't incur in a race condition to first check the status value and
+then poll for frames.
+
+Transmission process
+^^^^^^^^^^^^^^^^^^^^
+
+Those defines are also used for transmission::
+
+ #define TP_STATUS_AVAILABLE 0 // Frame is available
+ #define TP_STATUS_SEND_REQUEST 1 // Frame will be sent on next send()
+ #define TP_STATUS_SENDING 2 // Frame is currently in transmission
+ #define TP_STATUS_WRONG_FORMAT 4 // Frame format is not correct
+
+First, the kernel initializes all frames to TP_STATUS_AVAILABLE. To send a
+packet, the user fills a data buffer of an available frame, sets tp_len to
+current data buffer size and sets its status field to TP_STATUS_SEND_REQUEST.
+This can be done on multiple frames. Once the user is ready to transmit, it
+calls send(). Then all buffers with status equal to TP_STATUS_SEND_REQUEST are
+forwarded to the network device. The kernel updates each status of sent
+frames with TP_STATUS_SENDING until the end of transfer.
+
+At the end of each transfer, buffer status returns to TP_STATUS_AVAILABLE.
+
+::
+
+ header->tp_len = in_i_size;
+ header->tp_status = TP_STATUS_SEND_REQUEST;
+ retval = send(this->socket, NULL, 0, 0);
+
+The user can also use poll() to check if a buffer is available:
+
+(status == TP_STATUS_SENDING)
+
+::
+
+ struct pollfd pfd;
+ pfd.fd = fd;
+ pfd.revents = 0;
+ pfd.events = POLLOUT;
+ retval = poll(&pfd, 1, timeout);
+
+What TPACKET versions are available and when to use them?
+=========================================================
+
+::
+
+ int val = tpacket_version;
+ setsockopt(fd, SOL_PACKET, PACKET_VERSION, &val, sizeof(val));
+ getsockopt(fd, SOL_PACKET, PACKET_VERSION, &val, sizeof(val));
+
+where 'tpacket_version' can be TPACKET_V1 (default), TPACKET_V2, TPACKET_V3.
+
+TPACKET_V1:
+ - Default if not otherwise specified by setsockopt(2)
+ - RX_RING, TX_RING available
+
+TPACKET_V1 --> TPACKET_V2:
+ - Made 64 bit clean due to unsigned long usage in TPACKET_V1
+ structures, thus this also works on 64 bit kernel with 32 bit
+ userspace and the like
+ - Timestamp resolution in nanoseconds instead of microseconds
+ - RX_RING, TX_RING available
+ - VLAN metadata information available for packets
+ (TP_STATUS_VLAN_VALID, TP_STATUS_VLAN_TPID_VALID),
+ in the tpacket2_hdr structure:
+
+ - TP_STATUS_VLAN_VALID bit being set into the tp_status field indicates
+ that the tp_vlan_tci field has valid VLAN TCI value
+ - TP_STATUS_VLAN_TPID_VALID bit being set into the tp_status field
+ indicates that the tp_vlan_tpid field has valid VLAN TPID value
+
+ - How to switch to TPACKET_V2:
+
+ 1. Replace struct tpacket_hdr by struct tpacket2_hdr
+ 2. Query header len and save
+ 3. Set protocol version to 2, set up ring as usual
+ 4. For getting the sockaddr_ll,
+ use ``(void *)hdr + TPACKET_ALIGN(hdrlen)`` instead of
+ ``(void *)hdr + TPACKET_ALIGN(sizeof(struct tpacket_hdr))``
+
+TPACKET_V2 --> TPACKET_V3:
+ - Flexible buffer implementation for RX_RING:
+ 1. Blocks can be configured with non-static frame-size
+ 2. Read/poll is at a block-level (as opposed to packet-level)
+ 3. Added poll timeout to avoid indefinite user-space wait
+ on idle links
+ 4. Added user-configurable knobs:
+
+ 4.1 block::timeout
+ 4.2 tpkt_hdr::sk_rxhash
+
+ - RX Hash data available in user space
+ - TX_RING semantics are conceptually similar to TPACKET_V2;
+ use tpacket3_hdr instead of tpacket2_hdr, and TPACKET3_HDRLEN
+ instead of TPACKET2_HDRLEN. In the current implementation,
+ the tp_next_offset field in the tpacket3_hdr MUST be set to
+ zero, indicating that the ring does not hold variable sized frames.
+ Packets with non-zero values of tp_next_offset will be dropped.
+
+AF_PACKET fanout mode
+=====================
+
+In the AF_PACKET fanout mode, packet reception can be load balanced among
+processes. This also works in combination with mmap(2) on packet sockets.
+
+Currently implemented fanout policies are:
+
+ - PACKET_FANOUT_HASH: schedule to socket by skb's packet hash
+ - PACKET_FANOUT_LB: schedule to socket by round-robin
+ - PACKET_FANOUT_CPU: schedule to socket by CPU packet arrives on
+ - PACKET_FANOUT_RND: schedule to socket by random selection
+ - PACKET_FANOUT_ROLLOVER: if one socket is full, rollover to another
+ - PACKET_FANOUT_QM: schedule to socket by skbs recorded queue_mapping
+
+Minimal example code by David S. Miller (try things like "./test eth0 hash",
+"./test eth0 lb", etc.)::
+
+ #include <stddef.h>
+ #include <stdlib.h>
+ #include <stdio.h>
+ #include <string.h>
+
+ #include <sys/types.h>
+ #include <sys/wait.h>
+ #include <sys/socket.h>
+ #include <sys/ioctl.h>
+
+ #include <unistd.h>
+
+ #include <linux/if_ether.h>
+ #include <linux/if_packet.h>
+
+ #include <net/if.h>
+
+ static const char *device_name;
+ static int fanout_type;
+ static int fanout_id;
+
+ #ifndef PACKET_FANOUT
+ # define PACKET_FANOUT 18
+ # define PACKET_FANOUT_HASH 0
+ # define PACKET_FANOUT_LB 1
+ #endif
+
+ static int setup_socket(void)
+ {
+ int err, fd = socket(AF_PACKET, SOCK_RAW, htons(ETH_P_IP));
+ struct sockaddr_ll ll;
+ struct ifreq ifr;
+ int fanout_arg;
+
+ if (fd < 0) {
+ perror("socket");
+ return EXIT_FAILURE;
+ }
+
+ memset(&ifr, 0, sizeof(ifr));
+ strcpy(ifr.ifr_name, device_name);
+ err = ioctl(fd, SIOCGIFINDEX, &ifr);
+ if (err < 0) {
+ perror("SIOCGIFINDEX");
+ return EXIT_FAILURE;
+ }
+
+ memset(&ll, 0, sizeof(ll));
+ ll.sll_family = AF_PACKET;
+ ll.sll_ifindex = ifr.ifr_ifindex;
+ err = bind(fd, (struct sockaddr *) &ll, sizeof(ll));
+ if (err < 0) {
+ perror("bind");
+ return EXIT_FAILURE;
+ }
+
+ fanout_arg = (fanout_id | (fanout_type << 16));
+ err = setsockopt(fd, SOL_PACKET, PACKET_FANOUT,
+ &fanout_arg, sizeof(fanout_arg));
+ if (err) {
+ perror("setsockopt");
+ return EXIT_FAILURE;
+ }
+
+ return fd;
+ }
+
+ static void fanout_thread(void)
+ {
+ int fd = setup_socket();
+ int limit = 10000;
+
+ if (fd < 0)
+ exit(fd);
+
+ while (limit-- > 0) {
+ char buf[1600];
+ int err;
+
+ err = read(fd, buf, sizeof(buf));
+ if (err < 0) {
+ perror("read");
+ exit(EXIT_FAILURE);
+ }
+ if ((limit % 10) == 0)
+ fprintf(stdout, "(%d) \n", getpid());
+ }
+
+ fprintf(stdout, "%d: Received 10000 packets\n", getpid());
+
+ close(fd);
+ exit(0);
+ }
+
+ int main(int argc, char **argp)
+ {
+ int fd, err;
+ int i;
+
+ if (argc != 3) {
+ fprintf(stderr, "Usage: %s INTERFACE {hash|lb}\n", argp[0]);
+ return EXIT_FAILURE;
+ }
+
+ if (!strcmp(argp[2], "hash"))
+ fanout_type = PACKET_FANOUT_HASH;
+ else if (!strcmp(argp[2], "lb"))
+ fanout_type = PACKET_FANOUT_LB;
+ else {
+ fprintf(stderr, "Unknown fanout type [%s]\n", argp[2]);
+ exit(EXIT_FAILURE);
+ }
+
+ device_name = argp[1];
+ fanout_id = getpid() & 0xffff;
+
+ for (i = 0; i < 4; i++) {
+ pid_t pid = fork();
+
+ switch (pid) {
+ case 0:
+ fanout_thread();
+
+ case -1:
+ perror("fork");
+ exit(EXIT_FAILURE);
+ }
+ }
+
+ for (i = 0; i < 4; i++) {
+ int status;
+
+ wait(&status);
+ }
+
+ return 0;
+ }
+
+AF_PACKET TPACKET_V3 example
+============================
+
+AF_PACKET's TPACKET_V3 ring buffer can be configured to use non-static frame
+sizes by doing it's own memory management. It is based on blocks where polling
+works on a per block basis instead of per ring as in TPACKET_V2 and predecessor.
+
+It is said that TPACKET_V3 brings the following benefits:
+
+ * ~15% - 20% reduction in CPU-usage
+ * ~20% increase in packet capture rate
+ * ~2x increase in packet density
+ * Port aggregation analysis
+ * Non static frame size to capture entire packet payload
+
+So it seems to be a good candidate to be used with packet fanout.
+
+Minimal example code by Daniel Borkmann based on Chetan Loke's lolpcap (compile
+it with gcc -Wall -O2 blob.c, and try things like "./a.out eth0", etc.)::
+
+ /* Written from scratch, but kernel-to-user space API usage
+ * dissected from lolpcap:
+ * Copyright 2011, Chetan Loke <loke.chetan@gmail.com>
+ * License: GPL, version 2.0
+ */
+
+ #include <stdio.h>
+ #include <stdlib.h>
+ #include <stdint.h>
+ #include <string.h>
+ #include <assert.h>
+ #include <net/if.h>
+ #include <arpa/inet.h>
+ #include <netdb.h>
+ #include <poll.h>
+ #include <unistd.h>
+ #include <signal.h>
+ #include <inttypes.h>
+ #include <sys/socket.h>
+ #include <sys/mman.h>
+ #include <linux/if_packet.h>
+ #include <linux/if_ether.h>
+ #include <linux/ip.h>
+
+ #ifndef likely
+ # define likely(x) __builtin_expect(!!(x), 1)
+ #endif
+ #ifndef unlikely
+ # define unlikely(x) __builtin_expect(!!(x), 0)
+ #endif
+
+ struct block_desc {
+ uint32_t version;
+ uint32_t offset_to_priv;
+ struct tpacket_hdr_v1 h1;
+ };
+
+ struct ring {
+ struct iovec *rd;
+ uint8_t *map;
+ struct tpacket_req3 req;
+ };
+
+ static unsigned long packets_total = 0, bytes_total = 0;
+ static sig_atomic_t sigint = 0;
+
+ static void sighandler(int num)
+ {
+ sigint = 1;
+ }
+
+ static int setup_socket(struct ring *ring, char *netdev)
+ {
+ int err, i, fd, v = TPACKET_V3;
+ struct sockaddr_ll ll;
+ unsigned int blocksiz = 1 << 22, framesiz = 1 << 11;
+ unsigned int blocknum = 64;
+
+ fd = socket(AF_PACKET, SOCK_RAW, htons(ETH_P_ALL));
+ if (fd < 0) {
+ perror("socket");
+ exit(1);
+ }
+
+ err = setsockopt(fd, SOL_PACKET, PACKET_VERSION, &v, sizeof(v));
+ if (err < 0) {
+ perror("setsockopt");
+ exit(1);
+ }
+
+ memset(&ring->req, 0, sizeof(ring->req));
+ ring->req.tp_block_size = blocksiz;
+ ring->req.tp_frame_size = framesiz;
+ ring->req.tp_block_nr = blocknum;
+ ring->req.tp_frame_nr = (blocksiz * blocknum) / framesiz;
+ ring->req.tp_retire_blk_tov = 60;
+ ring->req.tp_feature_req_word = TP_FT_REQ_FILL_RXHASH;
+
+ err = setsockopt(fd, SOL_PACKET, PACKET_RX_RING, &ring->req,
+ sizeof(ring->req));
+ if (err < 0) {
+ perror("setsockopt");
+ exit(1);
+ }
+
+ ring->map = mmap(NULL, ring->req.tp_block_size * ring->req.tp_block_nr,
+ PROT_READ | PROT_WRITE, MAP_SHARED | MAP_LOCKED, fd, 0);
+ if (ring->map == MAP_FAILED) {
+ perror("mmap");
+ exit(1);
+ }
+
+ ring->rd = malloc(ring->req.tp_block_nr * sizeof(*ring->rd));
+ assert(ring->rd);
+ for (i = 0; i < ring->req.tp_block_nr; ++i) {
+ ring->rd[i].iov_base = ring->map + (i * ring->req.tp_block_size);
+ ring->rd[i].iov_len = ring->req.tp_block_size;
+ }
+
+ memset(&ll, 0, sizeof(ll));
+ ll.sll_family = PF_PACKET;
+ ll.sll_protocol = htons(ETH_P_ALL);
+ ll.sll_ifindex = if_nametoindex(netdev);
+ ll.sll_hatype = 0;
+ ll.sll_pkttype = 0;
+ ll.sll_halen = 0;
+
+ err = bind(fd, (struct sockaddr *) &ll, sizeof(ll));
+ if (err < 0) {
+ perror("bind");
+ exit(1);
+ }
+
+ return fd;
+ }
+
+ static void display(struct tpacket3_hdr *ppd)
+ {
+ struct ethhdr *eth = (struct ethhdr *) ((uint8_t *) ppd + ppd->tp_mac);
+ struct iphdr *ip = (struct iphdr *) ((uint8_t *) eth + ETH_HLEN);
+
+ if (eth->h_proto == htons(ETH_P_IP)) {
+ struct sockaddr_in ss, sd;
+ char sbuff[NI_MAXHOST], dbuff[NI_MAXHOST];
+
+ memset(&ss, 0, sizeof(ss));
+ ss.sin_family = PF_INET;
+ ss.sin_addr.s_addr = ip->saddr;
+ getnameinfo((struct sockaddr *) &ss, sizeof(ss),
+ sbuff, sizeof(sbuff), NULL, 0, NI_NUMERICHOST);
+
+ memset(&sd, 0, sizeof(sd));
+ sd.sin_family = PF_INET;
+ sd.sin_addr.s_addr = ip->daddr;
+ getnameinfo((struct sockaddr *) &sd, sizeof(sd),
+ dbuff, sizeof(dbuff), NULL, 0, NI_NUMERICHOST);
+
+ printf("%s -> %s, ", sbuff, dbuff);
+ }
+
+ printf("rxhash: 0x%x\n", ppd->hv1.tp_rxhash);
+ }
+
+ static void walk_block(struct block_desc *pbd, const int block_num)
+ {
+ int num_pkts = pbd->h1.num_pkts, i;
+ unsigned long bytes = 0;
+ struct tpacket3_hdr *ppd;
+
+ ppd = (struct tpacket3_hdr *) ((uint8_t *) pbd +
+ pbd->h1.offset_to_first_pkt);
+ for (i = 0; i < num_pkts; ++i) {
+ bytes += ppd->tp_snaplen;
+ display(ppd);
+
+ ppd = (struct tpacket3_hdr *) ((uint8_t *) ppd +
+ ppd->tp_next_offset);
+ }
+
+ packets_total += num_pkts;
+ bytes_total += bytes;
+ }
+
+ static void flush_block(struct block_desc *pbd)
+ {
+ pbd->h1.block_status = TP_STATUS_KERNEL;
+ }
+
+ static void teardown_socket(struct ring *ring, int fd)
+ {
+ munmap(ring->map, ring->req.tp_block_size * ring->req.tp_block_nr);
+ free(ring->rd);
+ close(fd);
+ }
+
+ int main(int argc, char **argp)
+ {
+ int fd, err;
+ socklen_t len;
+ struct ring ring;
+ struct pollfd pfd;
+ unsigned int block_num = 0, blocks = 64;
+ struct block_desc *pbd;
+ struct tpacket_stats_v3 stats;
+
+ if (argc != 2) {
+ fprintf(stderr, "Usage: %s INTERFACE\n", argp[0]);
+ return EXIT_FAILURE;
+ }
+
+ signal(SIGINT, sighandler);
+
+ memset(&ring, 0, sizeof(ring));
+ fd = setup_socket(&ring, argp[argc - 1]);
+ assert(fd > 0);
+
+ memset(&pfd, 0, sizeof(pfd));
+ pfd.fd = fd;
+ pfd.events = POLLIN | POLLERR;
+ pfd.revents = 0;
+
+ while (likely(!sigint)) {
+ pbd = (struct block_desc *) ring.rd[block_num].iov_base;
+
+ if ((pbd->h1.block_status & TP_STATUS_USER) == 0) {
+ poll(&pfd, 1, -1);
+ continue;
+ }
+
+ walk_block(pbd, block_num);
+ flush_block(pbd);
+ block_num = (block_num + 1) % blocks;
+ }
+
+ len = sizeof(stats);
+ err = getsockopt(fd, SOL_PACKET, PACKET_STATISTICS, &stats, &len);
+ if (err < 0) {
+ perror("getsockopt");
+ exit(1);
+ }
+
+ fflush(stdout);
+ printf("\nReceived %u packets, %lu bytes, %u dropped, freeze_q_cnt: %u\n",
+ stats.tp_packets, bytes_total, stats.tp_drops,
+ stats.tp_freeze_q_cnt);
+
+ teardown_socket(&ring, fd);
+ return 0;
+ }
+
+PACKET_QDISC_BYPASS
+===================
+
+If there is a requirement to load the network with many packets in a similar
+fashion as pktgen does, you might set the following option after socket
+creation::
+
+ int one = 1;
+ setsockopt(fd, SOL_PACKET, PACKET_QDISC_BYPASS, &one, sizeof(one));
+
+This has the side-effect, that packets sent through PF_PACKET will bypass the
+kernel's qdisc layer and are forcedly pushed to the driver directly. Meaning,
+packet are not buffered, tc disciplines are ignored, increased loss can occur
+and such packets are also not visible to other PF_PACKET sockets anymore. So,
+you have been warned; generally, this can be useful for stress testing various
+components of a system.
+
+On default, PACKET_QDISC_BYPASS is disabled and needs to be explicitly enabled
+on PF_PACKET sockets.
+
+PACKET_TIMESTAMP
+================
+
+The PACKET_TIMESTAMP setting determines the source of the timestamp in
+the packet meta information for mmap(2)ed RX_RING and TX_RINGs. If your
+NIC is capable of timestamping packets in hardware, you can request those
+hardware timestamps to be used. Note: you may need to enable the generation
+of hardware timestamps with SIOCSHWTSTAMP (see related information from
+Documentation/networking/timestamping.rst).
+
+PACKET_TIMESTAMP accepts the same integer bit field as SO_TIMESTAMPING::
+
+ int req = SOF_TIMESTAMPING_RAW_HARDWARE;
+ setsockopt(fd, SOL_PACKET, PACKET_TIMESTAMP, (void *) &req, sizeof(req))
+
+For the mmap(2)ed ring buffers, such timestamps are stored in the
+``tpacket{,2,3}_hdr`` structure's tp_sec and ``tp_{n,u}sec`` members.
+To determine what kind of timestamp has been reported, the tp_status field
+is binary or'ed with the following possible bits ...
+
+::
+
+ TP_STATUS_TS_RAW_HARDWARE
+ TP_STATUS_TS_SOFTWARE
+
+... that are equivalent to its ``SOF_TIMESTAMPING_*`` counterparts. For the
+RX_RING, if neither is set (i.e. PACKET_TIMESTAMP is not set), then a
+software fallback was invoked *within* PF_PACKET's processing code (less
+precise).
+
+Getting timestamps for the TX_RING works as follows: i) fill the ring frames,
+ii) call sendto() e.g. in blocking mode, iii) wait for status of relevant
+frames to be updated resp. the frame handed over to the application, iv) walk
+through the frames to pick up the individual hw/sw timestamps.
+
+Only (!) if transmit timestamping is enabled, then these bits are combined
+with binary | with TP_STATUS_AVAILABLE, so you must check for that in your
+application (e.g. !(tp_status & (TP_STATUS_SEND_REQUEST | TP_STATUS_SENDING))
+in a first step to see if the frame belongs to the application, and then
+one can extract the type of timestamp in a second step from tp_status)!
+
+If you don't care about them, thus having it disabled, checking for
+TP_STATUS_AVAILABLE resp. TP_STATUS_WRONG_FORMAT is sufficient. If in the
+TX_RING part only TP_STATUS_AVAILABLE is set, then the tp_sec and tp_{n,u}sec
+members do not contain a valid value. For TX_RINGs, by default no timestamp
+is generated!
+
+See include/linux/net_tstamp.h and Documentation/networking/timestamping.rst
+for more information on hardware timestamps.
+
+Miscellaneous bits
+==================
+
+- Packet sockets work well together with Linux socket filters, thus you also
+ might want to have a look at Documentation/networking/filter.rst
+
+THANKS
+======
+
+ Jesse Brandeburg, for fixing my grammathical/spelling errors
diff --git a/Documentation/networking/packet_mmap.txt b/Documentation/networking/packet_mmap.txt
deleted file mode 100644
index 4946145..0000000
--- a/Documentation/networking/packet_mmap.txt
+++ /dev/null
@@ -1,1061 +0,0 @@
---------------------------------------------------------------------------------
-+ ABSTRACT
---------------------------------------------------------------------------------
-
-This file documents the mmap() facility available with the PACKET
-socket interface on 2.4/2.6/3.x kernels. This type of sockets is used for
-i) capture network traffic with utilities like tcpdump, ii) transmit network
-traffic, or any other that needs raw access to network interface.
-
-Howto can be found at:
- https://sites.google.com/site/packetmmap/
-
-Please send your comments to
- Ulisses Alonso Camaró <uaca@i.hate.spam.alumni.uv.es>
- Johann Baudy
-
--------------------------------------------------------------------------------
-+ Why use PACKET_MMAP
---------------------------------------------------------------------------------
-
-In Linux 2.4/2.6/3.x if PACKET_MMAP is not enabled, the capture process is very
-inefficient. It uses very limited buffers and requires one system call to
-capture each packet, it requires two if you want to get packet's timestamp
-(like libpcap always does).
-
-In the other hand PACKET_MMAP is very efficient. PACKET_MMAP provides a size
-configurable circular buffer mapped in user space that can be used to either
-send or receive packets. This way reading packets just needs to wait for them,
-most of the time there is no need to issue a single system call. Concerning
-transmission, multiple packets can be sent through one system call to get the
-highest bandwidth. By using a shared buffer between the kernel and the user
-also has the benefit of minimizing packet copies.
-
-It's fine to use PACKET_MMAP to improve the performance of the capture and
-transmission process, but it isn't everything. At least, if you are capturing
-at high speeds (this is relative to the cpu speed), you should check if the
-device driver of your network interface card supports some sort of interrupt
-load mitigation or (even better) if it supports NAPI, also make sure it is
-enabled. For transmission, check the MTU (Maximum Transmission Unit) used and
-supported by devices of your network. CPU IRQ pinning of your network interface
-card can also be an advantage.
-
---------------------------------------------------------------------------------
-+ How to use mmap() to improve capture process
---------------------------------------------------------------------------------
-
-From the user standpoint, you should use the higher level libpcap library, which
-is a de facto standard, portable across nearly all operating systems
-including Win32.
-
-Packet MMAP support was integrated into libpcap around the time of version 1.3.0;
-TPACKET_V3 support was added in version 1.5.0
-
---------------------------------------------------------------------------------
-+ How to use mmap() directly to improve capture process
---------------------------------------------------------------------------------
-
-From the system calls stand point, the use of PACKET_MMAP involves
-the following process:
-
-
-[setup] socket() -------> creation of the capture socket
- setsockopt() ---> allocation of the circular buffer (ring)
- option: PACKET_RX_RING
- mmap() ---------> mapping of the allocated buffer to the
- user process
-
-[capture] poll() ---------> to wait for incoming packets
-
-[shutdown] close() --------> destruction of the capture socket and
- deallocation of all associated
- resources.
-
-
-socket creation and destruction is straight forward, and is done
-the same way with or without PACKET_MMAP:
-
- int fd = socket(PF_PACKET, mode, htons(ETH_P_ALL));
-
-where mode is SOCK_RAW for the raw interface were link level
-information can be captured or SOCK_DGRAM for the cooked
-interface where link level information capture is not
-supported and a link level pseudo-header is provided
-by the kernel.
-
-The destruction of the socket and all associated resources
-is done by a simple call to close(fd).
-
-Similarly as without PACKET_MMAP, it is possible to use one socket
-for capture and transmission. This can be done by mapping the
-allocated RX and TX buffer ring with a single mmap() call.
-See "Mapping and use of the circular buffer (ring)".
-
-Next I will describe PACKET_MMAP settings and its constraints,
-also the mapping of the circular buffer in the user process and
-the use of this buffer.
-
---------------------------------------------------------------------------------
-+ How to use mmap() directly to improve transmission process
---------------------------------------------------------------------------------
-Transmission process is similar to capture as shown below.
-
-[setup] socket() -------> creation of the transmission socket
- setsockopt() ---> allocation of the circular buffer (ring)
- option: PACKET_TX_RING
- bind() ---------> bind transmission socket with a network interface
- mmap() ---------> mapping of the allocated buffer to the
- user process
-
-[transmission] poll() ---------> wait for free packets (optional)
- send() ---------> send all packets that are set as ready in
- the ring
- The flag MSG_DONTWAIT can be used to return
- before end of transfer.
-
-[shutdown] close() --------> destruction of the transmission socket and
- deallocation of all associated resources.
-
-Socket creation and destruction is also straight forward, and is done
-the same way as in capturing described in the previous paragraph:
-
- int fd = socket(PF_PACKET, mode, 0);
-
-The protocol can optionally be 0 in case we only want to transmit
-via this socket, which avoids an expensive call to packet_rcv().
-In this case, you also need to bind(2) the TX_RING with sll_protocol = 0
-set. Otherwise, htons(ETH_P_ALL) or any other protocol, for example.
-
-Binding the socket to your network interface is mandatory (with zero copy) to
-know the header size of frames used in the circular buffer.
-
-As capture, each frame contains two parts:
-
- --------------------
-| struct tpacket_hdr | Header. It contains the status of
-| | of this frame
-|--------------------|
-| data buffer |
-. . Data that will be sent over the network interface.
-. .
- --------------------
-
- bind() associates the socket to your network interface thanks to
- sll_ifindex parameter of struct sockaddr_ll.
-
- Initialization example:
-
- struct sockaddr_ll my_addr;
- struct ifreq s_ifr;
- ...
-
- strncpy (s_ifr.ifr_name, "eth0", sizeof(s_ifr.ifr_name));
-
- /* get interface index of eth0 */
- ioctl(this->socket, SIOCGIFINDEX, &s_ifr);
-
- /* fill sockaddr_ll struct to prepare binding */
- my_addr.sll_family = AF_PACKET;
- my_addr.sll_protocol = htons(ETH_P_ALL);
- my_addr.sll_ifindex = s_ifr.ifr_ifindex;
-
- /* bind socket to eth0 */
- bind(this->socket, (struct sockaddr *)&my_addr, sizeof(struct sockaddr_ll));
-
- A complete tutorial is available at: https://sites.google.com/site/packetmmap/
-
-By default, the user should put data at :
- frame base + TPACKET_HDRLEN - sizeof(struct sockaddr_ll)
-
-So, whatever you choose for the socket mode (SOCK_DGRAM or SOCK_RAW),
-the beginning of the user data will be at :
- frame base + TPACKET_ALIGN(sizeof(struct tpacket_hdr))
-
-If you wish to put user data at a custom offset from the beginning of
-the frame (for payload alignment with SOCK_RAW mode for instance) you
-can set tp_net (with SOCK_DGRAM) or tp_mac (with SOCK_RAW). In order
-to make this work it must be enabled previously with setsockopt()
-and the PACKET_TX_HAS_OFF option.
-
---------------------------------------------------------------------------------
-+ PACKET_MMAP settings
---------------------------------------------------------------------------------
-
-To setup PACKET_MMAP from user level code is done with a call like
-
- - Capture process
- setsockopt(fd, SOL_PACKET, PACKET_RX_RING, (void *) &req, sizeof(req))
- - Transmission process
- setsockopt(fd, SOL_PACKET, PACKET_TX_RING, (void *) &req, sizeof(req))
-
-The most significant argument in the previous call is the req parameter,
-this parameter must to have the following structure:
-
- struct tpacket_req
- {
- unsigned int tp_block_size; /* Minimal size of contiguous block */
- unsigned int tp_block_nr; /* Number of blocks */
- unsigned int tp_frame_size; /* Size of frame */
- unsigned int tp_frame_nr; /* Total number of frames */
- };
-
-This structure is defined in /usr/include/linux/if_packet.h and establishes a
-circular buffer (ring) of unswappable memory.
-Being mapped in the capture process allows reading the captured frames and
-related meta-information like timestamps without requiring a system call.
-
-Frames are grouped in blocks. Each block is a physically contiguous
-region of memory and holds tp_block_size/tp_frame_size frames. The total number
-of blocks is tp_block_nr. Note that tp_frame_nr is a redundant parameter because
-
- frames_per_block = tp_block_size/tp_frame_size
-
-indeed, packet_set_ring checks that the following condition is true
-
- frames_per_block * tp_block_nr == tp_frame_nr
-
-Lets see an example, with the following values:
-
- tp_block_size= 4096
- tp_frame_size= 2048
- tp_block_nr = 4
- tp_frame_nr = 8
-
-we will get the following buffer structure:
-
- block #1 block #2
-+---------+---------+ +---------+---------+
-| frame 1 | frame 2 | | frame 3 | frame 4 |
-+---------+---------+ +---------+---------+
-
- block #3 block #4
-+---------+---------+ +---------+---------+
-| frame 5 | frame 6 | | frame 7 | frame 8 |
-+---------+---------+ +---------+---------+
-
-A frame can be of any size with the only condition it can fit in a block. A block
-can only hold an integer number of frames, or in other words, a frame cannot
-be spawned across two blocks, so there are some details you have to take into
-account when choosing the frame_size. See "Mapping and use of the circular
-buffer (ring)".
-
---------------------------------------------------------------------------------
-+ PACKET_MMAP setting constraints
---------------------------------------------------------------------------------
-
-In kernel versions prior to 2.4.26 (for the 2.4 branch) and 2.6.5 (2.6 branch),
-the PACKET_MMAP buffer could hold only 32768 frames in a 32 bit architecture or
-16384 in a 64 bit architecture. For information on these kernel versions
-see http://pusa.uv.es/~ulisses/packet_mmap/packet_mmap.pre-2.4.26_2.6.5.txt
-
- Block size limit
-------------------
-
-As stated earlier, each block is a contiguous physical region of memory. These
-memory regions are allocated with calls to the __get_free_pages() function. As
-the name indicates, this function allocates pages of memory, and the second
-argument is "order" or a power of two number of pages, that is
-(for PAGE_SIZE == 4096) order=0 ==> 4096 bytes, order=1 ==> 8192 bytes,
-order=2 ==> 16384 bytes, etc. The maximum size of a
-region allocated by __get_free_pages is determined by the MAX_ORDER macro. More
-precisely the limit can be calculated as:
-
- PAGE_SIZE << MAX_ORDER
-
- In a i386 architecture PAGE_SIZE is 4096 bytes
- In a 2.4/i386 kernel MAX_ORDER is 10
- In a 2.6/i386 kernel MAX_ORDER is 11
-
-So get_free_pages can allocate as much as 4MB or 8MB in a 2.4/2.6 kernel
-respectively, with an i386 architecture.
-
-User space programs can include /usr/include/sys/user.h and
-/usr/include/linux/mmzone.h to get PAGE_SIZE MAX_ORDER declarations.
-
-The pagesize can also be determined dynamically with the getpagesize (2)
-system call.
-
- Block number limit
---------------------
-
-To understand the constraints of PACKET_MMAP, we have to see the structure
-used to hold the pointers to each block.
-
-Currently, this structure is a dynamically allocated vector with kmalloc
-called pg_vec, its size limits the number of blocks that can be allocated.
-
- +---+---+---+---+
- | x | x | x | x |
- +---+---+---+---+
- | | | |
- | | | v
- | | v block #4
- | v block #3
- v block #2
- block #1
-
-kmalloc allocates any number of bytes of physically contiguous memory from
-a pool of pre-determined sizes. This pool of memory is maintained by the slab
-allocator which is at the end the responsible for doing the allocation and
-hence which imposes the maximum memory that kmalloc can allocate.
-
-In a 2.4/2.6 kernel and the i386 architecture, the limit is 131072 bytes. The
-predetermined sizes that kmalloc uses can be checked in the "size-<bytes>"
-entries of /proc/slabinfo
-
-In a 32 bit architecture, pointers are 4 bytes long, so the total number of
-pointers to blocks is
-
- 131072/4 = 32768 blocks
-
- PACKET_MMAP buffer size calculator
-------------------------------------
-
-Definitions:
-
-<size-max> : is the maximum size of allocable with kmalloc (see /proc/slabinfo)
-<pointer size>: depends on the architecture -- sizeof(void *)
-<page size> : depends on the architecture -- PAGE_SIZE or getpagesize (2)
-<max-order> : is the value defined with MAX_ORDER
-<frame size> : it's an upper bound of frame's capture size (more on this later)
-
-from these definitions we will derive
-
- <block number> = <size-max>/<pointer size>
- <block size> = <pagesize> << <max-order>
-
-so, the max buffer size is
-
- <block number> * <block size>
-
-and, the number of frames be
-
- <block number> * <block size> / <frame size>
-
-Suppose the following parameters, which apply for 2.6 kernel and an
-i386 architecture:
-
- <size-max> = 131072 bytes
- <pointer size> = 4 bytes
- <pagesize> = 4096 bytes
- <max-order> = 11
-
-and a value for <frame size> of 2048 bytes. These parameters will yield
-
- <block number> = 131072/4 = 32768 blocks
- <block size> = 4096 << 11 = 8 MiB.
-
-and hence the buffer will have a 262144 MiB size. So it can hold
-262144 MiB / 2048 bytes = 134217728 frames
-
-Actually, this buffer size is not possible with an i386 architecture.
-Remember that the memory is allocated in kernel space, in the case of
-an i386 kernel's memory size is limited to 1GiB.
-
-All memory allocations are not freed until the socket is closed. The memory
-allocations are done with GFP_KERNEL priority, this basically means that
-the allocation can wait and swap other process' memory in order to allocate
-the necessary memory, so normally limits can be reached.
-
- Other constraints
--------------------
-
-If you check the source code you will see that what I draw here as a frame
-is not only the link level frame. At the beginning of each frame there is a
-header called struct tpacket_hdr used in PACKET_MMAP to hold link level's frame
-meta information like timestamp. So what we draw here a frame it's really
-the following (from include/linux/if_packet.h):
-
-/*
- Frame structure:
-
- - Start. Frame must be aligned to TPACKET_ALIGNMENT=16
- - struct tpacket_hdr
- - pad to TPACKET_ALIGNMENT=16
- - struct sockaddr_ll
- - Gap, chosen so that packet data (Start+tp_net) aligns to
- TPACKET_ALIGNMENT=16
- - Start+tp_mac: [ Optional MAC header ]
- - Start+tp_net: Packet data, aligned to TPACKET_ALIGNMENT=16.
- - Pad to align to TPACKET_ALIGNMENT=16
- */
-
- The following are conditions that are checked in packet_set_ring
-
- tp_block_size must be a multiple of PAGE_SIZE (1)
- tp_frame_size must be greater than TPACKET_HDRLEN (obvious)
- tp_frame_size must be a multiple of TPACKET_ALIGNMENT
- tp_frame_nr must be exactly frames_per_block*tp_block_nr
-
-Note that tp_block_size should be chosen to be a power of two or there will
-be a waste of memory.
-
---------------------------------------------------------------------------------
-+ Mapping and use of the circular buffer (ring)
---------------------------------------------------------------------------------
-
-The mapping of the buffer in the user process is done with the conventional
-mmap function. Even the circular buffer is compound of several physically
-discontiguous blocks of memory, they are contiguous to the user space, hence
-just one call to mmap is needed:
-
- mmap(0, size, PROT_READ|PROT_WRITE, MAP_SHARED, fd, 0);
-
-If tp_frame_size is a divisor of tp_block_size frames will be
-contiguously spaced by tp_frame_size bytes. If not, each
-tp_block_size/tp_frame_size frames there will be a gap between
-the frames. This is because a frame cannot be spawn across two
-blocks.
-
-To use one socket for capture and transmission, the mapping of both the
-RX and TX buffer ring has to be done with one call to mmap:
-
- ...
- setsockopt(fd, SOL_PACKET, PACKET_RX_RING, &foo, sizeof(foo));
- setsockopt(fd, SOL_PACKET, PACKET_TX_RING, &bar, sizeof(bar));
- ...
- rx_ring = mmap(0, size * 2, PROT_READ|PROT_WRITE, MAP_SHARED, fd, 0);
- tx_ring = rx_ring + size;
-
-RX must be the first as the kernel maps the TX ring memory right
-after the RX one.
-
-At the beginning of each frame there is an status field (see
-struct tpacket_hdr). If this field is 0 means that the frame is ready
-to be used for the kernel, If not, there is a frame the user can read
-and the following flags apply:
-
-+++ Capture process:
- from include/linux/if_packet.h
-
- #define TP_STATUS_COPY (1 << 1)
- #define TP_STATUS_LOSING (1 << 2)
- #define TP_STATUS_CSUMNOTREADY (1 << 3)
- #define TP_STATUS_CSUM_VALID (1 << 7)
-
-TP_STATUS_COPY : This flag indicates that the frame (and associated
- meta information) has been truncated because it's
- larger than tp_frame_size. This packet can be
- read entirely with recvfrom().
-
- In order to make this work it must to be
- enabled previously with setsockopt() and
- the PACKET_COPY_THRESH option.
-
- The number of frames that can be buffered to
- be read with recvfrom is limited like a normal socket.
- See the SO_RCVBUF option in the socket (7) man page.
-
-TP_STATUS_LOSING : indicates there were packet drops from last time
- statistics where checked with getsockopt() and
- the PACKET_STATISTICS option.
-
-TP_STATUS_CSUMNOTREADY: currently it's used for outgoing IP packets which
- its checksum will be done in hardware. So while
- reading the packet we should not try to check the
- checksum.
-
-TP_STATUS_CSUM_VALID : This flag indicates that at least the transport
- header checksum of the packet has been already
- validated on the kernel side. If the flag is not set
- then we are free to check the checksum by ourselves
- provided that TP_STATUS_CSUMNOTREADY is also not set.
-
-for convenience there are also the following defines:
-
- #define TP_STATUS_KERNEL 0
- #define TP_STATUS_USER 1
-
-The kernel initializes all frames to TP_STATUS_KERNEL, when the kernel
-receives a packet it puts in the buffer and updates the status with
-at least the TP_STATUS_USER flag. Then the user can read the packet,
-once the packet is read the user must zero the status field, so the kernel
-can use again that frame buffer.
-
-The user can use poll (any other variant should apply too) to check if new
-packets are in the ring:
-
- struct pollfd pfd;
-
- pfd.fd = fd;
- pfd.revents = 0;
- pfd.events = POLLIN|POLLRDNORM|POLLERR;
-
- if (status == TP_STATUS_KERNEL)
- retval = poll(&pfd, 1, timeout);
-
-It doesn't incur in a race condition to first check the status value and
-then poll for frames.
-
-++ Transmission process
-Those defines are also used for transmission:
-
- #define TP_STATUS_AVAILABLE 0 // Frame is available
- #define TP_STATUS_SEND_REQUEST 1 // Frame will be sent on next send()
- #define TP_STATUS_SENDING 2 // Frame is currently in transmission
- #define TP_STATUS_WRONG_FORMAT 4 // Frame format is not correct
-
-First, the kernel initializes all frames to TP_STATUS_AVAILABLE. To send a
-packet, the user fills a data buffer of an available frame, sets tp_len to
-current data buffer size and sets its status field to TP_STATUS_SEND_REQUEST.
-This can be done on multiple frames. Once the user is ready to transmit, it
-calls send(). Then all buffers with status equal to TP_STATUS_SEND_REQUEST are
-forwarded to the network device. The kernel updates each status of sent
-frames with TP_STATUS_SENDING until the end of transfer.
-At the end of each transfer, buffer status returns to TP_STATUS_AVAILABLE.
-
- header->tp_len = in_i_size;
- header->tp_status = TP_STATUS_SEND_REQUEST;
- retval = send(this->socket, NULL, 0, 0);
-
-The user can also use poll() to check if a buffer is available:
-(status == TP_STATUS_SENDING)
-
- struct pollfd pfd;
- pfd.fd = fd;
- pfd.revents = 0;
- pfd.events = POLLOUT;
- retval = poll(&pfd, 1, timeout);
-
--------------------------------------------------------------------------------
-+ What TPACKET versions are available and when to use them?
--------------------------------------------------------------------------------
-
- int val = tpacket_version;
- setsockopt(fd, SOL_PACKET, PACKET_VERSION, &val, sizeof(val));
- getsockopt(fd, SOL_PACKET, PACKET_VERSION, &val, sizeof(val));
-
-where 'tpacket_version' can be TPACKET_V1 (default), TPACKET_V2, TPACKET_V3.
-
-TPACKET_V1:
- - Default if not otherwise specified by setsockopt(2)
- - RX_RING, TX_RING available
-
-TPACKET_V1 --> TPACKET_V2:
- - Made 64 bit clean due to unsigned long usage in TPACKET_V1
- structures, thus this also works on 64 bit kernel with 32 bit
- userspace and the like
- - Timestamp resolution in nanoseconds instead of microseconds
- - RX_RING, TX_RING available
- - VLAN metadata information available for packets
- (TP_STATUS_VLAN_VALID, TP_STATUS_VLAN_TPID_VALID),
- in the tpacket2_hdr structure:
- - TP_STATUS_VLAN_VALID bit being set into the tp_status field indicates
- that the tp_vlan_tci field has valid VLAN TCI value
- - TP_STATUS_VLAN_TPID_VALID bit being set into the tp_status field
- indicates that the tp_vlan_tpid field has valid VLAN TPID value
- - How to switch to TPACKET_V2:
- 1. Replace struct tpacket_hdr by struct tpacket2_hdr
- 2. Query header len and save
- 3. Set protocol version to 2, set up ring as usual
- 4. For getting the sockaddr_ll,
- use (void *)hdr + TPACKET_ALIGN(hdrlen) instead of
- (void *)hdr + TPACKET_ALIGN(sizeof(struct tpacket_hdr))
-
-TPACKET_V2 --> TPACKET_V3:
- - Flexible buffer implementation for RX_RING:
- 1. Blocks can be configured with non-static frame-size
- 2. Read/poll is at a block-level (as opposed to packet-level)
- 3. Added poll timeout to avoid indefinite user-space wait
- on idle links
- 4. Added user-configurable knobs:
- 4.1 block::timeout
- 4.2 tpkt_hdr::sk_rxhash
- - RX Hash data available in user space
- - TX_RING semantics are conceptually similar to TPACKET_V2;
- use tpacket3_hdr instead of tpacket2_hdr, and TPACKET3_HDRLEN
- instead of TPACKET2_HDRLEN. In the current implementation,
- the tp_next_offset field in the tpacket3_hdr MUST be set to
- zero, indicating that the ring does not hold variable sized frames.
- Packets with non-zero values of tp_next_offset will be dropped.
-
--------------------------------------------------------------------------------
-+ AF_PACKET fanout mode
--------------------------------------------------------------------------------
-
-In the AF_PACKET fanout mode, packet reception can be load balanced among
-processes. This also works in combination with mmap(2) on packet sockets.
-
-Currently implemented fanout policies are:
-
- - PACKET_FANOUT_HASH: schedule to socket by skb's packet hash
- - PACKET_FANOUT_LB: schedule to socket by round-robin
- - PACKET_FANOUT_CPU: schedule to socket by CPU packet arrives on
- - PACKET_FANOUT_RND: schedule to socket by random selection
- - PACKET_FANOUT_ROLLOVER: if one socket is full, rollover to another
- - PACKET_FANOUT_QM: schedule to socket by skbs recorded queue_mapping
-
-Minimal example code by David S. Miller (try things like "./test eth0 hash",
-"./test eth0 lb", etc.):
-
-#include <stddef.h>
-#include <stdlib.h>
-#include <stdio.h>
-#include <string.h>
-
-#include <sys/types.h>
-#include <sys/wait.h>
-#include <sys/socket.h>
-#include <sys/ioctl.h>
-
-#include <unistd.h>
-
-#include <linux/if_ether.h>
-#include <linux/if_packet.h>
-
-#include <net/if.h>
-
-static const char *device_name;
-static int fanout_type;
-static int fanout_id;
-
-#ifndef PACKET_FANOUT
-# define PACKET_FANOUT 18
-# define PACKET_FANOUT_HASH 0
-# define PACKET_FANOUT_LB 1
-#endif
-
-static int setup_socket(void)
-{
- int err, fd = socket(AF_PACKET, SOCK_RAW, htons(ETH_P_IP));
- struct sockaddr_ll ll;
- struct ifreq ifr;
- int fanout_arg;
-
- if (fd < 0) {
- perror("socket");
- return EXIT_FAILURE;
- }
-
- memset(&ifr, 0, sizeof(ifr));
- strcpy(ifr.ifr_name, device_name);
- err = ioctl(fd, SIOCGIFINDEX, &ifr);
- if (err < 0) {
- perror("SIOCGIFINDEX");
- return EXIT_FAILURE;
- }
-
- memset(&ll, 0, sizeof(ll));
- ll.sll_family = AF_PACKET;
- ll.sll_ifindex = ifr.ifr_ifindex;
- err = bind(fd, (struct sockaddr *) &ll, sizeof(ll));
- if (err < 0) {
- perror("bind");
- return EXIT_FAILURE;
- }
-
- fanout_arg = (fanout_id | (fanout_type << 16));
- err = setsockopt(fd, SOL_PACKET, PACKET_FANOUT,
- &fanout_arg, sizeof(fanout_arg));
- if (err) {
- perror("setsockopt");
- return EXIT_FAILURE;
- }
-
- return fd;
-}
-
-static void fanout_thread(void)
-{
- int fd = setup_socket();
- int limit = 10000;
-
- if (fd < 0)
- exit(fd);
-
- while (limit-- > 0) {
- char buf[1600];
- int err;
-
- err = read(fd, buf, sizeof(buf));
- if (err < 0) {
- perror("read");
- exit(EXIT_FAILURE);
- }
- if ((limit % 10) == 0)
- fprintf(stdout, "(%d) \n", getpid());
- }
-
- fprintf(stdout, "%d: Received 10000 packets\n", getpid());
-
- close(fd);
- exit(0);
-}
-
-int main(int argc, char **argp)
-{
- int fd, err;
- int i;
-
- if (argc != 3) {
- fprintf(stderr, "Usage: %s INTERFACE {hash|lb}\n", argp[0]);
- return EXIT_FAILURE;
- }
-
- if (!strcmp(argp[2], "hash"))
- fanout_type = PACKET_FANOUT_HASH;
- else if (!strcmp(argp[2], "lb"))
- fanout_type = PACKET_FANOUT_LB;
- else {
- fprintf(stderr, "Unknown fanout type [%s]\n", argp[2]);
- exit(EXIT_FAILURE);
- }
-
- device_name = argp[1];
- fanout_id = getpid() & 0xffff;
-
- for (i = 0; i < 4; i++) {
- pid_t pid = fork();
-
- switch (pid) {
- case 0:
- fanout_thread();
-
- case -1:
- perror("fork");
- exit(EXIT_FAILURE);
- }
- }
-
- for (i = 0; i < 4; i++) {
- int status;
-
- wait(&status);
- }
-
- return 0;
-}
-
--------------------------------------------------------------------------------
-+ AF_PACKET TPACKET_V3 example
--------------------------------------------------------------------------------
-
-AF_PACKET's TPACKET_V3 ring buffer can be configured to use non-static frame
-sizes by doing it's own memory management. It is based on blocks where polling
-works on a per block basis instead of per ring as in TPACKET_V2 and predecessor.
-
-It is said that TPACKET_V3 brings the following benefits:
- *) ~15 - 20% reduction in CPU-usage
- *) ~20% increase in packet capture rate
- *) ~2x increase in packet density
- *) Port aggregation analysis
- *) Non static frame size to capture entire packet payload
-
-So it seems to be a good candidate to be used with packet fanout.
-
-Minimal example code by Daniel Borkmann based on Chetan Loke's lolpcap (compile
-it with gcc -Wall -O2 blob.c, and try things like "./a.out eth0", etc.):
-
-/* Written from scratch, but kernel-to-user space API usage
- * dissected from lolpcap:
- * Copyright 2011, Chetan Loke <loke.chetan@gmail.com>
- * License: GPL, version 2.0
- */
-
-#include <stdio.h>
-#include <stdlib.h>
-#include <stdint.h>
-#include <string.h>
-#include <assert.h>
-#include <net/if.h>
-#include <arpa/inet.h>
-#include <netdb.h>
-#include <poll.h>
-#include <unistd.h>
-#include <signal.h>
-#include <inttypes.h>
-#include <sys/socket.h>
-#include <sys/mman.h>
-#include <linux/if_packet.h>
-#include <linux/if_ether.h>
-#include <linux/ip.h>
-
-#ifndef likely
-# define likely(x) __builtin_expect(!!(x), 1)
-#endif
-#ifndef unlikely
-# define unlikely(x) __builtin_expect(!!(x), 0)
-#endif
-
-struct block_desc {
- uint32_t version;
- uint32_t offset_to_priv;
- struct tpacket_hdr_v1 h1;
-};
-
-struct ring {
- struct iovec *rd;
- uint8_t *map;
- struct tpacket_req3 req;
-};
-
-static unsigned long packets_total = 0, bytes_total = 0;
-static sig_atomic_t sigint = 0;
-
-static void sighandler(int num)
-{
- sigint = 1;
-}
-
-static int setup_socket(struct ring *ring, char *netdev)
-{
- int err, i, fd, v = TPACKET_V3;
- struct sockaddr_ll ll;
- unsigned int blocksiz = 1 << 22, framesiz = 1 << 11;
- unsigned int blocknum = 64;
-
- fd = socket(AF_PACKET, SOCK_RAW, htons(ETH_P_ALL));
- if (fd < 0) {
- perror("socket");
- exit(1);
- }
-
- err = setsockopt(fd, SOL_PACKET, PACKET_VERSION, &v, sizeof(v));
- if (err < 0) {
- perror("setsockopt");
- exit(1);
- }
-
- memset(&ring->req, 0, sizeof(ring->req));
- ring->req.tp_block_size = blocksiz;
- ring->req.tp_frame_size = framesiz;
- ring->req.tp_block_nr = blocknum;
- ring->req.tp_frame_nr = (blocksiz * blocknum) / framesiz;
- ring->req.tp_retire_blk_tov = 60;
- ring->req.tp_feature_req_word = TP_FT_REQ_FILL_RXHASH;
-
- err = setsockopt(fd, SOL_PACKET, PACKET_RX_RING, &ring->req,
- sizeof(ring->req));
- if (err < 0) {
- perror("setsockopt");
- exit(1);
- }
-
- ring->map = mmap(NULL, ring->req.tp_block_size * ring->req.tp_block_nr,
- PROT_READ | PROT_WRITE, MAP_SHARED | MAP_LOCKED, fd, 0);
- if (ring->map == MAP_FAILED) {
- perror("mmap");
- exit(1);
- }
-
- ring->rd = malloc(ring->req.tp_block_nr * sizeof(*ring->rd));
- assert(ring->rd);
- for (i = 0; i < ring->req.tp_block_nr; ++i) {
- ring->rd[i].iov_base = ring->map + (i * ring->req.tp_block_size);
- ring->rd[i].iov_len = ring->req.tp_block_size;
- }
-
- memset(&ll, 0, sizeof(ll));
- ll.sll_family = PF_PACKET;
- ll.sll_protocol = htons(ETH_P_ALL);
- ll.sll_ifindex = if_nametoindex(netdev);
- ll.sll_hatype = 0;
- ll.sll_pkttype = 0;
- ll.sll_halen = 0;
-
- err = bind(fd, (struct sockaddr *) &ll, sizeof(ll));
- if (err < 0) {
- perror("bind");
- exit(1);
- }
-
- return fd;
-}
-
-static void display(struct tpacket3_hdr *ppd)
-{
- struct ethhdr *eth = (struct ethhdr *) ((uint8_t *) ppd + ppd->tp_mac);
- struct iphdr *ip = (struct iphdr *) ((uint8_t *) eth + ETH_HLEN);
-
- if (eth->h_proto == htons(ETH_P_IP)) {
- struct sockaddr_in ss, sd;
- char sbuff[NI_MAXHOST], dbuff[NI_MAXHOST];
-
- memset(&ss, 0, sizeof(ss));
- ss.sin_family = PF_INET;
- ss.sin_addr.s_addr = ip->saddr;
- getnameinfo((struct sockaddr *) &ss, sizeof(ss),
- sbuff, sizeof(sbuff), NULL, 0, NI_NUMERICHOST);
-
- memset(&sd, 0, sizeof(sd));
- sd.sin_family = PF_INET;
- sd.sin_addr.s_addr = ip->daddr;
- getnameinfo((struct sockaddr *) &sd, sizeof(sd),
- dbuff, sizeof(dbuff), NULL, 0, NI_NUMERICHOST);
-
- printf("%s -> %s, ", sbuff, dbuff);
- }
-
- printf("rxhash: 0x%x\n", ppd->hv1.tp_rxhash);
-}
-
-static void walk_block(struct block_desc *pbd, const int block_num)
-{
- int num_pkts = pbd->h1.num_pkts, i;
- unsigned long bytes = 0;
- struct tpacket3_hdr *ppd;
-
- ppd = (struct tpacket3_hdr *) ((uint8_t *) pbd +
- pbd->h1.offset_to_first_pkt);
- for (i = 0; i < num_pkts; ++i) {
- bytes += ppd->tp_snaplen;
- display(ppd);
-
- ppd = (struct tpacket3_hdr *) ((uint8_t *) ppd +
- ppd->tp_next_offset);
- }
-
- packets_total += num_pkts;
- bytes_total += bytes;
-}
-
-static void flush_block(struct block_desc *pbd)
-{
- pbd->h1.block_status = TP_STATUS_KERNEL;
-}
-
-static void teardown_socket(struct ring *ring, int fd)
-{
- munmap(ring->map, ring->req.tp_block_size * ring->req.tp_block_nr);
- free(ring->rd);
- close(fd);
-}
-
-int main(int argc, char **argp)
-{
- int fd, err;
- socklen_t len;
- struct ring ring;
- struct pollfd pfd;
- unsigned int block_num = 0, blocks = 64;
- struct block_desc *pbd;
- struct tpacket_stats_v3 stats;
-
- if (argc != 2) {
- fprintf(stderr, "Usage: %s INTERFACE\n", argp[0]);
- return EXIT_FAILURE;
- }
-
- signal(SIGINT, sighandler);
-
- memset(&ring, 0, sizeof(ring));
- fd = setup_socket(&ring, argp[argc - 1]);
- assert(fd > 0);
-
- memset(&pfd, 0, sizeof(pfd));
- pfd.fd = fd;
- pfd.events = POLLIN | POLLERR;
- pfd.revents = 0;
-
- while (likely(!sigint)) {
- pbd = (struct block_desc *) ring.rd[block_num].iov_base;
-
- if ((pbd->h1.block_status & TP_STATUS_USER) == 0) {
- poll(&pfd, 1, -1);
- continue;
- }
-
- walk_block(pbd, block_num);
- flush_block(pbd);
- block_num = (block_num + 1) % blocks;
- }
-
- len = sizeof(stats);
- err = getsockopt(fd, SOL_PACKET, PACKET_STATISTICS, &stats, &len);
- if (err < 0) {
- perror("getsockopt");
- exit(1);
- }
-
- fflush(stdout);
- printf("\nReceived %u packets, %lu bytes, %u dropped, freeze_q_cnt: %u\n",
- stats.tp_packets, bytes_total, stats.tp_drops,
- stats.tp_freeze_q_cnt);
-
- teardown_socket(&ring, fd);
- return 0;
-}
-
--------------------------------------------------------------------------------
-+ PACKET_QDISC_BYPASS
--------------------------------------------------------------------------------
-
-If there is a requirement to load the network with many packets in a similar
-fashion as pktgen does, you might set the following option after socket
-creation:
-
- int one = 1;
- setsockopt(fd, SOL_PACKET, PACKET_QDISC_BYPASS, &one, sizeof(one));
-
-This has the side-effect, that packets sent through PF_PACKET will bypass the
-kernel's qdisc layer and are forcedly pushed to the driver directly. Meaning,
-packet are not buffered, tc disciplines are ignored, increased loss can occur
-and such packets are also not visible to other PF_PACKET sockets anymore. So,
-you have been warned; generally, this can be useful for stress testing various
-components of a system.
-
-On default, PACKET_QDISC_BYPASS is disabled and needs to be explicitly enabled
-on PF_PACKET sockets.
-
--------------------------------------------------------------------------------
-+ PACKET_TIMESTAMP
--------------------------------------------------------------------------------
-
-The PACKET_TIMESTAMP setting determines the source of the timestamp in
-the packet meta information for mmap(2)ed RX_RING and TX_RINGs. If your
-NIC is capable of timestamping packets in hardware, you can request those
-hardware timestamps to be used. Note: you may need to enable the generation
-of hardware timestamps with SIOCSHWTSTAMP (see related information from
-Documentation/networking/timestamping.txt).
-
-PACKET_TIMESTAMP accepts the same integer bit field as SO_TIMESTAMPING:
-
- int req = SOF_TIMESTAMPING_RAW_HARDWARE;
- setsockopt(fd, SOL_PACKET, PACKET_TIMESTAMP, (void *) &req, sizeof(req))
-
-For the mmap(2)ed ring buffers, such timestamps are stored in the
-tpacket{,2,3}_hdr structure's tp_sec and tp_{n,u}sec members. To determine
-what kind of timestamp has been reported, the tp_status field is binary |'ed
-with the following possible bits ...
-
- TP_STATUS_TS_RAW_HARDWARE
- TP_STATUS_TS_SOFTWARE
-
-... that are equivalent to its SOF_TIMESTAMPING_* counterparts. For the
-RX_RING, if neither is set (i.e. PACKET_TIMESTAMP is not set), then a
-software fallback was invoked *within* PF_PACKET's processing code (less
-precise).
-
-Getting timestamps for the TX_RING works as follows: i) fill the ring frames,
-ii) call sendto() e.g. in blocking mode, iii) wait for status of relevant
-frames to be updated resp. the frame handed over to the application, iv) walk
-through the frames to pick up the individual hw/sw timestamps.
-
-Only (!) if transmit timestamping is enabled, then these bits are combined
-with binary | with TP_STATUS_AVAILABLE, so you must check for that in your
-application (e.g. !(tp_status & (TP_STATUS_SEND_REQUEST | TP_STATUS_SENDING))
-in a first step to see if the frame belongs to the application, and then
-one can extract the type of timestamp in a second step from tp_status)!
-
-If you don't care about them, thus having it disabled, checking for
-TP_STATUS_AVAILABLE resp. TP_STATUS_WRONG_FORMAT is sufficient. If in the
-TX_RING part only TP_STATUS_AVAILABLE is set, then the tp_sec and tp_{n,u}sec
-members do not contain a valid value. For TX_RINGs, by default no timestamp
-is generated!
-
-See include/linux/net_tstamp.h and Documentation/networking/timestamping.txt
-for more information on hardware timestamps.
-
--------------------------------------------------------------------------------
-+ Miscellaneous bits
--------------------------------------------------------------------------------
-
-- Packet sockets work well together with Linux socket filters, thus you also
- might want to have a look at Documentation/networking/filter.rst
-
---------------------------------------------------------------------------------
-+ THANKS
---------------------------------------------------------------------------------
-
- Jesse Brandeburg, for fixing my grammathical/spelling errors
-
diff --git a/Documentation/networking/phonet.txt b/Documentation/networking/phonet.rst
similarity index 82%
rename from Documentation/networking/phonet.txt
rename to Documentation/networking/phonet.rst
index 8100358..8668dcb 100644
--- a/Documentation/networking/phonet.txt
+++ b/Documentation/networking/phonet.rst
@@ -1,3 +1,7 @@
+.. SPDX-License-Identifier: GPL-2.0
+.. include:: <isonum.txt>
+
+============================
Linux Phonet protocol family
============================
@@ -11,6 +15,7 @@
Phonet packets can be exchanged through various hardware connections
depending on the device, such as:
+
- USB with the CDC Phonet interface,
- infrared,
- Bluetooth,
@@ -21,7 +26,7 @@
Packets format
--------------
-Phonet packets have a common header as follows:
+Phonet packets have a common header as follows::
struct phonethdr {
uint8_t pn_media; /* Media type (link-layer identifier) */
@@ -72,7 +77,7 @@
Network layer
-------------
-The Phonet socket address family maps the Phonet packet header:
+The Phonet socket address family maps the Phonet packet header::
struct sockaddr_pn {
sa_family_t spn_family; /* AF_PHONET */
@@ -94,6 +99,8 @@
2^10 object IDs available, and can send and receive packets with any
other peer.
+::
+
struct sockaddr_pn addr = { .spn_family = AF_PHONET, };
ssize_t len;
socklen_t addrlen = sizeof(addr);
@@ -105,7 +112,7 @@
sendto(fd, msg, msglen, 0, (struct sockaddr *)&addr, sizeof(addr));
len = recvfrom(fd, buf, sizeof(buf), 0,
- (struct sockaddr *)&addr, &addrlen);
+ (struct sockaddr *)&addr, &addrlen);
This protocol follows the SOCK_DGRAM connection-less semantics.
However, connect() and getpeername() are not supported, as they did
@@ -116,7 +123,7 @@
---------------------
A Phonet datagram socket can be subscribed to any number of 8-bits
-Phonet resources, as follow:
+Phonet resources, as follow::
uint32_t res = 0xXX;
ioctl(fd, SIOCPNADDRESOURCE, &res);
@@ -137,6 +144,8 @@
ID. Each listening socket can handle up to 255 simultaneous
connections, one per accept()'d socket.
+::
+
int lfd, cfd;
lfd = socket(PF_PHONET, SOCK_SEQPACKET, PN_PROTO_PIPE);
@@ -161,7 +170,7 @@
As of Linux kernel version 2.6.39, it is also possible to connect
two endpoints directly, using connect() on the active side. This is
intended to support the newer Nokia Wireless Modem API, as found in
-e.g. the Nokia Slim Modem in the ST-Ericsson U8500 platform:
+e.g. the Nokia Slim Modem in the ST-Ericsson U8500 platform::
struct sockaddr_spn spn;
int fd;
@@ -177,38 +186,45 @@
close(fd);
-WARNING:
-When polling a connected pipe socket for writability, there is an
-intrinsic race condition whereby writability might be lost between the
-polling and the writing system calls. In this case, the socket will
-block until write becomes possible again, unless non-blocking mode
-is enabled.
+.. Warning:
+
+ When polling a connected pipe socket for writability, there is an
+ intrinsic race condition whereby writability might be lost between the
+ polling and the writing system calls. In this case, the socket will
+ block until write becomes possible again, unless non-blocking mode
+ is enabled.
The pipe protocol provides two socket options at the SOL_PNPIPE level:
PNPIPE_ENCAP accepts one integer value (int) of:
- PNPIPE_ENCAP_NONE: The socket operates normally (default).
+ PNPIPE_ENCAP_NONE:
+ The socket operates normally (default).
- PNPIPE_ENCAP_IP: The socket is used as a backend for a virtual IP
+ PNPIPE_ENCAP_IP:
+ The socket is used as a backend for a virtual IP
interface. This requires CAP_NET_ADMIN capability. GPRS data
support on Nokia modems can use this. Note that the socket cannot
be reliably poll()'d or read() from while in this mode.
- PNPIPE_IFINDEX is a read-only integer value. It contains the
- interface index of the network interface created by PNPIPE_ENCAP,
- or zero if encapsulation is off.
+ PNPIPE_IFINDEX
+ is a read-only integer value. It contains the
+ interface index of the network interface created by PNPIPE_ENCAP,
+ or zero if encapsulation is off.
- PNPIPE_HANDLE is a read-only integer value. It contains the underlying
- identifier ("pipe handle") of the pipe. This is only defined for
- socket descriptors that are already connected or being connected.
+ PNPIPE_HANDLE
+ is a read-only integer value. It contains the underlying
+ identifier ("pipe handle") of the pipe. This is only defined for
+ socket descriptors that are already connected or being connected.
Authors
-------
Linux Phonet was initially written by Sakari Ailus.
+
Other contributors include Mikä Liljeberg, Andras Domokos,
Carlos Chinea and Rémi Denis-Courmont.
-Copyright (C) 2008 Nokia Corporation.
+
+Copyright |copy| 2008 Nokia Corporation.
diff --git a/Documentation/networking/pktgen.txt b/Documentation/networking/pktgen.rst
similarity index 62%
rename from Documentation/networking/pktgen.txt
rename to Documentation/networking/pktgen.rst
index d2fd78f..7afa1c9 100644
--- a/Documentation/networking/pktgen.txt
+++ b/Documentation/networking/pktgen.rst
@@ -1,7 +1,8 @@
+.. SPDX-License-Identifier: GPL-2.0
-
- HOWTO for the linux packet generator
- ------------------------------------
+====================================
+HOWTO for the linux packet generator
+====================================
Enable CONFIG_NET_PKTGEN to compile and build pktgen either in-kernel
or as a module. A module is preferred; modprobe pktgen if needed. Once
@@ -9,17 +10,18 @@
Monitoring and controlling is done via /proc. It is easiest to select a
suitable sample script and configure that.
-On a dual CPU:
+On a dual CPU::
-ps aux | grep pkt
-root 129 0.3 0.0 0 0 ? SW 2003 523:20 [kpktgend_0]
-root 130 0.3 0.0 0 0 ? SW 2003 509:50 [kpktgend_1]
+ ps aux | grep pkt
+ root 129 0.3 0.0 0 0 ? SW 2003 523:20 [kpktgend_0]
+ root 130 0.3 0.0 0 0 ? SW 2003 509:50 [kpktgend_1]
-For monitoring and control pktgen creates:
+For monitoring and control pktgen creates::
+
/proc/net/pktgen/pgctrl
/proc/net/pktgen/kpktgend_X
- /proc/net/pktgen/ethX
+ /proc/net/pktgen/ethX
Tuning NIC for max performance
@@ -28,7 +30,8 @@
The default NIC settings are (likely) not tuned for pktgen's artificial
overload type of benchmarking, as this could hurt the normal use-case.
-Specifically increasing the TX ring buffer in the NIC:
+Specifically increasing the TX ring buffer in the NIC::
+
# ethtool -G ethX tx 1024
A larger TX ring can improve pktgen's performance, while it can hurt
@@ -46,7 +49,8 @@
and the cleanup interval is affected by the ethtool --coalesce setting
of parameter "rx-usecs".
-For ixgbe use e.g. "30" resulting in approx 33K interrupts/sec (1/30*10^6):
+For ixgbe use e.g. "30" resulting in approx 33K interrupts/sec (1/30*10^6)::
+
# ethtool -C ethX rx-usecs 30
@@ -55,7 +59,7 @@
Pktgen creates a thread for each CPU with affinity to that CPU.
Which is controlled through procfile /proc/net/pktgen/kpktgend_X.
-Example: /proc/net/pktgen/kpktgend_0
+Example: /proc/net/pktgen/kpktgend_0::
Running:
Stopped: eth4@0
@@ -64,6 +68,7 @@
Most important are the devices assigned to the thread.
The two basic thread commands are:
+
* add_device DEVICE@NAME -- adds a single device
* rem_device_all -- remove all associated devices
@@ -73,7 +78,7 @@
To support adding the same device to multiple threads, which is useful
with multi queue NICs, the device naming scheme is extended with "@":
- device@something
+device@something
The part after "@" can be anything, but it is custom to use the thread
number.
@@ -83,30 +88,30 @@
The Params section holds configured information. The Current section
holds running statistics. The Result is printed after a run or after
-interruption. Example:
+interruption. Example::
-/proc/net/pktgen/eth4@0
+ /proc/net/pktgen/eth4@0
- Params: count 100000 min_pkt_size: 60 max_pkt_size: 60
- frags: 0 delay: 0 clone_skb: 64 ifname: eth4@0
- flows: 0 flowlen: 0
- queue_map_min: 0 queue_map_max: 0
- dst_min: 192.168.81.2 dst_max:
- src_min: src_max:
- src_mac: 90:e2:ba:0a:56:b4 dst_mac: 00:1b:21:3c:9d:f8
- udp_src_min: 9 udp_src_max: 109 udp_dst_min: 9 udp_dst_max: 9
- src_mac_count: 0 dst_mac_count: 0
- Flags: UDPSRC_RND NO_TIMESTAMP QUEUE_MAP_CPU
- Current:
- pkts-sofar: 100000 errors: 0
- started: 623913381008us stopped: 623913396439us idle: 25us
- seq_num: 100001 cur_dst_mac_offset: 0 cur_src_mac_offset: 0
- cur_saddr: 192.168.8.3 cur_daddr: 192.168.81.2
- cur_udp_dst: 9 cur_udp_src: 42
- cur_queue_map: 0
- flows: 0
- Result: OK: 15430(c15405+d25) usec, 100000 (60byte,0frags)
- 6480562pps 3110Mb/sec (3110669760bps) errors: 0
+ Params: count 100000 min_pkt_size: 60 max_pkt_size: 60
+ frags: 0 delay: 0 clone_skb: 64 ifname: eth4@0
+ flows: 0 flowlen: 0
+ queue_map_min: 0 queue_map_max: 0
+ dst_min: 192.168.81.2 dst_max:
+ src_min: src_max:
+ src_mac: 90:e2:ba:0a:56:b4 dst_mac: 00:1b:21:3c:9d:f8
+ udp_src_min: 9 udp_src_max: 109 udp_dst_min: 9 udp_dst_max: 9
+ src_mac_count: 0 dst_mac_count: 0
+ Flags: UDPSRC_RND NO_TIMESTAMP QUEUE_MAP_CPU
+ Current:
+ pkts-sofar: 100000 errors: 0
+ started: 623913381008us stopped: 623913396439us idle: 25us
+ seq_num: 100001 cur_dst_mac_offset: 0 cur_src_mac_offset: 0
+ cur_saddr: 192.168.8.3 cur_daddr: 192.168.81.2
+ cur_udp_dst: 9 cur_udp_src: 42
+ cur_queue_map: 0
+ flows: 0
+ Result: OK: 15430(c15405+d25) usec, 100000 (60byte,0frags)
+ 6480562pps 3110Mb/sec (3110669760bps) errors: 0
Configuring devices
@@ -114,11 +119,12 @@
This is done via the /proc interface, and most easily done via pgset
as defined in the sample scripts.
You need to specify PGDEV environment variable to use functions from sample
-scripts, i.e.:
-export PGDEV=/proc/net/pktgen/eth4@0
-source samples/pktgen/functions.sh
+scripts, i.e.::
-Examples:
+ export PGDEV=/proc/net/pktgen/eth4@0
+ source samples/pktgen/functions.sh
+
+Examples::
pg_ctrl start starts injection.
pg_ctrl stop aborts injection. Also, ^C aborts generator.
@@ -126,17 +132,17 @@
pgset "clone_skb 1" sets the number of copies of the same packet
pgset "clone_skb 0" use single SKB for all transmits
pgset "burst 8" uses xmit_more API to queue 8 copies of the same
- packet and update HW tx queue tail pointer once.
- "burst 1" is the default
+ packet and update HW tx queue tail pointer once.
+ "burst 1" is the default
pgset "pkt_size 9014" sets packet size to 9014
pgset "frags 5" packet will consist of 5 fragments
pgset "count 200000" sets number of packets to send, set to zero
- for continuous sends until explicitly stopped.
+ for continuous sends until explicitly stopped.
pgset "delay 5000" adds delay to hard_start_xmit(). nanoseconds
pgset "dst 10.0.0.1" sets IP destination address
- (BEWARE! This generator is very aggressive!)
+ (BEWARE! This generator is very aggressive!)
pgset "dst_min 10.0.0.1" Same as dst
pgset "dst_max 10.0.0.254" Set the maximum destination IP.
@@ -149,46 +155,46 @@
pgset "queue_map_min 0" Sets the min value of tx queue interval
pgset "queue_map_max 7" Sets the max value of tx queue interval, for multiqueue devices
- To select queue 1 of a given device,
- use queue_map_min=1 and queue_map_max=1
+ To select queue 1 of a given device,
+ use queue_map_min=1 and queue_map_max=1
pgset "src_mac_count 1" Sets the number of MACs we'll range through.
- The 'minimum' MAC is what you set with srcmac.
+ The 'minimum' MAC is what you set with srcmac.
pgset "dst_mac_count 1" Sets the number of MACs we'll range through.
- The 'minimum' MAC is what you set with dstmac.
+ The 'minimum' MAC is what you set with dstmac.
pgset "flag [name]" Set a flag to determine behaviour. Current flags
- are: IPSRC_RND # IP source is random (between min/max)
- IPDST_RND # IP destination is random
- UDPSRC_RND, UDPDST_RND,
- MACSRC_RND, MACDST_RND
- TXSIZE_RND, IPV6,
- MPLS_RND, VID_RND, SVID_RND
- FLOW_SEQ,
- QUEUE_MAP_RND # queue map random
- QUEUE_MAP_CPU # queue map mirrors smp_processor_id()
- UDPCSUM,
- IPSEC # IPsec encapsulation (needs CONFIG_XFRM)
- NODE_ALLOC # node specific memory allocation
- NO_TIMESTAMP # disable timestamping
+ are: IPSRC_RND # IP source is random (between min/max)
+ IPDST_RND # IP destination is random
+ UDPSRC_RND, UDPDST_RND,
+ MACSRC_RND, MACDST_RND
+ TXSIZE_RND, IPV6,
+ MPLS_RND, VID_RND, SVID_RND
+ FLOW_SEQ,
+ QUEUE_MAP_RND # queue map random
+ QUEUE_MAP_CPU # queue map mirrors smp_processor_id()
+ UDPCSUM,
+ IPSEC # IPsec encapsulation (needs CONFIG_XFRM)
+ NODE_ALLOC # node specific memory allocation
+ NO_TIMESTAMP # disable timestamping
pgset 'flag ![name]' Clear a flag to determine behaviour.
- Note that you might need to use single quote in
- interactive mode, so that your shell wouldn't expand
- the specified flag as a history command.
+ Note that you might need to use single quote in
+ interactive mode, so that your shell wouldn't expand
+ the specified flag as a history command.
pgset "spi [SPI_VALUE]" Set specific SA used to transform packet.
pgset "udp_src_min 9" set UDP source port min, If < udp_src_max, then
- cycle through the port range.
+ cycle through the port range.
pgset "udp_src_max 9" set UDP source port max.
pgset "udp_dst_min 9" set UDP destination port min, If < udp_dst_max, then
- cycle through the port range.
+ cycle through the port range.
pgset "udp_dst_max 9" set UDP destination port max.
pgset "mpls 0001000a,0002000a,0000000a" set MPLS labels (in this example
- outer label=16,middle label=32,
+ outer label=16,middle label=32,
inner label=0 (IPv4 NULL)) Note that
there must be no spaces between the
arguments. Leading zeros are required.
@@ -232,10 +238,14 @@
samples/pktgen directory. The helper parameters.sh file support easy
and consistent parameter parsing across the sample scripts.
-Usage example and help:
+Usage example and help::
+
./pktgen_sample01_simple.sh -i eth4 -m 00:1B:21:3C:9D:F8 -d 192.168.8.2
-Usage: ./pktgen_sample01_simple.sh [-vx] -i ethX
+Usage:::
+
+ ./pktgen_sample01_simple.sh [-vx] -i ethX
+
-i : ($DEV) output interface/device (required)
-s : ($PKT_SIZE) packet size
-d : ($DEST_IP) destination IP
@@ -250,13 +260,13 @@
interface/device parameter "-i" sets variable $DEV. Copy the
pktgen_sampleXX scripts and modify them to fit your own needs.
-The old scripts:
+The old scripts::
-pktgen.conf-1-2 # 1 CPU 2 dev
-pktgen.conf-1-1-rdos # 1 CPU 1 dev w. route DoS
-pktgen.conf-1-1-ip6 # 1 CPU 1 dev ipv6
-pktgen.conf-1-1-ip6-rdos # 1 CPU 1 dev ipv6 w. route DoS
-pktgen.conf-1-1-flows # 1 CPU 1 dev multiple flows.
+ pktgen.conf-1-2 # 1 CPU 2 dev
+ pktgen.conf-1-1-rdos # 1 CPU 1 dev w. route DoS
+ pktgen.conf-1-1-ip6 # 1 CPU 1 dev ipv6
+ pktgen.conf-1-1-ip6-rdos # 1 CPU 1 dev ipv6 w. route DoS
+ pktgen.conf-1-1-flows # 1 CPU 1 dev multiple flows.
Interrupt affinity
@@ -271,10 +281,10 @@
Enable IPsec
============
Default IPsec transformation with ESP encapsulation plus transport mode
-can be enabled by simply setting:
+can be enabled by simply setting::
-pgset "flag IPSEC"
-pgset "flows 1"
+ pgset "flag IPSEC"
+ pgset "flows 1"
To avoid breaking existing testbed scripts for using AH type and tunnel mode,
you can use "pgset spi SPI_VALUE" to specify which transformation mode
@@ -284,115 +294,117 @@
Current commands and configuration options
==========================================
-** Pgcontrol commands:
+**Pgcontrol commands**::
-start
-stop
-reset
+ start
+ stop
+ reset
-** Thread commands:
+**Thread commands**::
-add_device
-rem_device_all
+ add_device
+ rem_device_all
-** Device commands:
+**Device commands**::
-count
-clone_skb
-burst
-debug
+ count
+ clone_skb
+ burst
+ debug
-frags
-delay
+ frags
+ delay
-src_mac_count
-dst_mac_count
+ src_mac_count
+ dst_mac_count
-pkt_size
-min_pkt_size
-max_pkt_size
+ pkt_size
+ min_pkt_size
+ max_pkt_size
-queue_map_min
-queue_map_max
-skb_priority
+ queue_map_min
+ queue_map_max
+ skb_priority
-tos (ipv4)
-traffic_class (ipv6)
+ tos (ipv4)
+ traffic_class (ipv6)
-mpls
+ mpls
-udp_src_min
-udp_src_max
+ udp_src_min
+ udp_src_max
-udp_dst_min
-udp_dst_max
+ udp_dst_min
+ udp_dst_max
-node
+ node
-flag
- IPSRC_RND
- IPDST_RND
- UDPSRC_RND
- UDPDST_RND
- MACSRC_RND
- MACDST_RND
- TXSIZE_RND
- IPV6
- MPLS_RND
- VID_RND
- SVID_RND
- FLOW_SEQ
- QUEUE_MAP_RND
- QUEUE_MAP_CPU
- UDPCSUM
- IPSEC
- NODE_ALLOC
- NO_TIMESTAMP
+ flag
+ IPSRC_RND
+ IPDST_RND
+ UDPSRC_RND
+ UDPDST_RND
+ MACSRC_RND
+ MACDST_RND
+ TXSIZE_RND
+ IPV6
+ MPLS_RND
+ VID_RND
+ SVID_RND
+ FLOW_SEQ
+ QUEUE_MAP_RND
+ QUEUE_MAP_CPU
+ UDPCSUM
+ IPSEC
+ NODE_ALLOC
+ NO_TIMESTAMP
-spi (ipsec)
+ spi (ipsec)
-dst_min
-dst_max
+ dst_min
+ dst_max
-src_min
-src_max
+ src_min
+ src_max
-dst_mac
-src_mac
+ dst_mac
+ src_mac
-clear_counters
+ clear_counters
-src6
-dst6
-dst6_max
-dst6_min
+ src6
+ dst6
+ dst6_max
+ dst6_min
-flows
-flowlen
+ flows
+ flowlen
-rate
-ratep
+ rate
+ ratep
-xmit_mode <start_xmit|netif_receive>
+ xmit_mode <start_xmit|netif_receive>
-vlan_cfi
-vlan_id
-vlan_p
+ vlan_cfi
+ vlan_id
+ vlan_p
-svlan_cfi
-svlan_id
-svlan_p
+ svlan_cfi
+ svlan_id
+ svlan_p
References:
-ftp://robur.slu.se/pub/Linux/net-development/pktgen-testing/
-ftp://robur.slu.se/pub/Linux/net-development/pktgen-testing/examples/
+
+- ftp://robur.slu.se/pub/Linux/net-development/pktgen-testing/
+- tp://robur.slu.se/pub/Linux/net-development/pktgen-testing/examples/
Paper from Linux-Kongress in Erlangen 2004.
-ftp://robur.slu.se/pub/Linux/net-development/pktgen-testing/pktgen_paper.pdf
+- ftp://robur.slu.se/pub/Linux/net-development/pktgen-testing/pktgen_paper.pdf
Thanks to:
+
Grant Grundler for testing on IA-64 and parisc, Harald Welte, Lennert Buytenhek
Stephen Hemminger, Andi Kleen, Dave Miller and many others.
diff --git a/Documentation/networking/PLIP.txt b/Documentation/networking/plip.rst
similarity index 92%
rename from Documentation/networking/PLIP.txt
rename to Documentation/networking/plip.rst
index ad7e3f7..0eda745 100644
--- a/Documentation/networking/PLIP.txt
+++ b/Documentation/networking/plip.rst
@@ -1,4 +1,8 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+================================================
PLIP: The Parallel Line Internet Protocol Device
+================================================
Donald Becker (becker@super.org)
I.D.A. Supercomputing Research Center, Bowie MD 20715
@@ -83,7 +87,7 @@
data transfer (the maximal time the PLIP driver would allow the other side
before announcing a timeout, when trying to handshake a transfer of some
data) is, by default, 500usec. As IRQ delivery is more or less immediate,
-this timeout is quite sufficient.
+this timeout is quite sufficient.
When in IRQ-less mode, the PLIP driver polls the parallel port HZ times
per second (where HZ is typically 100 on most platforms, and 1024 on an
@@ -115,7 +119,7 @@
data bit outputs connected to status bit inputs.
The second data transfer method relies on both machines having
-bi-directional parallel ports, rather than output-only ``printer''
+bi-directional parallel ports, rather than output-only ``printer``
ports. This allows byte-wide transfers and avoids reconstructing
nibbles into bytes, leading to much faster transfers.
@@ -132,7 +136,7 @@
A cable that implements this protocol is available commercially as a
"Null Printer" or "Turbo Laplink" cable. It can be constructed with
-two DB-25 male connectors symmetrically connected as follows:
+two DB-25 male connectors symmetrically connected as follows::
STROBE output 1*
D0->ERROR 2 - 15 15 - 2
@@ -146,7 +150,8 @@
SLCTIN 17 - 17
extra grounds are 18*,19*,20*,21*,22*,23*,24*
GROUND 25 - 25
-* Do not connect these pins on either end
+
+ * Do not connect these pins on either end
If the cable you are using has a metallic shield it should be
connected to the metallic DB-25 shell at one end only.
@@ -155,14 +160,14 @@
========================
The second data transfer method relies on both machines having
-bi-directional parallel ports, rather than output-only ``printer''
+bi-directional parallel ports, rather than output-only ``printer``
ports. This allows byte-wide transfers, and avoids reconstructing
nibbles into bytes. This cable should not be used on unidirectional
-``printer'' (as opposed to ``parallel'') ports or when the machine
+``printer`` (as opposed to ``parallel``) ports or when the machine
isn't configured for PLIP, as it will result in output driver
conflicts and the (unlikely) possibility of damage.
-The cable for this transfer mode should be constructed as follows:
+The cable for this transfer mode should be constructed as follows::
STROBE->BUSY 1 - 11
D0->D0 2 - 2
@@ -179,7 +184,8 @@
GND->ERROR 18 - 15
extra grounds are 19*,20*,21*,22*,23*,24*
GROUND 25 - 25
-* Do not connect these pins on either end
+
+ * Do not connect these pins on either end
Once again, if the cable you are using has a metallic shield it should
be connected to the metallic DB-25 shell at one end only.
@@ -188,7 +194,7 @@
=============================
The PLIP driver is compatible with the "Crynwr" parallel port transfer
-standard in Mode 0. That standard specifies the following protocol:
+standard in Mode 0. That standard specifies the following protocol::
send header nibble '0x8'
count-low octet
@@ -196,20 +202,21 @@
... data octets
checksum octet
-Each octet is sent as
+Each octet is sent as::
+
<wait for rx. '0x1?'> <send 0x10+(octet&0x0F)>
<wait for rx. '0x0?'> <send 0x00+((octet>>4)&0x0F)>
To start a transfer the transmitting machine outputs a nibble 0x08.
That raises the ACK line, triggering an interrupt in the receiving
machine. The receiving machine disables interrupts and raises its own ACK
-line.
+line.
-Restated:
+Restated::
-(OUT is bit 0-4, OUT.j is bit j from OUT. IN likewise)
-Send_Byte:
- OUT := low nibble, OUT.4 := 1
- WAIT FOR IN.4 = 1
- OUT := high nibble, OUT.4 := 0
- WAIT FOR IN.4 = 0
+ (OUT is bit 0-4, OUT.j is bit j from OUT. IN likewise)
+ Send_Byte:
+ OUT := low nibble, OUT.4 := 1
+ WAIT FOR IN.4 = 1
+ OUT := high nibble, OUT.4 := 0
+ WAIT FOR IN.4 = 0
diff --git a/Documentation/networking/ppp_generic.txt b/Documentation/networking/ppp_generic.rst
similarity index 91%
rename from Documentation/networking/ppp_generic.txt
rename to Documentation/networking/ppp_generic.rst
index fd563af..e605043 100644
--- a/Documentation/networking/ppp_generic.txt
+++ b/Documentation/networking/ppp_generic.rst
@@ -1,8 +1,12 @@
- PPP Generic Driver and Channel Interface
- ----------------------------------------
+.. SPDX-License-Identifier: GPL-2.0
- Paul Mackerras
+========================================
+PPP Generic Driver and Channel Interface
+========================================
+
+ Paul Mackerras
paulus@samba.org
+
7 Feb 2002
The generic PPP driver in linux-2.4 provides an implementation of the
@@ -19,7 +23,7 @@
* simple packet filtering
For sending and receiving PPP frames, the generic PPP driver calls on
-the services of PPP `channels'. A PPP channel encapsulates a
+the services of PPP ``channels``. A PPP channel encapsulates a
mechanism for transporting PPP frames from one machine to another. A
PPP channel implementation can be arbitrarily complex internally but
has a very simple interface with the generic PPP code: it merely has
@@ -102,7 +106,7 @@
async tty, this can involve setting the tty speed and modes, issuing
modem commands, and then going through some sort of dialog with the
remote system to invoke PPP service there. We refer to this process
-as `discovery'. Then the user-level process tells the medium to
+as ``discovery``. Then the user-level process tells the medium to
become a PPP channel and register itself with the generic PPP layer.
The channel then has to report the channel number assigned to it back
to the user-level process. From that point, the PPP negotiation code
@@ -111,8 +115,8 @@
At the interface to the PPP generic layer, PPP frames are stored in
skbuff structures and start with the two-byte PPP protocol number.
-The frame does *not* include the 0xff `address' byte or the 0x03
-`control' byte that are optionally used in async PPP. Nor is there
+The frame does *not* include the 0xff ``address`` byte or the 0x03
+``control`` byte that are optionally used in async PPP. Nor is there
any escaping of control characters, nor are there any FCS or framing
characters included. That is all the responsibility of the channel
code, if it is needed for the particular medium. That is, the skbuffs
@@ -121,16 +125,16 @@
must be in the same format.
The channel must provide an instance of a ppp_channel struct to
-represent the channel. The channel is free to use the `private' field
-however it wishes. The channel should initialize the `mtu' and
-`hdrlen' fields before calling ppp_register_channel() and not change
-them until after ppp_unregister_channel() returns. The `mtu' field
+represent the channel. The channel is free to use the ``private`` field
+however it wishes. The channel should initialize the ``mtu`` and
+``hdrlen`` fields before calling ppp_register_channel() and not change
+them until after ppp_unregister_channel() returns. The ``mtu`` field
represents the maximum size of the data part of the PPP frames, that
is, it does not include the 2-byte protocol number.
If the channel needs some headroom in the skbuffs presented to it for
transmission (i.e., some space free in the skbuff data area before the
-start of the PPP frame), it should set the `hdrlen' field of the
+start of the PPP frame), it should set the ``hdrlen`` field of the
ppp_channel struct to the amount of headroom required. The generic
PPP layer will attempt to provide that much headroom but the channel
should still check if there is sufficient headroom and copy the skbuff
@@ -322,6 +326,8 @@
interface. The argument should be a pointer to an int containing
the new flags value. The bits in the flags value that can be set
are:
+
+ ================ ========================================
SC_COMP_TCP enable transmit TCP header compression
SC_NO_TCP_CCID disable connection-id compression for
TCP header compression
@@ -335,6 +341,7 @@
SC_MP_SHORTSEQ expect short multilink sequence
numbers on received multilink fragments
SC_MP_XSHORTSEQ transmit short multilink sequence nos.
+ ================ ========================================
The values of these flags are defined in <linux/ppp-ioctl.h>. Note
that the values of the SC_MULTILINK, SC_MP_SHORTSEQ and
@@ -345,17 +352,20 @@
interface unit. The argument should point to an int where the ioctl
will store the flags value. As well as the values listed above for
PPPIOCSFLAGS, the following bits may be set in the returned value:
+
+ ================ =========================================
SC_COMP_RUN CCP compressor is running
SC_DECOMP_RUN CCP decompressor is running
SC_DC_ERROR CCP decompressor detected non-fatal error
SC_DC_FERROR CCP decompressor detected fatal error
+ ================ =========================================
* PPPIOCSCOMPRESS sets the parameters for packet compression or
decompression. The argument should point to a ppp_option_data
structure (defined in <linux/ppp-ioctl.h>), which contains a
pointer/length pair which should describe a block of memory
containing a CCP option specifying a compression method and its
- parameters. The ppp_option_data struct also contains a `transmit'
+ parameters. The ppp_option_data struct also contains a ``transmit``
field. If this is 0, the ioctl will affect the receive path,
otherwise the transmit path.
@@ -377,7 +387,7 @@
ppp_idle structure (defined in <linux/ppp_defs.h>). If the
CONFIG_PPP_FILTER option is enabled, the set of packets which reset
the transmit and receive idle timers is restricted to those which
- pass the `active' packet filter.
+ pass the ``active`` packet filter.
Two versions of this command exist, to deal with user space
expecting times as either 32-bit or 64-bit time_t seconds.
@@ -391,31 +401,33 @@
* PPPIOCSNPMODE sets the network-protocol mode for a given network
protocol. The argument should point to an npioctl struct (defined
- in <linux/ppp-ioctl.h>). The `protocol' field gives the PPP protocol
- number for the protocol to be affected, and the `mode' field
+ in <linux/ppp-ioctl.h>). The ``protocol`` field gives the PPP protocol
+ number for the protocol to be affected, and the ``mode`` field
specifies what to do with packets for that protocol:
+ ============= ==============================================
NPMODE_PASS normal operation, transmit and receive packets
NPMODE_DROP silently drop packets for this protocol
NPMODE_ERROR drop packets and return an error on transmit
NPMODE_QUEUE queue up packets for transmit, drop received
packets
+ ============= ==============================================
At present NPMODE_ERROR and NPMODE_QUEUE have the same effect as
NPMODE_DROP.
* PPPIOCGNPMODE returns the network-protocol mode for a given
protocol. The argument should point to an npioctl struct with the
- `protocol' field set to the PPP protocol number for the protocol of
- interest. On return the `mode' field will be set to the network-
+ ``protocol`` field set to the PPP protocol number for the protocol of
+ interest. On return the ``mode`` field will be set to the network-
protocol mode for that protocol.
-* PPPIOCSPASS and PPPIOCSACTIVE set the `pass' and `active' packet
+* PPPIOCSPASS and PPPIOCSACTIVE set the ``pass`` and ``active`` packet
filters. These ioctls are only available if the CONFIG_PPP_FILTER
option is selected. The argument should point to a sock_fprog
structure (defined in <linux/filter.h>) containing the compiled BPF
instructions for the filter. Packets are dropped if they fail the
- `pass' filter; otherwise, if they fail the `active' filter they are
+ ``pass`` filter; otherwise, if they fail the ``active`` filter they are
passed but they do not reset the transmit or receive idle timer.
* PPPIOCSMRRU enables or disables multilink processing for received
diff --git a/Documentation/networking/proc_net_tcp.txt b/Documentation/networking/proc_net_tcp.rst
similarity index 82%
rename from Documentation/networking/proc_net_tcp.txt
rename to Documentation/networking/proc_net_tcp.rst
index 4a79209..7d9dfe3 100644
--- a/Documentation/networking/proc_net_tcp.txt
+++ b/Documentation/networking/proc_net_tcp.rst
@@ -1,15 +1,21 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+============================================
+The proc/net/tcp and proc/net/tcp6 variables
+============================================
+
This document describes the interfaces /proc/net/tcp and /proc/net/tcp6.
Note that these interfaces are deprecated in favor of tcp_diag.
-These /proc interfaces provide information about currently active TCP
+These /proc interfaces provide information about currently active TCP
connections, and are implemented by tcp4_seq_show() in net/ipv4/tcp_ipv4.c
and tcp6_seq_show() in net/ipv6/tcp_ipv6.c, respectively.
It will first list all listening TCP sockets, and next list all established
-TCP connections. A typical entry of /proc/net/tcp would look like this (split
-up into 3 parts because of the length of the line):
+TCP connections. A typical entry of /proc/net/tcp would look like this (split
+up into 3 parts because of the length of the line)::
- 46: 010310AC:9C4C 030310AC:1770 01
+ 46: 010310AC:9C4C 030310AC:1770 01
| | | | | |--> connection state
| | | | |------> remote TCP port number
| | | |-------------> remote IPv4 address
@@ -17,7 +23,7 @@
| |---------------------------> local IPv4 address
|----------------------------------> number of entry
- 00000150:00000000 01:00000019 00000000
+ 00000150:00000000 01:00000019 00000000
| | | | |--> number of unrecovered RTO timeouts
| | | |----------> number of jiffies until timer expires
| | |----------------> timer_active (see below)
@@ -25,7 +31,7 @@
|-------------------------------> transmit-queue
1000 0 54165785 4 cd1e6040 25 4 27 3 -1
- | | | | | | | | | |--> slow start size threshold,
+ | | | | | | | | | |--> slow start size threshold,
| | | | | | | | | or -1 if the threshold
| | | | | | | | | is >= 0xFFFF
| | | | | | | | |----> sending congestion window
@@ -40,9 +46,12 @@
|---------------------------------------------> uid
timer_active:
+
+ == ================================================================
0 no timer is pending
1 retransmit-timer is pending
2 another timer (e.g. delayed ack or keepalive) is pending
- 3 this is a socket in TIME_WAIT state. Not all fields will contain
+ 3 this is a socket in TIME_WAIT state. Not all fields will contain
data (or even exist)
4 zero window probe timer is pending
+ == ================================================================
diff --git a/Documentation/networking/radiotap-headers.txt b/Documentation/networking/radiotap-headers.rst
similarity index 69%
rename from Documentation/networking/radiotap-headers.txt
rename to Documentation/networking/radiotap-headers.rst
index 953331c..1a1bd1e 100644
--- a/Documentation/networking/radiotap-headers.txt
+++ b/Documentation/networking/radiotap-headers.rst
@@ -1,3 +1,6 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+===========================
How to use radiotap headers
===========================
@@ -5,9 +8,9 @@
------------------------------------
Radiotap headers are variable-length and extensible, you can get most of the
-information you need to know on them from:
+information you need to know on them from::
-./include/net/ieee80211_radiotap.h
+ ./include/net/ieee80211_radiotap.h
This document gives an overview and warns on some corner cases.
@@ -21,6 +24,8 @@
the header for argument index 0 (IEEE80211_RADIOTAP_TSFT) is present in the
argument area.
+::
+
< 8-byte ieee80211_radiotap_header >
[ <possible argument bitmap extensions ... > ]
[ <argument> ... ]
@@ -76,6 +81,8 @@
Example valid radiotap header
-----------------------------
+::
+
0x00, 0x00, // <-- radiotap version + pad byte
0x0b, 0x00, // <- radiotap header length
0x04, 0x0c, 0x00, 0x00, // <-- bitmap
@@ -89,64 +96,64 @@
If you are having to parse a radiotap struct, you can radically simplify the
job by using the radiotap parser that lives in net/wireless/radiotap.c and has
-its prototypes available in include/net/cfg80211.h. You use it like this:
+its prototypes available in include/net/cfg80211.h. You use it like this::
-#include <net/cfg80211.h>
+ #include <net/cfg80211.h>
-/* buf points to the start of the radiotap header part */
+ /* buf points to the start of the radiotap header part */
-int MyFunction(u8 * buf, int buflen)
-{
- int pkt_rate_100kHz = 0, antenna = 0, pwr = 0;
- struct ieee80211_radiotap_iterator iterator;
- int ret = ieee80211_radiotap_iterator_init(&iterator, buf, buflen);
+ int MyFunction(u8 * buf, int buflen)
+ {
+ int pkt_rate_100kHz = 0, antenna = 0, pwr = 0;
+ struct ieee80211_radiotap_iterator iterator;
+ int ret = ieee80211_radiotap_iterator_init(&iterator, buf, buflen);
- while (!ret) {
+ while (!ret) {
- ret = ieee80211_radiotap_iterator_next(&iterator);
+ ret = ieee80211_radiotap_iterator_next(&iterator);
- if (ret)
- continue;
+ if (ret)
+ continue;
- /* see if this argument is something we can use */
+ /* see if this argument is something we can use */
- switch (iterator.this_arg_index) {
- /*
- * You must take care when dereferencing iterator.this_arg
- * for multibyte types... the pointer is not aligned. Use
- * get_unaligned((type *)iterator.this_arg) to dereference
- * iterator.this_arg for type "type" safely on all arches.
- */
- case IEEE80211_RADIOTAP_RATE:
- /* radiotap "rate" u8 is in
- * 500kbps units, eg, 0x02=1Mbps
- */
- pkt_rate_100kHz = (*iterator.this_arg) * 5;
- break;
+ switch (iterator.this_arg_index) {
+ /*
+ * You must take care when dereferencing iterator.this_arg
+ * for multibyte types... the pointer is not aligned. Use
+ * get_unaligned((type *)iterator.this_arg) to dereference
+ * iterator.this_arg for type "type" safely on all arches.
+ */
+ case IEEE80211_RADIOTAP_RATE:
+ /* radiotap "rate" u8 is in
+ * 500kbps units, eg, 0x02=1Mbps
+ */
+ pkt_rate_100kHz = (*iterator.this_arg) * 5;
+ break;
- case IEEE80211_RADIOTAP_ANTENNA:
- /* radiotap uses 0 for 1st ant */
- antenna = *iterator.this_arg);
- break;
+ case IEEE80211_RADIOTAP_ANTENNA:
+ /* radiotap uses 0 for 1st ant */
+ antenna = *iterator.this_arg);
+ break;
- case IEEE80211_RADIOTAP_DBM_TX_POWER:
- pwr = *iterator.this_arg;
- break;
+ case IEEE80211_RADIOTAP_DBM_TX_POWER:
+ pwr = *iterator.this_arg;
+ break;
- default:
- break;
- }
- } /* while more rt headers */
+ default:
+ break;
+ }
+ } /* while more rt headers */
- if (ret != -ENOENT)
- return TXRX_DROP;
+ if (ret != -ENOENT)
+ return TXRX_DROP;
- /* discard the radiotap header part */
- buf += iterator.max_length;
- buflen -= iterator.max_length;
+ /* discard the radiotap header part */
+ buf += iterator.max_length;
+ buflen -= iterator.max_length;
- ...
+ ...
-}
+ }
Andy Green <andy@warmcat.com>
diff --git a/Documentation/networking/ray_cs.txt b/Documentation/networking/ray_cs.rst
similarity index 64%
rename from Documentation/networking/ray_cs.txt
rename to Documentation/networking/ray_cs.rst
index c0c1230..9a46d1a 100644
--- a/Documentation/networking/ray_cs.txt
+++ b/Documentation/networking/ray_cs.rst
@@ -1,6 +1,14 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+.. include:: <isonum.txt>
+
+=========================
+Raylink wireless LAN card
+=========================
+
September 21, 1999
-Copyright (c) 1998 Corey Thomas (corey@world.std.com)
+Copyright |copy| 1998 Corey Thomas (corey@world.std.com)
This file is the documentation for the Raylink Wireless LAN card driver for
Linux. The Raylink wireless LAN card is a PCMCIA card which provides IEEE
@@ -13,7 +21,7 @@
As of kernel 2.3.18, the ray_cs driver is part of the Linux kernel
source. My web page for the development of ray_cs is at
-http://web.ralinktech.com/ralink/Home/Support/Linux.html
+http://web.ralinktech.com/ralink/Home/Support/Linux.html
and I can be emailed at corey@world.std.com
The kernel driver is based on ray_cs-1.62.tgz
@@ -29,6 +37,7 @@
will find them all.
Information on card services is available at:
+
http://pcmcia-cs.sourceforge.net/
@@ -39,72 +48,78 @@
Currently, ray_cs is not part of David Hinds card services package,
so the following magic is required.
-At the end of the /etc/pcmcia/config.opts file, add the line:
-source ./ray_cs.opts
+At the end of the /etc/pcmcia/config.opts file, add the line:
+source ./ray_cs.opts
This will make card services read the ray_cs.opts file
when starting. Create the file /etc/pcmcia/ray_cs.opts containing the
-following:
+following::
-#### start of /etc/pcmcia/ray_cs.opts ###################
-# Configuration options for Raylink Wireless LAN PCMCIA card
-device "ray_cs"
- class "network" module "misc/ray_cs"
+ #### start of /etc/pcmcia/ray_cs.opts ###################
+ # Configuration options for Raylink Wireless LAN PCMCIA card
+ device "ray_cs"
+ class "network" module "misc/ray_cs"
-card "RayLink PC Card WLAN Adapter"
- manfid 0x01a6, 0x0000
- bind "ray_cs"
+ card "RayLink PC Card WLAN Adapter"
+ manfid 0x01a6, 0x0000
+ bind "ray_cs"
-module "misc/ray_cs" opts ""
-#### end of /etc/pcmcia/ray_cs.opts #####################
+ module "misc/ray_cs" opts ""
+ #### end of /etc/pcmcia/ray_cs.opts #####################
To join an existing network with
-different parameters, contact the network administrator for the
+different parameters, contact the network administrator for the
configuration information, and edit /etc/pcmcia/ray_cs.opts.
Add the parameters below between the empty quotes.
Parameters for ray_cs driver which may be specified in ray_cs.opts:
-bc integer 0 = normal mode (802.11 timing)
- 1 = slow down inter frame timing to allow
- operation with older breezecom access
- points.
+=============== =============== =============================================
+bc integer 0 = normal mode (802.11 timing),
+ 1 = slow down inter frame timing to allow
+ operation with older breezecom access
+ points.
-beacon_period integer beacon period in Kilo-microseconds
- legal values = must be integer multiple
- of hop dwell
- default = 256
+beacon_period integer beacon period in Kilo-microseconds,
-country integer 1 = USA (default)
- 2 = Europe
- 3 = Japan
- 4 = Korea
- 5 = Spain
- 6 = France
- 7 = Israel
- 8 = Australia
+ legal values = must be integer multiple
+ of hop dwell
+
+ default = 256
+
+country integer 1 = USA (default),
+ 2 = Europe,
+ 3 = Japan,
+ 4 = Korea,
+ 5 = Spain,
+ 6 = France,
+ 7 = Israel,
+ 8 = Australia
essid string ESS ID - network name to join
+
string with maximum length of 32 chars
default value = "ADHOC_ESSID"
-hop_dwell integer hop dwell time in Kilo-microseconds
+hop_dwell integer hop dwell time in Kilo-microseconds
+
legal values = 16,32,64,128(default),256
irq_mask integer linux standard 16 bit value 1bit/IRQ
+
lsb is IRQ 0, bit 1 is IRQ 1 etc.
Used to restrict choice of IRQ's to use.
- Recommended method for controlling
- interrupts is in /etc/pcmcia/config.opts
+ Recommended method for controlling
+ interrupts is in /etc/pcmcia/config.opts
-net_type integer 0 (default) = adhoc network,
+net_type integer 0 (default) = adhoc network,
1 = infrastructure
phy_addr string string containing new MAC address in
hex, must start with x eg
x00008f123456
-psm integer 0 = continuously active
+psm integer 0 = continuously active,
1 = power save mode (not useful yet)
pc_debug integer (0-5) larger values for more verbose
@@ -114,14 +129,14 @@
ray_mem_speed integer defaults to 500
-sniffer integer 0 = not sniffer (default)
- 1 = sniffer which can be used to record all
- network traffic using tcpdump or similar,
- but no normal network use is allowed.
+sniffer integer 0 = not sniffer (default),
+ 1 = sniffer which can be used to record all
+ network traffic using tcpdump or similar,
+ but no normal network use is allowed.
-translate integer 0 = no translation (encapsulate frames)
+translate integer 0 = no translation (encapsulate frames),
1 = translation (RFC1042/802.1)
-
+=============== =============== =============================================
More on sniffer mode:
@@ -136,7 +151,7 @@
Known Problems and missing features
- Does not work with non x86
+ Does not work with non x86
Does not work with SMP
diff --git a/Documentation/networking/rds.rst b/Documentation/networking/rds.rst
new file mode 100644
index 0000000..44936c2
--- /dev/null
+++ b/Documentation/networking/rds.rst
@@ -0,0 +1,448 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+==
+RDS
+===
+
+Overview
+========
+
+This readme tries to provide some background on the hows and whys of RDS,
+and will hopefully help you find your way around the code.
+
+In addition, please see this email about RDS origins:
+http://oss.oracle.com/pipermail/rds-devel/2007-November/000228.html
+
+RDS Architecture
+================
+
+RDS provides reliable, ordered datagram delivery by using a single
+reliable connection between any two nodes in the cluster. This allows
+applications to use a single socket to talk to any other process in the
+cluster - so in a cluster with N processes you need N sockets, in contrast
+to N*N if you use a connection-oriented socket transport like TCP.
+
+RDS is not Infiniband-specific; it was designed to support different
+transports. The current implementation used to support RDS over TCP as well
+as IB.
+
+The high-level semantics of RDS from the application's point of view are
+
+ * Addressing
+
+ RDS uses IPv4 addresses and 16bit port numbers to identify
+ the end point of a connection. All socket operations that involve
+ passing addresses between kernel and user space generally
+ use a struct sockaddr_in.
+
+ The fact that IPv4 addresses are used does not mean the underlying
+ transport has to be IP-based. In fact, RDS over IB uses a
+ reliable IB connection; the IP address is used exclusively to
+ locate the remote node's GID (by ARPing for the given IP).
+
+ The port space is entirely independent of UDP, TCP or any other
+ protocol.
+
+ * Socket interface
+
+ RDS sockets work *mostly* as you would expect from a BSD
+ socket. The next section will cover the details. At any rate,
+ all I/O is performed through the standard BSD socket API.
+ Some additions like zerocopy support are implemented through
+ control messages, while other extensions use the getsockopt/
+ setsockopt calls.
+
+ Sockets must be bound before you can send or receive data.
+ This is needed because binding also selects a transport and
+ attaches it to the socket. Once bound, the transport assignment
+ does not change. RDS will tolerate IPs moving around (eg in
+ a active-active HA scenario), but only as long as the address
+ doesn't move to a different transport.
+
+ * sysctls
+
+ RDS supports a number of sysctls in /proc/sys/net/rds
+
+
+Socket Interface
+================
+
+ AF_RDS, PF_RDS, SOL_RDS
+ AF_RDS and PF_RDS are the domain type to be used with socket(2)
+ to create RDS sockets. SOL_RDS is the socket-level to be used
+ with setsockopt(2) and getsockopt(2) for RDS specific socket
+ options.
+
+ fd = socket(PF_RDS, SOCK_SEQPACKET, 0);
+ This creates a new, unbound RDS socket.
+
+ setsockopt(SOL_SOCKET): send and receive buffer size
+ RDS honors the send and receive buffer size socket options.
+ You are not allowed to queue more than SO_SNDSIZE bytes to
+ a socket. A message is queued when sendmsg is called, and
+ it leaves the queue when the remote system acknowledges
+ its arrival.
+
+ The SO_RCVSIZE option controls the maximum receive queue length.
+ This is a soft limit rather than a hard limit - RDS will
+ continue to accept and queue incoming messages, even if that
+ takes the queue length over the limit. However, it will also
+ mark the port as "congested" and send a congestion update to
+ the source node. The source node is supposed to throttle any
+ processes sending to this congested port.
+
+ bind(fd, &sockaddr_in, ...)
+ This binds the socket to a local IP address and port, and a
+ transport, if one has not already been selected via the
+ SO_RDS_TRANSPORT socket option
+
+ sendmsg(fd, ...)
+ Sends a message to the indicated recipient. The kernel will
+ transparently establish the underlying reliable connection
+ if it isn't up yet.
+
+ An attempt to send a message that exceeds SO_SNDSIZE will
+ return with -EMSGSIZE
+
+ An attempt to send a message that would take the total number
+ of queued bytes over the SO_SNDSIZE threshold will return
+ EAGAIN.
+
+ An attempt to send a message to a destination that is marked
+ as "congested" will return ENOBUFS.
+
+ recvmsg(fd, ...)
+ Receives a message that was queued to this socket. The sockets
+ recv queue accounting is adjusted, and if the queue length
+ drops below SO_SNDSIZE, the port is marked uncongested, and
+ a congestion update is sent to all peers.
+
+ Applications can ask the RDS kernel module to receive
+ notifications via control messages (for instance, there is a
+ notification when a congestion update arrived, or when a RDMA
+ operation completes). These notifications are received through
+ the msg.msg_control buffer of struct msghdr. The format of the
+ messages is described in manpages.
+
+ poll(fd)
+ RDS supports the poll interface to allow the application
+ to implement async I/O.
+
+ POLLIN handling is pretty straightforward. When there's an
+ incoming message queued to the socket, or a pending notification,
+ we signal POLLIN.
+
+ POLLOUT is a little harder. Since you can essentially send
+ to any destination, RDS will always signal POLLOUT as long as
+ there's room on the send queue (ie the number of bytes queued
+ is less than the sendbuf size).
+
+ However, the kernel will refuse to accept messages to
+ a destination marked congested - in this case you will loop
+ forever if you rely on poll to tell you what to do.
+ This isn't a trivial problem, but applications can deal with
+ this - by using congestion notifications, and by checking for
+ ENOBUFS errors returned by sendmsg.
+
+ setsockopt(SOL_RDS, RDS_CANCEL_SENT_TO, &sockaddr_in)
+ This allows the application to discard all messages queued to a
+ specific destination on this particular socket.
+
+ This allows the application to cancel outstanding messages if
+ it detects a timeout. For instance, if it tried to send a message,
+ and the remote host is unreachable, RDS will keep trying forever.
+ The application may decide it's not worth it, and cancel the
+ operation. In this case, it would use RDS_CANCEL_SENT_TO to
+ nuke any pending messages.
+
+ ``setsockopt(fd, SOL_RDS, SO_RDS_TRANSPORT, (int *)&transport ..), getsockopt(fd, SOL_RDS, SO_RDS_TRANSPORT, (int *)&transport ..)``
+ Set or read an integer defining the underlying
+ encapsulating transport to be used for RDS packets on the
+ socket. When setting the option, integer argument may be
+ one of RDS_TRANS_TCP or RDS_TRANS_IB. When retrieving the
+ value, RDS_TRANS_NONE will be returned on an unbound socket.
+ This socket option may only be set exactly once on the socket,
+ prior to binding it via the bind(2) system call. Attempts to
+ set SO_RDS_TRANSPORT on a socket for which the transport has
+ been previously attached explicitly (by SO_RDS_TRANSPORT) or
+ implicitly (via bind(2)) will return an error of EOPNOTSUPP.
+ An attempt to set SO_RDS_TRANSPORT to RDS_TRANS_NONE will
+ always return EINVAL.
+
+RDMA for RDS
+============
+
+ see rds-rdma(7) manpage (available in rds-tools)
+
+
+Congestion Notifications
+========================
+
+ see rds(7) manpage
+
+
+RDS Protocol
+============
+
+ Message header
+
+ The message header is a 'struct rds_header' (see rds.h):
+
+ Fields:
+
+ h_sequence:
+ per-packet sequence number
+ h_ack:
+ piggybacked acknowledgment of last packet received
+ h_len:
+ length of data, not including header
+ h_sport:
+ source port
+ h_dport:
+ destination port
+ h_flags:
+ Can be:
+
+ ============= ==================================
+ CONG_BITMAP this is a congestion update bitmap
+ ACK_REQUIRED receiver must ack this packet
+ RETRANSMITTED packet has previously been sent
+ ============= ==================================
+
+ h_credit:
+ indicate to other end of connection that
+ it has more credits available (i.e. there is
+ more send room)
+ h_padding[4]:
+ unused, for future use
+ h_csum:
+ header checksum
+ h_exthdr:
+ optional data can be passed here. This is currently used for
+ passing RDMA-related information.
+
+ ACK and retransmit handling
+
+ One might think that with reliable IB connections you wouldn't need
+ to ack messages that have been received. The problem is that IB
+ hardware generates an ack message before it has DMAed the message
+ into memory. This creates a potential message loss if the HCA is
+ disabled for any reason between when it sends the ack and before
+ the message is DMAed and processed. This is only a potential issue
+ if another HCA is available for fail-over.
+
+ Sending an ack immediately would allow the sender to free the sent
+ message from their send queue quickly, but could cause excessive
+ traffic to be used for acks. RDS piggybacks acks on sent data
+ packets. Ack-only packets are reduced by only allowing one to be
+ in flight at a time, and by the sender only asking for acks when
+ its send buffers start to fill up. All retransmissions are also
+ acked.
+
+ Flow Control
+
+ RDS's IB transport uses a credit-based mechanism to verify that
+ there is space in the peer's receive buffers for more data. This
+ eliminates the need for hardware retries on the connection.
+
+ Congestion
+
+ Messages waiting in the receive queue on the receiving socket
+ are accounted against the sockets SO_RCVBUF option value. Only
+ the payload bytes in the message are accounted for. If the
+ number of bytes queued equals or exceeds rcvbuf then the socket
+ is congested. All sends attempted to this socket's address
+ should return block or return -EWOULDBLOCK.
+
+ Applications are expected to be reasonably tuned such that this
+ situation very rarely occurs. An application encountering this
+ "back-pressure" is considered a bug.
+
+ This is implemented by having each node maintain bitmaps which
+ indicate which ports on bound addresses are congested. As the
+ bitmap changes it is sent through all the connections which
+ terminate in the local address of the bitmap which changed.
+
+ The bitmaps are allocated as connections are brought up. This
+ avoids allocation in the interrupt handling path which queues
+ sages on sockets. The dense bitmaps let transports send the
+ entire bitmap on any bitmap change reasonably efficiently. This
+ is much easier to implement than some finer-grained
+ communication of per-port congestion. The sender does a very
+ inexpensive bit test to test if the port it's about to send to
+ is congested or not.
+
+
+RDS Transport Layer
+===================
+
+ As mentioned above, RDS is not IB-specific. Its code is divided
+ into a general RDS layer and a transport layer.
+
+ The general layer handles the socket API, congestion handling,
+ loopback, stats, usermem pinning, and the connection state machine.
+
+ The transport layer handles the details of the transport. The IB
+ transport, for example, handles all the queue pairs, work requests,
+ CM event handlers, and other Infiniband details.
+
+
+RDS Kernel Structures
+=====================
+
+ struct rds_message
+ aka possibly "rds_outgoing", the generic RDS layer copies data to
+ be sent and sets header fields as needed, based on the socket API.
+ This is then queued for the individual connection and sent by the
+ connection's transport.
+
+ struct rds_incoming
+ a generic struct referring to incoming data that can be handed from
+ the transport to the general code and queued by the general code
+ while the socket is awoken. It is then passed back to the transport
+ code to handle the actual copy-to-user.
+
+ struct rds_socket
+ per-socket information
+
+ struct rds_connection
+ per-connection information
+
+ struct rds_transport
+ pointers to transport-specific functions
+
+ struct rds_statistics
+ non-transport-specific statistics
+
+ struct rds_cong_map
+ wraps the raw congestion bitmap, contains rbnode, waitq, etc.
+
+Connection management
+=====================
+
+ Connections may be in UP, DOWN, CONNECTING, DISCONNECTING, and
+ ERROR states.
+
+ The first time an attempt is made by an RDS socket to send data to
+ a node, a connection is allocated and connected. That connection is
+ then maintained forever -- if there are transport errors, the
+ connection will be dropped and re-established.
+
+ Dropping a connection while packets are queued will cause queued or
+ partially-sent datagrams to be retransmitted when the connection is
+ re-established.
+
+
+The send path
+=============
+
+ rds_sendmsg()
+ - struct rds_message built from incoming data
+ - CMSGs parsed (e.g. RDMA ops)
+ - transport connection alloced and connected if not already
+ - rds_message placed on send queue
+ - send worker awoken
+
+ rds_send_worker()
+ - calls rds_send_xmit() until queue is empty
+
+ rds_send_xmit()
+ - transmits congestion map if one is pending
+ - may set ACK_REQUIRED
+ - calls transport to send either non-RDMA or RDMA message
+ (RDMA ops never retransmitted)
+
+ rds_ib_xmit()
+ - allocs work requests from send ring
+ - adds any new send credits available to peer (h_credits)
+ - maps the rds_message's sg list
+ - piggybacks ack
+ - populates work requests
+ - post send to connection's queue pair
+
+The recv path
+=============
+
+ rds_ib_recv_cq_comp_handler()
+ - looks at write completions
+ - unmaps recv buffer from device
+ - no errors, call rds_ib_process_recv()
+ - refill recv ring
+
+ rds_ib_process_recv()
+ - validate header checksum
+ - copy header to rds_ib_incoming struct if start of a new datagram
+ - add to ibinc's fraglist
+ - if competed datagram:
+ - update cong map if datagram was cong update
+ - call rds_recv_incoming() otherwise
+ - note if ack is required
+
+ rds_recv_incoming()
+ - drop duplicate packets
+ - respond to pings
+ - find the sock associated with this datagram
+ - add to sock queue
+ - wake up sock
+ - do some congestion calculations
+ rds_recvmsg
+ - copy data into user iovec
+ - handle CMSGs
+ - return to application
+
+Multipath RDS (mprds)
+=====================
+ Mprds is multipathed-RDS, primarily intended for RDS-over-TCP
+ (though the concept can be extended to other transports). The classical
+ implementation of RDS-over-TCP is implemented by demultiplexing multiple
+ PF_RDS sockets between any 2 endpoints (where endpoint == [IP address,
+ port]) over a single TCP socket between the 2 IP addresses involved. This
+ has the limitation that it ends up funneling multiple RDS flows over a
+ single TCP flow, thus it is
+ (a) upper-bounded to the single-flow bandwidth,
+ (b) suffers from head-of-line blocking for all the RDS sockets.
+
+ Better throughput (for a fixed small packet size, MTU) can be achieved
+ by having multiple TCP/IP flows per rds/tcp connection, i.e., multipathed
+ RDS (mprds). Each such TCP/IP flow constitutes a path for the rds/tcp
+ connection. RDS sockets will be attached to a path based on some hash
+ (e.g., of local address and RDS port number) and packets for that RDS
+ socket will be sent over the attached path using TCP to segment/reassemble
+ RDS datagrams on that path.
+
+ Multipathed RDS is implemented by splitting the struct rds_connection into
+ a common (to all paths) part, and a per-path struct rds_conn_path. All
+ I/O workqs and reconnect threads are driven from the rds_conn_path.
+ Transports such as TCP that are multipath capable may then set up a
+ TCP socket per rds_conn_path, and this is managed by the transport via
+ the transport privatee cp_transport_data pointer.
+
+ Transports announce themselves as multipath capable by setting the
+ t_mp_capable bit during registration with the rds core module. When the
+ transport is multipath-capable, rds_sendmsg() hashes outgoing traffic
+ across multiple paths. The outgoing hash is computed based on the
+ local address and port that the PF_RDS socket is bound to.
+
+ Additionally, even if the transport is MP capable, we may be
+ peering with some node that does not support mprds, or supports
+ a different number of paths. As a result, the peering nodes need
+ to agree on the number of paths to be used for the connection.
+ This is done by sending out a control packet exchange before the
+ first data packet. The control packet exchange must have completed
+ prior to outgoing hash completion in rds_sendmsg() when the transport
+ is mutlipath capable.
+
+ The control packet is an RDS ping packet (i.e., packet to rds dest
+ port 0) with the ping packet having a rds extension header option of
+ type RDS_EXTHDR_NPATHS, length 2 bytes, and the value is the
+ number of paths supported by the sender. The "probe" ping packet will
+ get sent from some reserved port, RDS_FLAG_PROBE_PORT (in <linux/rds.h>)
+ The receiver of a ping from RDS_FLAG_PROBE_PORT will thus immediately
+ be able to compute the min(sender_paths, rcvr_paths). The pong
+ sent in response to a probe-ping should contain the rcvr's npaths
+ when the rcvr is mprds-capable.
+
+ If the rcvr is not mprds-capable, the exthdr in the ping will be
+ ignored. In this case the pong will not have any exthdrs, so the sender
+ of the probe-ping can default to single-path mprds.
+
diff --git a/Documentation/networking/rds.txt b/Documentation/networking/rds.txt
deleted file mode 100644
index eec6169..0000000
--- a/Documentation/networking/rds.txt
+++ /dev/null
@@ -1,423 +0,0 @@
-
-Overview
-========
-
-This readme tries to provide some background on the hows and whys of RDS,
-and will hopefully help you find your way around the code.
-
-In addition, please see this email about RDS origins:
-http://oss.oracle.com/pipermail/rds-devel/2007-November/000228.html
-
-RDS Architecture
-================
-
-RDS provides reliable, ordered datagram delivery by using a single
-reliable connection between any two nodes in the cluster. This allows
-applications to use a single socket to talk to any other process in the
-cluster - so in a cluster with N processes you need N sockets, in contrast
-to N*N if you use a connection-oriented socket transport like TCP.
-
-RDS is not Infiniband-specific; it was designed to support different
-transports. The current implementation used to support RDS over TCP as well
-as IB.
-
-The high-level semantics of RDS from the application's point of view are
-
- * Addressing
- RDS uses IPv4 addresses and 16bit port numbers to identify
- the end point of a connection. All socket operations that involve
- passing addresses between kernel and user space generally
- use a struct sockaddr_in.
-
- The fact that IPv4 addresses are used does not mean the underlying
- transport has to be IP-based. In fact, RDS over IB uses a
- reliable IB connection; the IP address is used exclusively to
- locate the remote node's GID (by ARPing for the given IP).
-
- The port space is entirely independent of UDP, TCP or any other
- protocol.
-
- * Socket interface
- RDS sockets work *mostly* as you would expect from a BSD
- socket. The next section will cover the details. At any rate,
- all I/O is performed through the standard BSD socket API.
- Some additions like zerocopy support are implemented through
- control messages, while other extensions use the getsockopt/
- setsockopt calls.
-
- Sockets must be bound before you can send or receive data.
- This is needed because binding also selects a transport and
- attaches it to the socket. Once bound, the transport assignment
- does not change. RDS will tolerate IPs moving around (eg in
- a active-active HA scenario), but only as long as the address
- doesn't move to a different transport.
-
- * sysctls
- RDS supports a number of sysctls in /proc/sys/net/rds
-
-
-Socket Interface
-================
-
- AF_RDS, PF_RDS, SOL_RDS
- AF_RDS and PF_RDS are the domain type to be used with socket(2)
- to create RDS sockets. SOL_RDS is the socket-level to be used
- with setsockopt(2) and getsockopt(2) for RDS specific socket
- options.
-
- fd = socket(PF_RDS, SOCK_SEQPACKET, 0);
- This creates a new, unbound RDS socket.
-
- setsockopt(SOL_SOCKET): send and receive buffer size
- RDS honors the send and receive buffer size socket options.
- You are not allowed to queue more than SO_SNDSIZE bytes to
- a socket. A message is queued when sendmsg is called, and
- it leaves the queue when the remote system acknowledges
- its arrival.
-
- The SO_RCVSIZE option controls the maximum receive queue length.
- This is a soft limit rather than a hard limit - RDS will
- continue to accept and queue incoming messages, even if that
- takes the queue length over the limit. However, it will also
- mark the port as "congested" and send a congestion update to
- the source node. The source node is supposed to throttle any
- processes sending to this congested port.
-
- bind(fd, &sockaddr_in, ...)
- This binds the socket to a local IP address and port, and a
- transport, if one has not already been selected via the
- SO_RDS_TRANSPORT socket option
-
- sendmsg(fd, ...)
- Sends a message to the indicated recipient. The kernel will
- transparently establish the underlying reliable connection
- if it isn't up yet.
-
- An attempt to send a message that exceeds SO_SNDSIZE will
- return with -EMSGSIZE
-
- An attempt to send a message that would take the total number
- of queued bytes over the SO_SNDSIZE threshold will return
- EAGAIN.
-
- An attempt to send a message to a destination that is marked
- as "congested" will return ENOBUFS.
-
- recvmsg(fd, ...)
- Receives a message that was queued to this socket. The sockets
- recv queue accounting is adjusted, and if the queue length
- drops below SO_SNDSIZE, the port is marked uncongested, and
- a congestion update is sent to all peers.
-
- Applications can ask the RDS kernel module to receive
- notifications via control messages (for instance, there is a
- notification when a congestion update arrived, or when a RDMA
- operation completes). These notifications are received through
- the msg.msg_control buffer of struct msghdr. The format of the
- messages is described in manpages.
-
- poll(fd)
- RDS supports the poll interface to allow the application
- to implement async I/O.
-
- POLLIN handling is pretty straightforward. When there's an
- incoming message queued to the socket, or a pending notification,
- we signal POLLIN.
-
- POLLOUT is a little harder. Since you can essentially send
- to any destination, RDS will always signal POLLOUT as long as
- there's room on the send queue (ie the number of bytes queued
- is less than the sendbuf size).
-
- However, the kernel will refuse to accept messages to
- a destination marked congested - in this case you will loop
- forever if you rely on poll to tell you what to do.
- This isn't a trivial problem, but applications can deal with
- this - by using congestion notifications, and by checking for
- ENOBUFS errors returned by sendmsg.
-
- setsockopt(SOL_RDS, RDS_CANCEL_SENT_TO, &sockaddr_in)
- This allows the application to discard all messages queued to a
- specific destination on this particular socket.
-
- This allows the application to cancel outstanding messages if
- it detects a timeout. For instance, if it tried to send a message,
- and the remote host is unreachable, RDS will keep trying forever.
- The application may decide it's not worth it, and cancel the
- operation. In this case, it would use RDS_CANCEL_SENT_TO to
- nuke any pending messages.
-
- setsockopt(fd, SOL_RDS, SO_RDS_TRANSPORT, (int *)&transport ..)
- getsockopt(fd, SOL_RDS, SO_RDS_TRANSPORT, (int *)&transport ..)
- Set or read an integer defining the underlying
- encapsulating transport to be used for RDS packets on the
- socket. When setting the option, integer argument may be
- one of RDS_TRANS_TCP or RDS_TRANS_IB. When retrieving the
- value, RDS_TRANS_NONE will be returned on an unbound socket.
- This socket option may only be set exactly once on the socket,
- prior to binding it via the bind(2) system call. Attempts to
- set SO_RDS_TRANSPORT on a socket for which the transport has
- been previously attached explicitly (by SO_RDS_TRANSPORT) or
- implicitly (via bind(2)) will return an error of EOPNOTSUPP.
- An attempt to set SO_RDS_TRANSPORT to RDS_TRANS_NONE will
- always return EINVAL.
-
-RDMA for RDS
-============
-
- see rds-rdma(7) manpage (available in rds-tools)
-
-
-Congestion Notifications
-========================
-
- see rds(7) manpage
-
-
-RDS Protocol
-============
-
- Message header
-
- The message header is a 'struct rds_header' (see rds.h):
- Fields:
- h_sequence:
- per-packet sequence number
- h_ack:
- piggybacked acknowledgment of last packet received
- h_len:
- length of data, not including header
- h_sport:
- source port
- h_dport:
- destination port
- h_flags:
- CONG_BITMAP - this is a congestion update bitmap
- ACK_REQUIRED - receiver must ack this packet
- RETRANSMITTED - packet has previously been sent
- h_credit:
- indicate to other end of connection that
- it has more credits available (i.e. there is
- more send room)
- h_padding[4]:
- unused, for future use
- h_csum:
- header checksum
- h_exthdr:
- optional data can be passed here. This is currently used for
- passing RDMA-related information.
-
- ACK and retransmit handling
-
- One might think that with reliable IB connections you wouldn't need
- to ack messages that have been received. The problem is that IB
- hardware generates an ack message before it has DMAed the message
- into memory. This creates a potential message loss if the HCA is
- disabled for any reason between when it sends the ack and before
- the message is DMAed and processed. This is only a potential issue
- if another HCA is available for fail-over.
-
- Sending an ack immediately would allow the sender to free the sent
- message from their send queue quickly, but could cause excessive
- traffic to be used for acks. RDS piggybacks acks on sent data
- packets. Ack-only packets are reduced by only allowing one to be
- in flight at a time, and by the sender only asking for acks when
- its send buffers start to fill up. All retransmissions are also
- acked.
-
- Flow Control
-
- RDS's IB transport uses a credit-based mechanism to verify that
- there is space in the peer's receive buffers for more data. This
- eliminates the need for hardware retries on the connection.
-
- Congestion
-
- Messages waiting in the receive queue on the receiving socket
- are accounted against the sockets SO_RCVBUF option value. Only
- the payload bytes in the message are accounted for. If the
- number of bytes queued equals or exceeds rcvbuf then the socket
- is congested. All sends attempted to this socket's address
- should return block or return -EWOULDBLOCK.
-
- Applications are expected to be reasonably tuned such that this
- situation very rarely occurs. An application encountering this
- "back-pressure" is considered a bug.
-
- This is implemented by having each node maintain bitmaps which
- indicate which ports on bound addresses are congested. As the
- bitmap changes it is sent through all the connections which
- terminate in the local address of the bitmap which changed.
-
- The bitmaps are allocated as connections are brought up. This
- avoids allocation in the interrupt handling path which queues
- sages on sockets. The dense bitmaps let transports send the
- entire bitmap on any bitmap change reasonably efficiently. This
- is much easier to implement than some finer-grained
- communication of per-port congestion. The sender does a very
- inexpensive bit test to test if the port it's about to send to
- is congested or not.
-
-
-RDS Transport Layer
-==================
-
- As mentioned above, RDS is not IB-specific. Its code is divided
- into a general RDS layer and a transport layer.
-
- The general layer handles the socket API, congestion handling,
- loopback, stats, usermem pinning, and the connection state machine.
-
- The transport layer handles the details of the transport. The IB
- transport, for example, handles all the queue pairs, work requests,
- CM event handlers, and other Infiniband details.
-
-
-RDS Kernel Structures
-=====================
-
- struct rds_message
- aka possibly "rds_outgoing", the generic RDS layer copies data to
- be sent and sets header fields as needed, based on the socket API.
- This is then queued for the individual connection and sent by the
- connection's transport.
- struct rds_incoming
- a generic struct referring to incoming data that can be handed from
- the transport to the general code and queued by the general code
- while the socket is awoken. It is then passed back to the transport
- code to handle the actual copy-to-user.
- struct rds_socket
- per-socket information
- struct rds_connection
- per-connection information
- struct rds_transport
- pointers to transport-specific functions
- struct rds_statistics
- non-transport-specific statistics
- struct rds_cong_map
- wraps the raw congestion bitmap, contains rbnode, waitq, etc.
-
-Connection management
-=====================
-
- Connections may be in UP, DOWN, CONNECTING, DISCONNECTING, and
- ERROR states.
-
- The first time an attempt is made by an RDS socket to send data to
- a node, a connection is allocated and connected. That connection is
- then maintained forever -- if there are transport errors, the
- connection will be dropped and re-established.
-
- Dropping a connection while packets are queued will cause queued or
- partially-sent datagrams to be retransmitted when the connection is
- re-established.
-
-
-The send path
-=============
-
- rds_sendmsg()
- struct rds_message built from incoming data
- CMSGs parsed (e.g. RDMA ops)
- transport connection alloced and connected if not already
- rds_message placed on send queue
- send worker awoken
- rds_send_worker()
- calls rds_send_xmit() until queue is empty
- rds_send_xmit()
- transmits congestion map if one is pending
- may set ACK_REQUIRED
- calls transport to send either non-RDMA or RDMA message
- (RDMA ops never retransmitted)
- rds_ib_xmit()
- allocs work requests from send ring
- adds any new send credits available to peer (h_credits)
- maps the rds_message's sg list
- piggybacks ack
- populates work requests
- post send to connection's queue pair
-
-The recv path
-=============
-
- rds_ib_recv_cq_comp_handler()
- looks at write completions
- unmaps recv buffer from device
- no errors, call rds_ib_process_recv()
- refill recv ring
- rds_ib_process_recv()
- validate header checksum
- copy header to rds_ib_incoming struct if start of a new datagram
- add to ibinc's fraglist
- if competed datagram:
- update cong map if datagram was cong update
- call rds_recv_incoming() otherwise
- note if ack is required
- rds_recv_incoming()
- drop duplicate packets
- respond to pings
- find the sock associated with this datagram
- add to sock queue
- wake up sock
- do some congestion calculations
- rds_recvmsg
- copy data into user iovec
- handle CMSGs
- return to application
-
-Multipath RDS (mprds)
-=====================
- Mprds is multipathed-RDS, primarily intended for RDS-over-TCP
- (though the concept can be extended to other transports). The classical
- implementation of RDS-over-TCP is implemented by demultiplexing multiple
- PF_RDS sockets between any 2 endpoints (where endpoint == [IP address,
- port]) over a single TCP socket between the 2 IP addresses involved. This
- has the limitation that it ends up funneling multiple RDS flows over a
- single TCP flow, thus it is
- (a) upper-bounded to the single-flow bandwidth,
- (b) suffers from head-of-line blocking for all the RDS sockets.
-
- Better throughput (for a fixed small packet size, MTU) can be achieved
- by having multiple TCP/IP flows per rds/tcp connection, i.e., multipathed
- RDS (mprds). Each such TCP/IP flow constitutes a path for the rds/tcp
- connection. RDS sockets will be attached to a path based on some hash
- (e.g., of local address and RDS port number) and packets for that RDS
- socket will be sent over the attached path using TCP to segment/reassemble
- RDS datagrams on that path.
-
- Multipathed RDS is implemented by splitting the struct rds_connection into
- a common (to all paths) part, and a per-path struct rds_conn_path. All
- I/O workqs and reconnect threads are driven from the rds_conn_path.
- Transports such as TCP that are multipath capable may then set up a
- TCP socket per rds_conn_path, and this is managed by the transport via
- the transport privatee cp_transport_data pointer.
-
- Transports announce themselves as multipath capable by setting the
- t_mp_capable bit during registration with the rds core module. When the
- transport is multipath-capable, rds_sendmsg() hashes outgoing traffic
- across multiple paths. The outgoing hash is computed based on the
- local address and port that the PF_RDS socket is bound to.
-
- Additionally, even if the transport is MP capable, we may be
- peering with some node that does not support mprds, or supports
- a different number of paths. As a result, the peering nodes need
- to agree on the number of paths to be used for the connection.
- This is done by sending out a control packet exchange before the
- first data packet. The control packet exchange must have completed
- prior to outgoing hash completion in rds_sendmsg() when the transport
- is mutlipath capable.
-
- The control packet is an RDS ping packet (i.e., packet to rds dest
- port 0) with the ping packet having a rds extension header option of
- type RDS_EXTHDR_NPATHS, length 2 bytes, and the value is the
- number of paths supported by the sender. The "probe" ping packet will
- get sent from some reserved port, RDS_FLAG_PROBE_PORT (in <linux/rds.h>)
- The receiver of a ping from RDS_FLAG_PROBE_PORT will thus immediately
- be able to compute the min(sender_paths, rcvr_paths). The pong
- sent in response to a probe-ping should contain the rcvr's npaths
- when the rcvr is mprds-capable.
-
- If the rcvr is not mprds-capable, the exthdr in the ping will be
- ignored. In this case the pong will not have any exthdrs, so the sender
- of the probe-ping can default to single-path mprds.
-
diff --git a/Documentation/networking/regulatory.txt b/Documentation/networking/regulatory.rst
similarity index 93%
rename from Documentation/networking/regulatory.txt
rename to Documentation/networking/regulatory.rst
index 381e5b2..8701b91 100644
--- a/Documentation/networking/regulatory.txt
+++ b/Documentation/networking/regulatory.rst
@@ -1,5 +1,8 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+=======================================
Linux wireless regulatory documentation
----------------------------------------
+=======================================
This document gives a brief review over how the Linux wireless
regulatory infrastructure works.
@@ -57,7 +60,7 @@
http://wireless.kernel.org/en/users/Documentation/iw
-An example:
+An example::
# set regulatory domain to "Costa Rica"
iw reg set CR
@@ -104,9 +107,9 @@
This example comes from the zd1211rw device driver. You can start
by having a mapping of your device's EEPROM country/regulatory
-domain value to a specific alpha2 as follows:
+domain value to a specific alpha2 as follows::
-static struct zd_reg_alpha2_map reg_alpha2_map[] = {
+ static struct zd_reg_alpha2_map reg_alpha2_map[] = {
{ ZD_REGDOMAIN_FCC, "US" },
{ ZD_REGDOMAIN_IC, "CA" },
{ ZD_REGDOMAIN_ETSI, "DE" }, /* Generic ETSI, use most restrictive */
@@ -116,10 +119,10 @@
{ ZD_REGDOMAIN_FRANCE, "FR" },
Then you can define a routine to map your read EEPROM value to an alpha2,
-as follows:
+as follows::
-static int zd_reg2alpha2(u8 regdomain, char *alpha2)
-{
+ static int zd_reg2alpha2(u8 regdomain, char *alpha2)
+ {
unsigned int i;
struct zd_reg_alpha2_map *reg_map;
for (i = 0; i < ARRAY_SIZE(reg_alpha2_map); i++) {
@@ -131,12 +134,14 @@
}
}
return 1;
-}
+ }
Lastly, you can then hint to the core of your discovered alpha2, if a match
was found. You need to do this after you have registered your wiphy. You
are expected to do this during initialization.
+::
+
r = zd_reg2alpha2(mac->regdomain, alpha2);
if (!r)
regulatory_hint(hw->wiphy, alpha2);
@@ -156,9 +161,9 @@
Bellow is a simple example, with a regulatory domain cached using the stack.
Your implementation may vary (read EEPROM cache instead, for example).
-Example cache of some regulatory domain
+Example cache of some regulatory domain::
-struct ieee80211_regdomain mydriver_jp_regdom = {
+ struct ieee80211_regdomain mydriver_jp_regdom = {
.n_reg_rules = 3,
.alpha2 = "JP",
//.alpha2 = "99", /* If I have no alpha2 to map it to */
@@ -173,9 +178,9 @@
NL80211_RRF_NO_IR|
NL80211_RRF_DFS),
}
-};
+ };
-Then in some part of your code after your wiphy has been registered:
+Then in some part of your code after your wiphy has been registered::
struct ieee80211_regdomain *rd;
int size_of_regd;
diff --git a/Documentation/networking/rxrpc.txt b/Documentation/networking/rxrpc.rst
similarity index 85%
rename from Documentation/networking/rxrpc.txt
rename to Documentation/networking/rxrpc.rst
index 180e07d..5ad3511 100644
--- a/Documentation/networking/rxrpc.txt
+++ b/Documentation/networking/rxrpc.rst
@@ -1,6 +1,8 @@
- ======================
- RxRPC NETWORK PROTOCOL
- ======================
+.. SPDX-License-Identifier: GPL-2.0
+
+======================
+RxRPC Network Protocol
+======================
The RxRPC protocol driver provides a reliable two-phase transport on top of UDP
that can be used to perform RxRPC remote operations. This is done over sockets
@@ -9,36 +11,35 @@
Contents of this document:
- (*) Overview.
+ (#) Overview.
- (*) RxRPC protocol summary.
+ (#) RxRPC protocol summary.
- (*) AF_RXRPC driver model.
+ (#) AF_RXRPC driver model.
- (*) Control messages.
+ (#) Control messages.
- (*) Socket options.
+ (#) Socket options.
- (*) Security.
+ (#) Security.
- (*) Example client usage.
+ (#) Example client usage.
- (*) Example server usage.
+ (#) Example server usage.
- (*) AF_RXRPC kernel interface.
+ (#) AF_RXRPC kernel interface.
- (*) Configurable parameters.
+ (#) Configurable parameters.
-========
-OVERVIEW
+Overview
========
RxRPC is a two-layer protocol. There is a session layer which provides
reliable virtual connections using UDP over IPv4 (or IPv6) as the transport
layer, but implements a real network protocol; and there's the presentation
layer which renders structured data to binary blobs and back again using XDR
-(as does SunRPC):
+(as does SunRPC)::
+-------------+
| Application |
@@ -85,31 +86,30 @@
that has both kernel (filesystem) and userspace (utility) components.
-======================
-RXRPC PROTOCOL SUMMARY
+RxRPC Protocol Summary
======================
An overview of the RxRPC protocol:
- (*) RxRPC sits on top of another networking protocol (UDP is the only option
+ (#) RxRPC sits on top of another networking protocol (UDP is the only option
currently), and uses this to provide network transport. UDP ports, for
example, provide transport endpoints.
- (*) RxRPC supports multiple virtual "connections" from any given transport
+ (#) RxRPC supports multiple virtual "connections" from any given transport
endpoint, thus allowing the endpoints to be shared, even to the same
remote endpoint.
- (*) Each connection goes to a particular "service". A connection may not go
+ (#) Each connection goes to a particular "service". A connection may not go
to multiple services. A service may be considered the RxRPC equivalent of
a port number. AF_RXRPC permits multiple services to share an endpoint.
- (*) Client-originating packets are marked, thus a transport endpoint can be
+ (#) Client-originating packets are marked, thus a transport endpoint can be
shared between client and server connections (connections have a
direction).
- (*) Up to a billion connections may be supported concurrently between one
+ (#) Up to a billion connections may be supported concurrently between one
local transport endpoint and one service on one remote endpoint. An RxRPC
- connection is described by seven numbers:
+ connection is described by seven numbers::
Local address }
Local port } Transport (UDP) address
@@ -119,22 +119,22 @@
Connection ID
Service ID
- (*) Each RxRPC operation is a "call". A connection may make up to four
+ (#) Each RxRPC operation is a "call". A connection may make up to four
billion calls, but only up to four calls may be in progress on a
connection at any one time.
- (*) Calls are two-phase and asymmetric: the client sends its request data,
+ (#) Calls are two-phase and asymmetric: the client sends its request data,
which the service receives; then the service transmits the reply data
which the client receives.
- (*) The data blobs are of indefinite size, the end of a phase is marked with a
+ (#) The data blobs are of indefinite size, the end of a phase is marked with a
flag in the packet. The number of packets of data making up one blob may
not exceed 4 billion, however, as this would cause the sequence number to
wrap.
- (*) The first four bytes of the request data are the service operation ID.
+ (#) The first four bytes of the request data are the service operation ID.
- (*) Security is negotiated on a per-connection basis. The connection is
+ (#) Security is negotiated on a per-connection basis. The connection is
initiated by the first data packet on it arriving. If security is
requested, the server then issues a "challenge" and then the client
replies with a "response". If the response is successful, the security is
@@ -143,146 +143,145 @@
connection lapse before the client, the security will be renegotiated if
the client uses the connection again.
- (*) Calls use ACK packets to handle reliability. Data packets are also
+ (#) Calls use ACK packets to handle reliability. Data packets are also
explicitly sequenced per call.
- (*) There are two types of positive acknowledgment: hard-ACKs and soft-ACKs.
+ (#) There are two types of positive acknowledgment: hard-ACKs and soft-ACKs.
A hard-ACK indicates to the far side that all the data received to a point
has been received and processed; a soft-ACK indicates that the data has
been received but may yet be discarded and re-requested. The sender may
not discard any transmittable packets until they've been hard-ACK'd.
- (*) Reception of a reply data packet implicitly hard-ACK's all the data
+ (#) Reception of a reply data packet implicitly hard-ACK's all the data
packets that make up the request.
- (*) An call is complete when the request has been sent, the reply has been
+ (#) An call is complete when the request has been sent, the reply has been
received and the final hard-ACK on the last packet of the reply has
reached the server.
- (*) An call may be aborted by either end at any time up to its completion.
+ (#) An call may be aborted by either end at any time up to its completion.
-=====================
-AF_RXRPC DRIVER MODEL
+AF_RXRPC Driver Model
=====================
About the AF_RXRPC driver:
- (*) The AF_RXRPC protocol transparently uses internal sockets of the transport
+ (#) The AF_RXRPC protocol transparently uses internal sockets of the transport
protocol to represent transport endpoints.
- (*) AF_RXRPC sockets map onto RxRPC connection bundles. Actual RxRPC
+ (#) AF_RXRPC sockets map onto RxRPC connection bundles. Actual RxRPC
connections are handled transparently. One client socket may be used to
make multiple simultaneous calls to the same service. One server socket
may handle calls from many clients.
- (*) Additional parallel client connections will be initiated to support extra
+ (#) Additional parallel client connections will be initiated to support extra
concurrent calls, up to a tunable limit.
- (*) Each connection is retained for a certain amount of time [tunable] after
+ (#) Each connection is retained for a certain amount of time [tunable] after
the last call currently using it has completed in case a new call is made
that could reuse it.
- (*) Each internal UDP socket is retained [tunable] for a certain amount of
+ (#) Each internal UDP socket is retained [tunable] for a certain amount of
time [tunable] after the last connection using it discarded, in case a new
connection is made that could use it.
- (*) A client-side connection is only shared between calls if they have have
+ (#) A client-side connection is only shared between calls if they have have
the same key struct describing their security (and assuming the calls
would otherwise share the connection). Non-secured calls would also be
able to share connections with each other.
- (*) A server-side connection is shared if the client says it is.
+ (#) A server-side connection is shared if the client says it is.
- (*) ACK'ing is handled by the protocol driver automatically, including ping
+ (#) ACK'ing is handled by the protocol driver automatically, including ping
replying.
- (*) SO_KEEPALIVE automatically pings the other side to keep the connection
+ (#) SO_KEEPALIVE automatically pings the other side to keep the connection
alive [TODO].
- (*) If an ICMP error is received, all calls affected by that error will be
+ (#) If an ICMP error is received, all calls affected by that error will be
aborted with an appropriate network error passed through recvmsg().
Interaction with the user of the RxRPC socket:
- (*) A socket is made into a server socket by binding an address with a
+ (#) A socket is made into a server socket by binding an address with a
non-zero service ID.
- (*) In the client, sending a request is achieved with one or more sendmsgs,
+ (#) In the client, sending a request is achieved with one or more sendmsgs,
followed by the reply being received with one or more recvmsgs.
- (*) The first sendmsg for a request to be sent from a client contains a tag to
+ (#) The first sendmsg for a request to be sent from a client contains a tag to
be used in all other sendmsgs or recvmsgs associated with that call. The
tag is carried in the control data.
- (*) connect() is used to supply a default destination address for a client
+ (#) connect() is used to supply a default destination address for a client
socket. This may be overridden by supplying an alternate address to the
first sendmsg() of a call (struct msghdr::msg_name).
- (*) If connect() is called on an unbound client, a random local port will
+ (#) If connect() is called on an unbound client, a random local port will
bound before the operation takes place.
- (*) A server socket may also be used to make client calls. To do this, the
+ (#) A server socket may also be used to make client calls. To do this, the
first sendmsg() of the call must specify the target address. The server's
transport endpoint is used to send the packets.
- (*) Once the application has received the last message associated with a call,
+ (#) Once the application has received the last message associated with a call,
the tag is guaranteed not to be seen again, and so it can be used to pin
client resources. A new call can then be initiated with the same tag
without fear of interference.
- (*) In the server, a request is received with one or more recvmsgs, then the
+ (#) In the server, a request is received with one or more recvmsgs, then the
the reply is transmitted with one or more sendmsgs, and then the final ACK
is received with a last recvmsg.
- (*) When sending data for a call, sendmsg is given MSG_MORE if there's more
+ (#) When sending data for a call, sendmsg is given MSG_MORE if there's more
data to come on that call.
- (*) When receiving data for a call, recvmsg flags MSG_MORE if there's more
+ (#) When receiving data for a call, recvmsg flags MSG_MORE if there's more
data to come for that call.
- (*) When receiving data or messages for a call, MSG_EOR is flagged by recvmsg
+ (#) When receiving data or messages for a call, MSG_EOR is flagged by recvmsg
to indicate the terminal message for that call.
- (*) A call may be aborted by adding an abort control message to the control
+ (#) A call may be aborted by adding an abort control message to the control
data. Issuing an abort terminates the kernel's use of that call's tag.
Any messages waiting in the receive queue for that call will be discarded.
- (*) Aborts, busy notifications and challenge packets are delivered by recvmsg,
+ (#) Aborts, busy notifications and challenge packets are delivered by recvmsg,
and control data messages will be set to indicate the context. Receiving
an abort or a busy message terminates the kernel's use of that call's tag.
- (*) The control data part of the msghdr struct is used for a number of things:
+ (#) The control data part of the msghdr struct is used for a number of things:
- (*) The tag of the intended or affected call.
+ (#) The tag of the intended or affected call.
- (*) Sending or receiving errors, aborts and busy notifications.
+ (#) Sending or receiving errors, aborts and busy notifications.
- (*) Notifications of incoming calls.
+ (#) Notifications of incoming calls.
- (*) Sending debug requests and receiving debug replies [TODO].
+ (#) Sending debug requests and receiving debug replies [TODO].
- (*) When the kernel has received and set up an incoming call, it sends a
+ (#) When the kernel has received and set up an incoming call, it sends a
message to server application to let it know there's a new call awaiting
its acceptance [recvmsg reports a special control message]. The server
application then uses sendmsg to assign a tag to the new call. Once that
is done, the first part of the request data will be delivered by recvmsg.
- (*) The server application has to provide the server socket with a keyring of
+ (#) The server application has to provide the server socket with a keyring of
secret keys corresponding to the security types it permits. When a secure
connection is being set up, the kernel looks up the appropriate secret key
in the keyring and then sends a challenge packet to the client and
receives a response packet. The kernel then checks the authorisation of
the packet and either aborts the connection or sets up the security.
- (*) The name of the key a client will use to secure its communications is
+ (#) The name of the key a client will use to secure its communications is
nominated by a socket option.
Notes on sendmsg:
- (*) MSG_WAITALL can be set to tell sendmsg to ignore signals if the peer is
+ (#) MSG_WAITALL can be set to tell sendmsg to ignore signals if the peer is
making progress at accepting packets within a reasonable time such that we
manage to queue up all the data for transmission. This requires the
client to accept at least one packet per 2*RTT time period.
@@ -294,7 +293,7 @@
Notes on recvmsg:
- (*) If there's a sequence of data messages belonging to a particular call on
+ (#) If there's a sequence of data messages belonging to a particular call on
the receive queue, then recvmsg will keep working through them until:
(a) it meets the end of that call's received data,
@@ -320,13 +319,13 @@
flagged.
-================
-CONTROL MESSAGES
+Control Messages
================
AF_RXRPC makes use of control messages in sendmsg() and recvmsg() to multiplex
calls, to invoke certain actions and to report certain conditions. These are:
+ ======================= === =========== ===============================
MESSAGE ID SRT DATA MEANING
======================= === =========== ===============================
RXRPC_USER_CALL_ID sr- User ID App's call specifier
@@ -340,10 +339,11 @@
RXRPC_EXCLUSIVE_CALL s-- n/a Make an exclusive client call
RXRPC_UPGRADE_SERVICE s-- n/a Client call can be upgraded
RXRPC_TX_LENGTH s-- data len Total length of Tx data
+ ======================= === =========== ===============================
(SRT = usable in Sendmsg / delivered by Recvmsg / Terminal message)
- (*) RXRPC_USER_CALL_ID
+ (#) RXRPC_USER_CALL_ID
This is used to indicate the application's call ID. It's an unsigned long
that the app specifies in the client by attaching it to the first data
@@ -351,7 +351,7 @@
message. recvmsg() passes it in conjunction with all messages except
those of the RXRPC_NEW_CALL message.
- (*) RXRPC_ABORT
+ (#) RXRPC_ABORT
This is can be used by an application to abort a call by passing it to
sendmsg, or it can be delivered by recvmsg to indicate a remote abort was
@@ -359,13 +359,13 @@
specify the call affected. If an abort is being sent, then error EBADSLT
will be returned if there is no call with that user ID.
- (*) RXRPC_ACK
+ (#) RXRPC_ACK
This is delivered to a server application to indicate that the final ACK
of a call was received from the client. It will be associated with an
RXRPC_USER_CALL_ID to indicate the call that's now complete.
- (*) RXRPC_NET_ERROR
+ (#) RXRPC_NET_ERROR
This is delivered to an application to indicate that an ICMP error message
was encountered in the process of trying to talk to the peer. An
@@ -373,13 +373,13 @@
indicating the problem, and an RXRPC_USER_CALL_ID will indicate the call
affected.
- (*) RXRPC_BUSY
+ (#) RXRPC_BUSY
This is delivered to a client application to indicate that a call was
rejected by the server due to the server being busy. It will be
associated with an RXRPC_USER_CALL_ID to indicate the rejected call.
- (*) RXRPC_LOCAL_ERROR
+ (#) RXRPC_LOCAL_ERROR
This is delivered to an application to indicate that a local error was
encountered and that a call has been aborted because of it. An
@@ -387,13 +387,13 @@
indicating the problem, and an RXRPC_USER_CALL_ID will indicate the call
affected.
- (*) RXRPC_NEW_CALL
+ (#) RXRPC_NEW_CALL
This is delivered to indicate to a server application that a new call has
arrived and is awaiting acceptance. No user ID is associated with this,
as a user ID must subsequently be assigned by doing an RXRPC_ACCEPT.
- (*) RXRPC_ACCEPT
+ (#) RXRPC_ACCEPT
This is used by a server application to attempt to accept a call and
assign it a user ID. It should be associated with an RXRPC_USER_CALL_ID
@@ -402,12 +402,12 @@
return error ENODATA. If the user ID is already in use by another call,
then error EBADSLT will be returned.
- (*) RXRPC_EXCLUSIVE_CALL
+ (#) RXRPC_EXCLUSIVE_CALL
This is used to indicate that a client call should be made on a one-off
connection. The connection is discarded once the call has terminated.
- (*) RXRPC_UPGRADE_SERVICE
+ (#) RXRPC_UPGRADE_SERVICE
This is used to make a client call to probe if the specified service ID
may be upgraded by the server. The caller must check msg_name returned to
@@ -419,7 +419,7 @@
future communication to that server and RXRPC_UPGRADE_SERVICE should no
longer be set.
- (*) RXRPC_TX_LENGTH
+ (#) RXRPC_TX_LENGTH
This is used to inform the kernel of the total amount of data that is
going to be transmitted by a call (whether in a client request or a
@@ -443,7 +443,7 @@
AF_RXRPC sockets support a few socket options at the SOL_RXRPC level:
- (*) RXRPC_SECURITY_KEY
+ (#) RXRPC_SECURITY_KEY
This is used to specify the description of the key to be used. The key is
extracted from the calling process's keyrings with request_key() and
@@ -452,17 +452,17 @@
The optval pointer points to the description string, and optlen indicates
how long the string is, without the NUL terminator.
- (*) RXRPC_SECURITY_KEYRING
+ (#) RXRPC_SECURITY_KEYRING
Similar to above but specifies a keyring of server secret keys to use (key
type "keyring"). See the "Security" section.
- (*) RXRPC_EXCLUSIVE_CONNECTION
+ (#) RXRPC_EXCLUSIVE_CONNECTION
This is used to request that new connections should be used for each call
made subsequently on this socket. optval should be NULL and optlen 0.
- (*) RXRPC_MIN_SECURITY_LEVEL
+ (#) RXRPC_MIN_SECURITY_LEVEL
This is used to specify the minimum security level required for calls on
this socket. optval must point to an int containing one of the following
@@ -482,14 +482,14 @@
Encrypted checksum plus entire packet padded and encrypted, including
actual packet length.
- (*) RXRPC_UPGRADEABLE_SERVICE
+ (#) RXRPC_UPGRADEABLE_SERVICE
This is used to indicate that a service socket with two bindings may
upgrade one bound service to the other if requested by the client. optval
must point to an array of two unsigned short ints. The first is the
service ID to upgrade from and the second the service ID to upgrade to.
- (*) RXRPC_SUPPORTED_CMSG
+ (#) RXRPC_SUPPORTED_CMSG
This is a read-only option that writes an int into the buffer indicating
the highest control message type supported.
@@ -509,7 +509,7 @@
http://people.redhat.com/~dhowells/rxrpc/klog.c
The payload provided to add_key() on the client should be of the following
-form:
+form::
struct rxrpc_key_sec2_v1 {
uint16_t security_index; /* 2 */
@@ -546,14 +546,14 @@
A client would issue an operation by:
- (1) An RxRPC socket is set up by:
+ (1) An RxRPC socket is set up by::
client = socket(AF_RXRPC, SOCK_DGRAM, PF_INET);
Where the third parameter indicates the protocol family of the transport
socket used - usually IPv4 but it can also be IPv6 [TODO].
- (2) A local address can optionally be bound:
+ (2) A local address can optionally be bound::
struct sockaddr_rxrpc srx = {
.srx_family = AF_RXRPC,
@@ -570,20 +570,20 @@
several unrelated RxRPC sockets. Security is handled on a basis of
per-RxRPC virtual connection.
- (3) The security is set:
+ (3) The security is set::
const char *key = "AFS:cambridge.redhat.com";
setsockopt(client, SOL_RXRPC, RXRPC_SECURITY_KEY, key, strlen(key));
This issues a request_key() to get the key representing the security
- context. The minimum security level can be set:
+ context. The minimum security level can be set::
unsigned int sec = RXRPC_SECURITY_ENCRYPTED;
setsockopt(client, SOL_RXRPC, RXRPC_MIN_SECURITY_LEVEL,
&sec, sizeof(sec));
(4) The server to be contacted can then be specified (alternatively this can
- be done through sendmsg):
+ be done through sendmsg)::
struct sockaddr_rxrpc srx = {
.srx_family = AF_RXRPC,
@@ -598,7 +598,9 @@
(5) The request data should then be posted to the server socket using a series
of sendmsg() calls, each with the following control message attached:
- RXRPC_USER_CALL_ID - specifies the user ID for this call
+ ================== ===================================
+ RXRPC_USER_CALL_ID specifies the user ID for this call
+ ================== ===================================
MSG_MORE should be set in msghdr::msg_flags on all but the last part of
the request. Multiple requests may be made simultaneously.
@@ -635,13 +637,12 @@
probe is concluded).
-====================
-EXAMPLE SERVER USAGE
+Example Server Usage
====================
A server would be set up to accept operations in the following manner:
- (1) An RxRPC socket is created by:
+ (1) An RxRPC socket is created by::
server = socket(AF_RXRPC, SOCK_DGRAM, PF_INET);
@@ -649,7 +650,7 @@
socket used - usually IPv4.
(2) Security is set up if desired by giving the socket a keyring with server
- secret keys in it:
+ secret keys in it::
keyring = add_key("keyring", "AFSkeys", NULL, 0,
KEY_SPEC_PROCESS_KEYRING);
@@ -663,7 +664,7 @@
The keyring can be manipulated after it has been given to the socket. This
permits the server to add more keys, replace keys, etc. while it is live.
- (3) A local address must then be bound:
+ (3) A local address must then be bound::
struct sockaddr_rxrpc srx = {
.srx_family = AF_RXRPC,
@@ -680,7 +681,7 @@
should be called twice.
(4) If service upgrading is required, first two service IDs must have been
- bound and then the following option must be set:
+ bound and then the following option must be set::
unsigned short service_ids[2] = { from_ID, to_ID };
setsockopt(server, SOL_RXRPC, RXRPC_UPGRADEABLE_SERVICE,
@@ -690,14 +691,14 @@
to_ID if they request it. This will be reflected in msg_name obtained
through recvmsg() when the request data is delivered to userspace.
- (5) The server is then set to listen out for incoming calls:
+ (5) The server is then set to listen out for incoming calls::
listen(server, 100);
(6) The kernel notifies the server of pending incoming connections by sending
it a message for each. This is received with recvmsg() on the server
socket. It has no data, and has a single dataless control message
- attached:
+ attached::
RXRPC_NEW_CALL
@@ -709,8 +710,10 @@
(7) The server then accepts the new call by issuing a sendmsg() with two
pieces of control data and no actual data:
- RXRPC_ACCEPT - indicate connection acceptance
- RXRPC_USER_CALL_ID - specify user ID for this call
+ ================== ==============================
+ RXRPC_ACCEPT indicate connection acceptance
+ RXRPC_USER_CALL_ID specify user ID for this call
+ ================== ==============================
(8) The first request data packet will then be posted to the server socket for
recvmsg() to pick up. At that point, the RxRPC address for the call can
@@ -722,12 +725,17 @@
All data will be delivered with the following control message attached:
- RXRPC_USER_CALL_ID - specifies the user ID for this call
+
+ ================== ===================================
+ RXRPC_USER_CALL_ID specifies the user ID for this call
+ ================== ===================================
(9) The reply data should then be posted to the server socket using a series
of sendmsg() calls, each with the following control messages attached:
- RXRPC_USER_CALL_ID - specifies the user ID for this call
+ ================== ===================================
+ RXRPC_USER_CALL_ID specifies the user ID for this call
+ ================== ===================================
MSG_MORE should be set in msghdr::msg_flags on all but the last message
for a particular call.
@@ -736,8 +744,10 @@
when it is received. It will take the form of a dataless message with two
control messages attached:
- RXRPC_USER_CALL_ID - specifies the user ID for this call
- RXRPC_ACK - indicates final ACK (no data)
+ ================== ===================================
+ RXRPC_USER_CALL_ID specifies the user ID for this call
+ RXRPC_ACK indicates final ACK (no data)
+ ================== ===================================
MSG_EOR will be flagged to indicate that this is the final message for
this call.
@@ -746,8 +756,10 @@
aborted by calling sendmsg() with a dataless message with the following
control messages attached:
- RXRPC_USER_CALL_ID - specifies the user ID for this call
- RXRPC_ABORT - indicates abort code (4 byte data)
+ ================== ===================================
+ RXRPC_USER_CALL_ID specifies the user ID for this call
+ RXRPC_ABORT indicates abort code (4 byte data)
+ ================== ===================================
Any packets waiting in the socket's receive queue will be discarded if
this is issued.
@@ -757,8 +769,7 @@
determine the call affected.
-=========================
-AF_RXRPC KERNEL INTERFACE
+AF_RXRPC Kernel Interface
=========================
The AF_RXRPC module also provides an interface for use by in-kernel utilities
@@ -786,7 +797,7 @@
The kernel interface functions are as follows:
- (*) Begin a new client call.
+ (#) Begin a new client call::
struct rxrpc_call *
rxrpc_kernel_begin_call(struct socket *sock,
@@ -837,7 +848,7 @@
returned. The caller now holds a reference on this and it must be
properly ended.
- (*) End a client call.
+ (#) End a client call::
void rxrpc_kernel_end_call(struct socket *sock,
struct rxrpc_call *call);
@@ -846,7 +857,7 @@
from AF_RXRPC's knowledge and will not be seen again in association with
the specified call.
- (*) Send data through a call.
+ (#) Send data through a call::
typedef void (*rxrpc_notify_end_tx_t)(struct sock *sk,
unsigned long user_call_ID,
@@ -872,7 +883,7 @@
called with the call-state spinlock held to prevent any reply or final ACK
from being delivered first.
- (*) Receive data from a call.
+ (#) Receive data from a call::
int rxrpc_kernel_recv_data(struct socket *sock,
struct rxrpc_call *call,
@@ -902,12 +913,14 @@
more data was available, EMSGSIZE is returned.
If a remote ABORT is detected, the abort code received will be stored in
- *_abort and ECONNABORTED will be returned.
+ ``*_abort`` and ECONNABORTED will be returned.
The service ID that the call ended up with is returned into *_service.
This can be used to see if a call got a service upgrade.
- (*) Abort a call.
+ (#) Abort a call??
+
+ ::
void rxrpc_kernel_abort_call(struct socket *sock,
struct rxrpc_call *call,
@@ -916,7 +929,7 @@
This is used to abort a call if it's still in an abortable state. The
abort code specified will be placed in the ABORT message sent.
- (*) Intercept received RxRPC messages.
+ (#) Intercept received RxRPC messages::
typedef void (*rxrpc_interceptor_t)(struct sock *sk,
unsigned long user_call_ID,
@@ -937,7 +950,8 @@
The skb->mark field indicates the type of message:
- MARK MEANING
+ =============================== =======================================
+ Mark Meaning
=============================== =======================================
RXRPC_SKB_MARK_DATA Data message
RXRPC_SKB_MARK_FINAL_ACK Final ACK received for an incoming call
@@ -946,6 +960,7 @@
RXRPC_SKB_MARK_NET_ERROR Network error detected
RXRPC_SKB_MARK_LOCAL_ERROR Local error encountered
RXRPC_SKB_MARK_NEW_CALL New incoming call awaiting acceptance
+ =============================== =======================================
The remote abort message can be probed with rxrpc_kernel_get_abort_code().
The two error messages can be probed with rxrpc_kernel_get_error_number().
@@ -961,7 +976,7 @@
is possible to get extra refs on all types of message for later freeing,
but this may pin the state of a call until the message is finally freed.
- (*) Accept an incoming call.
+ (#) Accept an incoming call::
struct rxrpc_call *
rxrpc_kernel_accept_call(struct socket *sock,
@@ -975,7 +990,7 @@
returned. The caller now holds a reference on this and it must be
properly ended.
- (*) Reject an incoming call.
+ (#) Reject an incoming call::
int rxrpc_kernel_reject_call(struct socket *sock);
@@ -984,21 +999,21 @@
Other errors may be returned if the call had been aborted (-ECONNABORTED)
or had timed out (-ETIME).
- (*) Allocate a null key for doing anonymous security.
+ (#) Allocate a null key for doing anonymous security::
struct key *rxrpc_get_null_key(const char *keyname);
This is used to allocate a null RxRPC key that can be used to indicate
anonymous security for a particular domain.
- (*) Get the peer address of a call.
+ (#) Get the peer address of a call::
void rxrpc_kernel_get_peer(struct socket *sock, struct rxrpc_call *call,
struct sockaddr_rxrpc *_srx);
This is used to find the remote peer address of a call.
- (*) Set the total transmit data size on a call.
+ (#) Set the total transmit data size on a call::
void rxrpc_kernel_set_tx_length(struct socket *sock,
struct rxrpc_call *call,
@@ -1009,14 +1024,14 @@
size should be set when the call is begun. tx_total_len may not be less
than zero.
- (*) Get call RTT.
+ (#) Get call RTT::
u64 rxrpc_kernel_get_rtt(struct socket *sock, struct rxrpc_call *call);
Get the RTT time to the peer in use by a call. The value returned is in
nanoseconds.
- (*) Check call still alive.
+ (#) Check call still alive::
bool rxrpc_kernel_check_life(struct socket *sock,
struct rxrpc_call *call,
@@ -1024,7 +1039,7 @@
void rxrpc_kernel_probe_life(struct socket *sock,
struct rxrpc_call *call);
- The first function passes back in *_life a number that is updated when
+ The first function passes back in ``*_life`` a number that is updated when
ACKs are received from the peer (notably including PING RESPONSE ACKs
which we can elicit by sending PING ACKs to see if the call still exists
on the server). The caller should compare the numbers of two calls to see
@@ -1040,7 +1055,7 @@
first function to change. Note that this must be called in TASK_RUNNING
state.
- (*) Get reply timestamp.
+ (#) Get reply timestamp::
bool rxrpc_kernel_get_reply_time(struct socket *sock,
struct rxrpc_call *call,
@@ -1048,10 +1063,10 @@
This allows the timestamp on the first DATA packet of the reply of a
client call to be queried, provided that it is still in the Rx ring. If
- successful, the timestamp will be stored into *_ts and true will be
+ successful, the timestamp will be stored into ``*_ts`` and true will be
returned; false will be returned otherwise.
- (*) Get remote client epoch.
+ (#) Get remote client epoch::
u32 rxrpc_kernel_get_epoch(struct socket *sock,
struct rxrpc_call *call)
@@ -1065,7 +1080,7 @@
This value can be used to determine if the remote client has been
restarted as it shouldn't change otherwise.
- (*) Set the maxmimum lifespan on a call.
+ (#) Set the maxmimum lifespan on a call::
void rxrpc_kernel_set_max_life(struct socket *sock,
struct rxrpc_call *call,
@@ -1076,14 +1091,13 @@
aborted and -ETIME or -ETIMEDOUT will be returned.
-=======================
-CONFIGURABLE PARAMETERS
+Configurable Parameters
=======================
The RxRPC protocol driver has a number of configurable parameters that can be
adjusted through sysctls in /proc/net/rxrpc/:
- (*) req_ack_delay
+ (#) req_ack_delay
The amount of time in milliseconds after receiving a packet with the
request-ack flag set before we honour the flag and actually send the
@@ -1093,60 +1107,60 @@
reception window is full (to a maximum of 255 packets), so delaying the
ACK permits several packets to be ACK'd in one go.
- (*) soft_ack_delay
+ (#) soft_ack_delay
The amount of time in milliseconds after receiving a new packet before we
generate a soft-ACK to tell the sender that it doesn't need to resend.
- (*) idle_ack_delay
+ (#) idle_ack_delay
The amount of time in milliseconds after all the packets currently in the
received queue have been consumed before we generate a hard-ACK to tell
the sender it can free its buffers, assuming no other reason occurs that
we would send an ACK.
- (*) resend_timeout
+ (#) resend_timeout
The amount of time in milliseconds after transmitting a packet before we
transmit it again, assuming no ACK is received from the receiver telling
us they got it.
- (*) max_call_lifetime
+ (#) max_call_lifetime
The maximum amount of time in seconds that a call may be in progress
before we preemptively kill it.
- (*) dead_call_expiry
+ (#) dead_call_expiry
The amount of time in seconds before we remove a dead call from the call
list. Dead calls are kept around for a little while for the purpose of
repeating ACK and ABORT packets.
- (*) connection_expiry
+ (#) connection_expiry
The amount of time in seconds after a connection was last used before we
remove it from the connection list. While a connection is in existence,
it serves as a placeholder for negotiated security; when it is deleted,
the security must be renegotiated.
- (*) transport_expiry
+ (#) transport_expiry
The amount of time in seconds after a transport was last used before we
remove it from the transport list. While a transport is in existence, it
serves to anchor the peer data and keeps the connection ID counter.
- (*) rxrpc_rx_window_size
+ (#) rxrpc_rx_window_size
The size of the receive window in packets. This is the maximum number of
unconsumed received packets we're willing to hold in memory for any
particular call.
- (*) rxrpc_rx_mtu
+ (#) rxrpc_rx_mtu
The maximum packet MTU size that we're willing to receive in bytes. This
indicates to the peer whether we're willing to accept jumbo packets.
- (*) rxrpc_rx_jumbo_max
+ (#) rxrpc_rx_jumbo_max
The maximum number of packets that we're willing to accept in a jumbo
packet. Non-terminal packets in a jumbo packet must contain a four byte
diff --git a/Documentation/networking/sctp.txt b/Documentation/networking/sctp.rst
similarity index 64%
rename from Documentation/networking/sctp.txt
rename to Documentation/networking/sctp.rst
index 97b810c..9f4d9c8 100644
--- a/Documentation/networking/sctp.txt
+++ b/Documentation/networking/sctp.rst
@@ -1,35 +1,42 @@
-Linux Kernel SCTP
+.. SPDX-License-Identifier: GPL-2.0
+
+=================
+Linux Kernel SCTP
+=================
This is the current BETA release of the Linux Kernel SCTP reference
-implementation.
+implementation.
SCTP (Stream Control Transmission Protocol) is a IP based, message oriented,
reliable transport protocol, with congestion control, support for
transparent multi-homing, and multiple ordered streams of messages.
RFC2960 defines the core protocol. The IETF SIGTRAN working group originally
-developed the SCTP protocol and later handed the protocol over to the
-Transport Area (TSVWG) working group for the continued evolvement of SCTP as a
-general purpose transport.
+developed the SCTP protocol and later handed the protocol over to the
+Transport Area (TSVWG) working group for the continued evolvement of SCTP as a
+general purpose transport.
-See the IETF website (http://www.ietf.org) for further documents on SCTP.
-See http://www.ietf.org/rfc/rfc2960.txt
+See the IETF website (http://www.ietf.org) for further documents on SCTP.
+See http://www.ietf.org/rfc/rfc2960.txt
The initial project goal is to create an Linux kernel reference implementation
-of SCTP that is RFC 2960 compliant and provides an programming interface
-referred to as the UDP-style API of the Sockets Extensions for SCTP, as
-proposed in IETF Internet-Drafts.
+of SCTP that is RFC 2960 compliant and provides an programming interface
+referred to as the UDP-style API of the Sockets Extensions for SCTP, as
+proposed in IETF Internet-Drafts.
-Caveats:
+Caveats
+=======
--lksctp can be built as statically or as a module. However, be aware that
-module removal of lksctp is not yet a safe activity.
+- lksctp can be built as statically or as a module. However, be aware that
+ module removal of lksctp is not yet a safe activity.
--There is tentative support for IPv6, but most work has gone towards
-implementation and testing lksctp on IPv4.
+- There is tentative support for IPv6, but most work has gone towards
+ implementation and testing lksctp on IPv4.
For more information, please visit the lksctp project website:
+
http://www.sf.net/projects/lksctp
Or contact the lksctp developers through the mailing list:
+
<linux-sctp@vger.kernel.org>
diff --git a/Documentation/networking/secid.txt b/Documentation/networking/secid.rst
similarity index 87%
rename from Documentation/networking/secid.txt
rename to Documentation/networking/secid.rst
index 95ea067..b45141a 100644
--- a/Documentation/networking/secid.txt
+++ b/Documentation/networking/secid.rst
@@ -1,3 +1,9 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+=================
+LSM/SeLinux secid
+=================
+
flowi structure:
The secid member in the flow structure is used in LSMs (e.g. SELinux) to indicate
diff --git a/Documentation/networking/seg6-sysctl.rst b/Documentation/networking/seg6-sysctl.rst
new file mode 100644
index 0000000..ec73e14
--- /dev/null
+++ b/Documentation/networking/seg6-sysctl.rst
@@ -0,0 +1,26 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+====================
+Seg6 Sysfs variables
+====================
+
+
+/proc/sys/net/conf/<iface>/seg6_* variables:
+============================================
+
+seg6_enabled - BOOL
+ Accept or drop SR-enabled IPv6 packets on this interface.
+
+ Relevant packets are those with SRH present and DA = local.
+
+ * 0 - disabled (default)
+ * not 0 - enabled
+
+seg6_require_hmac - INTEGER
+ Define HMAC policy for ingress SR-enabled packets on this interface.
+
+ * -1 - Ignore HMAC field
+ * 0 - Accept SR packets without HMAC, validate SR packets with HMAC
+ * 1 - Drop SR packets without HMAC, validate SR packets with HMAC
+
+ Default is 0.
diff --git a/Documentation/networking/seg6-sysctl.txt b/Documentation/networking/seg6-sysctl.txt
deleted file mode 100644
index bdbde23..0000000
--- a/Documentation/networking/seg6-sysctl.txt
+++ /dev/null
@@ -1,18 +0,0 @@
-/proc/sys/net/conf/<iface>/seg6_* variables:
-
-seg6_enabled - BOOL
- Accept or drop SR-enabled IPv6 packets on this interface.
-
- Relevant packets are those with SRH present and DA = local.
-
- 0 - disabled (default)
- not 0 - enabled
-
-seg6_require_hmac - INTEGER
- Define HMAC policy for ingress SR-enabled packets on this interface.
-
- -1 - Ignore HMAC field
- 0 - Accept SR packets without HMAC, validate SR packets with HMAC
- 1 - Drop SR packets without HMAC, validate SR packets with HMAC
-
- Default is 0.
diff --git a/Documentation/networking/skfp.txt b/Documentation/networking/skfp.rst
similarity index 68%
rename from Documentation/networking/skfp.txt
rename to Documentation/networking/skfp.rst
index 203ec66..58f5481 100644
--- a/Documentation/networking/skfp.txt
+++ b/Documentation/networking/skfp.rst
@@ -1,35 +1,41 @@
-(C)Copyright 1998-2000 SysKonnect,
-===========================================================================
+.. SPDX-License-Identifier: GPL-2.0
+
+.. include:: <isonum.txt>
+
+========================
+SysKonnect driver - SKFP
+========================
+
+|copy| Copyright 1998-2000 SysKonnect,
skfp.txt created 11-May-2000
Readme File for skfp.o v2.06
-This file contains
-(1) OVERVIEW
-(2) SUPPORTED ADAPTERS
-(3) GENERAL INFORMATION
-(4) INSTALLATION
-(5) INCLUSION OF THE ADAPTER IN SYSTEM START
-(6) TROUBLESHOOTING
-(7) FUNCTION OF THE ADAPTER LEDS
-(8) HISTORY
+.. This file contains
-===========================================================================
+ (1) OVERVIEW
+ (2) SUPPORTED ADAPTERS
+ (3) GENERAL INFORMATION
+ (4) INSTALLATION
+ (5) INCLUSION OF THE ADAPTER IN SYSTEM START
+ (6) TROUBLESHOOTING
+ (7) FUNCTION OF THE ADAPTER LEDS
+ (8) HISTORY
-
-(1) OVERVIEW
-============
+1. Overview
+===========
This README explains how to use the driver 'skfp' for Linux with your
network adapter.
Chapter 2: Contains a list of all network adapters that are supported by
- this driver.
+this driver.
-Chapter 3: Gives some general information.
+Chapter 3:
+ Gives some general information.
Chapter 4: Describes common problems and solutions.
@@ -37,14 +43,13 @@
Chapter 6: History of development.
-***
-
-(2) SUPPORTED ADAPTERS
-======================
+2. Supported adapters
+=====================
The network driver 'skfp' supports the following network adapters:
SysKonnect adapters:
+
- SK-5521 (SK-NET FDDI-UP)
- SK-5522 (SK-NET FDDI-UP DAS)
- SK-5541 (SK-NET FDDI-FP)
@@ -55,157 +60,187 @@
- SK-5841 (SK-NET FDDI-FP64)
- SK-5843 (SK-NET FDDI-LP64)
- SK-5844 (SK-NET FDDI-LP64 DAS)
+
Compaq adapters (not tested):
+
- Netelligent 100 FDDI DAS Fibre SC
- Netelligent 100 FDDI SAS Fibre SC
- Netelligent 100 FDDI DAS UTP
- Netelligent 100 FDDI SAS UTP
- Netelligent 100 FDDI SAS Fibre MIC
-***
-(3) GENERAL INFORMATION
-=======================
+3. General Information
+======================
From v2.01 on, the driver is integrated in the linux kernel sources.
Therefore, the installation is the same as for any other adapter
supported by the kernel.
+
Refer to the manual of your distribution about the installation
of network adapters.
+
Makes my life much easier :-)
-***
-
-(4) TROUBLESHOOTING
-===================
+4. Troubleshooting
+==================
If you run into problems during installation, check those items:
-Problem: The FDDI adapter cannot be found by the driver.
-Reason: Look in /proc/pci for the following entry:
- 'FDDI network controller: SysKonnect SK-FDDI-PCI ...'
+Problem:
+ The FDDI adapter cannot be found by the driver.
+
+Reason:
+ Look in /proc/pci for the following entry:
+
+ 'FDDI network controller: SysKonnect SK-FDDI-PCI ...'
+
If this entry exists, then the FDDI adapter has been
found by the system and should be able to be used.
+
If this entry does not exist or if the file '/proc/pci'
is not there, then you may have a hardware problem or PCI
support may not be enabled in your kernel.
+
The adapter can be checked using the diagnostic program
which is available from the SysKonnect web site:
+
www.syskonnect.de
+
Some COMPAQ machines have a problem with PCI under
Linux. This is described in the 'PCI howto' document
(included in some distributions or available from the
www, e.g. at 'www.linux.org') and no workaround is available.
-Problem: You want to use your computer as a router between
- multiple IP subnetworks (using multiple adapters), but
+Problem:
+ You want to use your computer as a router between
+ multiple IP subnetworks (using multiple adapters), but
you cannot reach computers in other subnetworks.
-Reason: Either the router's kernel is not configured for IP
+
+Reason:
+ Either the router's kernel is not configured for IP
forwarding or there is a problem with the routing table
and gateway configuration in at least one of the
computers.
If your problem is not listed here, please contact our
-technical support for help.
-You can send email to:
- linux@syskonnect.de
+technical support for help.
+
+You can send email to: linux@syskonnect.de
+
When contacting our technical support,
please ensure that the following information is available:
+
- System Manufacturer and Model
- Boards in your system
- Distribution
- Kernel version
-***
+
+5. Function of the Adapter LEDs
+===============================
+
+ The functionality of the LED's on the FDDI network adapters was
+ changed in SMT version v2.82. With this new SMT version, the yellow
+ LED works as a ring operational indicator. An active yellow LED
+ indicates that the ring is down. The green LED on the adapter now
+ works as a link indicator where an active GREEN LED indicates that
+ the respective port has a physical connection.
+
+ With versions of SMT prior to v2.82 a ring up was indicated if the
+ yellow LED was off while the green LED(s) showed the connection
+ status of the adapter. During a ring down the green LED was off and
+ the yellow LED was on.
+
+ All implementations indicate that a driver is not loaded if
+ all LEDs are off.
-(5) FUNCTION OF THE ADAPTER LEDS
-================================
-
- The functionality of the LED's on the FDDI network adapters was
- changed in SMT version v2.82. With this new SMT version, the yellow
- LED works as a ring operational indicator. An active yellow LED
- indicates that the ring is down. The green LED on the adapter now
- works as a link indicator where an active GREEN LED indicates that
- the respective port has a physical connection.
-
- With versions of SMT prior to v2.82 a ring up was indicated if the
- yellow LED was off while the green LED(s) showed the connection
- status of the adapter. During a ring down the green LED was off and
- the yellow LED was on.
-
- All implementations indicate that a driver is not loaded if
- all LEDs are off.
-
-***
-
-
-(6) HISTORY
-===========
+6. History
+==========
v2.06 (20000511) (In-Kernel version)
New features:
+
- 64 bit support
- new pci dma interface
- in kernel 2.3.99
v2.05 (20000217) (In-Kernel version)
New features:
+
- Changes for 2.3.45 kernel
v2.04 (20000207) (Standalone version)
New features:
+
- Added rx/tx byte counter
v2.03 (20000111) (Standalone version)
Problems fixed:
+
- Fixed printk statements from v2.02
v2.02 (991215) (Standalone version)
Problems fixed:
+
- Removed unnecessary output
- Fixed path for "printver.sh" in makefile
v2.01 (991122) (In-Kernel version)
New features:
+
- Integration in Linux kernel sources
- Support for memory mapped I/O.
v2.00 (991112)
New features:
+
- Full source released under GPL
v1.05 (991023)
Problems fixed:
+
- Compilation with kernel version 2.2.13 failed
v1.04 (990427)
Changes:
+
- New SMT module included, changing LED functionality
+
Problems fixed:
+
- Synchronization on SMP machines was buggy
v1.03 (990325)
Problems fixed:
+
- Interrupt routing on SMP machines could be incorrect
v1.02 (990310)
New features:
+
- Support for kernel versions 2.2.x added
- Kernel patch instead of private duplicate of kernel functions
v1.01 (980812)
Problems fixed:
+
Connection hangup with telnet
Slow telnet connection
v1.00 beta 01 (980507)
New features:
+
None.
+
Problems fixed:
+
None.
+
Known limitations:
- - tar archive instead of standard package format (rpm).
+
+ - tar archive instead of standard package format (rpm).
- FDDI statistic is empty.
- not tested with 2.1.xx kernels
- integration in kernel not tested
@@ -216,5 +251,3 @@
- does not work on some COMPAQ machines. See the PCI howto
document for details about this problem.
- data corruption with kernel versions below 2.0.33.
-
-*** End of information file ***
diff --git a/Documentation/networking/strparser.txt b/Documentation/networking/strparser.rst
similarity index 80%
rename from Documentation/networking/strparser.txt
rename to Documentation/networking/strparser.rst
index a7d354d..6cab1f7 100644
--- a/Documentation/networking/strparser.txt
+++ b/Documentation/networking/strparser.rst
@@ -1,4 +1,8 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+=========================
Stream Parser (strparser)
+=========================
Introduction
============
@@ -34,8 +38,10 @@
Functions
=========
-strp_init(struct strparser *strp, struct sock *sk,
- const struct strp_callbacks *cb)
+ ::
+
+ strp_init(struct strparser *strp, struct sock *sk,
+ const struct strp_callbacks *cb)
Called to initialize a stream parser. strp is a struct of type
strparser that is allocated by the upper layer. sk is the TCP
@@ -43,31 +49,41 @@
callback mode; in general mode this is set to NULL. Callbacks
are called by the stream parser (the callbacks are listed below).
-void strp_pause(struct strparser *strp)
+ ::
+
+ void strp_pause(struct strparser *strp)
Temporarily pause a stream parser. Message parsing is suspended
and no new messages are delivered to the upper layer.
-void strp_unpause(struct strparser *strp)
+ ::
+
+ void strp_unpause(struct strparser *strp)
Unpause a paused stream parser.
-void strp_stop(struct strparser *strp);
+ ::
+
+ void strp_stop(struct strparser *strp);
strp_stop is called to completely stop stream parser operations.
This is called internally when the stream parser encounters an
error, and it is called from the upper layer to stop parsing
operations.
-void strp_done(struct strparser *strp);
+ ::
+
+ void strp_done(struct strparser *strp);
strp_done is called to release any resources held by the stream
parser instance. This must be called after the stream processor
has been stopped.
-int strp_process(struct strparser *strp, struct sk_buff *orig_skb,
- unsigned int orig_offset, size_t orig_len,
- size_t max_msg_size, long timeo)
+ ::
+
+ int strp_process(struct strparser *strp, struct sk_buff *orig_skb,
+ unsigned int orig_offset, size_t orig_len,
+ size_t max_msg_size, long timeo)
strp_process is called in general mode for a stream parser to
parse an sk_buff. The number of bytes processed or a negative
@@ -75,7 +91,9 @@
consume the sk_buff. max_msg_size is maximum size the stream
parser will parse. timeo is timeout for completing a message.
-void strp_data_ready(struct strparser *strp);
+ ::
+
+ void strp_data_ready(struct strparser *strp);
The upper layer calls strp_tcp_data_ready when data is ready on
the lower socket for strparser to process. This should be called
@@ -83,7 +101,9 @@
maximum messages size is the limit of the receive socket
buffer and message timeout is the receive timeout for the socket.
-void strp_check_rcv(struct strparser *strp);
+ ::
+
+ void strp_check_rcv(struct strparser *strp);
strp_check_rcv is called to check for new messages on the socket.
This is normally called at initialization of a stream parser
@@ -94,7 +114,9 @@
There are six callbacks:
-int (*parse_msg)(struct strparser *strp, struct sk_buff *skb);
+ ::
+
+ int (*parse_msg)(struct strparser *strp, struct sk_buff *skb);
parse_msg is called to determine the length of the next message
in the stream. The upper layer must implement this function. It
@@ -107,14 +129,16 @@
The return values of this function are:
- >0 : indicates length of successfully parsed message
- 0 : indicates more data must be received to parse the message
- -ESTRPIPE : current message should not be processed by the
- kernel, return control of the socket to userspace which
- can proceed to read the messages itself
- other < 0 : Error in parsing, give control back to userspace
- assuming that synchronization is lost and the stream
- is unrecoverable (application expected to close TCP socket)
+ ========= ===========================================================
+ >0 indicates length of successfully parsed message
+ 0 indicates more data must be received to parse the message
+ -ESTRPIPE current message should not be processed by the
+ kernel, return control of the socket to userspace which
+ can proceed to read the messages itself
+ other < 0 Error in parsing, give control back to userspace
+ assuming that synchronization is lost and the stream
+ is unrecoverable (application expected to close TCP socket)
+ ========= ===========================================================
In the case that an error is returned (return value is less than
zero) and the parser is in receive callback mode, then it will set
@@ -123,7 +147,9 @@
the current message, then the error set on the attached socket is
ENODATA since the stream is unrecoverable in that case.
-void (*lock)(struct strparser *strp)
+ ::
+
+ void (*lock)(struct strparser *strp)
The lock callback is called to lock the strp structure when
the strparser is performing an asynchronous operation (such as
@@ -131,14 +157,18 @@
function is to lock_sock for the associated socket. In general
mode the callback must be set appropriately.
-void (*unlock)(struct strparser *strp)
+ ::
+
+ void (*unlock)(struct strparser *strp)
The unlock callback is called to release the lock obtained
by the lock callback. In receive callback mode the default
function is release_sock for the associated socket. In general
mode the callback must be set appropriately.
-void (*rcv_msg)(struct strparser *strp, struct sk_buff *skb);
+ ::
+
+ void (*rcv_msg)(struct strparser *strp, struct sk_buff *skb);
rcv_msg is called when a full message has been received and
is queued. The callee must consume the sk_buff; it can
@@ -152,7 +182,9 @@
the length of the message. skb->len - offset may be greater
then full_len since strparser does not trim the skb.
-int (*read_sock_done)(struct strparser *strp, int err);
+ ::
+
+ int (*read_sock_done)(struct strparser *strp, int err);
read_sock_done is called when the stream parser is done reading
the TCP socket in receive callback mode. The stream parser may
@@ -160,7 +192,9 @@
to occur when exiting the loop. If the callback is not set (NULL
in strp_init) a default function is used.
-void (*abort_parser)(struct strparser *strp, int err);
+ ::
+
+ void (*abort_parser)(struct strparser *strp, int err);
This function is called when stream parser encounters an error
in parsing. The default function stops the stream parser and
@@ -204,4 +238,3 @@
======
Tom Herbert (tom@quantonium.net)
-
diff --git a/Documentation/networking/switchdev.txt b/Documentation/networking/switchdev.rst
similarity index 83%
rename from Documentation/networking/switchdev.txt
rename to Documentation/networking/switchdev.rst
index 86174ce..ddc3f35 100644
--- a/Documentation/networking/switchdev.txt
+++ b/Documentation/networking/switchdev.rst
@@ -1,7 +1,13 @@
+.. SPDX-License-Identifier: GPL-2.0
+.. include:: <isonum.txt>
+
+===============================================
Ethernet switch device driver model (switchdev)
===============================================
-Copyright (c) 2014 Jiri Pirko <jiri@resnulli.us>
-Copyright (c) 2014-2015 Scott Feldman <sfeldma@gmail.com>
+
+Copyright |copy| 2014 Jiri Pirko <jiri@resnulli.us>
+
+Copyright |copy| 2014-2015 Scott Feldman <sfeldma@gmail.com>
The Ethernet switch device driver model (switchdev) is an in-kernel driver
@@ -12,53 +18,57 @@
an example setup using a data-center-class switch ASIC chip. Other setups
with SR-IOV or soft switches, such as OVS, are possible.
+::
- User-space tools
+
+ User-space tools
user space |
+-------------------------------------------------------------------+
kernel | Netlink
- |
- +--------------+-------------------------------+
- | Network stack |
- | (Linux) |
- | |
- +----------------------------------------------+
+ |
+ +--------------+-------------------------------+
+ | Network stack |
+ | (Linux) |
+ | |
+ +----------------------------------------------+
- sw1p2 sw1p4 sw1p6
- sw1p1 + sw1p3 + sw1p5 + eth1
- + | + | + | +
- | | | | | | |
- +--+----+----+----+----+----+---+ +-----+-----+
- | Switch driver | | mgmt |
- | (this document) | | driver |
- | | | |
- +--------------+----------------+ +-----------+
- |
+ sw1p2 sw1p4 sw1p6
+ sw1p1 + sw1p3 + sw1p5 + eth1
+ + | + | + | +
+ | | | | | | |
+ +--+----+----+----+----+----+---+ +-----+-----+
+ | Switch driver | | mgmt |
+ | (this document) | | driver |
+ | | | |
+ +--------------+----------------+ +-----------+
+ |
kernel | HW bus (eg PCI)
+-------------------------------------------------------------------+
hardware |
- +--------------+----------------+
- | Switch device (sw1) |
- | +----+ +--------+
- | | v offloaded data path | mgmt port
- | | | |
- +--|----|----+----+----+----+---+
- | | | | | |
- + + + + + +
- p1 p2 p3 p4 p5 p6
+ +--------------+----------------+
+ | Switch device (sw1) |
+ | +----+ +--------+
+ | | v offloaded data path | mgmt port
+ | | | |
+ +--|----|----+----+----+----+---+
+ | | | | | |
+ + + + + + +
+ p1 p2 p3 p4 p5 p6
- front-panel ports
+ front-panel ports
- Fig 1.
+ Fig 1.
Include Files
-------------
-#include <linux/netdevice.h>
-#include <net/switchdev.h>
+::
+
+ #include <linux/netdevice.h>
+ #include <net/switchdev.h>
Configuration
@@ -114,10 +124,10 @@
useful for dynamically-named ports where the device names its ports based on
external configuration. For example, if a physical 40G port is split logically
into 4 10G ports, resulting in 4 port netdevs, the device can give a unique
-name for each port using port PHYS name. The udev rule would be:
+name for each port using port PHYS name. The udev rule would be::
-SUBSYSTEM=="net", ACTION=="add", ATTR{phys_switch_id}=="<phys_switch_id>", \
- ATTR{phys_port_name}!="", NAME="swX$attr{phys_port_name}"
+ SUBSYSTEM=="net", ACTION=="add", ATTR{phys_switch_id}=="<phys_switch_id>", \
+ ATTR{phys_port_name}!="", NAME="swX$attr{phys_port_name}"
Suggested naming convention is "swXpYsZ", where X is the switch name or ID, Y
is the port name or ID, and Z is the sub-port name or ID. For example, sw1p1s0
@@ -173,7 +183,7 @@
The switchdev driver should implement ndo_fdb_add, ndo_fdb_del and ndo_fdb_dump
to support static FDB entries installed to the device. Static bridge FDB
-entries are installed, for example, using iproute2 bridge cmd:
+entries are installed, for example, using iproute2 bridge cmd::
bridge fdb add ADDR dev DEV [vlan VID] [self]
@@ -185,7 +195,7 @@
example, due to full capacity in hardware tables) ?
Note: by default, the bridge does not filter on VLAN and only bridges untagged
-traffic. To enable VLAN support, turn on VLAN filtering:
+traffic. To enable VLAN support, turn on VLAN filtering::
echo 1 >/sys/class/net/<bridge>/bridge/vlan_filtering
@@ -194,7 +204,7 @@
The switch device will learn/forget source MAC address/VLAN on ingress packets
and notify the switch driver of the mac/vlan/port tuples. The switch driver,
-in turn, will notify the bridge driver using the switchdev notifier call:
+in turn, will notify the bridge driver using the switchdev notifier call::
err = call_switchdev_notifiers(val, dev, info, extack);
@@ -202,7 +212,7 @@
forgetting, and info points to a struct switchdev_notifier_fdb_info. On
SWITCHDEV_FDB_ADD, the bridge driver will install the FDB entry into the
bridge's FDB and mark the entry as NTF_EXT_LEARNED. The iproute2 bridge
-command will label these entries "offload":
+command will label these entries "offload"::
$ bridge fdb
52:54:00:12:35:01 dev sw1p1 master br0 permanent
@@ -219,11 +229,11 @@
01:00:5e:00:00:01 dev br0 self permanent
33:33:ff:12:35:01 dev br0 self permanent
-Learning on the port should be disabled on the bridge using the bridge command:
+Learning on the port should be disabled on the bridge using the bridge command::
bridge link set dev DEV learning off
-Learning on the device port should be enabled, as well as learning_sync:
+Learning on the device port should be enabled, as well as learning_sync::
bridge link set dev DEV learning on self
bridge link set dev DEV learning_sync on self
@@ -314,12 +324,16 @@
To program the device, the driver has to register a FIB notifier handler
using register_fib_notifier. The following events are available:
-FIB_EVENT_ENTRY_ADD: used for both adding a new FIB entry to the device,
- or modifying an existing entry on the device.
-FIB_EVENT_ENTRY_DEL: used for removing a FIB entry
-FIB_EVENT_RULE_ADD, FIB_EVENT_RULE_DEL: used to propagate FIB rule changes
-FIB_EVENT_ENTRY_ADD and FIB_EVENT_ENTRY_DEL events pass:
+=================== ===================================================
+FIB_EVENT_ENTRY_ADD used for both adding a new FIB entry to the device,
+ or modifying an existing entry on the device.
+FIB_EVENT_ENTRY_DEL used for removing a FIB entry
+FIB_EVENT_RULE_ADD,
+FIB_EVENT_RULE_DEL used to propagate FIB rule changes
+=================== ===================================================
+
+FIB_EVENT_ENTRY_ADD and FIB_EVENT_ENTRY_DEL events pass::
struct fib_entry_notifier_info {
struct fib_notifier_info info; /* must be first */
@@ -332,12 +346,12 @@
u32 nlflags;
};
-to add/modify/delete IPv4 dst/dest_len prefix on table tb_id. The *fi
-structure holds details on the route and route's nexthops. *dev is one of the
-port netdevs mentioned in the route's next hop list.
+to add/modify/delete IPv4 dst/dest_len prefix on table tb_id. The ``*fi``
+structure holds details on the route and route's nexthops. ``*dev`` is one
+of the port netdevs mentioned in the route's next hop list.
Routes offloaded to the device are labeled with "offload" in the ip route
-listing:
+listing::
$ ip route show
default via 192.168.0.2 dev eth0
diff --git a/Documentation/networking/tc-actions-env-rules.rst b/Documentation/networking/tc-actions-env-rules.rst
new file mode 100644
index 0000000..86884b8
--- /dev/null
+++ b/Documentation/networking/tc-actions-env-rules.rst
@@ -0,0 +1,29 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+================================
+TC Actions - Environmental Rules
+================================
+
+
+The "environmental" rules for authors of any new tc actions are:
+
+1) If you stealeth or borroweth any packet thou shalt be branching
+ from the righteous path and thou shalt cloneth.
+
+ For example if your action queues a packet to be processed later,
+ or intentionally branches by redirecting a packet, then you need to
+ clone the packet.
+
+2) If you munge any packet thou shalt call pskb_expand_head in the case
+ someone else is referencing the skb. After that you "own" the skb.
+
+3) Dropping packets you don't own is a no-no. You simply return
+ TC_ACT_SHOT to the caller and they will drop it.
+
+The "environmental" rules for callers of actions (qdiscs etc) are:
+
+#) Thou art responsible for freeing anything returned as being
+ TC_ACT_SHOT/STOLEN/QUEUED. If none of TC_ACT_SHOT/STOLEN/QUEUED is
+ returned, then all is great and you don't need to do anything.
+
+Post on netdev if something is unclear.
diff --git a/Documentation/networking/tc-actions-env-rules.txt b/Documentation/networking/tc-actions-env-rules.txt
deleted file mode 100644
index f378146..0000000
--- a/Documentation/networking/tc-actions-env-rules.txt
+++ /dev/null
@@ -1,24 +0,0 @@
-
-The "environmental" rules for authors of any new tc actions are:
-
-1) If you stealeth or borroweth any packet thou shalt be branching
-from the righteous path and thou shalt cloneth.
-
-For example if your action queues a packet to be processed later,
-or intentionally branches by redirecting a packet, then you need to
-clone the packet.
-
-2) If you munge any packet thou shalt call pskb_expand_head in the case
-someone else is referencing the skb. After that you "own" the skb.
-
-3) Dropping packets you don't own is a no-no. You simply return
-TC_ACT_SHOT to the caller and they will drop it.
-
-The "environmental" rules for callers of actions (qdiscs etc) are:
-
-*) Thou art responsible for freeing anything returned as being
-TC_ACT_SHOT/STOLEN/QUEUED. If none of TC_ACT_SHOT/STOLEN/QUEUED is
-returned, then all is great and you don't need to do anything.
-
-Post on netdev if something is unclear.
-
diff --git a/Documentation/networking/tcp-thin.txt b/Documentation/networking/tcp-thin.rst
similarity index 97%
rename from Documentation/networking/tcp-thin.txt
rename to Documentation/networking/tcp-thin.rst
index 151e229..b06765c 100644
--- a/Documentation/networking/tcp-thin.txt
+++ b/Documentation/networking/tcp-thin.rst
@@ -1,5 +1,9 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+====================
Thin-streams and TCP
====================
+
A wide range of Internet-based services that use reliable transport
protocols display what we call thin-stream properties. This means
that the application sends data with such a low rate that the
@@ -42,6 +46,7 @@
==========
More information on the modifications, as well as a wide range of
experimental data can be found here:
+
"Improving latency for interactive, thin-stream applications over
reliable transport"
http://simula.no/research/nd/publications/Simula.nd.477/simula_pdf_file
diff --git a/Documentation/networking/team.txt b/Documentation/networking/team.rst
similarity index 67%
rename from Documentation/networking/team.txt
rename to Documentation/networking/team.rst
index 5a01368..0a7f3a0 100644
--- a/Documentation/networking/team.txt
+++ b/Documentation/networking/team.rst
@@ -1,2 +1,8 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+====
+Team
+====
+
Team devices are driven from userspace via libteam library which is here:
https://github.com/jpirko/libteam
diff --git a/Documentation/networking/timestamping.txt b/Documentation/networking/timestamping.rst
similarity index 89%
rename from Documentation/networking/timestamping.txt
rename to Documentation/networking/timestamping.rst
index 8dd6333..1adead6 100644
--- a/Documentation/networking/timestamping.txt
+++ b/Documentation/networking/timestamping.rst
@@ -1,9 +1,16 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+============
+Timestamping
+============
+
1. Control Interfaces
+=====================
The interfaces for receiving network packages timestamps are:
-* SO_TIMESTAMP
+SO_TIMESTAMP
Generates a timestamp for each incoming packet in (not necessarily
monotonic) system time. Reports the timestamp via recvmsg() in a
control message in usec resolution.
@@ -13,7 +20,7 @@
SO_TIMESTAMP_OLD and in struct __kernel_sock_timeval for
SO_TIMESTAMP_NEW options respectively.
-* SO_TIMESTAMPNS
+SO_TIMESTAMPNS
Same timestamping mechanism as SO_TIMESTAMP, but reports the
timestamp as struct timespec in nsec resolution.
SO_TIMESTAMPNS is defined as SO_TIMESTAMPNS_NEW or SO_TIMESTAMPNS_OLD
@@ -22,17 +29,18 @@
and in struct __kernel_timespec for SO_TIMESTAMPNS_NEW options
respectively.
-* IP_MULTICAST_LOOP + SO_TIMESTAMP[NS]
+IP_MULTICAST_LOOP + SO_TIMESTAMP[NS]
Only for multicast:approximate transmit timestamp obtained by
reading the looped packet receive timestamp.
-* SO_TIMESTAMPING
+SO_TIMESTAMPING
Generates timestamps on reception, transmission or both. Supports
multiple timestamp sources, including hardware. Supports generating
timestamps for stream sockets.
-1.1 SO_TIMESTAMP (also SO_TIMESTAMP_OLD and SO_TIMESTAMP_NEW):
+1.1 SO_TIMESTAMP (also SO_TIMESTAMP_OLD and SO_TIMESTAMP_NEW)
+-------------------------------------------------------------
This socket option enables timestamping of datagrams on the reception
path. Because the destination socket, if any, is not known early in
@@ -59,10 +67,11 @@
SO_TIMESTAMPNS_OLD returns incorrect timestamps after the year 2038
on 32 bit machines.
-1.3 SO_TIMESTAMPING (also SO_TIMESTAMPING_OLD and SO_TIMESTAMPING_NEW):
+1.3 SO_TIMESTAMPING (also SO_TIMESTAMPING_OLD and SO_TIMESTAMPING_NEW)
+----------------------------------------------------------------------
Supports multiple types of timestamp requests. As a result, this
-socket option takes a bitmap of flags, not a boolean. In
+socket option takes a bitmap of flags, not a boolean. In::
err = setsockopt(fd, SOL_SOCKET, SO_TIMESTAMPING, &val, sizeof(val));
@@ -76,6 +85,7 @@
1.3.1 Timestamp Generation
+^^^^^^^^^^^^^^^^^^^^^^^^^^
Some bits are requests to the stack to try to generate timestamps. Any
combination of them is valid. Changes to these bits apply to newly
@@ -106,7 +116,6 @@
require driver support and may not be available for all devices.
This flag can be enabled via both socket options and control messages.
-
SOF_TIMESTAMPING_TX_SCHED:
Request tx timestamps prior to entering the packet scheduler. Kernel
transmit latency is, if long, often dominated by queuing delay. The
@@ -132,6 +141,7 @@
1.3.2 Timestamp Reporting
+^^^^^^^^^^^^^^^^^^^^^^^^^
The other three bits control which timestamps will be reported in a
generated control message. Changes to the bits take immediate
@@ -151,11 +161,11 @@
1.3.3 Timestamp Options
+^^^^^^^^^^^^^^^^^^^^^^^
The interface supports the options
SOF_TIMESTAMPING_OPT_ID:
-
Generate a unique identifier along with each packet. A process can
have multiple concurrent timestamping requests outstanding. Packets
can be reordered in the transmit path, for instance in the packet
@@ -183,7 +193,6 @@
SOF_TIMESTAMPING_OPT_CMSG:
-
Support recv() cmsg for all timestamped packets. Control messages
are already supported unconditionally on all packets with receive
timestamps and on IPv6 packets with transmit timestamp. This option
@@ -193,7 +202,6 @@
SOF_TIMESTAMPING_OPT_TSONLY:
-
Applies to transmit timestamps only. Makes the kernel return the
timestamp as a cmsg alongside an empty packet, as opposed to
alongside the original packet. This reduces the amount of memory
@@ -202,7 +210,6 @@
This option disables SOF_TIMESTAMPING_OPT_CMSG.
SOF_TIMESTAMPING_OPT_STATS:
-
Optional stats that are obtained along with the transmit timestamps.
It must be used together with SOF_TIMESTAMPING_OPT_TSONLY. When the
transmit timestamp is available, the stats are available in a
@@ -213,7 +220,6 @@
data was limited by peer's receiver window.
SOF_TIMESTAMPING_OPT_PKTINFO:
-
Enable the SCM_TIMESTAMPING_PKTINFO control message for incoming
packets with hardware timestamps. The message contains struct
scm_ts_pktinfo, which supplies the index of the real interface which
@@ -223,7 +229,6 @@
other fields, but they are reserved and undefined.
SOF_TIMESTAMPING_OPT_TX_SWHW:
-
Request both hardware and software timestamps for outgoing packets
when SOF_TIMESTAMPING_TX_HARDWARE and SOF_TIMESTAMPING_TX_SOFTWARE
are enabled at the same time. If both timestamps are generated,
@@ -242,12 +247,13 @@
1.3.4. Enabling timestamps via control messages
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
In addition to socket options, timestamp generation can be requested
per write via cmsg, only for SOF_TIMESTAMPING_TX_* (see Section 1.3.1).
Using this feature, applications can sample timestamps per sendmsg()
without paying the overhead of enabling and disabling timestamps via
-setsockopt:
+setsockopt::
struct msghdr *msg;
...
@@ -264,7 +270,7 @@
the SOF_TIMESTAMPING_TX_* flags set via setsockopt.
Moreover, applications must still enable timestamp reporting via
-setsockopt to receive timestamps:
+setsockopt to receive timestamps::
__u32 val = SOF_TIMESTAMPING_SOFTWARE |
SOF_TIMESTAMPING_OPT_ID /* or any other flag */;
@@ -272,6 +278,7 @@
1.4 Bytestream Timestamps
+-------------------------
The SO_TIMESTAMPING interface supports timestamping of bytes in a
bytestream. Each request is interpreted as a request for when the
@@ -331,6 +338,7 @@
2 Data Interfaces
+==================
Timestamps are read using the ancillary data feature of recvmsg().
See `man 3 cmsg` for details of this interface. The socket manual
@@ -339,20 +347,21 @@
2.1 SCM_TIMESTAMPING records
+----------------------------
These timestamps are returned in a control message with cmsg_level
SOL_SOCKET, cmsg_type SCM_TIMESTAMPING, and payload of type
-For SO_TIMESTAMPING_OLD:
+For SO_TIMESTAMPING_OLD::
-struct scm_timestamping {
- struct timespec ts[3];
-};
+ struct scm_timestamping {
+ struct timespec ts[3];
+ };
-For SO_TIMESTAMPING_NEW:
+For SO_TIMESTAMPING_NEW::
-struct scm_timestamping64 {
- struct __kernel_timespec ts[3];
+ struct scm_timestamping64 {
+ struct __kernel_timespec ts[3];
Always use SO_TIMESTAMPING_NEW timestamp to always get timestamp in
struct scm_timestamping64 format.
@@ -377,6 +386,7 @@
on hardware transmit timestamps.
2.1.1 Transmit timestamps with MSG_ERRQUEUE
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
For transmit timestamps the outgoing packet is looped back to the
socket's error queue with the send timestamp(s) attached. A process
@@ -393,6 +403,7 @@
2.1.1.2 Timestamp types
+~~~~~~~~~~~~~~~~~~~~~~~
The semantics of the three struct timespec are defined by field
ee_info in the extended error structure. It contains a value of
@@ -408,6 +419,7 @@
2.1.1.3 Fragmentation
+~~~~~~~~~~~~~~~~~~~~~
Fragmentation of outgoing datagrams is rare, but is possible, e.g., by
explicitly disabling PMTU discovery. If an outgoing packet is fragmented,
@@ -416,6 +428,7 @@
2.1.1.4 Packet Payload
+~~~~~~~~~~~~~~~~~~~~~~
The calling application is often not interested in receiving the whole
packet payload that it passed to the stack originally: the socket
@@ -427,6 +440,7 @@
2.1.1.5 Blocking Read
+~~~~~~~~~~~~~~~~~~~~~
Reading from the error queue is always a non-blocking operation. To
block waiting on a timestamp, use poll or select. poll() will return
@@ -436,6 +450,7 @@
2.1.2 Receive timestamps
+^^^^^^^^^^^^^^^^^^^^^^^^
On reception, there is no reason to read from the socket error queue.
The SCM_TIMESTAMPING ancillary data is sent along with the packet data
@@ -447,16 +462,17 @@
3. Hardware Timestamping configuration: SIOCSHWTSTAMP and SIOCGHWTSTAMP
+=======================================================================
Hardware time stamping must also be initialized for each device driver
that is expected to do hardware time stamping. The parameter is defined in
-include/uapi/linux/net_tstamp.h as:
+include/uapi/linux/net_tstamp.h as::
-struct hwtstamp_config {
- int flags; /* no flags defined right now, must be zero */
- int tx_type; /* HWTSTAMP_TX_* */
- int rx_filter; /* HWTSTAMP_FILTER_* */
-};
+ struct hwtstamp_config {
+ int flags; /* no flags defined right now, must be zero */
+ int tx_type; /* HWTSTAMP_TX_* */
+ int rx_filter; /* HWTSTAMP_FILTER_* */
+ };
Desired behavior is passed into the kernel and to a specific device by
calling ioctl(SIOCSHWTSTAMP) with a pointer to a struct ifreq whose
@@ -487,44 +503,47 @@
structure to ioctl(SIOCGHWTSTAMP) in the same way. However, this has
not been implemented in all drivers.
-/* possible values for hwtstamp_config->tx_type */
-enum {
- /*
- * no outgoing packet will need hardware time stamping;
- * should a packet arrive which asks for it, no hardware
- * time stamping will be done
- */
- HWTSTAMP_TX_OFF,
+::
- /*
- * enables hardware time stamping for outgoing packets;
- * the sender of the packet decides which are to be
- * time stamped by setting SOF_TIMESTAMPING_TX_SOFTWARE
- * before sending the packet
- */
- HWTSTAMP_TX_ON,
-};
+ /* possible values for hwtstamp_config->tx_type */
+ enum {
+ /*
+ * no outgoing packet will need hardware time stamping;
+ * should a packet arrive which asks for it, no hardware
+ * time stamping will be done
+ */
+ HWTSTAMP_TX_OFF,
-/* possible values for hwtstamp_config->rx_filter */
-enum {
- /* time stamp no incoming packet at all */
- HWTSTAMP_FILTER_NONE,
+ /*
+ * enables hardware time stamping for outgoing packets;
+ * the sender of the packet decides which are to be
+ * time stamped by setting SOF_TIMESTAMPING_TX_SOFTWARE
+ * before sending the packet
+ */
+ HWTSTAMP_TX_ON,
+ };
- /* time stamp any incoming packet */
- HWTSTAMP_FILTER_ALL,
+ /* possible values for hwtstamp_config->rx_filter */
+ enum {
+ /* time stamp no incoming packet at all */
+ HWTSTAMP_FILTER_NONE,
- /* return value: time stamp all packets requested plus some others */
- HWTSTAMP_FILTER_SOME,
+ /* time stamp any incoming packet */
+ HWTSTAMP_FILTER_ALL,
- /* PTP v1, UDP, any kind of event packet */
- HWTSTAMP_FILTER_PTP_V1_L4_EVENT,
+ /* return value: time stamp all packets requested plus some others */
+ HWTSTAMP_FILTER_SOME,
- /* for the complete list of values, please check
- * the include file include/uapi/linux/net_tstamp.h
- */
-};
+ /* PTP v1, UDP, any kind of event packet */
+ HWTSTAMP_FILTER_PTP_V1_L4_EVENT,
+
+ /* for the complete list of values, please check
+ * the include file include/uapi/linux/net_tstamp.h
+ */
+ };
3.1 Hardware Timestamping Implementation: Device Drivers
+--------------------------------------------------------
A driver which supports hardware time stamping must support the
SIOCSHWTSTAMP ioctl and update the supplied struct hwtstamp_config with
@@ -533,22 +552,23 @@
Time stamps for received packets must be stored in the skb. To get a pointer
to the shared time stamp structure of the skb call skb_hwtstamps(). Then
-set the time stamps in the structure:
+set the time stamps in the structure::
-struct skb_shared_hwtstamps {
- /* hardware time stamp transformed into duration
- * since arbitrary point in time
- */
- ktime_t hwtstamp;
-};
+ struct skb_shared_hwtstamps {
+ /* hardware time stamp transformed into duration
+ * since arbitrary point in time
+ */
+ ktime_t hwtstamp;
+ };
Time stamps for outgoing packets are to be generated as follows:
+
- In hard_start_xmit(), check if (skb_shinfo(skb)->tx_flags & SKBTX_HW_TSTAMP)
is set no-zero. If yes, then the driver is expected to do hardware time
stamping.
- If this is possible for the skb and requested, then declare
that the driver is doing the time stamping by setting the flag
- SKBTX_IN_PROGRESS in skb_shinfo(skb)->tx_flags , e.g. with
+ SKBTX_IN_PROGRESS in skb_shinfo(skb)->tx_flags , e.g. with::
skb_shinfo(skb)->tx_flags |= SKBTX_IN_PROGRESS;
diff --git a/Documentation/networking/tproxy.txt b/Documentation/networking/tproxy.rst
similarity index 70%
rename from Documentation/networking/tproxy.txt
rename to Documentation/networking/tproxy.rst
index b9a1888..00dc3a1 100644
--- a/Documentation/networking/tproxy.txt
+++ b/Documentation/networking/tproxy.rst
@@ -1,3 +1,6 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+=========================
Transparent proxy support
=========================
@@ -11,39 +14,39 @@
================================
The idea is that you identify packets with destination address matching a local
-socket on your box, set the packet mark to a certain value:
+socket on your box, set the packet mark to a certain value::
-# iptables -t mangle -N DIVERT
-# iptables -t mangle -A PREROUTING -p tcp -m socket -j DIVERT
-# iptables -t mangle -A DIVERT -j MARK --set-mark 1
-# iptables -t mangle -A DIVERT -j ACCEPT
+ # iptables -t mangle -N DIVERT
+ # iptables -t mangle -A PREROUTING -p tcp -m socket -j DIVERT
+ # iptables -t mangle -A DIVERT -j MARK --set-mark 1
+ # iptables -t mangle -A DIVERT -j ACCEPT
-Alternatively you can do this in nft with the following commands:
+Alternatively you can do this in nft with the following commands::
-# nft add table filter
-# nft add chain filter divert "{ type filter hook prerouting priority -150; }"
-# nft add rule filter divert meta l4proto tcp socket transparent 1 meta mark set 1 accept
+ # nft add table filter
+ # nft add chain filter divert "{ type filter hook prerouting priority -150; }"
+ # nft add rule filter divert meta l4proto tcp socket transparent 1 meta mark set 1 accept
And then match on that value using policy routing to have those packets
-delivered locally:
+delivered locally::
-# ip rule add fwmark 1 lookup 100
-# ip route add local 0.0.0.0/0 dev lo table 100
+ # ip rule add fwmark 1 lookup 100
+ # ip route add local 0.0.0.0/0 dev lo table 100
Because of certain restrictions in the IPv4 routing output code you'll have to
modify your application to allow it to send datagrams _from_ non-local IP
addresses. All you have to do is enable the (SOL_IP, IP_TRANSPARENT) socket
-option before calling bind:
+option before calling bind::
-fd = socket(AF_INET, SOCK_STREAM, 0);
-/* - 8< -*/
-int value = 1;
-setsockopt(fd, SOL_IP, IP_TRANSPARENT, &value, sizeof(value));
-/* - 8< -*/
-name.sin_family = AF_INET;
-name.sin_port = htons(0xCAFE);
-name.sin_addr.s_addr = htonl(0xDEADBEEF);
-bind(fd, &name, sizeof(name));
+ fd = socket(AF_INET, SOCK_STREAM, 0);
+ /* - 8< -*/
+ int value = 1;
+ setsockopt(fd, SOL_IP, IP_TRANSPARENT, &value, sizeof(value));
+ /* - 8< -*/
+ name.sin_family = AF_INET;
+ name.sin_port = htons(0xCAFE);
+ name.sin_addr.s_addr = htonl(0xDEADBEEF);
+ bind(fd, &name, sizeof(name));
A trivial patch for netcat is available here:
http://people.netfilter.org/hidden/tproxy/netcat-ip_transparent-support.patch
@@ -61,10 +64,10 @@
getting the original destination address is racy.)
The 'TPROXY' target provides similar functionality without relying on NAT. Simply
-add rules like this to the iptables ruleset above:
+add rules like this to the iptables ruleset above::
-# iptables -t mangle -A PREROUTING -p tcp --dport 80 -j TPROXY \
- --tproxy-mark 0x1/0x1 --on-port 50080
+ # iptables -t mangle -A PREROUTING -p tcp --dport 80 -j TPROXY \
+ --tproxy-mark 0x1/0x1 --on-port 50080
Or the following rule to nft:
@@ -82,10 +85,12 @@
====================================
To use tproxy you'll need to have the following modules compiled for iptables:
+
- NETFILTER_XT_MATCH_SOCKET
- NETFILTER_XT_TARGET_TPROXY
Or the floowing modules for nf_tables:
+
- NFT_SOCKET
- NFT_TPROXY
diff --git a/MAINTAINERS b/MAINTAINERS
index 3a5f52a..0ac9cec 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -193,7 +193,7 @@
T: git git://git.kernel.org/pub/scm/linux/kernel/git/jberg/mac80211.git
T: git git://git.kernel.org/pub/scm/linux/kernel/git/jberg/mac80211-next.git
F: Documentation/driver-api/80211/cfg80211.rst
-F: Documentation/networking/regulatory.txt
+F: Documentation/networking/regulatory.rst
F: include/linux/ieee80211.h
F: include/net/cfg80211.h
F: include/net/ieee80211_radiotap.h
@@ -9515,7 +9515,7 @@
LAPB module
L: linux-x25@vger.kernel.org
S: Orphan
-F: Documentation/networking/lapb-module.txt
+F: Documentation/networking/lapb-module.rst
F: include/*/lapb.h
F: net/lapb/
@@ -10079,7 +10079,7 @@
W: https://wireless.wiki.kernel.org/
T: git git://git.kernel.org/pub/scm/linux/kernel/git/jberg/mac80211.git
T: git git://git.kernel.org/pub/scm/linux/kernel/git/jberg/mac80211-next.git
-F: Documentation/networking/mac80211-injection.txt
+F: Documentation/networking/mac80211-injection.rst
F: Documentation/networking/mac80211_hwsim/mac80211_hwsim.rst
F: drivers/net/wireless/mac80211_hwsim.[ch]
F: include/net/mac80211.h
@@ -13262,7 +13262,7 @@
PHONET PROTOCOL
M: Remi Denis-Courmont <courmisch@gmail.com>
S: Supported
-F: Documentation/networking/phonet.txt
+F: Documentation/networking/phonet.rst
F: include/linux/phonet.h
F: include/net/phonet/
F: include/uapi/linux/phonet.h
@@ -14219,7 +14219,7 @@
L: rds-devel@oss.oracle.com (moderated for non-subscribers)
S: Supported
W: https://oss.oracle.com/projects/rds/
-F: Documentation/networking/rds.txt
+F: Documentation/networking/rds.rst
F: net/rds/
RDT - RESOURCE ALLOCATION
@@ -14593,7 +14593,7 @@
L: linux-afs@lists.infradead.org
S: Supported
W: https://www.infradead.org/~dhowells/kafs/
-F: Documentation/networking/rxrpc.txt
+F: Documentation/networking/rxrpc.rst
F: include/keys/rxrpc-type.h
F: include/net/af_rxrpc.h
F: include/trace/events/rxrpc.h
@@ -14999,7 +14999,7 @@
L: linux-sctp@vger.kernel.org
S: Maintained
W: http://lksctp.sourceforge.net
-F: Documentation/networking/sctp.txt
+F: Documentation/networking/sctp.rst
F: include/linux/sctp.h
F: include/net/sctp/
F: include/uapi/linux/sctp.h
diff --git a/arch/arm/boot/dts/qcom-ipq4019.dtsi b/arch/arm/boot/dts/qcom-ipq4019.dtsi
index bfa9ce4..b9839f8 100644
--- a/arch/arm/boot/dts/qcom-ipq4019.dtsi
+++ b/arch/arm/boot/dts/qcom-ipq4019.dtsi
@@ -576,5 +576,33 @@ wifi1: wifi@a800000 {
"legacy";
status = "disabled";
};
+
+ mdio: mdio@90000 {
+ #address-cells = <1>;
+ #size-cells = <0>;
+ compatible = "qcom,ipq4019-mdio";
+ reg = <0x90000 0x64>;
+ status = "disabled";
+
+ ethphy0: ethernet-phy@0 {
+ reg = <0>;
+ };
+
+ ethphy1: ethernet-phy@1 {
+ reg = <1>;
+ };
+
+ ethphy2: ethernet-phy@2 {
+ reg = <2>;
+ };
+
+ ethphy3: ethernet-phy@3 {
+ reg = <3>;
+ };
+
+ ethphy4: ethernet-phy@4 {
+ reg = <4>;
+ };
+ };
};
};
diff --git a/drivers/net/Kconfig b/drivers/net/Kconfig
index c822f4a..ad64be9 100644
--- a/drivers/net/Kconfig
+++ b/drivers/net/Kconfig
@@ -302,7 +302,7 @@
tristate "Network console logging support"
---help---
If you want to log kernel messages over the network, enable this.
- See <file:Documentation/networking/netconsole.txt> for details.
+ See <file:Documentation/networking/netconsole.rst> for details.
config NETCONSOLE_DYNAMIC
bool "Dynamic reconfiguration of logging targets"
@@ -312,7 +312,7 @@
This option enables the ability to dynamically reconfigure target
parameters (interface, IP addresses, port numbers, MAC addresses)
at runtime through a userspace interface exported using configfs.
- See <file:Documentation/networking/netconsole.txt> for details.
+ See <file:Documentation/networking/netconsole.rst> for details.
config NETPOLL
def_bool NETCONSOLE
diff --git a/drivers/net/appletalk/Kconfig b/drivers/net/appletalk/Kconfig
index ccde647..10589a8 100644
--- a/drivers/net/appletalk/Kconfig
+++ b/drivers/net/appletalk/Kconfig
@@ -48,7 +48,7 @@
If you are in doubt, this card is the one with the 65C02 chip on it.
You also need version 1.3.3 or later of the netatalk package.
This driver is experimental, which means that it may not work.
- See the file <file:Documentation/networking/ltpc.txt>.
+ See the file <file:Documentation/networking/ltpc.rst>.
config COPS
tristate "COPS LocalTalk PC support"
diff --git a/drivers/net/ethernet/broadcom/genet/bcmgenet.c b/drivers/net/ethernet/broadcom/genet/bcmgenet.c
index 351d028..ad614d7 100644
--- a/drivers/net/ethernet/broadcom/genet/bcmgenet.c
+++ b/drivers/net/ethernet/broadcom/genet/bcmgenet.c
@@ -2,7 +2,7 @@
/*
* Broadcom GENET (Gigabit Ethernet) controller driver
*
- * Copyright (c) 2014-2019 Broadcom
+ * Copyright (c) 2014-2020 Broadcom
*/
#define pr_fmt(fmt) "bcmgenet: " fmt
@@ -65,6 +65,9 @@
#define GENET_RDMA_REG_OFF (priv->hw_params->rdma_offset + \
TOTAL_DESC * DMA_DESC_SIZE)
+/* Forward declarations */
+static void bcmgenet_set_rx_mode(struct net_device *dev);
+
static inline void bcmgenet_writel(u32 value, void __iomem *offset)
{
/* MIPS chips strapped for BE will automagically configure the
@@ -456,6 +459,384 @@ static inline void bcmgenet_rdma_ring_writel(struct bcmgenet_priv *priv,
genet_dma_ring_regs[r]);
}
+static bool bcmgenet_hfb_is_filter_enabled(struct bcmgenet_priv *priv,
+ u32 f_index)
+{
+ u32 offset;
+ u32 reg;
+
+ offset = HFB_FLT_ENABLE_V3PLUS + (f_index < 32) * sizeof(u32);
+ reg = bcmgenet_hfb_reg_readl(priv, offset);
+ return !!(reg & (1 << (f_index % 32)));
+}
+
+static void bcmgenet_hfb_enable_filter(struct bcmgenet_priv *priv, u32 f_index)
+{
+ u32 offset;
+ u32 reg;
+
+ offset = HFB_FLT_ENABLE_V3PLUS + (f_index < 32) * sizeof(u32);
+ reg = bcmgenet_hfb_reg_readl(priv, offset);
+ reg |= (1 << (f_index % 32));
+ bcmgenet_hfb_reg_writel(priv, reg, offset);
+ reg = bcmgenet_hfb_reg_readl(priv, HFB_CTRL);
+ reg |= RBUF_HFB_EN;
+ bcmgenet_hfb_reg_writel(priv, reg, HFB_CTRL);
+}
+
+static void bcmgenet_hfb_disable_filter(struct bcmgenet_priv *priv, u32 f_index)
+{
+ u32 offset, reg, reg1;
+
+ offset = HFB_FLT_ENABLE_V3PLUS;
+ reg = bcmgenet_hfb_reg_readl(priv, offset);
+ reg1 = bcmgenet_hfb_reg_readl(priv, offset + sizeof(u32));
+ if (f_index < 32) {
+ reg1 &= ~(1 << (f_index % 32));
+ bcmgenet_hfb_reg_writel(priv, reg1, offset + sizeof(u32));
+ } else {
+ reg &= ~(1 << (f_index % 32));
+ bcmgenet_hfb_reg_writel(priv, reg, offset);
+ }
+ if (!reg && !reg1) {
+ reg = bcmgenet_hfb_reg_readl(priv, HFB_CTRL);
+ reg &= ~RBUF_HFB_EN;
+ bcmgenet_hfb_reg_writel(priv, reg, HFB_CTRL);
+ }
+}
+
+static void bcmgenet_hfb_set_filter_rx_queue_mapping(struct bcmgenet_priv *priv,
+ u32 f_index, u32 rx_queue)
+{
+ u32 offset;
+ u32 reg;
+
+ offset = f_index / 8;
+ reg = bcmgenet_rdma_readl(priv, DMA_INDEX2RING_0 + offset);
+ reg &= ~(0xF << (4 * (f_index % 8)));
+ reg |= ((rx_queue & 0xF) << (4 * (f_index % 8)));
+ bcmgenet_rdma_writel(priv, reg, DMA_INDEX2RING_0 + offset);
+}
+
+static void bcmgenet_hfb_set_filter_length(struct bcmgenet_priv *priv,
+ u32 f_index, u32 f_length)
+{
+ u32 offset;
+ u32 reg;
+
+ offset = HFB_FLT_LEN_V3PLUS +
+ ((priv->hw_params->hfb_filter_cnt - 1 - f_index) / 4) *
+ sizeof(u32);
+ reg = bcmgenet_hfb_reg_readl(priv, offset);
+ reg &= ~(0xFF << (8 * (f_index % 4)));
+ reg |= ((f_length & 0xFF) << (8 * (f_index % 4)));
+ bcmgenet_hfb_reg_writel(priv, reg, offset);
+}
+
+static int bcmgenet_hfb_find_unused_filter(struct bcmgenet_priv *priv)
+{
+ u32 f_index;
+
+ /* First MAX_NUM_OF_FS_RULES are reserved for Rx NFC filters */
+ for (f_index = MAX_NUM_OF_FS_RULES;
+ f_index < priv->hw_params->hfb_filter_cnt; f_index++)
+ if (!bcmgenet_hfb_is_filter_enabled(priv, f_index))
+ return f_index;
+
+ return -ENOMEM;
+}
+
+static int bcmgenet_hfb_validate_mask(void *mask, size_t size)
+{
+ while (size) {
+ switch (*(unsigned char *)mask++) {
+ case 0x00:
+ case 0x0f:
+ case 0xf0:
+ case 0xff:
+ size--;
+ continue;
+ default:
+ return -EINVAL;
+ }
+ }
+
+ return 0;
+}
+
+#define VALIDATE_MASK(x) \
+ bcmgenet_hfb_validate_mask(&(x), sizeof(x))
+
+static int bcmgenet_hfb_insert_data(u32 *f, int offset,
+ void *val, void *mask, size_t size)
+{
+ int index;
+ u32 tmp;
+
+ index = offset / 2;
+ tmp = f[index];
+
+ while (size--) {
+ if (offset++ & 1) {
+ tmp &= ~0x300FF;
+ tmp |= (*(unsigned char *)val++);
+ switch ((*(unsigned char *)mask++)) {
+ case 0xFF:
+ tmp |= 0x30000;
+ break;
+ case 0xF0:
+ tmp |= 0x20000;
+ break;
+ case 0x0F:
+ tmp |= 0x10000;
+ break;
+ }
+ f[index++] = tmp;
+ if (size)
+ tmp = f[index];
+ } else {
+ tmp &= ~0xCFF00;
+ tmp |= (*(unsigned char *)val++) << 8;
+ switch ((*(unsigned char *)mask++)) {
+ case 0xFF:
+ tmp |= 0xC0000;
+ break;
+ case 0xF0:
+ tmp |= 0x80000;
+ break;
+ case 0x0F:
+ tmp |= 0x40000;
+ break;
+ }
+ if (!size)
+ f[index] = tmp;
+ }
+ }
+
+ return 0;
+}
+
+static void bcmgenet_hfb_set_filter(struct bcmgenet_priv *priv, u32 *f_data,
+ u32 f_length, u32 rx_queue, int f_index)
+{
+ u32 base = f_index * priv->hw_params->hfb_filter_size;
+ int i;
+
+ for (i = 0; i < f_length; i++)
+ bcmgenet_hfb_writel(priv, f_data[i], (base + i) * sizeof(u32));
+
+ bcmgenet_hfb_set_filter_length(priv, f_index, 2 * f_length);
+ bcmgenet_hfb_set_filter_rx_queue_mapping(priv, f_index, rx_queue);
+}
+
+static int bcmgenet_hfb_create_rxnfc_filter(struct bcmgenet_priv *priv,
+ struct bcmgenet_rxnfc_rule *rule)
+{
+ struct ethtool_rx_flow_spec *fs = &rule->fs;
+ int err = 0, offset = 0, f_length = 0;
+ u16 val_16, mask_16;
+ u8 val_8, mask_8;
+ size_t size;
+ u32 *f_data;
+
+ f_data = kcalloc(priv->hw_params->hfb_filter_size, sizeof(u32),
+ GFP_KERNEL);
+ if (!f_data)
+ return -ENOMEM;
+
+ if (fs->flow_type & FLOW_MAC_EXT) {
+ bcmgenet_hfb_insert_data(f_data, 0,
+ &fs->h_ext.h_dest, &fs->m_ext.h_dest,
+ sizeof(fs->h_ext.h_dest));
+ }
+
+ if (fs->flow_type & FLOW_EXT) {
+ if (fs->m_ext.vlan_etype ||
+ fs->m_ext.vlan_tci) {
+ bcmgenet_hfb_insert_data(f_data, 12,
+ &fs->h_ext.vlan_etype,
+ &fs->m_ext.vlan_etype,
+ sizeof(fs->h_ext.vlan_etype));
+ bcmgenet_hfb_insert_data(f_data, 14,
+ &fs->h_ext.vlan_tci,
+ &fs->m_ext.vlan_tci,
+ sizeof(fs->h_ext.vlan_tci));
+ offset += VLAN_HLEN;
+ f_length += DIV_ROUND_UP(VLAN_HLEN, 2);
+ }
+ }
+
+ switch (fs->flow_type & ~(FLOW_EXT | FLOW_MAC_EXT)) {
+ case ETHER_FLOW:
+ f_length += DIV_ROUND_UP(ETH_HLEN, 2);
+ bcmgenet_hfb_insert_data(f_data, 0,
+ &fs->h_u.ether_spec.h_dest,
+ &fs->m_u.ether_spec.h_dest,
+ sizeof(fs->h_u.ether_spec.h_dest));
+ bcmgenet_hfb_insert_data(f_data, ETH_ALEN,
+ &fs->h_u.ether_spec.h_source,
+ &fs->m_u.ether_spec.h_source,
+ sizeof(fs->h_u.ether_spec.h_source));
+ bcmgenet_hfb_insert_data(f_data, (2 * ETH_ALEN) + offset,
+ &fs->h_u.ether_spec.h_proto,
+ &fs->m_u.ether_spec.h_proto,
+ sizeof(fs->h_u.ether_spec.h_proto));
+ break;
+ case IP_USER_FLOW:
+ f_length += DIV_ROUND_UP(ETH_HLEN + 20, 2);
+ /* Specify IP Ether Type */
+ val_16 = htons(ETH_P_IP);
+ mask_16 = 0xFFFF;
+ bcmgenet_hfb_insert_data(f_data, (2 * ETH_ALEN) + offset,
+ &val_16, &mask_16, sizeof(val_16));
+ bcmgenet_hfb_insert_data(f_data, 15 + offset,
+ &fs->h_u.usr_ip4_spec.tos,
+ &fs->m_u.usr_ip4_spec.tos,
+ sizeof(fs->h_u.usr_ip4_spec.tos));
+ bcmgenet_hfb_insert_data(f_data, 23 + offset,
+ &fs->h_u.usr_ip4_spec.proto,
+ &fs->m_u.usr_ip4_spec.proto,
+ sizeof(fs->h_u.usr_ip4_spec.proto));
+ bcmgenet_hfb_insert_data(f_data, 26 + offset,
+ &fs->h_u.usr_ip4_spec.ip4src,
+ &fs->m_u.usr_ip4_spec.ip4src,
+ sizeof(fs->h_u.usr_ip4_spec.ip4src));
+ bcmgenet_hfb_insert_data(f_data, 30 + offset,
+ &fs->h_u.usr_ip4_spec.ip4dst,
+ &fs->m_u.usr_ip4_spec.ip4dst,
+ sizeof(fs->h_u.usr_ip4_spec.ip4dst));
+ if (!fs->m_u.usr_ip4_spec.l4_4_bytes)
+ break;
+
+ /* Only supports 20 byte IPv4 header */
+ val_8 = 0x45;
+ mask_8 = 0xFF;
+ bcmgenet_hfb_insert_data(f_data, ETH_HLEN + offset,
+ &val_8, &mask_8,
+ sizeof(val_8));
+ size = sizeof(fs->h_u.usr_ip4_spec.l4_4_bytes);
+ bcmgenet_hfb_insert_data(f_data,
+ ETH_HLEN + 20 + offset,
+ &fs->h_u.usr_ip4_spec.l4_4_bytes,
+ &fs->m_u.usr_ip4_spec.l4_4_bytes,
+ size);
+ f_length += DIV_ROUND_UP(size, 2);
+ break;
+ }
+
+ if (!fs->ring_cookie || fs->ring_cookie == RX_CLS_FLOW_WAKE) {
+ /* Ring 0 flows can be handled by the default Descriptor Ring
+ * We'll map them to ring 0, but don't enable the filter
+ */
+ bcmgenet_hfb_set_filter(priv, f_data, f_length, 0,
+ fs->location);
+ rule->state = BCMGENET_RXNFC_STATE_DISABLED;
+ } else {
+ /* Other Rx rings are direct mapped here */
+ bcmgenet_hfb_set_filter(priv, f_data, f_length,
+ fs->ring_cookie, fs->location);
+ bcmgenet_hfb_enable_filter(priv, fs->location);
+ rule->state = BCMGENET_RXNFC_STATE_ENABLED;
+ }
+
+ kfree(f_data);
+
+ return err;
+}
+
+/* bcmgenet_hfb_add_filter
+ *
+ * Add new filter to Hardware Filter Block to match and direct Rx traffic to
+ * desired Rx queue.
+ *
+ * f_data is an array of unsigned 32-bit integers where each 32-bit integer
+ * provides filter data for 2 bytes (4 nibbles) of Rx frame:
+ *
+ * bits 31:20 - unused
+ * bit 19 - nibble 0 match enable
+ * bit 18 - nibble 1 match enable
+ * bit 17 - nibble 2 match enable
+ * bit 16 - nibble 3 match enable
+ * bits 15:12 - nibble 0 data
+ * bits 11:8 - nibble 1 data
+ * bits 7:4 - nibble 2 data
+ * bits 3:0 - nibble 3 data
+ *
+ * Example:
+ * In order to match:
+ * - Ethernet frame type = 0x0800 (IP)
+ * - IP version field = 4
+ * - IP protocol field = 0x11 (UDP)
+ *
+ * The following filter is needed:
+ * u32 hfb_filter_ipv4_udp[] = {
+ * Rx frame offset 0x00: 0x00000000, 0x00000000, 0x00000000, 0x00000000,
+ * Rx frame offset 0x08: 0x00000000, 0x00000000, 0x000F0800, 0x00084000,
+ * Rx frame offset 0x10: 0x00000000, 0x00000000, 0x00000000, 0x00030011,
+ * };
+ *
+ * To add the filter to HFB and direct the traffic to Rx queue 0, call:
+ * bcmgenet_hfb_add_filter(priv, hfb_filter_ipv4_udp,
+ * ARRAY_SIZE(hfb_filter_ipv4_udp), 0);
+ */
+int bcmgenet_hfb_add_filter(struct bcmgenet_priv *priv, u32 *f_data,
+ u32 f_length, u32 rx_queue)
+{
+ int f_index;
+
+ f_index = bcmgenet_hfb_find_unused_filter(priv);
+ if (f_index < 0)
+ return -ENOMEM;
+
+ if (f_length > priv->hw_params->hfb_filter_size)
+ return -EINVAL;
+
+ bcmgenet_hfb_set_filter(priv, f_data, f_length, rx_queue, f_index);
+ bcmgenet_hfb_enable_filter(priv, f_index);
+
+ return 0;
+}
+
+/* bcmgenet_hfb_clear
+ *
+ * Clear Hardware Filter Block and disable all filtering.
+ */
+static void bcmgenet_hfb_clear(struct bcmgenet_priv *priv)
+{
+ u32 i;
+
+ bcmgenet_hfb_reg_writel(priv, 0x0, HFB_CTRL);
+ bcmgenet_hfb_reg_writel(priv, 0x0, HFB_FLT_ENABLE_V3PLUS);
+ bcmgenet_hfb_reg_writel(priv, 0x0, HFB_FLT_ENABLE_V3PLUS + 4);
+
+ for (i = DMA_INDEX2RING_0; i <= DMA_INDEX2RING_7; i++)
+ bcmgenet_rdma_writel(priv, 0x0, i);
+
+ for (i = 0; i < (priv->hw_params->hfb_filter_cnt / 4); i++)
+ bcmgenet_hfb_reg_writel(priv, 0x0,
+ HFB_FLT_LEN_V3PLUS + i * sizeof(u32));
+
+ for (i = 0; i < priv->hw_params->hfb_filter_cnt *
+ priv->hw_params->hfb_filter_size; i++)
+ bcmgenet_hfb_writel(priv, 0x0, i * sizeof(u32));
+}
+
+static void bcmgenet_hfb_init(struct bcmgenet_priv *priv)
+{
+ int i;
+
+ if (GENET_IS_V1(priv) || GENET_IS_V2(priv))
+ return;
+
+ INIT_LIST_HEAD(&priv->rxnfc_list);
+ for (i = 0; i < MAX_NUM_OF_FS_RULES; i++) {
+ INIT_LIST_HEAD(&priv->rxnfc_rules[i].list);
+ priv->rxnfc_rules[i].state = BCMGENET_RXNFC_STATE_UNUSED;
+ }
+
+ bcmgenet_hfb_clear(priv);
+}
+
static int bcmgenet_begin(struct net_device *dev)
{
struct bcmgenet_priv *priv = netdev_priv(dev);
@@ -1040,6 +1421,229 @@ static int bcmgenet_set_eee(struct net_device *dev, struct ethtool_eee *e)
return phy_ethtool_set_eee(dev->phydev, e);
}
+static int bcmgenet_validate_flow(struct net_device *dev,
+ struct ethtool_rxnfc *cmd)
+{
+ struct ethtool_usrip4_spec *l4_mask;
+ struct ethhdr *eth_mask;
+
+ if (cmd->fs.location >= MAX_NUM_OF_FS_RULES) {
+ netdev_err(dev, "rxnfc: Invalid location (%d)\n",
+ cmd->fs.location);
+ return -EINVAL;
+ }
+
+ switch (cmd->fs.flow_type & ~(FLOW_EXT | FLOW_MAC_EXT)) {
+ case IP_USER_FLOW:
+ l4_mask = &cmd->fs.m_u.usr_ip4_spec;
+ /* don't allow mask which isn't valid */
+ if (VALIDATE_MASK(l4_mask->ip4src) ||
+ VALIDATE_MASK(l4_mask->ip4dst) ||
+ VALIDATE_MASK(l4_mask->l4_4_bytes) ||
+ VALIDATE_MASK(l4_mask->proto) ||
+ VALIDATE_MASK(l4_mask->ip_ver) ||
+ VALIDATE_MASK(l4_mask->tos)) {
+ netdev_err(dev, "rxnfc: Unsupported mask\n");
+ return -EINVAL;
+ }
+ break;
+ case ETHER_FLOW:
+ eth_mask = &cmd->fs.m_u.ether_spec;
+ /* don't allow mask which isn't valid */
+ if (VALIDATE_MASK(eth_mask->h_source) ||
+ VALIDATE_MASK(eth_mask->h_source) ||
+ VALIDATE_MASK(eth_mask->h_proto)) {
+ netdev_err(dev, "rxnfc: Unsupported mask\n");
+ return -EINVAL;
+ }
+ break;
+ default:
+ netdev_err(dev, "rxnfc: Unsupported flow type (0x%x)\n",
+ cmd->fs.flow_type);
+ return -EINVAL;
+ }
+
+ if ((cmd->fs.flow_type & FLOW_EXT)) {
+ /* don't allow mask which isn't valid */
+ if (VALIDATE_MASK(cmd->fs.m_ext.vlan_etype) ||
+ VALIDATE_MASK(cmd->fs.m_ext.vlan_tci)) {
+ netdev_err(dev, "rxnfc: Unsupported mask\n");
+ return -EINVAL;
+ }
+ if (cmd->fs.m_ext.data[0] || cmd->fs.m_ext.data[1]) {
+ netdev_err(dev, "rxnfc: user-def not supported\n");
+ return -EINVAL;
+ }
+ }
+
+ if ((cmd->fs.flow_type & FLOW_MAC_EXT)) {
+ /* don't allow mask which isn't valid */
+ if (VALIDATE_MASK(cmd->fs.m_ext.h_dest)) {
+ netdev_err(dev, "rxnfc: Unsupported mask\n");
+ return -EINVAL;
+ }
+ }
+
+ return 0;
+}
+
+static int bcmgenet_insert_flow(struct net_device *dev,
+ struct ethtool_rxnfc *cmd)
+{
+ struct bcmgenet_priv *priv = netdev_priv(dev);
+ struct bcmgenet_rxnfc_rule *loc_rule;
+ int err;
+
+ if (priv->hw_params->hfb_filter_size < 128) {
+ netdev_err(dev, "rxnfc: Not supported by this device\n");
+ return -EINVAL;
+ }
+
+ if (cmd->fs.ring_cookie > priv->hw_params->rx_queues &&
+ cmd->fs.ring_cookie != RX_CLS_FLOW_WAKE) {
+ netdev_err(dev, "rxnfc: Unsupported action (%llu)\n",
+ cmd->fs.ring_cookie);
+ return -EINVAL;
+ }
+
+ err = bcmgenet_validate_flow(dev, cmd);
+ if (err)
+ return err;
+
+ loc_rule = &priv->rxnfc_rules[cmd->fs.location];
+ if (loc_rule->state == BCMGENET_RXNFC_STATE_ENABLED)
+ bcmgenet_hfb_disable_filter(priv, cmd->fs.location);
+ if (loc_rule->state != BCMGENET_RXNFC_STATE_UNUSED)
+ list_del(&loc_rule->list);
+ loc_rule->state = BCMGENET_RXNFC_STATE_UNUSED;
+ memcpy(&loc_rule->fs, &cmd->fs,
+ sizeof(struct ethtool_rx_flow_spec));
+
+ err = bcmgenet_hfb_create_rxnfc_filter(priv, loc_rule);
+ if (err) {
+ netdev_err(dev, "rxnfc: Could not install rule (%d)\n",
+ err);
+ return err;
+ }
+
+ list_add_tail(&loc_rule->list, &priv->rxnfc_list);
+
+ return 0;
+}
+
+static int bcmgenet_delete_flow(struct net_device *dev,
+ struct ethtool_rxnfc *cmd)
+{
+ struct bcmgenet_priv *priv = netdev_priv(dev);
+ struct bcmgenet_rxnfc_rule *rule;
+ int err = 0;
+
+ if (cmd->fs.location >= MAX_NUM_OF_FS_RULES)
+ return -EINVAL;
+
+ rule = &priv->rxnfc_rules[cmd->fs.location];
+ if (rule->state == BCMGENET_RXNFC_STATE_UNUSED) {
+ err = -ENOENT;
+ goto out;
+ }
+
+ if (rule->state == BCMGENET_RXNFC_STATE_ENABLED)
+ bcmgenet_hfb_disable_filter(priv, cmd->fs.location);
+ if (rule->state != BCMGENET_RXNFC_STATE_UNUSED)
+ list_del(&rule->list);
+ rule->state = BCMGENET_RXNFC_STATE_UNUSED;
+ memset(&rule->fs, 0, sizeof(struct ethtool_rx_flow_spec));
+
+out:
+ return err;
+}
+
+static int bcmgenet_set_rxnfc(struct net_device *dev, struct ethtool_rxnfc *cmd)
+{
+ struct bcmgenet_priv *priv = netdev_priv(dev);
+ int err = 0;
+
+ switch (cmd->cmd) {
+ case ETHTOOL_SRXCLSRLINS:
+ err = bcmgenet_insert_flow(dev, cmd);
+ break;
+ case ETHTOOL_SRXCLSRLDEL:
+ err = bcmgenet_delete_flow(dev, cmd);
+ break;
+ default:
+ netdev_warn(priv->dev, "Unsupported ethtool command. (%d)\n",
+ cmd->cmd);
+ return -EINVAL;
+ }
+
+ return err;
+}
+
+static int bcmgenet_get_flow(struct net_device *dev, struct ethtool_rxnfc *cmd,
+ int loc)
+{
+ struct bcmgenet_priv *priv = netdev_priv(dev);
+ struct bcmgenet_rxnfc_rule *rule;
+ int err = 0;
+
+ if (loc < 0 || loc >= MAX_NUM_OF_FS_RULES)
+ return -EINVAL;
+
+ rule = &priv->rxnfc_rules[loc];
+ if (rule->state == BCMGENET_RXNFC_STATE_UNUSED)
+ err = -ENOENT;
+ else
+ memcpy(&cmd->fs, &rule->fs,
+ sizeof(struct ethtool_rx_flow_spec));
+
+ return err;
+}
+
+static int bcmgenet_get_num_flows(struct bcmgenet_priv *priv)
+{
+ struct list_head *pos;
+ int res = 0;
+
+ list_for_each(pos, &priv->rxnfc_list)
+ res++;
+
+ return res;
+}
+
+static int bcmgenet_get_rxnfc(struct net_device *dev, struct ethtool_rxnfc *cmd,
+ u32 *rule_locs)
+{
+ struct bcmgenet_priv *priv = netdev_priv(dev);
+ struct bcmgenet_rxnfc_rule *rule;
+ int err = 0;
+ int i = 0;
+
+ switch (cmd->cmd) {
+ case ETHTOOL_GRXRINGS:
+ cmd->data = priv->hw_params->rx_queues ?: 1;
+ break;
+ case ETHTOOL_GRXCLSRLCNT:
+ cmd->rule_cnt = bcmgenet_get_num_flows(priv);
+ cmd->data = MAX_NUM_OF_FS_RULES;
+ break;
+ case ETHTOOL_GRXCLSRULE:
+ err = bcmgenet_get_flow(dev, cmd, cmd->fs.location);
+ break;
+ case ETHTOOL_GRXCLSRLALL:
+ list_for_each_entry(rule, &priv->rxnfc_list, list)
+ if (i < cmd->rule_cnt)
+ rule_locs[i++] = rule->fs.location;
+ cmd->rule_cnt = i;
+ cmd->data = MAX_NUM_OF_FS_RULES;
+ break;
+ default:
+ err = -EOPNOTSUPP;
+ break;
+ }
+
+ return err;
+}
+
/* standard ethtool support functions. */
static const struct ethtool_ops bcmgenet_ethtool_ops = {
.supported_coalesce_params = ETHTOOL_COALESCE_RX_USECS |
@@ -1064,6 +1668,8 @@ static const struct ethtool_ops bcmgenet_ethtool_ops = {
.get_link_ksettings = bcmgenet_get_link_ksettings,
.set_link_ksettings = bcmgenet_set_link_ksettings,
.get_ts_info = ethtool_op_get_ts_info,
+ .get_rxnfc = bcmgenet_get_rxnfc,
+ .set_rxnfc = bcmgenet_set_rxnfc,
};
/* Power down the unimac, based on mode. */
@@ -2756,43 +3362,12 @@ static void bcmgenet_enable_dma(struct bcmgenet_priv *priv, u32 dma_ctrl)
bcmgenet_tdma_writel(priv, reg, DMA_CTRL);
}
-/* bcmgenet_hfb_clear
- *
- * Clear Hardware Filter Block and disable all filtering.
- */
-static void bcmgenet_hfb_clear(struct bcmgenet_priv *priv)
-{
- u32 i;
-
- bcmgenet_hfb_reg_writel(priv, 0x0, HFB_CTRL);
- bcmgenet_hfb_reg_writel(priv, 0x0, HFB_FLT_ENABLE_V3PLUS);
- bcmgenet_hfb_reg_writel(priv, 0x0, HFB_FLT_ENABLE_V3PLUS + 4);
-
- for (i = DMA_INDEX2RING_0; i <= DMA_INDEX2RING_7; i++)
- bcmgenet_rdma_writel(priv, 0x0, i);
-
- for (i = 0; i < (priv->hw_params->hfb_filter_cnt / 4); i++)
- bcmgenet_hfb_reg_writel(priv, 0x0,
- HFB_FLT_LEN_V3PLUS + i * sizeof(u32));
-
- for (i = 0; i < priv->hw_params->hfb_filter_cnt *
- priv->hw_params->hfb_filter_size; i++)
- bcmgenet_hfb_writel(priv, 0x0, i * sizeof(u32));
-}
-
-static void bcmgenet_hfb_init(struct bcmgenet_priv *priv)
-{
- if (GENET_IS_V1(priv) || GENET_IS_V2(priv))
- return;
-
- bcmgenet_hfb_clear(priv);
-}
-
static void bcmgenet_netif_start(struct net_device *dev)
{
struct bcmgenet_priv *priv = netdev_priv(dev);
/* Start the network engine */
+ bcmgenet_set_rx_mode(dev);
bcmgenet_enable_rx_napi(priv);
umac_enable_set(priv, CMD_TX_EN | CMD_RX_EN, true);
@@ -3604,8 +4179,8 @@ static int bcmgenet_resume(struct device *d)
struct net_device *dev = dev_get_drvdata(d);
struct bcmgenet_priv *priv = netdev_priv(dev);
unsigned long dma_ctrl;
+ u32 offset, reg;
int ret;
- u32 reg;
if (!netif_running(dev))
return 0;
@@ -3615,6 +4190,10 @@ static int bcmgenet_resume(struct device *d)
if (ret)
return ret;
+ /* From WOL-enabled suspend, switch to regular clock */
+ if (device_may_wakeup(d) && priv->wolopts)
+ bcmgenet_power_up(priv, GENET_POWER_WOL_MAGIC);
+
/* If this is an internal GPHY, power it back on now, before UniMAC is
* brought out of reset as absolutely no UniMAC activity is allowed
*/
@@ -3625,10 +4204,6 @@ static int bcmgenet_resume(struct device *d)
init_umac(priv);
- /* From WOL-enabled suspend, switch to regular clock */
- if (priv->wolopts)
- clk_disable_unprepare(priv->clk_wol);
-
phy_init_hw(dev->phydev);
/* Speed settings must be restored */
@@ -3640,15 +4215,17 @@ static int bcmgenet_resume(struct device *d)
bcmgenet_set_hw_addr(priv, dev->dev_addr);
+ offset = HFB_FLT_ENABLE_V3PLUS;
+ bcmgenet_hfb_reg_writel(priv, priv->hfb_en[1], offset);
+ bcmgenet_hfb_reg_writel(priv, priv->hfb_en[2], offset + sizeof(u32));
+ bcmgenet_hfb_reg_writel(priv, priv->hfb_en[0], HFB_CTRL);
+
if (priv->internal_phy) {
reg = bcmgenet_ext_readl(priv, EXT_EXT_PWR_MGMT);
reg |= EXT_ENERGY_DET_MASK;
bcmgenet_ext_writel(priv, reg, EXT_EXT_PWR_MGMT);
}
- if (priv->wolopts)
- bcmgenet_power_up(priv, GENET_POWER_WOL_MAGIC);
-
/* Disable RX/TX DMA and flush TX queues */
dma_ctrl = bcmgenet_dma_disable(priv);
@@ -3686,6 +4263,7 @@ static int bcmgenet_suspend(struct device *d)
struct net_device *dev = dev_get_drvdata(d);
struct bcmgenet_priv *priv = netdev_priv(dev);
int ret = 0;
+ u32 offset;
if (!netif_running(dev))
return 0;
@@ -3697,13 +4275,18 @@ static int bcmgenet_suspend(struct device *d)
if (!device_may_wakeup(d))
phy_suspend(dev->phydev);
+ /* Preserve filter state and disable filtering */
+ priv->hfb_en[0] = bcmgenet_hfb_reg_readl(priv, HFB_CTRL);
+ offset = HFB_FLT_ENABLE_V3PLUS;
+ priv->hfb_en[1] = bcmgenet_hfb_reg_readl(priv, offset);
+ priv->hfb_en[2] = bcmgenet_hfb_reg_readl(priv, offset + sizeof(u32));
+ bcmgenet_hfb_reg_writel(priv, 0, HFB_CTRL);
+
/* Prepare the device for Wake-on-LAN and switch to the slow clock */
- if (device_may_wakeup(d) && priv->wolopts) {
+ if (device_may_wakeup(d) && priv->wolopts)
ret = bcmgenet_power_down(priv, GENET_POWER_WOL_MAGIC);
- clk_prepare_enable(priv->clk_wol);
- } else if (priv->internal_phy) {
+ else if (priv->internal_phy)
ret = bcmgenet_power_down(priv, GENET_POWER_PASSIVE);
- }
/* Turn off the clocks */
clk_disable_unprepare(priv->clk);
diff --git a/drivers/net/ethernet/broadcom/genet/bcmgenet.h b/drivers/net/ethernet/broadcom/genet/bcmgenet.h
index daf8fb2..031d91f 100644
--- a/drivers/net/ethernet/broadcom/genet/bcmgenet.h
+++ b/drivers/net/ethernet/broadcom/genet/bcmgenet.h
@@ -1,6 +1,6 @@
/* SPDX-License-Identifier: GPL-2.0-only */
/*
- * Copyright (c) 2014-2017 Broadcom
+ * Copyright (c) 2014-2020 Broadcom
*/
#ifndef __BCMGENET_H__
@@ -14,6 +14,7 @@
#include <linux/if_vlan.h>
#include <linux/phy.h>
#include <linux/dim.h>
+#include <linux/ethtool.h>
/* total number of Buffer Descriptors, same for Rx/Tx */
#define TOTAL_DESC 256
@@ -31,6 +32,7 @@
#define DMA_MAX_BURST_LENGTH 0x10
/* misc. configuration */
+#define MAX_NUM_OF_FS_RULES 16
#define CLEAR_ALL_HFB 0xFF
#define DMA_FC_THRESH_HI (TOTAL_DESC >> 4)
#define DMA_FC_THRESH_LO 5
@@ -608,6 +610,18 @@ struct bcmgenet_rx_ring {
struct bcmgenet_priv *priv;
};
+enum bcmgenet_rxnfc_state {
+ BCMGENET_RXNFC_STATE_UNUSED = 0,
+ BCMGENET_RXNFC_STATE_DISABLED,
+ BCMGENET_RXNFC_STATE_ENABLED
+};
+
+struct bcmgenet_rxnfc_rule {
+ struct list_head list;
+ struct ethtool_rx_flow_spec fs;
+ enum bcmgenet_rxnfc_state state;
+};
+
/* device context */
struct bcmgenet_priv {
void __iomem *base;
@@ -626,6 +640,8 @@ struct bcmgenet_priv {
struct enet_cb *rx_cbs;
unsigned int num_rx_bds;
unsigned int rx_buf_len;
+ struct bcmgenet_rxnfc_rule rxnfc_rules[MAX_NUM_OF_FS_RULES];
+ struct list_head rxnfc_list;
struct bcmgenet_rx_ring rx_rings[DESC_INDEX + 1];
@@ -676,6 +692,9 @@ struct bcmgenet_priv {
/* WOL */
struct clk *clk_wol;
u32 wolopts;
+ u8 sopass[SOPASS_MAX];
+ bool wol_active;
+ u32 hfb_en[3];
struct bcmgenet_mib_counters mib;
diff --git a/drivers/net/ethernet/broadcom/genet/bcmgenet_wol.c b/drivers/net/ethernet/broadcom/genet/bcmgenet_wol.c
index c9a4369..4b9d65f 100644
--- a/drivers/net/ethernet/broadcom/genet/bcmgenet_wol.c
+++ b/drivers/net/ethernet/broadcom/genet/bcmgenet_wol.c
@@ -2,7 +2,7 @@
/*
* Broadcom GENET (Gigabit Ethernet) Wake-on-LAN support
*
- * Copyright (c) 2014-2017 Broadcom
+ * Copyright (c) 2014-2020 Broadcom
*/
#define pr_fmt(fmt) "bcmgenet_wol: " fmt
@@ -41,18 +41,13 @@
void bcmgenet_get_wol(struct net_device *dev, struct ethtool_wolinfo *wol)
{
struct bcmgenet_priv *priv = netdev_priv(dev);
- u32 reg;
- wol->supported = WAKE_MAGIC | WAKE_MAGICSECURE;
+ wol->supported = WAKE_MAGIC | WAKE_MAGICSECURE | WAKE_FILTER;
wol->wolopts = priv->wolopts;
memset(wol->sopass, 0, sizeof(wol->sopass));
- if (wol->wolopts & WAKE_MAGICSECURE) {
- reg = bcmgenet_umac_readl(priv, UMAC_MPD_PW_MS);
- put_unaligned_be16(reg, &wol->sopass[0]);
- reg = bcmgenet_umac_readl(priv, UMAC_MPD_PW_LS);
- put_unaligned_be32(reg, &wol->sopass[2]);
- }
+ if (wol->wolopts & WAKE_MAGICSECURE)
+ memcpy(wol->sopass, priv->sopass, sizeof(priv->sopass));
}
/* ethtool function - set WOL (Wake on LAN) settings.
@@ -62,25 +57,15 @@ int bcmgenet_set_wol(struct net_device *dev, struct ethtool_wolinfo *wol)
{
struct bcmgenet_priv *priv = netdev_priv(dev);
struct device *kdev = &priv->pdev->dev;
- u32 reg;
if (!device_can_wakeup(kdev))
return -ENOTSUPP;
- if (wol->wolopts & ~(WAKE_MAGIC | WAKE_MAGICSECURE))
+ if (wol->wolopts & ~(WAKE_MAGIC | WAKE_MAGICSECURE | WAKE_FILTER))
return -EINVAL;
- reg = bcmgenet_umac_readl(priv, UMAC_MPD_CTRL);
- if (wol->wolopts & WAKE_MAGICSECURE) {
- bcmgenet_umac_writel(priv, get_unaligned_be16(&wol->sopass[0]),
- UMAC_MPD_PW_MS);
- bcmgenet_umac_writel(priv, get_unaligned_be32(&wol->sopass[2]),
- UMAC_MPD_PW_LS);
- reg |= MPD_PW_EN;
- } else {
- reg &= ~MPD_PW_EN;
- }
- bcmgenet_umac_writel(priv, reg, UMAC_MPD_CTRL);
+ if (wol->wolopts & WAKE_MAGICSECURE)
+ memcpy(priv->sopass, wol->sopass, sizeof(priv->sopass));
/* Flag the device and relevant IRQ as wakeup capable */
if (wol->wolopts) {
@@ -120,12 +105,21 @@ static int bcmgenet_poll_wol_status(struct bcmgenet_priv *priv)
return retries;
}
+static void bcmgenet_set_mpd_password(struct bcmgenet_priv *priv)
+{
+ bcmgenet_umac_writel(priv, get_unaligned_be16(&priv->sopass[0]),
+ UMAC_MPD_PW_MS);
+ bcmgenet_umac_writel(priv, get_unaligned_be32(&priv->sopass[2]),
+ UMAC_MPD_PW_LS);
+}
+
int bcmgenet_wol_power_down_cfg(struct bcmgenet_priv *priv,
enum bcmgenet_power_mode mode)
{
struct net_device *dev = priv->dev;
+ struct bcmgenet_rxnfc_rule *rule;
+ u32 reg, hfb_ctrl_reg, hfb_enable = 0;
int retries = 0;
- u32 reg;
if (mode != GENET_POWER_WOL_MAGIC) {
netif_err(priv, wol, dev, "unsupported mode: %d\n", mode);
@@ -142,22 +136,48 @@ int bcmgenet_wol_power_down_cfg(struct bcmgenet_priv *priv,
bcmgenet_umac_writel(priv, reg, UMAC_CMD);
mdelay(10);
- reg = bcmgenet_umac_readl(priv, UMAC_MPD_CTRL);
- reg |= MPD_EN;
- bcmgenet_umac_writel(priv, reg, UMAC_MPD_CTRL);
+ if (priv->wolopts & (WAKE_MAGIC | WAKE_MAGICSECURE)) {
+ reg = bcmgenet_umac_readl(priv, UMAC_MPD_CTRL);
+ reg |= MPD_EN;
+ if (priv->wolopts & WAKE_MAGICSECURE) {
+ bcmgenet_set_mpd_password(priv);
+ reg |= MPD_PW_EN;
+ }
+ bcmgenet_umac_writel(priv, reg, UMAC_MPD_CTRL);
+ }
+
+ hfb_ctrl_reg = bcmgenet_hfb_reg_readl(priv, HFB_CTRL);
+ if (priv->wolopts & WAKE_FILTER) {
+ list_for_each_entry(rule, &priv->rxnfc_list, list)
+ if (rule->fs.ring_cookie == RX_CLS_FLOW_WAKE)
+ hfb_enable |= (1 << rule->fs.location);
+ reg = (hfb_ctrl_reg & ~RBUF_HFB_EN) | RBUF_ACPI_EN;
+ bcmgenet_hfb_reg_writel(priv, reg, HFB_CTRL);
+ }
/* Do not leave UniMAC in MPD mode only */
retries = bcmgenet_poll_wol_status(priv);
if (retries < 0) {
reg = bcmgenet_umac_readl(priv, UMAC_MPD_CTRL);
- reg &= ~MPD_EN;
+ reg &= ~(MPD_EN | MPD_PW_EN);
bcmgenet_umac_writel(priv, reg, UMAC_MPD_CTRL);
+ bcmgenet_hfb_reg_writel(priv, hfb_ctrl_reg, HFB_CTRL);
return retries;
}
netif_dbg(priv, wol, dev, "MPD WOL-ready status set after %d msec\n",
retries);
+ clk_prepare_enable(priv->clk_wol);
+ priv->wol_active = 1;
+
+ if (hfb_enable) {
+ bcmgenet_hfb_reg_writel(priv, hfb_enable,
+ HFB_FLT_ENABLE_V3PLUS + 4);
+ hfb_ctrl_reg = RBUF_HFB_EN | RBUF_ACPI_EN;
+ bcmgenet_hfb_reg_writel(priv, hfb_ctrl_reg, HFB_CTRL);
+ }
+
/* Enable CRC forward */
reg = bcmgenet_umac_readl(priv, UMAC_CMD);
priv->crc_fwd_en = 1;
@@ -186,12 +206,22 @@ void bcmgenet_wol_power_up_cfg(struct bcmgenet_priv *priv,
return;
}
+ if (!priv->wol_active)
+ return; /* failed to suspend so skip the rest */
+
+ priv->wol_active = 0;
+ clk_disable_unprepare(priv->clk_wol);
+
+ /* Disable Magic Packet Detection */
reg = bcmgenet_umac_readl(priv, UMAC_MPD_CTRL);
- if (!(reg & MPD_EN))
- return; /* already powered up so skip the rest */
- reg &= ~MPD_EN;
+ reg &= ~(MPD_EN | MPD_PW_EN);
bcmgenet_umac_writel(priv, reg, UMAC_MPD_CTRL);
+ /* Disable WAKE_FILTER Detection */
+ reg = bcmgenet_hfb_reg_readl(priv, HFB_CTRL);
+ reg &= ~(RBUF_HFB_EN | RBUF_ACPI_EN);
+ bcmgenet_hfb_reg_writel(priv, reg, HFB_CTRL);
+
/* Disable CRC Forward */
reg = bcmgenet_umac_readl(priv, UMAC_CMD);
reg &= ~CMD_CRC_FWD;
diff --git a/drivers/net/ethernet/faraday/ftmac100.c b/drivers/net/ethernet/faraday/ftmac100.c
index 32cf54f..473b337 100644
--- a/drivers/net/ethernet/faraday/ftmac100.c
+++ b/drivers/net/ethernet/faraday/ftmac100.c
@@ -1057,9 +1057,6 @@ static int ftmac100_probe(struct platform_device *pdev)
struct ftmac100 *priv;
int err;
- if (!pdev)
- return -ENODEV;
-
res = platform_get_resource(pdev, IORESOURCE_MEM, 0);
if (!res)
return -ENXIO;
diff --git a/drivers/net/ethernet/huawei/hinic/hinic_sriov.c b/drivers/net/ethernet/huawei/hinic/hinic_sriov.c
index b24788e..fd4aaf4 100644
--- a/drivers/net/ethernet/huawei/hinic/hinic_sriov.c
+++ b/drivers/net/ethernet/huawei/hinic/hinic_sriov.c
@@ -23,8 +23,8 @@ MODULE_PARM_DESC(set_vf_link_state, "Set vf link state, 0 represents link auto,
#define HINIC_VLAN_PRIORITY_SHIFT 13
#define HINIC_ADD_VLAN_IN_MAC 0x8000
-int hinic_set_mac(struct hinic_hwdev *hwdev, const u8 *mac_addr, u16 vlan_id,
- u16 func_id)
+static int hinic_set_mac(struct hinic_hwdev *hwdev, const u8 *mac_addr,
+ u16 vlan_id, u16 func_id)
{
struct hinic_port_mac_cmd mac_info = {0};
u16 out_size = sizeof(mac_info);
@@ -84,7 +84,7 @@ void hinic_notify_all_vfs_link_changed(struct hinic_hwdev *hwdev,
}
}
-u16 hinic_vf_info_vlanprio(struct hinic_hwdev *hwdev, int vf_id)
+static u16 hinic_vf_info_vlanprio(struct hinic_hwdev *hwdev, int vf_id)
{
struct hinic_func_to_io *nic_io = &hwdev->func_to_io;
u16 pf_vlan, vlanprio;
@@ -97,8 +97,8 @@ u16 hinic_vf_info_vlanprio(struct hinic_hwdev *hwdev, int vf_id)
return vlanprio;
}
-int hinic_set_vf_vlan(struct hinic_hwdev *hwdev, bool add, u16 vid,
- u8 qos, int vf_id)
+static int hinic_set_vf_vlan(struct hinic_hwdev *hwdev, bool add, u16 vid,
+ u8 qos, int vf_id)
{
struct hinic_vf_vlan_config vf_vlan = {0};
u16 out_size = sizeof(vf_vlan);
@@ -163,9 +163,9 @@ static int hinic_init_vf_config(struct hinic_hwdev *hwdev, u16 vf_id)
return 0;
}
-int hinic_register_vf_msg_handler(void *hwdev, u16 vf_id,
- void *buf_in, u16 in_size,
- void *buf_out, u16 *out_size)
+static int hinic_register_vf_msg_handler(void *hwdev, u16 vf_id,
+ void *buf_in, u16 in_size,
+ void *buf_out, u16 *out_size)
{
struct hinic_register_vf *register_info = buf_out;
struct hinic_hwdev *hw_dev = hwdev;
@@ -192,9 +192,9 @@ int hinic_register_vf_msg_handler(void *hwdev, u16 vf_id,
return 0;
}
-int hinic_unregister_vf_msg_handler(void *hwdev, u16 vf_id,
- void *buf_in, u16 in_size,
- void *buf_out, u16 *out_size)
+static int hinic_unregister_vf_msg_handler(void *hwdev, u16 vf_id,
+ void *buf_in, u16 in_size,
+ void *buf_out, u16 *out_size)
{
struct hinic_hwdev *hw_dev = hwdev;
struct hinic_func_to_io *nic_io;
@@ -209,9 +209,9 @@ int hinic_unregister_vf_msg_handler(void *hwdev, u16 vf_id,
return 0;
}
-int hinic_change_vf_mtu_msg_handler(void *hwdev, u16 vf_id,
- void *buf_in, u16 in_size,
- void *buf_out, u16 *out_size)
+static int hinic_change_vf_mtu_msg_handler(void *hwdev, u16 vf_id,
+ void *buf_in, u16 in_size,
+ void *buf_out, u16 *out_size)
{
struct hinic_hwdev *hw_dev = hwdev;
int err;
@@ -227,9 +227,9 @@ int hinic_change_vf_mtu_msg_handler(void *hwdev, u16 vf_id,
return 0;
}
-int hinic_get_vf_mac_msg_handler(void *hwdev, u16 vf_id,
- void *buf_in, u16 in_size,
- void *buf_out, u16 *out_size)
+static int hinic_get_vf_mac_msg_handler(void *hwdev, u16 vf_id,
+ void *buf_in, u16 in_size,
+ void *buf_out, u16 *out_size)
{
struct hinic_port_mac_cmd *mac_info = buf_out;
struct hinic_hwdev *dev = hwdev;
@@ -246,9 +246,9 @@ int hinic_get_vf_mac_msg_handler(void *hwdev, u16 vf_id,
return 0;
}
-int hinic_set_vf_mac_msg_handler(void *hwdev, u16 vf_id,
- void *buf_in, u16 in_size,
- void *buf_out, u16 *out_size)
+static int hinic_set_vf_mac_msg_handler(void *hwdev, u16 vf_id,
+ void *buf_in, u16 in_size,
+ void *buf_out, u16 *out_size)
{
struct hinic_port_mac_cmd *mac_out = buf_out;
struct hinic_port_mac_cmd *mac_in = buf_in;
@@ -280,9 +280,9 @@ int hinic_set_vf_mac_msg_handler(void *hwdev, u16 vf_id,
return err;
}
-int hinic_del_vf_mac_msg_handler(void *hwdev, u16 vf_id,
- void *buf_in, u16 in_size,
- void *buf_out, u16 *out_size)
+static int hinic_del_vf_mac_msg_handler(void *hwdev, u16 vf_id,
+ void *buf_in, u16 in_size,
+ void *buf_out, u16 *out_size)
{
struct hinic_port_mac_cmd *mac_out = buf_out;
struct hinic_port_mac_cmd *mac_in = buf_in;
@@ -312,9 +312,9 @@ int hinic_del_vf_mac_msg_handler(void *hwdev, u16 vf_id,
return err;
}
-int hinic_get_vf_link_status_msg_handler(void *hwdev, u16 vf_id,
- void *buf_in, u16 in_size,
- void *buf_out, u16 *out_size)
+static int hinic_get_vf_link_status_msg_handler(void *hwdev, u16 vf_id,
+ void *buf_in, u16 in_size,
+ void *buf_out, u16 *out_size)
{
struct hinic_port_link_cmd *get_link = buf_out;
struct hinic_hwdev *hw_dev = hwdev;
@@ -339,7 +339,7 @@ int hinic_get_vf_link_status_msg_handler(void *hwdev, u16 vf_id,
return 0;
}
-struct vf_cmd_msg_handle nic_vf_cmd_msg_handler[] = {
+static struct vf_cmd_msg_handle nic_vf_cmd_msg_handler[] = {
{HINIC_PORT_CMD_VF_REGISTER, hinic_register_vf_msg_handler},
{HINIC_PORT_CMD_VF_UNREGISTER, hinic_unregister_vf_msg_handler},
{HINIC_PORT_CMD_CHANGE_MTU, hinic_change_vf_mtu_msg_handler},
@@ -351,6 +351,7 @@ struct vf_cmd_msg_handle nic_vf_cmd_msg_handler[] = {
#define CHECK_IPSU_15BIT 0X8000
+static
struct hinic_sriov_info *hinic_get_sriov_info_by_pcidev(struct pci_dev *pdev)
{
struct net_device *netdev = pci_get_drvdata(pdev);
@@ -372,8 +373,8 @@ static int hinic_check_mac_info(u8 status, u16 vlan_id)
#define HINIC_VLAN_ID_MASK 0x7FFF
-int hinic_update_mac(struct hinic_hwdev *hwdev, u8 *old_mac, u8 *new_mac,
- u16 vlan_id, u16 func_id)
+static int hinic_update_mac(struct hinic_hwdev *hwdev, u8 *old_mac,
+ u8 *new_mac, u16 vlan_id, u16 func_id)
{
struct hinic_port_mac_update mac_info = {0};
u16 out_size = sizeof(mac_info);
@@ -416,8 +417,8 @@ int hinic_update_mac(struct hinic_hwdev *hwdev, u8 *old_mac, u8 *new_mac,
return 0;
}
-void hinic_get_vf_config(struct hinic_hwdev *hwdev, u16 vf_id,
- struct ifla_vf_info *ivi)
+static void hinic_get_vf_config(struct hinic_hwdev *hwdev, u16 vf_id,
+ struct ifla_vf_info *ivi)
{
struct vf_data_storage *vfinfo;
@@ -455,7 +456,8 @@ int hinic_ndo_get_vf_config(struct net_device *netdev,
return 0;
}
-int hinic_set_vf_mac(struct hinic_hwdev *hwdev, int vf, unsigned char *mac_addr)
+static int hinic_set_vf_mac(struct hinic_hwdev *hwdev, int vf,
+ unsigned char *mac_addr)
{
struct hinic_func_to_io *nic_io = &hwdev->func_to_io;
struct vf_data_storage *vf_info;
@@ -504,7 +506,8 @@ int hinic_ndo_set_vf_mac(struct net_device *netdev, int vf, u8 *mac)
return 0;
}
-int hinic_add_vf_vlan(struct hinic_hwdev *hwdev, int vf_id, u16 vlan, u8 qos)
+static int hinic_add_vf_vlan(struct hinic_hwdev *hwdev, int vf_id,
+ u16 vlan, u8 qos)
{
struct hinic_func_to_io *nic_io = &hwdev->func_to_io;
int err;
@@ -521,7 +524,7 @@ int hinic_add_vf_vlan(struct hinic_hwdev *hwdev, int vf_id, u16 vlan, u8 qos)
return 0;
}
-int hinic_kill_vf_vlan(struct hinic_hwdev *hwdev, int vf_id)
+static int hinic_kill_vf_vlan(struct hinic_hwdev *hwdev, int vf_id)
{
struct hinic_func_to_io *nic_io = &hwdev->func_to_io;
int err;
@@ -543,8 +546,8 @@ int hinic_kill_vf_vlan(struct hinic_hwdev *hwdev, int vf_id)
return 0;
}
-int hinic_update_mac_vlan(struct hinic_dev *nic_dev, u16 old_vlan, u16 new_vlan,
- int vf_id)
+static int hinic_update_mac_vlan(struct hinic_dev *nic_dev, u16 old_vlan,
+ u16 new_vlan, int vf_id)
{
struct vf_data_storage *vf_info;
u16 vlan_id;
@@ -651,7 +654,8 @@ int hinic_ndo_set_vf_vlan(struct net_device *netdev, int vf, u16 vlan, u8 qos,
return set_hw_vf_vlan(nic_dev, cur_vlanprio, vf, vlan, qos);
}
-int hinic_set_vf_trust(struct hinic_hwdev *hwdev, u16 vf_id, bool trust)
+static int hinic_set_vf_trust(struct hinic_hwdev *hwdev, u16 vf_id,
+ bool trust)
{
struct vf_data_storage *vf_infos;
struct hinic_func_to_io *nic_io;
@@ -697,24 +701,22 @@ int hinic_ndo_set_vf_trust(struct net_device *netdev, int vf, bool setting)
}
/* pf receive message from vf */
-int nic_pf_mbox_handler(void *hwdev, u16 vf_id, u8 cmd, void *buf_in,
- u16 in_size, void *buf_out, u16 *out_size)
+static int nic_pf_mbox_handler(void *hwdev, u16 vf_id, u8 cmd, void *buf_in,
+ u16 in_size, void *buf_out, u16 *out_size)
{
struct vf_cmd_msg_handle *vf_msg_handle;
struct hinic_hwdev *dev = hwdev;
struct hinic_func_to_io *nic_io;
struct hinic_pfhwdev *pfhwdev;
- u32 i, cmd_number;
int err = 0;
+ u32 i;
if (!hwdev)
return -EFAULT;
- cmd_number = sizeof(nic_vf_cmd_msg_handler) /
- sizeof(struct vf_cmd_msg_handle);
pfhwdev = container_of(dev, struct hinic_pfhwdev, hwdev);
nic_io = &dev->func_to_io;
- for (i = 0; i < cmd_number; i++) {
+ for (i = 0; i < ARRAY_SIZE(nic_vf_cmd_msg_handler); i++) {
vf_msg_handle = &nic_vf_cmd_msg_handler[i];
if (cmd == vf_msg_handle->cmd &&
vf_msg_handle->cmd_msg_handler) {
@@ -725,7 +727,7 @@ int nic_pf_mbox_handler(void *hwdev, u16 vf_id, u8 cmd, void *buf_in,
break;
}
}
- if (i == cmd_number)
+ if (i == ARRAY_SIZE(nic_vf_cmd_msg_handler))
err = hinic_msg_to_mgmt(&pfhwdev->pf_to_mgmt, HINIC_MOD_L2NIC,
cmd, buf_in, in_size, buf_out,
out_size, HINIC_MGMT_MSG_SYNC);
@@ -786,7 +788,7 @@ static int hinic_init_vf_infos(struct hinic_func_to_io *nic_io, u16 vf_id)
return 0;
}
-void hinic_clear_vf_infos(struct hinic_dev *nic_dev, u16 vf_id)
+static void hinic_clear_vf_infos(struct hinic_dev *nic_dev, u16 vf_id)
{
struct vf_data_storage *vf_infos;
u16 func_id;
@@ -807,8 +809,8 @@ void hinic_clear_vf_infos(struct hinic_dev *nic_dev, u16 vf_id)
hinic_init_vf_infos(&nic_dev->hwdev->func_to_io, HW_VF_ID_TO_OS(vf_id));
}
-int hinic_deinit_vf_hw(struct hinic_sriov_info *sriov_info, u16 start_vf_id,
- u16 end_vf_id)
+static int hinic_deinit_vf_hw(struct hinic_sriov_info *sriov_info,
+ u16 start_vf_id, u16 end_vf_id)
{
struct hinic_dev *nic_dev;
u16 func_idx, idx;
@@ -908,7 +910,8 @@ void hinic_vf_func_free(struct hinic_hwdev *hwdev)
}
}
-int hinic_init_vf_hw(struct hinic_hwdev *hwdev, u16 start_vf_id, u16 end_vf_id)
+static int hinic_init_vf_hw(struct hinic_hwdev *hwdev, u16 start_vf_id,
+ u16 end_vf_id)
{
u16 i, func_idx;
int err;
diff --git a/drivers/net/ethernet/mellanox/mlxsw/spectrum_acl_flex_actions.c b/drivers/net/ethernet/mellanox/mlxsw/spectrum_acl_flex_actions.c
index e47d1d2..73d5601 100644
--- a/drivers/net/ethernet/mellanox/mlxsw/spectrum_acl_flex_actions.c
+++ b/drivers/net/ethernet/mellanox/mlxsw/spectrum_acl_flex_actions.c
@@ -136,28 +136,35 @@ mlxsw_sp_act_mirror_add(void *priv, u8 local_in_port,
const struct net_device *out_dev,
bool ingress, int *p_span_id)
{
- struct mlxsw_sp_port *in_port;
+ struct mlxsw_sp_port *mlxsw_sp_port;
struct mlxsw_sp *mlxsw_sp = priv;
- enum mlxsw_sp_span_type type;
+ int err;
- type = ingress ? MLXSW_SP_SPAN_INGRESS : MLXSW_SP_SPAN_EGRESS;
- in_port = mlxsw_sp->ports[local_in_port];
+ err = mlxsw_sp_span_agent_get(mlxsw_sp, out_dev, p_span_id);
+ if (err)
+ return err;
- return mlxsw_sp_span_mirror_add(in_port, out_dev, type,
- false, p_span_id);
+ mlxsw_sp_port = mlxsw_sp->ports[local_in_port];
+ err = mlxsw_sp_span_analyzed_port_get(mlxsw_sp_port, ingress);
+ if (err)
+ goto err_analyzed_port_get;
+
+ return 0;
+
+err_analyzed_port_get:
+ mlxsw_sp_span_agent_put(mlxsw_sp, *p_span_id);
+ return err;
}
static void
mlxsw_sp_act_mirror_del(void *priv, u8 local_in_port, int span_id, bool ingress)
{
+ struct mlxsw_sp_port *mlxsw_sp_port;
struct mlxsw_sp *mlxsw_sp = priv;
- struct mlxsw_sp_port *in_port;
- enum mlxsw_sp_span_type type;
- type = ingress ? MLXSW_SP_SPAN_INGRESS : MLXSW_SP_SPAN_EGRESS;
- in_port = mlxsw_sp->ports[local_in_port];
-
- mlxsw_sp_span_mirror_del(in_port, span_id, type, false);
+ mlxsw_sp_port = mlxsw_sp->ports[local_in_port];
+ mlxsw_sp_span_analyzed_port_put(mlxsw_sp_port, ingress);
+ mlxsw_sp_span_agent_put(mlxsw_sp, span_id);
}
const struct mlxsw_afa_ops mlxsw_sp1_act_afa_ops = {
diff --git a/drivers/net/ethernet/mellanox/mlxsw/spectrum_matchall.c b/drivers/net/ethernet/mellanox/mlxsw/spectrum_matchall.c
index 889da63..da1c05f 100644
--- a/drivers/net/ethernet/mellanox/mlxsw/spectrum_matchall.c
+++ b/drivers/net/ethernet/mellanox/mlxsw/spectrum_matchall.c
@@ -48,31 +48,57 @@ static int
mlxsw_sp_mall_port_mirror_add(struct mlxsw_sp_port *mlxsw_sp_port,
struct mlxsw_sp_mall_entry *mall_entry)
{
- enum mlxsw_sp_span_type span_type;
+ struct mlxsw_sp *mlxsw_sp = mlxsw_sp_port->mlxsw_sp;
+ struct mlxsw_sp_span_trigger_parms parms;
+ enum mlxsw_sp_span_trigger trigger;
+ int err;
if (!mall_entry->mirror.to_dev) {
netdev_err(mlxsw_sp_port->dev, "Could not find requested device\n");
return -EINVAL;
}
- span_type = mall_entry->ingress ? MLXSW_SP_SPAN_INGRESS :
- MLXSW_SP_SPAN_EGRESS;
- return mlxsw_sp_span_mirror_add(mlxsw_sp_port,
- mall_entry->mirror.to_dev,
- span_type, true,
- &mall_entry->mirror.span_id);
+ err = mlxsw_sp_span_agent_get(mlxsw_sp, mall_entry->mirror.to_dev,
+ &mall_entry->mirror.span_id);
+ if (err)
+ return err;
+
+ err = mlxsw_sp_span_analyzed_port_get(mlxsw_sp_port,
+ mall_entry->ingress);
+ if (err)
+ goto err_analyzed_port_get;
+
+ trigger = mall_entry->ingress ? MLXSW_SP_SPAN_TRIGGER_INGRESS :
+ MLXSW_SP_SPAN_TRIGGER_EGRESS;
+ parms.span_id = mall_entry->mirror.span_id;
+ err = mlxsw_sp_span_agent_bind(mlxsw_sp, trigger, mlxsw_sp_port,
+ &parms);
+ if (err)
+ goto err_agent_bind;
+
+ return 0;
+
+err_agent_bind:
+ mlxsw_sp_span_analyzed_port_put(mlxsw_sp_port, mall_entry->ingress);
+err_analyzed_port_get:
+ mlxsw_sp_span_agent_put(mlxsw_sp, mall_entry->mirror.span_id);
+ return err;
}
static void
mlxsw_sp_mall_port_mirror_del(struct mlxsw_sp_port *mlxsw_sp_port,
struct mlxsw_sp_mall_entry *mall_entry)
{
- enum mlxsw_sp_span_type span_type;
+ struct mlxsw_sp *mlxsw_sp = mlxsw_sp_port->mlxsw_sp;
+ struct mlxsw_sp_span_trigger_parms parms;
+ enum mlxsw_sp_span_trigger trigger;
- span_type = mall_entry->ingress ? MLXSW_SP_SPAN_INGRESS :
- MLXSW_SP_SPAN_EGRESS;
- mlxsw_sp_span_mirror_del(mlxsw_sp_port, mall_entry->mirror.span_id,
- span_type, true);
+ trigger = mall_entry->ingress ? MLXSW_SP_SPAN_TRIGGER_INGRESS :
+ MLXSW_SP_SPAN_TRIGGER_EGRESS;
+ parms.span_id = mall_entry->mirror.span_id;
+ mlxsw_sp_span_agent_unbind(mlxsw_sp, trigger, mlxsw_sp_port, &parms);
+ mlxsw_sp_span_analyzed_port_put(mlxsw_sp_port, mall_entry->ingress);
+ mlxsw_sp_span_agent_put(mlxsw_sp, mall_entry->mirror.span_id);
}
static int mlxsw_sp_mall_port_sample_set(struct mlxsw_sp_port *mlxsw_sp_port,
diff --git a/drivers/net/ethernet/mellanox/mlxsw/spectrum_span.c b/drivers/net/ethernet/mellanox/mlxsw/spectrum_span.c
index ae3c8a1..304eb8c 100644
--- a/drivers/net/ethernet/mellanox/mlxsw/spectrum_span.c
+++ b/drivers/net/ethernet/mellanox/mlxsw/spectrum_span.c
@@ -3,6 +3,7 @@
#include <linux/if_bridge.h>
#include <linux/list.h>
+#include <linux/mutex.h>
#include <linux/refcount.h>
#include <linux/rtnetlink.h>
#include <linux/workqueue.h>
@@ -20,11 +21,29 @@
struct mlxsw_sp_span {
struct work_struct work;
struct mlxsw_sp *mlxsw_sp;
+ struct list_head analyzed_ports_list;
+ struct mutex analyzed_ports_lock; /* Protects analyzed_ports_list */
+ struct list_head trigger_entries_list;
atomic_t active_entries_count;
int entries_count;
struct mlxsw_sp_span_entry entries[];
};
+struct mlxsw_sp_span_analyzed_port {
+ struct list_head list; /* Member of analyzed_ports_list */
+ refcount_t ref_count;
+ u8 local_port;
+ bool ingress;
+};
+
+struct mlxsw_sp_span_trigger_entry {
+ struct list_head list; /* Member of trigger_entries_list */
+ refcount_t ref_count;
+ u8 local_port;
+ enum mlxsw_sp_span_trigger trigger;
+ struct mlxsw_sp_span_trigger_parms parms;
+};
+
static void mlxsw_sp_span_respin_work(struct work_struct *work);
static u64 mlxsw_sp_span_occ_get(void *priv)
@@ -49,15 +68,14 @@ int mlxsw_sp_span_init(struct mlxsw_sp *mlxsw_sp)
return -ENOMEM;
span->entries_count = entries_count;
atomic_set(&span->active_entries_count, 0);
+ mutex_init(&span->analyzed_ports_lock);
+ INIT_LIST_HEAD(&span->analyzed_ports_list);
+ INIT_LIST_HEAD(&span->trigger_entries_list);
span->mlxsw_sp = mlxsw_sp;
mlxsw_sp->span = span;
- for (i = 0; i < mlxsw_sp->span->entries_count; i++) {
- struct mlxsw_sp_span_entry *curr = &mlxsw_sp->span->entries[i];
-
- INIT_LIST_HEAD(&curr->bound_ports_list);
- curr->id = i;
- }
+ for (i = 0; i < mlxsw_sp->span->entries_count; i++)
+ mlxsw_sp->span->entries[i].id = i;
devlink_resource_occ_get_register(devlink, MLXSW_SP_RESOURCE_SPAN,
mlxsw_sp_span_occ_get, mlxsw_sp);
@@ -69,16 +87,13 @@ int mlxsw_sp_span_init(struct mlxsw_sp *mlxsw_sp)
void mlxsw_sp_span_fini(struct mlxsw_sp *mlxsw_sp)
{
struct devlink *devlink = priv_to_devlink(mlxsw_sp->core);
- int i;
cancel_work_sync(&mlxsw_sp->span->work);
devlink_resource_occ_get_unregister(devlink, MLXSW_SP_RESOURCE_SPAN);
- for (i = 0; i < mlxsw_sp->span->entries_count; i++) {
- struct mlxsw_sp_span_entry *curr = &mlxsw_sp->span->entries[i];
-
- WARN_ON_ONCE(!list_empty(&curr->bound_ports_list));
- }
+ WARN_ON_ONCE(!list_empty(&mlxsw_sp->span->trigger_entries_list));
+ WARN_ON_ONCE(!list_empty(&mlxsw_sp->span->analyzed_ports_list));
+ mutex_destroy(&mlxsw_sp->span->analyzed_ports_lock);
kfree(mlxsw_sp->span);
}
@@ -751,26 +766,8 @@ static int mlxsw_sp_span_entry_put(struct mlxsw_sp *mlxsw_sp,
return 0;
}
-static bool mlxsw_sp_span_is_egress_mirror(struct mlxsw_sp_port *port)
-{
- struct mlxsw_sp *mlxsw_sp = port->mlxsw_sp;
- struct mlxsw_sp_span_inspected_port *p;
- int i;
-
- for (i = 0; i < mlxsw_sp->span->entries_count; i++) {
- struct mlxsw_sp_span_entry *curr = &mlxsw_sp->span->entries[i];
-
- list_for_each_entry(p, &curr->bound_ports_list, list)
- if (p->local_port == port->local_port &&
- p->type == MLXSW_SP_SPAN_EGRESS)
- return true;
- }
-
- return false;
-}
-
static int
-mlxsw_sp_span_port_buffsize_update(struct mlxsw_sp_port *mlxsw_sp_port, u16 mtu)
+mlxsw_sp_span_port_buffer_update(struct mlxsw_sp_port *mlxsw_sp_port, u16 mtu)
{
struct mlxsw_sp *mlxsw_sp = mlxsw_sp_port->mlxsw_sp;
char sbib_pl[MLXSW_REG_SBIB_LEN];
@@ -789,20 +786,54 @@ mlxsw_sp_span_port_buffsize_update(struct mlxsw_sp_port *mlxsw_sp_port, u16 mtu)
return mlxsw_reg_write(mlxsw_sp->core, MLXSW_REG(sbib), sbib_pl);
}
+static void mlxsw_sp_span_port_buffer_disable(struct mlxsw_sp *mlxsw_sp,
+ u8 local_port)
+{
+ char sbib_pl[MLXSW_REG_SBIB_LEN];
+
+ mlxsw_reg_sbib_pack(sbib_pl, local_port, 0);
+ mlxsw_reg_write(mlxsw_sp->core, MLXSW_REG(sbib), sbib_pl);
+}
+
+static struct mlxsw_sp_span_analyzed_port *
+mlxsw_sp_span_analyzed_port_find(struct mlxsw_sp_span *span, u8 local_port,
+ bool ingress)
+{
+ struct mlxsw_sp_span_analyzed_port *analyzed_port;
+
+ list_for_each_entry(analyzed_port, &span->analyzed_ports_list, list) {
+ if (analyzed_port->local_port == local_port &&
+ analyzed_port->ingress == ingress)
+ return analyzed_port;
+ }
+
+ return NULL;
+}
+
int mlxsw_sp_span_port_mtu_update(struct mlxsw_sp_port *port, u16 mtu)
{
+ struct mlxsw_sp *mlxsw_sp = port->mlxsw_sp;
+ int err = 0;
+
/* If port is egress mirrored, the shared buffer size should be
* updated according to the mtu value
*/
- if (mlxsw_sp_span_is_egress_mirror(port))
- return mlxsw_sp_span_port_buffsize_update(port, mtu);
- return 0;
+ mutex_lock(&mlxsw_sp->span->analyzed_ports_lock);
+
+ if (mlxsw_sp_span_analyzed_port_find(mlxsw_sp->span, port->local_port,
+ false))
+ err = mlxsw_sp_span_port_buffer_update(port, mtu);
+
+ mutex_unlock(&mlxsw_sp->span->analyzed_ports_lock);
+
+ return err;
}
void mlxsw_sp_span_speed_update_work(struct work_struct *work)
{
struct delayed_work *dwork = to_delayed_work(work);
struct mlxsw_sp_port *mlxsw_sp_port;
+ struct mlxsw_sp *mlxsw_sp;
mlxsw_sp_port = container_of(dwork, struct mlxsw_sp_port,
span.speed_update_dw);
@@ -810,134 +841,15 @@ void mlxsw_sp_span_speed_update_work(struct work_struct *work)
/* If port is egress mirrored, the shared buffer size should be
* updated according to the speed value.
*/
- if (mlxsw_sp_span_is_egress_mirror(mlxsw_sp_port))
- mlxsw_sp_span_port_buffsize_update(mlxsw_sp_port,
- mlxsw_sp_port->dev->mtu);
-}
+ mlxsw_sp = mlxsw_sp_port->mlxsw_sp;
+ mutex_lock(&mlxsw_sp->span->analyzed_ports_lock);
-static struct mlxsw_sp_span_inspected_port *
-mlxsw_sp_span_entry_bound_port_find(struct mlxsw_sp_span_entry *span_entry,
- enum mlxsw_sp_span_type type,
- struct mlxsw_sp_port *port,
- bool bind)
-{
- struct mlxsw_sp_span_inspected_port *p;
+ if (mlxsw_sp_span_analyzed_port_find(mlxsw_sp->span,
+ mlxsw_sp_port->local_port, false))
+ mlxsw_sp_span_port_buffer_update(mlxsw_sp_port,
+ mlxsw_sp_port->dev->mtu);
- list_for_each_entry(p, &span_entry->bound_ports_list, list)
- if (type == p->type &&
- port->local_port == p->local_port &&
- bind == p->bound)
- return p;
- return NULL;
-}
-
-static int
-mlxsw_sp_span_inspected_port_bind(struct mlxsw_sp_port *port,
- struct mlxsw_sp_span_entry *span_entry,
- enum mlxsw_sp_span_type type,
- bool bind)
-{
- struct mlxsw_sp *mlxsw_sp = port->mlxsw_sp;
- char mpar_pl[MLXSW_REG_MPAR_LEN];
- int pa_id = span_entry->id;
-
- /* bind the port to the SPAN entry */
- mlxsw_reg_mpar_pack(mpar_pl, port->local_port,
- (enum mlxsw_reg_mpar_i_e)type, bind, pa_id);
- return mlxsw_reg_write(mlxsw_sp->core, MLXSW_REG(mpar), mpar_pl);
-}
-
-static int
-mlxsw_sp_span_inspected_port_add(struct mlxsw_sp_port *port,
- struct mlxsw_sp_span_entry *span_entry,
- enum mlxsw_sp_span_type type,
- bool bind)
-{
- struct mlxsw_sp_span_inspected_port *inspected_port;
- struct mlxsw_sp *mlxsw_sp = port->mlxsw_sp;
- char sbib_pl[MLXSW_REG_SBIB_LEN];
- int i;
- int err;
-
- /* A given (source port, direction) can only be bound to one analyzer,
- * so if a binding is requested, check for conflicts.
- */
- if (bind)
- for (i = 0; i < mlxsw_sp->span->entries_count; i++) {
- struct mlxsw_sp_span_entry *curr =
- &mlxsw_sp->span->entries[i];
-
- if (mlxsw_sp_span_entry_bound_port_find(curr, type,
- port, bind))
- return -EEXIST;
- }
-
- /* if it is an egress SPAN, bind a shared buffer to it */
- if (type == MLXSW_SP_SPAN_EGRESS) {
- err = mlxsw_sp_span_port_buffsize_update(port, port->dev->mtu);
- if (err)
- return err;
- }
-
- if (bind) {
- err = mlxsw_sp_span_inspected_port_bind(port, span_entry, type,
- true);
- if (err)
- goto err_port_bind;
- }
-
- inspected_port = kzalloc(sizeof(*inspected_port), GFP_KERNEL);
- if (!inspected_port) {
- err = -ENOMEM;
- goto err_inspected_port_alloc;
- }
- inspected_port->local_port = port->local_port;
- inspected_port->type = type;
- inspected_port->bound = bind;
- list_add_tail(&inspected_port->list, &span_entry->bound_ports_list);
-
- return 0;
-
-err_inspected_port_alloc:
- if (bind)
- mlxsw_sp_span_inspected_port_bind(port, span_entry, type,
- false);
-err_port_bind:
- if (type == MLXSW_SP_SPAN_EGRESS) {
- mlxsw_reg_sbib_pack(sbib_pl, port->local_port, 0);
- mlxsw_reg_write(mlxsw_sp->core, MLXSW_REG(sbib), sbib_pl);
- }
- return err;
-}
-
-static void
-mlxsw_sp_span_inspected_port_del(struct mlxsw_sp_port *port,
- struct mlxsw_sp_span_entry *span_entry,
- enum mlxsw_sp_span_type type,
- bool bind)
-{
- struct mlxsw_sp_span_inspected_port *inspected_port;
- struct mlxsw_sp *mlxsw_sp = port->mlxsw_sp;
- char sbib_pl[MLXSW_REG_SBIB_LEN];
-
- inspected_port = mlxsw_sp_span_entry_bound_port_find(span_entry, type,
- port, bind);
- if (!inspected_port)
- return;
-
- if (bind)
- mlxsw_sp_span_inspected_port_bind(port, span_entry, type,
- false);
- /* remove the SBIB buffer if it was egress SPAN */
- if (type == MLXSW_SP_SPAN_EGRESS) {
- mlxsw_reg_sbib_pack(sbib_pl, port->local_port, 0);
- mlxsw_reg_write(mlxsw_sp->core, MLXSW_REG(sbib), sbib_pl);
- }
-
- mlxsw_sp_span_entry_put(mlxsw_sp, span_entry);
-
- list_del(&inspected_port->list);
- kfree(inspected_port);
+ mutex_unlock(&mlxsw_sp->span->analyzed_ports_lock);
}
static const struct mlxsw_sp_span_entry_ops *
@@ -953,57 +865,6 @@ mlxsw_sp_span_entry_ops(struct mlxsw_sp *mlxsw_sp,
return NULL;
}
-int mlxsw_sp_span_mirror_add(struct mlxsw_sp_port *from,
- const struct net_device *to_dev,
- enum mlxsw_sp_span_type type, bool bind,
- int *p_span_id)
-{
- struct mlxsw_sp *mlxsw_sp = from->mlxsw_sp;
- const struct mlxsw_sp_span_entry_ops *ops;
- struct mlxsw_sp_span_parms sparms = {NULL};
- struct mlxsw_sp_span_entry *span_entry;
- int err;
-
- ops = mlxsw_sp_span_entry_ops(mlxsw_sp, to_dev);
- if (!ops) {
- netdev_err(to_dev, "Cannot mirror to %s", to_dev->name);
- return -EOPNOTSUPP;
- }
-
- err = ops->parms_set(to_dev, &sparms);
- if (err)
- return err;
-
- span_entry = mlxsw_sp_span_entry_get(mlxsw_sp, to_dev, ops, sparms);
- if (!span_entry)
- return -ENOBUFS;
-
- err = mlxsw_sp_span_inspected_port_add(from, span_entry, type, bind);
- if (err)
- goto err_port_bind;
-
- *p_span_id = span_entry->id;
- return 0;
-
-err_port_bind:
- mlxsw_sp_span_entry_put(mlxsw_sp, span_entry);
- return err;
-}
-
-void mlxsw_sp_span_mirror_del(struct mlxsw_sp_port *from, int span_id,
- enum mlxsw_sp_span_type type, bool bind)
-{
- struct mlxsw_sp_span_entry *span_entry;
-
- span_entry = mlxsw_sp_span_entry_find_by_id(from->mlxsw_sp, span_id);
- if (!span_entry) {
- netdev_err(from->dev, "no span entry found\n");
- return;
- }
-
- mlxsw_sp_span_inspected_port_del(from, span_entry, type, bind);
-}
-
static void mlxsw_sp_span_respin_work(struct work_struct *work)
{
struct mlxsw_sp_span *span;
@@ -1039,3 +900,309 @@ void mlxsw_sp_span_respin(struct mlxsw_sp *mlxsw_sp)
return;
mlxsw_core_schedule_work(&mlxsw_sp->span->work);
}
+
+int mlxsw_sp_span_agent_get(struct mlxsw_sp *mlxsw_sp,
+ const struct net_device *to_dev, int *p_span_id)
+{
+ const struct mlxsw_sp_span_entry_ops *ops;
+ struct mlxsw_sp_span_entry *span_entry;
+ struct mlxsw_sp_span_parms sparms;
+ int err;
+
+ ASSERT_RTNL();
+
+ ops = mlxsw_sp_span_entry_ops(mlxsw_sp, to_dev);
+ if (!ops) {
+ dev_err(mlxsw_sp->bus_info->dev, "Cannot mirror to requested destination\n");
+ return -EOPNOTSUPP;
+ }
+
+ memset(&sparms, 0, sizeof(sparms));
+ err = ops->parms_set(to_dev, &sparms);
+ if (err)
+ return err;
+
+ span_entry = mlxsw_sp_span_entry_get(mlxsw_sp, to_dev, ops, sparms);
+ if (!span_entry)
+ return -ENOBUFS;
+
+ *p_span_id = span_entry->id;
+
+ return 0;
+}
+
+void mlxsw_sp_span_agent_put(struct mlxsw_sp *mlxsw_sp, int span_id)
+{
+ struct mlxsw_sp_span_entry *span_entry;
+
+ ASSERT_RTNL();
+
+ span_entry = mlxsw_sp_span_entry_find_by_id(mlxsw_sp, span_id);
+ if (WARN_ON_ONCE(!span_entry))
+ return;
+
+ mlxsw_sp_span_entry_put(mlxsw_sp, span_entry);
+}
+
+static struct mlxsw_sp_span_analyzed_port *
+mlxsw_sp_span_analyzed_port_create(struct mlxsw_sp_span *span,
+ struct mlxsw_sp_port *mlxsw_sp_port,
+ bool ingress)
+{
+ struct mlxsw_sp_span_analyzed_port *analyzed_port;
+ int err;
+
+ analyzed_port = kzalloc(sizeof(*analyzed_port), GFP_KERNEL);
+ if (!analyzed_port)
+ return ERR_PTR(-ENOMEM);
+
+ refcount_set(&analyzed_port->ref_count, 1);
+ analyzed_port->local_port = mlxsw_sp_port->local_port;
+ analyzed_port->ingress = ingress;
+ list_add_tail(&analyzed_port->list, &span->analyzed_ports_list);
+
+ /* An egress mirror buffer should be allocated on the egress port which
+ * does the mirroring.
+ */
+ if (!ingress) {
+ u16 mtu = mlxsw_sp_port->dev->mtu;
+
+ err = mlxsw_sp_span_port_buffer_update(mlxsw_sp_port, mtu);
+ if (err)
+ goto err_buffer_update;
+ }
+
+ return analyzed_port;
+
+err_buffer_update:
+ list_del(&analyzed_port->list);
+ kfree(analyzed_port);
+ return ERR_PTR(err);
+}
+
+static void
+mlxsw_sp_span_analyzed_port_destroy(struct mlxsw_sp_span *span,
+ struct mlxsw_sp_span_analyzed_port *
+ analyzed_port)
+{
+ struct mlxsw_sp *mlxsw_sp = span->mlxsw_sp;
+
+ /* Remove egress mirror buffer now that port is no longer analyzed
+ * at egress.
+ */
+ if (!analyzed_port->ingress)
+ mlxsw_sp_span_port_buffer_disable(mlxsw_sp,
+ analyzed_port->local_port);
+
+ list_del(&analyzed_port->list);
+ kfree(analyzed_port);
+}
+
+int mlxsw_sp_span_analyzed_port_get(struct mlxsw_sp_port *mlxsw_sp_port,
+ bool ingress)
+{
+ struct mlxsw_sp *mlxsw_sp = mlxsw_sp_port->mlxsw_sp;
+ struct mlxsw_sp_span_analyzed_port *analyzed_port;
+ u8 local_port = mlxsw_sp_port->local_port;
+ int err = 0;
+
+ mutex_lock(&mlxsw_sp->span->analyzed_ports_lock);
+
+ analyzed_port = mlxsw_sp_span_analyzed_port_find(mlxsw_sp->span,
+ local_port, ingress);
+ if (analyzed_port) {
+ refcount_inc(&analyzed_port->ref_count);
+ goto out_unlock;
+ }
+
+ analyzed_port = mlxsw_sp_span_analyzed_port_create(mlxsw_sp->span,
+ mlxsw_sp_port,
+ ingress);
+ if (IS_ERR(analyzed_port))
+ err = PTR_ERR(analyzed_port);
+
+out_unlock:
+ mutex_unlock(&mlxsw_sp->span->analyzed_ports_lock);
+ return err;
+}
+
+void mlxsw_sp_span_analyzed_port_put(struct mlxsw_sp_port *mlxsw_sp_port,
+ bool ingress)
+{
+ struct mlxsw_sp *mlxsw_sp = mlxsw_sp_port->mlxsw_sp;
+ struct mlxsw_sp_span_analyzed_port *analyzed_port;
+ u8 local_port = mlxsw_sp_port->local_port;
+
+ mutex_lock(&mlxsw_sp->span->analyzed_ports_lock);
+
+ analyzed_port = mlxsw_sp_span_analyzed_port_find(mlxsw_sp->span,
+ local_port, ingress);
+ if (WARN_ON_ONCE(!analyzed_port))
+ goto out_unlock;
+
+ if (!refcount_dec_and_test(&analyzed_port->ref_count))
+ goto out_unlock;
+
+ mlxsw_sp_span_analyzed_port_destroy(mlxsw_sp->span, analyzed_port);
+
+out_unlock:
+ mutex_unlock(&mlxsw_sp->span->analyzed_ports_lock);
+}
+
+static int
+__mlxsw_sp_span_trigger_entry_bind(struct mlxsw_sp_span *span,
+ struct mlxsw_sp_span_trigger_entry *
+ trigger_entry, bool enable)
+{
+ char mpar_pl[MLXSW_REG_MPAR_LEN];
+ enum mlxsw_reg_mpar_i_e i_e;
+
+ switch (trigger_entry->trigger) {
+ case MLXSW_SP_SPAN_TRIGGER_INGRESS:
+ i_e = MLXSW_REG_MPAR_TYPE_INGRESS;
+ break;
+ case MLXSW_SP_SPAN_TRIGGER_EGRESS:
+ i_e = MLXSW_REG_MPAR_TYPE_EGRESS;
+ break;
+ default:
+ WARN_ON_ONCE(1);
+ return -EINVAL;
+ }
+
+ mlxsw_reg_mpar_pack(mpar_pl, trigger_entry->local_port, i_e, enable,
+ trigger_entry->parms.span_id);
+ return mlxsw_reg_write(span->mlxsw_sp->core, MLXSW_REG(mpar), mpar_pl);
+}
+
+static int
+mlxsw_sp_span_trigger_entry_bind(struct mlxsw_sp_span *span,
+ struct mlxsw_sp_span_trigger_entry *
+ trigger_entry)
+{
+ return __mlxsw_sp_span_trigger_entry_bind(span, trigger_entry, true);
+}
+
+static void
+mlxsw_sp_span_trigger_entry_unbind(struct mlxsw_sp_span *span,
+ struct mlxsw_sp_span_trigger_entry *
+ trigger_entry)
+{
+ __mlxsw_sp_span_trigger_entry_bind(span, trigger_entry, false);
+}
+
+static struct mlxsw_sp_span_trigger_entry *
+mlxsw_sp_span_trigger_entry_create(struct mlxsw_sp_span *span,
+ enum mlxsw_sp_span_trigger trigger,
+ struct mlxsw_sp_port *mlxsw_sp_port,
+ const struct mlxsw_sp_span_trigger_parms
+ *parms)
+{
+ struct mlxsw_sp_span_trigger_entry *trigger_entry;
+ int err;
+
+ trigger_entry = kzalloc(sizeof(*trigger_entry), GFP_KERNEL);
+ if (!trigger_entry)
+ return ERR_PTR(-ENOMEM);
+
+ refcount_set(&trigger_entry->ref_count, 1);
+ trigger_entry->local_port = mlxsw_sp_port->local_port;
+ trigger_entry->trigger = trigger;
+ memcpy(&trigger_entry->parms, parms, sizeof(trigger_entry->parms));
+ list_add_tail(&trigger_entry->list, &span->trigger_entries_list);
+
+ err = mlxsw_sp_span_trigger_entry_bind(span, trigger_entry);
+ if (err)
+ goto err_trigger_entry_bind;
+
+ return trigger_entry;
+
+err_trigger_entry_bind:
+ list_del(&trigger_entry->list);
+ kfree(trigger_entry);
+ return ERR_PTR(err);
+}
+
+static void
+mlxsw_sp_span_trigger_entry_destroy(struct mlxsw_sp_span *span,
+ struct mlxsw_sp_span_trigger_entry *
+ trigger_entry)
+{
+ mlxsw_sp_span_trigger_entry_unbind(span, trigger_entry);
+ list_del(&trigger_entry->list);
+ kfree(trigger_entry);
+}
+
+static struct mlxsw_sp_span_trigger_entry *
+mlxsw_sp_span_trigger_entry_find(struct mlxsw_sp_span *span,
+ enum mlxsw_sp_span_trigger trigger,
+ struct mlxsw_sp_port *mlxsw_sp_port)
+{
+ struct mlxsw_sp_span_trigger_entry *trigger_entry;
+
+ list_for_each_entry(trigger_entry, &span->trigger_entries_list, list) {
+ if (trigger_entry->trigger == trigger &&
+ trigger_entry->local_port == mlxsw_sp_port->local_port)
+ return trigger_entry;
+ }
+
+ return NULL;
+}
+
+int mlxsw_sp_span_agent_bind(struct mlxsw_sp *mlxsw_sp,
+ enum mlxsw_sp_span_trigger trigger,
+ struct mlxsw_sp_port *mlxsw_sp_port,
+ const struct mlxsw_sp_span_trigger_parms *parms)
+{
+ struct mlxsw_sp_span_trigger_entry *trigger_entry;
+ int err = 0;
+
+ ASSERT_RTNL();
+
+ if (!mlxsw_sp_span_entry_find_by_id(mlxsw_sp, parms->span_id))
+ return -EINVAL;
+
+ trigger_entry = mlxsw_sp_span_trigger_entry_find(mlxsw_sp->span,
+ trigger,
+ mlxsw_sp_port);
+ if (trigger_entry) {
+ if (trigger_entry->parms.span_id != parms->span_id)
+ return -EINVAL;
+ refcount_inc(&trigger_entry->ref_count);
+ goto out;
+ }
+
+ trigger_entry = mlxsw_sp_span_trigger_entry_create(mlxsw_sp->span,
+ trigger,
+ mlxsw_sp_port,
+ parms);
+ if (IS_ERR(trigger_entry))
+ err = PTR_ERR(trigger_entry);
+
+out:
+ return err;
+}
+
+void mlxsw_sp_span_agent_unbind(struct mlxsw_sp *mlxsw_sp,
+ enum mlxsw_sp_span_trigger trigger,
+ struct mlxsw_sp_port *mlxsw_sp_port,
+ const struct mlxsw_sp_span_trigger_parms *parms)
+{
+ struct mlxsw_sp_span_trigger_entry *trigger_entry;
+
+ ASSERT_RTNL();
+
+ if (WARN_ON_ONCE(!mlxsw_sp_span_entry_find_by_id(mlxsw_sp,
+ parms->span_id)))
+ return;
+
+ trigger_entry = mlxsw_sp_span_trigger_entry_find(mlxsw_sp->span,
+ trigger,
+ mlxsw_sp_port);
+ if (WARN_ON_ONCE(!trigger_entry))
+ return;
+
+ if (!refcount_dec_and_test(&trigger_entry->ref_count))
+ return;
+
+ mlxsw_sp_span_trigger_entry_destroy(mlxsw_sp->span, trigger_entry);
+}
diff --git a/drivers/net/ethernet/mellanox/mlxsw/spectrum_span.h b/drivers/net/ethernet/mellanox/mlxsw/spectrum_span.h
index d23abdf..9f6dd2d 100644
--- a/drivers/net/ethernet/mellanox/mlxsw/spectrum_span.h
+++ b/drivers/net/ethernet/mellanox/mlxsw/spectrum_span.h
@@ -13,20 +13,6 @@
struct mlxsw_sp;
struct mlxsw_sp_port;
-enum mlxsw_sp_span_type {
- MLXSW_SP_SPAN_EGRESS,
- MLXSW_SP_SPAN_INGRESS
-};
-
-struct mlxsw_sp_span_inspected_port {
- struct list_head list;
- enum mlxsw_sp_span_type type;
- u8 local_port;
-
- /* Whether this is a directly bound mirror (port-to-port) or an ACL. */
- bool bound;
-};
-
struct mlxsw_sp_span_parms {
struct mlxsw_sp_port *dest_port; /* NULL for unoffloaded SPAN. */
unsigned int ttl;
@@ -37,13 +23,21 @@ struct mlxsw_sp_span_parms {
u16 vid;
};
+enum mlxsw_sp_span_trigger {
+ MLXSW_SP_SPAN_TRIGGER_INGRESS,
+ MLXSW_SP_SPAN_TRIGGER_EGRESS,
+};
+
+struct mlxsw_sp_span_trigger_parms {
+ int span_id;
+};
+
struct mlxsw_sp_span_entry_ops;
struct mlxsw_sp_span_entry {
const struct net_device *to_dev;
const struct mlxsw_sp_span_entry_ops *ops;
struct mlxsw_sp_span_parms parms;
- struct list_head bound_ports_list;
refcount_t ref_count;
int id;
};
@@ -61,12 +55,6 @@ int mlxsw_sp_span_init(struct mlxsw_sp *mlxsw_sp);
void mlxsw_sp_span_fini(struct mlxsw_sp *mlxsw_sp);
void mlxsw_sp_span_respin(struct mlxsw_sp *mlxsw_sp);
-int mlxsw_sp_span_mirror_add(struct mlxsw_sp_port *from,
- const struct net_device *to_dev,
- enum mlxsw_sp_span_type type,
- bool bind, int *p_span_id);
-void mlxsw_sp_span_mirror_del(struct mlxsw_sp_port *from, int span_id,
- enum mlxsw_sp_span_type type, bool bind);
struct mlxsw_sp_span_entry *
mlxsw_sp_span_entry_find_by_port(struct mlxsw_sp *mlxsw_sp,
const struct net_device *to_dev);
@@ -77,4 +65,21 @@ void mlxsw_sp_span_entry_invalidate(struct mlxsw_sp *mlxsw_sp,
int mlxsw_sp_span_port_mtu_update(struct mlxsw_sp_port *port, u16 mtu);
void mlxsw_sp_span_speed_update_work(struct work_struct *work);
+int mlxsw_sp_span_agent_get(struct mlxsw_sp *mlxsw_sp,
+ const struct net_device *to_dev, int *p_span_id);
+void mlxsw_sp_span_agent_put(struct mlxsw_sp *mlxsw_sp, int span_id);
+int mlxsw_sp_span_analyzed_port_get(struct mlxsw_sp_port *mlxsw_sp_port,
+ bool ingress);
+void mlxsw_sp_span_analyzed_port_put(struct mlxsw_sp_port *mlxsw_sp_port,
+ bool ingress);
+int mlxsw_sp_span_agent_bind(struct mlxsw_sp *mlxsw_sp,
+ enum mlxsw_sp_span_trigger trigger,
+ struct mlxsw_sp_port *mlxsw_sp_port,
+ const struct mlxsw_sp_span_trigger_parms *parms);
+void
+mlxsw_sp_span_agent_unbind(struct mlxsw_sp *mlxsw_sp,
+ enum mlxsw_sp_span_trigger trigger,
+ struct mlxsw_sp_port *mlxsw_sp_port,
+ const struct mlxsw_sp_span_trigger_parms *parms);
+
#endif
diff --git a/drivers/net/ethernet/stmicro/stmmac/Makefile b/drivers/net/ethernet/stmicro/stmmac/Makefile
index 5a6f265..f9d024d 100644
--- a/drivers/net/ethernet/stmicro/stmmac/Makefile
+++ b/drivers/net/ethernet/stmicro/stmmac/Makefile
@@ -30,6 +30,6 @@
stmmac-platform-objs:= stmmac_platform.o
dwmac-altr-socfpga-objs := altr_tse_pcs.o dwmac-socfpga.o
-obj-$(CONFIG_DWMAC_INTEL) += dwmac-intel.o
-obj-$(CONFIG_STMMAC_PCI) += stmmac-pci.o
+obj-$(CONFIG_STMMAC_PCI) += stmmac-pci.o
+obj-$(CONFIG_DWMAC_INTEL) += dwmac-intel.o
stmmac-pci-objs:= stmmac_pci.o
diff --git a/drivers/net/ethernet/stmicro/stmmac/dwmac-intel.c b/drivers/net/ethernet/stmicro/stmmac/dwmac-intel.c
index 2e4aaed..2ac9dfb 100644
--- a/drivers/net/ethernet/stmicro/stmmac/dwmac-intel.c
+++ b/drivers/net/ethernet/stmicro/stmmac/dwmac-intel.c
@@ -83,13 +83,9 @@ static int intel_serdes_powerup(struct net_device *ndev, void *priv_data)
serdes_phy_addr = intel_priv->mdio_adhoc_addr;
/* assert clk_req */
- data = mdiobus_read(priv->mii, serdes_phy_addr,
- SERDES_GCR0);
-
+ data = mdiobus_read(priv->mii, serdes_phy_addr, SERDES_GCR0);
data |= SERDES_PLL_CLK;
-
- mdiobus_write(priv->mii, serdes_phy_addr,
- SERDES_GCR0, data);
+ mdiobus_write(priv->mii, serdes_phy_addr, SERDES_GCR0, data);
/* check for clk_ack assertion */
data = serdes_status_poll(priv, serdes_phy_addr,
@@ -103,13 +99,9 @@ static int intel_serdes_powerup(struct net_device *ndev, void *priv_data)
}
/* assert lane reset */
- data = mdiobus_read(priv->mii, serdes_phy_addr,
- SERDES_GCR0);
-
+ data = mdiobus_read(priv->mii, serdes_phy_addr, SERDES_GCR0);
data |= SERDES_RST;
-
- mdiobus_write(priv->mii, serdes_phy_addr,
- SERDES_GCR0, data);
+ mdiobus_write(priv->mii, serdes_phy_addr, SERDES_GCR0, data);
/* check for assert lane reset reflection */
data = serdes_status_poll(priv, serdes_phy_addr,
@@ -123,14 +115,12 @@ static int intel_serdes_powerup(struct net_device *ndev, void *priv_data)
}
/* move power state to P0 */
- data = mdiobus_read(priv->mii, serdes_phy_addr,
- SERDES_GCR0);
+ data = mdiobus_read(priv->mii, serdes_phy_addr, SERDES_GCR0);
data &= ~SERDES_PWR_ST_MASK;
data |= SERDES_PWR_ST_P0 << SERDES_PWR_ST_SHIFT;
- mdiobus_write(priv->mii, serdes_phy_addr,
- SERDES_GCR0, data);
+ mdiobus_write(priv->mii, serdes_phy_addr, SERDES_GCR0, data);
/* Check for P0 state */
data = serdes_status_poll(priv, serdes_phy_addr,
@@ -159,14 +149,12 @@ static void intel_serdes_powerdown(struct net_device *ndev, void *intel_data)
serdes_phy_addr = intel_priv->mdio_adhoc_addr;
/* move power state to P3 */
- data = mdiobus_read(priv->mii, serdes_phy_addr,
- SERDES_GCR0);
+ data = mdiobus_read(priv->mii, serdes_phy_addr, SERDES_GCR0);
data &= ~SERDES_PWR_ST_MASK;
data |= SERDES_PWR_ST_P3 << SERDES_PWR_ST_SHIFT;
- mdiobus_write(priv->mii, serdes_phy_addr,
- SERDES_GCR0, data);
+ mdiobus_write(priv->mii, serdes_phy_addr, SERDES_GCR0, data);
/* Check for P3 state */
data = serdes_status_poll(priv, serdes_phy_addr,
@@ -180,13 +168,9 @@ static void intel_serdes_powerdown(struct net_device *ndev, void *intel_data)
}
/* de-assert clk_req */
- data = mdiobus_read(priv->mii, serdes_phy_addr,
- SERDES_GCR0);
-
+ data = mdiobus_read(priv->mii, serdes_phy_addr, SERDES_GCR0);
data &= ~SERDES_PLL_CLK;
-
- mdiobus_write(priv->mii, serdes_phy_addr,
- SERDES_GCR0, data);
+ mdiobus_write(priv->mii, serdes_phy_addr, SERDES_GCR0, data);
/* check for clk_ack de-assert */
data = serdes_status_poll(priv, serdes_phy_addr,
@@ -200,13 +184,9 @@ static void intel_serdes_powerdown(struct net_device *ndev, void *intel_data)
}
/* de-assert lane reset */
- data = mdiobus_read(priv->mii, serdes_phy_addr,
- SERDES_GCR0);
-
+ data = mdiobus_read(priv->mii, serdes_phy_addr, SERDES_GCR0);
data &= ~SERDES_RST;
-
- mdiobus_write(priv->mii, serdes_phy_addr,
- SERDES_GCR0, data);
+ mdiobus_write(priv->mii, serdes_phy_addr, SERDES_GCR0, data);
/* check for de-assert lane reset reflection */
data = serdes_status_poll(priv, serdes_phy_addr,
@@ -252,6 +232,7 @@ static void common_default_data(struct plat_stmmacenet_data *plat)
static int intel_mgbe_common_data(struct pci_dev *pdev,
struct plat_stmmacenet_data *plat)
{
+ int ret;
int i;
plat->clk_csr = 5;
@@ -324,7 +305,12 @@ static int intel_mgbe_common_data(struct pci_dev *pdev,
dev_warn(&pdev->dev, "Fail to register stmmac-clk\n");
plat->stmmac_clk = NULL;
}
- clk_prepare_enable(plat->stmmac_clk);
+
+ ret = clk_prepare_enable(plat->stmmac_clk);
+ if (ret) {
+ clk_unregister_fixed_rate(plat->stmmac_clk);
+ return ret;
+ }
/* Set default value for multicast hash bins */
plat->multicast_filter_bins = HASH_TABLE_SIZE;
@@ -341,16 +327,11 @@ static int intel_mgbe_common_data(struct pci_dev *pdev,
static int ehl_common_data(struct pci_dev *pdev,
struct plat_stmmacenet_data *plat)
{
- int ret;
-
plat->rx_queues_to_use = 8;
plat->tx_queues_to_use = 8;
plat->clk_ptp_rate = 200000000;
- ret = intel_mgbe_common_data(pdev, plat);
- if (ret)
- return ret;
- return 0;
+ return intel_mgbe_common_data(pdev, plat);
}
static int ehl_sgmii_data(struct pci_dev *pdev,
@@ -366,7 +347,7 @@ static int ehl_sgmii_data(struct pci_dev *pdev,
return ehl_common_data(pdev, plat);
}
-static struct stmmac_pci_info ehl_sgmii1g_pci_info = {
+static struct stmmac_pci_info ehl_sgmii1g_info = {
.setup = ehl_sgmii_data,
};
@@ -380,7 +361,7 @@ static int ehl_rgmii_data(struct pci_dev *pdev,
return ehl_common_data(pdev, plat);
}
-static struct stmmac_pci_info ehl_rgmii1g_pci_info = {
+static struct stmmac_pci_info ehl_rgmii1g_info = {
.setup = ehl_rgmii_data,
};
@@ -399,7 +380,7 @@ static int ehl_pse0_rgmii1g_data(struct pci_dev *pdev,
return ehl_pse0_common_data(pdev, plat);
}
-static struct stmmac_pci_info ehl_pse0_rgmii1g_pci_info = {
+static struct stmmac_pci_info ehl_pse0_rgmii1g_info = {
.setup = ehl_pse0_rgmii1g_data,
};
@@ -412,7 +393,7 @@ static int ehl_pse0_sgmii1g_data(struct pci_dev *pdev,
return ehl_pse0_common_data(pdev, plat);
}
-static struct stmmac_pci_info ehl_pse0_sgmii1g_pci_info = {
+static struct stmmac_pci_info ehl_pse0_sgmii1g_info = {
.setup = ehl_pse0_sgmii1g_data,
};
@@ -431,7 +412,7 @@ static int ehl_pse1_rgmii1g_data(struct pci_dev *pdev,
return ehl_pse1_common_data(pdev, plat);
}
-static struct stmmac_pci_info ehl_pse1_rgmii1g_pci_info = {
+static struct stmmac_pci_info ehl_pse1_rgmii1g_info = {
.setup = ehl_pse1_rgmii1g_data,
};
@@ -444,23 +425,18 @@ static int ehl_pse1_sgmii1g_data(struct pci_dev *pdev,
return ehl_pse1_common_data(pdev, plat);
}
-static struct stmmac_pci_info ehl_pse1_sgmii1g_pci_info = {
+static struct stmmac_pci_info ehl_pse1_sgmii1g_info = {
.setup = ehl_pse1_sgmii1g_data,
};
static int tgl_common_data(struct pci_dev *pdev,
struct plat_stmmacenet_data *plat)
{
- int ret;
-
plat->rx_queues_to_use = 6;
plat->tx_queues_to_use = 4;
plat->clk_ptp_rate = 200000000;
- ret = intel_mgbe_common_data(pdev, plat);
- if (ret)
- return ret;
- return 0;
+ return intel_mgbe_common_data(pdev, plat);
}
static int tgl_sgmii_data(struct pci_dev *pdev,
@@ -474,7 +450,7 @@ static int tgl_sgmii_data(struct pci_dev *pdev,
return tgl_common_data(pdev, plat);
}
-static struct stmmac_pci_info tgl_sgmii1g_pci_info = {
+static struct stmmac_pci_info tgl_sgmii1g_info = {
.setup = tgl_sgmii_data,
};
@@ -577,7 +553,7 @@ static int quark_default_data(struct pci_dev *pdev,
return 0;
}
-static const struct stmmac_pci_info quark_pci_info = {
+static const struct stmmac_pci_info quark_info = {
.setup = quark_default_data,
};
@@ -600,11 +576,9 @@ static int intel_eth_pci_probe(struct pci_dev *pdev,
struct intel_priv_data *intel_priv;
struct plat_stmmacenet_data *plat;
struct stmmac_resources res;
- int i;
int ret;
- intel_priv = devm_kzalloc(&pdev->dev, sizeof(*intel_priv),
- GFP_KERNEL);
+ intel_priv = devm_kzalloc(&pdev->dev, sizeof(*intel_priv), GFP_KERNEL);
if (!intel_priv)
return -ENOMEM;
@@ -631,15 +605,9 @@ static int intel_eth_pci_probe(struct pci_dev *pdev,
return ret;
}
- /* Get the base address of device */
- for (i = 0; i < PCI_STD_NUM_BARS; i++) {
- if (pci_resource_len(pdev, i) == 0)
- continue;
- ret = pcim_iomap_regions(pdev, BIT(i), pci_name(pdev));
- if (ret)
- return ret;
- break;
- }
+ ret = pcim_iomap_regions(pdev, BIT(0), pci_name(pdev));
+ if (ret)
+ return ret;
pci_set_master(pdev);
@@ -650,14 +618,23 @@ static int intel_eth_pci_probe(struct pci_dev *pdev,
if (ret)
return ret;
- pci_enable_msi(pdev);
+ ret = pci_alloc_irq_vectors(pdev, 1, 1, PCI_IRQ_ALL_TYPES);
+ if (ret < 0)
+ return ret;
memset(&res, 0, sizeof(res));
- res.addr = pcim_iomap_table(pdev)[i];
- res.wol_irq = pdev->irq;
- res.irq = pdev->irq;
+ res.addr = pcim_iomap_table(pdev)[0];
+ res.wol_irq = pci_irq_vector(pdev, 0);
+ res.irq = pci_irq_vector(pdev, 0);
- return stmmac_dvr_probe(&pdev->dev, plat, &res);
+ ret = stmmac_dvr_probe(&pdev->dev, plat, &res);
+ if (ret) {
+ pci_free_irq_vectors(pdev);
+ clk_disable_unprepare(plat->stmmac_clk);
+ clk_unregister_fixed_rate(plat->stmmac_clk);
+ }
+
+ return ret;
}
/**
@@ -671,19 +648,15 @@ static void intel_eth_pci_remove(struct pci_dev *pdev)
{
struct net_device *ndev = dev_get_drvdata(&pdev->dev);
struct stmmac_priv *priv = netdev_priv(ndev);
- int i;
stmmac_dvr_remove(&pdev->dev);
- if (priv->plat->stmmac_clk)
- clk_unregister_fixed_rate(priv->plat->stmmac_clk);
+ pci_free_irq_vectors(pdev);
- for (i = 0; i < PCI_STD_NUM_BARS; i++) {
- if (pci_resource_len(pdev, i) == 0)
- continue;
- pcim_iounmap_regions(pdev, BIT(i));
- break;
- }
+ clk_disable_unprepare(priv->plat->stmmac_clk);
+ clk_unregister_fixed_rate(priv->plat->stmmac_clk);
+
+ pcim_iounmap_regions(pdev, BIT(0));
pci_disable_device(pdev);
}
@@ -742,26 +715,19 @@ static SIMPLE_DEV_PM_OPS(intel_eth_pm_ops, intel_eth_pci_suspend,
#define PCI_DEVICE_ID_INTEL_TGL_SGMII1G_ID 0xa0ac
static const struct pci_device_id intel_eth_pci_id_table[] = {
- { PCI_DEVICE_DATA(INTEL, QUARK_ID, &quark_pci_info) },
- { PCI_DEVICE_DATA(INTEL, EHL_RGMII1G_ID, &ehl_rgmii1g_pci_info) },
- { PCI_DEVICE_DATA(INTEL, EHL_SGMII1G_ID, &ehl_sgmii1g_pci_info) },
- { PCI_DEVICE_DATA(INTEL, EHL_SGMII2G5_ID, &ehl_sgmii1g_pci_info) },
- { PCI_DEVICE_DATA(INTEL, EHL_PSE0_RGMII1G_ID,
- &ehl_pse0_rgmii1g_pci_info) },
- { PCI_DEVICE_DATA(INTEL, EHL_PSE0_SGMII1G_ID,
- &ehl_pse0_sgmii1g_pci_info) },
- { PCI_DEVICE_DATA(INTEL, EHL_PSE0_SGMII2G5_ID,
- &ehl_pse0_sgmii1g_pci_info) },
- { PCI_DEVICE_DATA(INTEL, EHL_PSE1_RGMII1G_ID,
- &ehl_pse1_rgmii1g_pci_info) },
- { PCI_DEVICE_DATA(INTEL, EHL_PSE1_SGMII1G_ID,
- &ehl_pse1_sgmii1g_pci_info) },
- { PCI_DEVICE_DATA(INTEL, EHL_PSE1_SGMII2G5_ID,
- &ehl_pse1_sgmii1g_pci_info) },
- { PCI_DEVICE_DATA(INTEL, TGL_SGMII1G_ID, &tgl_sgmii1g_pci_info) },
+ { PCI_DEVICE_DATA(INTEL, QUARK_ID, &quark_info) },
+ { PCI_DEVICE_DATA(INTEL, EHL_RGMII1G_ID, &ehl_rgmii1g_info) },
+ { PCI_DEVICE_DATA(INTEL, EHL_SGMII1G_ID, &ehl_sgmii1g_info) },
+ { PCI_DEVICE_DATA(INTEL, EHL_SGMII2G5_ID, &ehl_sgmii1g_info) },
+ { PCI_DEVICE_DATA(INTEL, EHL_PSE0_RGMII1G_ID, &ehl_pse0_rgmii1g_info) },
+ { PCI_DEVICE_DATA(INTEL, EHL_PSE0_SGMII1G_ID, &ehl_pse0_sgmii1g_info) },
+ { PCI_DEVICE_DATA(INTEL, EHL_PSE0_SGMII2G5_ID, &ehl_pse0_sgmii1g_info) },
+ { PCI_DEVICE_DATA(INTEL, EHL_PSE1_RGMII1G_ID, &ehl_pse1_rgmii1g_info) },
+ { PCI_DEVICE_DATA(INTEL, EHL_PSE1_SGMII1G_ID, &ehl_pse1_sgmii1g_info) },
+ { PCI_DEVICE_DATA(INTEL, EHL_PSE1_SGMII2G5_ID, &ehl_pse1_sgmii1g_info) },
+ { PCI_DEVICE_DATA(INTEL, TGL_SGMII1G_ID, &tgl_sgmii1g_info) },
{}
};
-
MODULE_DEVICE_TABLE(pci, intel_eth_pci_id_table);
static struct pci_driver intel_eth_pci_driver = {
diff --git a/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c b/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
index 565da64..ff22f27 100644
--- a/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
+++ b/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
@@ -4991,7 +4991,7 @@ int stmmac_dvr_probe(struct device *device,
priv->plat->bsp_priv);
if (ret < 0)
- return ret;
+ goto error_serdes_powerup;
}
#ifdef CONFIG_DEBUG_FS
@@ -5000,6 +5000,8 @@ int stmmac_dvr_probe(struct device *device,
return ret;
+error_serdes_powerup:
+ unregister_netdev(ndev);
error_netdev_register:
phylink_destroy(priv->phylink);
error_phy_setup:
diff --git a/drivers/net/ethernet/stmicro/stmmac/stmmac_pci.c b/drivers/net/ethernet/stmicro/stmmac/stmmac_pci.c
index 3fb21f7..272cb47 100644
--- a/drivers/net/ethernet/stmicro/stmmac/stmmac_pci.c
+++ b/drivers/net/ethernet/stmicro/stmmac/stmmac_pci.c
@@ -217,15 +217,10 @@ static int stmmac_pci_probe(struct pci_dev *pdev,
*/
static void stmmac_pci_remove(struct pci_dev *pdev)
{
- struct net_device *ndev = dev_get_drvdata(&pdev->dev);
- struct stmmac_priv *priv = netdev_priv(ndev);
int i;
stmmac_dvr_remove(&pdev->dev);
- if (priv->plat->stmmac_clk)
- clk_unregister_fixed_rate(priv->plat->stmmac_clk);
-
for (i = 0; i < PCI_STD_NUM_BARS; i++) {
if (pci_resource_len(pdev, i) == 0)
continue;
diff --git a/drivers/net/ethernet/toshiba/ps3_gelic_net.c b/drivers/net/ethernet/toshiba/ps3_gelic_net.c
index 070dd6f..310e683 100644
--- a/drivers/net/ethernet/toshiba/ps3_gelic_net.c
+++ b/drivers/net/ethernet/toshiba/ps3_gelic_net.c
@@ -1150,7 +1150,7 @@ static irqreturn_t gelic_card_interrupt(int irq, void *ptr)
* gelic_net_poll_controller - artificial interrupt for netconsole etc.
* @netdev: interface device structure
*
- * see Documentation/networking/netconsole.txt
+ * see Documentation/networking/netconsole.rst
*/
void gelic_net_poll_controller(struct net_device *netdev)
{
diff --git a/drivers/net/ethernet/toshiba/spider_net.c b/drivers/net/ethernet/toshiba/spider_net.c
index 6576271..3902b3a 100644
--- a/drivers/net/ethernet/toshiba/spider_net.c
+++ b/drivers/net/ethernet/toshiba/spider_net.c
@@ -1615,7 +1615,7 @@ spider_net_interrupt(int irq, void *ptr)
* spider_net_poll_controller - artificial interrupt for netconsole etc.
* @netdev: interface device structure
*
- * see Documentation/networking/netconsole.txt
+ * see Documentation/networking/netconsole.rst
*/
static void
spider_net_poll_controller(struct net_device *netdev)
diff --git a/drivers/net/fddi/Kconfig b/drivers/net/fddi/Kconfig
index 3b412a5..da4f58e 100644
--- a/drivers/net/fddi/Kconfig
+++ b/drivers/net/fddi/Kconfig
@@ -77,7 +77,7 @@
- Netelligent 100 FDDI SAS UTP
- Netelligent 100 FDDI SAS Fibre MIC
- Read <file:Documentation/networking/skfp.txt> for information about
+ Read <file:Documentation/networking/skfp.rst> for information about
the driver.
Questions concerning this driver can be addressed to:
diff --git a/drivers/net/phy/Kconfig b/drivers/net/phy/Kconfig
index bacfee4..2a32f26 100644
--- a/drivers/net/phy/Kconfig
+++ b/drivers/net/phy/Kconfig
@@ -157,6 +157,13 @@
This is library mode.
+config MDIO_IPQ4019
+ tristate "Qualcomm IPQ4019 MDIO interface support"
+ depends on HAS_IOMEM && OF_MDIO
+ help
+ This driver supports the MDIO interface found in Qualcomm
+ IPQ40xx series Soc-s.
+
config MDIO_IPQ8064
tristate "Qualcomm IPQ8064 MDIO interface support"
depends on HAS_IOMEM && OF_MDIO
diff --git a/drivers/net/phy/Makefile b/drivers/net/phy/Makefile
index cd345b7..dc9e53b 100644
--- a/drivers/net/phy/Makefile
+++ b/drivers/net/phy/Makefile
@@ -37,6 +37,7 @@
obj-$(CONFIG_MDIO_GPIO) += mdio-gpio.o
obj-$(CONFIG_MDIO_HISI_FEMAC) += mdio-hisi-femac.o
obj-$(CONFIG_MDIO_I2C) += mdio-i2c.o
+obj-$(CONFIG_MDIO_IPQ4019) += mdio-ipq4019.o
obj-$(CONFIG_MDIO_IPQ8064) += mdio-ipq8064.o
obj-$(CONFIG_MDIO_MOXART) += mdio-moxart.o
obj-$(CONFIG_MDIO_MSCC_MIIM) += mdio-mscc-miim.o
diff --git a/drivers/net/phy/mdio-ipq4019.c b/drivers/net/phy/mdio-ipq4019.c
new file mode 100644
index 0000000..1ce81ff
--- /dev/null
+++ b/drivers/net/phy/mdio-ipq4019.c
@@ -0,0 +1,160 @@
+// SPDX-License-Identifier: GPL-2.0 OR BSD-3-Clause
+/* Copyright (c) 2015, The Linux Foundation. All rights reserved. */
+/* Copyright (c) 2020 Sartura Ltd. */
+
+#include <linux/delay.h>
+#include <linux/kernel.h>
+#include <linux/module.h>
+#include <linux/io.h>
+#include <linux/iopoll.h>
+#include <linux/of_address.h>
+#include <linux/of_mdio.h>
+#include <linux/phy.h>
+#include <linux/platform_device.h>
+
+#define MDIO_ADDR_REG 0x44
+#define MDIO_DATA_WRITE_REG 0x48
+#define MDIO_DATA_READ_REG 0x4c
+#define MDIO_CMD_REG 0x50
+#define MDIO_CMD_ACCESS_BUSY BIT(16)
+#define MDIO_CMD_ACCESS_START BIT(8)
+#define MDIO_CMD_ACCESS_CODE_READ 0
+#define MDIO_CMD_ACCESS_CODE_WRITE 1
+
+#define ipq4019_MDIO_TIMEOUT 10000
+#define ipq4019_MDIO_SLEEP 10
+
+struct ipq4019_mdio_data {
+ void __iomem *membase;
+};
+
+static int ipq4019_mdio_wait_busy(struct mii_bus *bus)
+{
+ struct ipq4019_mdio_data *priv = bus->priv;
+ unsigned int busy;
+
+ return readl_poll_timeout(priv->membase + MDIO_CMD_REG, busy,
+ (busy & MDIO_CMD_ACCESS_BUSY) == 0,
+ ipq4019_MDIO_SLEEP, ipq4019_MDIO_TIMEOUT);
+}
+
+static int ipq4019_mdio_read(struct mii_bus *bus, int mii_id, int regnum)
+{
+ struct ipq4019_mdio_data *priv = bus->priv;
+ unsigned int cmd;
+
+ /* Reject clause 45 */
+ if (regnum & MII_ADDR_C45)
+ return -EOPNOTSUPP;
+
+ if (ipq4019_mdio_wait_busy(bus))
+ return -ETIMEDOUT;
+
+ /* issue the phy address and reg */
+ writel((mii_id << 8) | regnum, priv->membase + MDIO_ADDR_REG);
+
+ cmd = MDIO_CMD_ACCESS_START | MDIO_CMD_ACCESS_CODE_READ;
+
+ /* issue read command */
+ writel(cmd, priv->membase + MDIO_CMD_REG);
+
+ /* Wait read complete */
+ if (ipq4019_mdio_wait_busy(bus))
+ return -ETIMEDOUT;
+
+ /* Read and return data */
+ return readl(priv->membase + MDIO_DATA_READ_REG);
+}
+
+static int ipq4019_mdio_write(struct mii_bus *bus, int mii_id, int regnum,
+ u16 value)
+{
+ struct ipq4019_mdio_data *priv = bus->priv;
+ unsigned int cmd;
+
+ /* Reject clause 45 */
+ if (regnum & MII_ADDR_C45)
+ return -EOPNOTSUPP;
+
+ if (ipq4019_mdio_wait_busy(bus))
+ return -ETIMEDOUT;
+
+ /* issue the phy address and reg */
+ writel((mii_id << 8) | regnum, priv->membase + MDIO_ADDR_REG);
+
+ /* issue write data */
+ writel(value, priv->membase + MDIO_DATA_WRITE_REG);
+
+ cmd = MDIO_CMD_ACCESS_START | MDIO_CMD_ACCESS_CODE_WRITE;
+ /* issue write command */
+ writel(cmd, priv->membase + MDIO_CMD_REG);
+
+ /* Wait write complete */
+ if (ipq4019_mdio_wait_busy(bus))
+ return -ETIMEDOUT;
+
+ return 0;
+}
+
+static int ipq4019_mdio_probe(struct platform_device *pdev)
+{
+ struct ipq4019_mdio_data *priv;
+ struct mii_bus *bus;
+ int ret;
+
+ bus = devm_mdiobus_alloc_size(&pdev->dev, sizeof(*priv));
+ if (!bus)
+ return -ENOMEM;
+
+ priv = bus->priv;
+
+ priv->membase = devm_platform_ioremap_resource(pdev, 0);
+ if (IS_ERR(priv->membase))
+ return PTR_ERR(priv->membase);
+
+ bus->name = "ipq4019_mdio";
+ bus->read = ipq4019_mdio_read;
+ bus->write = ipq4019_mdio_write;
+ bus->parent = &pdev->dev;
+ snprintf(bus->id, MII_BUS_ID_SIZE, "%s%d", pdev->name, pdev->id);
+
+ ret = of_mdiobus_register(bus, pdev->dev.of_node);
+ if (ret) {
+ dev_err(&pdev->dev, "Cannot register MDIO bus!\n");
+ return ret;
+ }
+
+ platform_set_drvdata(pdev, bus);
+
+ return 0;
+}
+
+static int ipq4019_mdio_remove(struct platform_device *pdev)
+{
+ struct mii_bus *bus = platform_get_drvdata(pdev);
+
+ mdiobus_unregister(bus);
+
+ return 0;
+}
+
+static const struct of_device_id ipq4019_mdio_dt_ids[] = {
+ { .compatible = "qcom,ipq4019-mdio" },
+ { }
+};
+MODULE_DEVICE_TABLE(of, ipq4019_mdio_dt_ids);
+
+static struct platform_driver ipq4019_mdio_driver = {
+ .probe = ipq4019_mdio_probe,
+ .remove = ipq4019_mdio_remove,
+ .driver = {
+ .name = "ipq4019-mdio",
+ .of_match_table = ipq4019_mdio_dt_ids,
+ },
+};
+
+module_platform_driver(ipq4019_mdio_driver);
+
+MODULE_DESCRIPTION("ipq4019 MDIO interface driver");
+MODULE_AUTHOR("Qualcomm Atheros");
+MODULE_LICENSE("Dual BSD/GPL");
diff --git a/drivers/net/plip/Kconfig b/drivers/net/plip/Kconfig
index b41035b..e03556d 100644
--- a/drivers/net/plip/Kconfig
+++ b/drivers/net/plip/Kconfig
@@ -21,7 +21,7 @@
bits at a time (mode 0) or with special PLIP cables, to be used on
bidirectional parallel ports only, which can transmit 8 bits at a
time (mode 1); you can find the wiring of these cables in
- <file:Documentation/networking/PLIP.txt>. The cables can be up to
+ <file:Documentation/networking/plip.rst>. The cables can be up to
15m long. Mode 0 works also if one of the machines runs DOS/Windows
and has some PLIP software installed, e.g. the Crynwr PLIP packet
driver (<http://oak.oakland.edu/simtel.net/msdos/pktdrvr-pre.html>)
diff --git a/drivers/net/rionet.c b/drivers/net/rionet.c
index 8eeb38c..2056d6a 100644
--- a/drivers/net/rionet.c
+++ b/drivers/net/rionet.c
@@ -166,7 +166,8 @@ static int rionet_queue_tx_msg(struct sk_buff *skb, struct net_device *ndev,
return 0;
}
-static int rionet_start_xmit(struct sk_buff *skb, struct net_device *ndev)
+static netdev_tx_t rionet_start_xmit(struct sk_buff *skb,
+ struct net_device *ndev)
{
int i;
struct rionet_private *rnet = netdev_priv(ndev);
diff --git a/drivers/net/wireless/Kconfig b/drivers/net/wireless/Kconfig
index 1c98d78..15b0ad1 100644
--- a/drivers/net/wireless/Kconfig
+++ b/drivers/net/wireless/Kconfig
@@ -57,7 +57,7 @@
---help---
Say Y here if you intend to attach an Aviator/Raytheon PCMCIA
(PC-card) wireless Ethernet networking card to your computer.
- Please read the file <file:Documentation/networking/ray_cs.txt> for
+ Please read the file <file:Documentation/networking/ray_cs.rst> for
details.
To compile this driver as a module, choose M here: the module will be
diff --git a/drivers/staging/fsl-dpaa2/ethsw/README b/drivers/staging/fsl-dpaa2/ethsw/README
index f6fc07f..b48dcbf 100644
--- a/drivers/staging/fsl-dpaa2/ethsw/README
+++ b/drivers/staging/fsl-dpaa2/ethsw/README
@@ -79,7 +79,7 @@
For a more detailed description of the Ethernet switch device driver model
see:
- Documentation/networking/switchdev.txt
+ Documentation/networking/switchdev.rst
Creating an Ethernet Switch
===========================
diff --git a/include/linux/inet_diag.h b/include/linux/inet_diag.h
index ce9ed1c..0ef2d80 100644
--- a/include/linux/inet_diag.h
+++ b/include/linux/inet_diag.h
@@ -71,7 +71,11 @@ static inline size_t inet_diag_msg_attrs_size(void)
+ nla_total_size(1) /* INET_DIAG_SKV6ONLY */
#endif
+ nla_total_size(4) /* INET_DIAG_MARK */
- + nla_total_size(4); /* INET_DIAG_CLASS_ID */
+ + nla_total_size(4) /* INET_DIAG_CLASS_ID */
+#ifdef CONFIG_SOCK_CGROUP_DATA
+ + nla_total_size_64bit(sizeof(u64)) /* INET_DIAG_CGROUP_ID */
+#endif
+ ;
}
int inet_diag_msg_attrs_fill(struct sock *sk, struct sk_buff *skb,
struct inet_diag_msg *r, int ext,
diff --git a/include/linux/netdev_features.h b/include/linux/netdev_features.h
index 9d53c5a..2cc3cf8 100644
--- a/include/linux/netdev_features.h
+++ b/include/linux/netdev_features.h
@@ -89,7 +89,7 @@ enum {
* Add your fresh new feature above and remember to update
* netdev_features_strings[] in net/core/ethtool.c and maybe
* some feature mask #defines below. Please also describe it
- * in Documentation/networking/netdev-features.txt.
+ * in Documentation/networking/netdev-features.rst.
*/
/**/NETDEV_FEATURE_COUNT
diff --git a/include/net/cfg80211.h b/include/net/cfg80211.h
index 70e48f6..46ac804 100644
--- a/include/net/cfg80211.h
+++ b/include/net/cfg80211.h
@@ -5211,7 +5211,7 @@ u32 ieee80211_mandatory_rates(struct ieee80211_supported_band *sband,
* Radiotap parsing functions -- for controlled injection support
*
* Implemented in net/wireless/radiotap.c
- * Documentation in Documentation/networking/radiotap-headers.txt
+ * Documentation in Documentation/networking/radiotap-headers.rst
*/
struct radiotap_align_size {
diff --git a/include/uapi/linux/errqueue.h b/include/uapi/linux/errqueue.h
index 0cca196..ca5cb3e 100644
--- a/include/uapi/linux/errqueue.h
+++ b/include/uapi/linux/errqueue.h
@@ -36,7 +36,7 @@ struct sock_extended_err {
*
* The timestamping interfaces SO_TIMESTAMPING, MSG_TSTAMP_*
* communicate network timestamps by passing this struct in a cmsg with
- * recvmsg(). See Documentation/networking/timestamping.txt for details.
+ * recvmsg(). See Documentation/networking/timestamping.rst for details.
* User space sees a timespec definition that matches either
* __kernel_timespec or __kernel_old_timespec, in the kernel we
* require two structure definitions to provide both.
diff --git a/include/uapi/linux/inet_diag.h b/include/uapi/linux/inet_diag.h
index 57cc429..e6f183e 100644
--- a/include/uapi/linux/inet_diag.h
+++ b/include/uapi/linux/inet_diag.h
@@ -96,6 +96,7 @@ enum {
INET_DIAG_BC_MARK_COND,
INET_DIAG_BC_S_EQ,
INET_DIAG_BC_D_EQ,
+ INET_DIAG_BC_CGROUP_COND, /* u64 cgroup v2 ID */
};
struct inet_diag_hostcond {
@@ -157,6 +158,7 @@ enum {
INET_DIAG_MD5SIG,
INET_DIAG_ULP_INFO,
INET_DIAG_SK_BPF_STORAGES,
+ INET_DIAG_CGROUP_ID,
__INET_DIAG_MAX,
};
diff --git a/net/Kconfig b/net/Kconfig
index 8b1f858..c5ba2d1 100644
--- a/net/Kconfig
+++ b/net/Kconfig
@@ -344,7 +344,7 @@
what was just said, you don't need it: say N.
Documentation on how to use the packet generator can be found
- at <file:Documentation/networking/pktgen.txt>.
+ at <file:Documentation/networking/pktgen.rst>.
To compile this code as a module, choose M here: the
module will be called pktgen.
diff --git a/net/caif/chnl_net.c b/net/caif/chnl_net.c
index a566289..79b6a04 100644
--- a/net/caif/chnl_net.c
+++ b/net/caif/chnl_net.c
@@ -211,7 +211,8 @@ static void chnl_flowctrl_cb(struct cflayer *layr, enum caif_ctrlcmd flow,
}
}
-static int chnl_net_start_xmit(struct sk_buff *skb, struct net_device *dev)
+static netdev_tx_t chnl_net_start_xmit(struct sk_buff *skb,
+ struct net_device *dev)
{
struct chnl_net *priv;
struct cfpkt *pkt = NULL;
diff --git a/net/core/pktgen.c b/net/core/pktgen.c
index 08e2811..b53b6d3 100644
--- a/net/core/pktgen.c
+++ b/net/core/pktgen.c
@@ -56,7 +56,7 @@
* Integrated to 2.5.x 021029 --Lucio Maciel (luciomaciel@zipmail.com.br)
*
* 021124 Finished major redesign and rewrite for new functionality.
- * See Documentation/networking/pktgen.txt for how to use this.
+ * See Documentation/networking/pktgen.rst for how to use this.
*
* The new operation:
* For each CPU one thread/process is created at start. This process checks
diff --git a/net/ipv4/inet_diag.c b/net/ipv4/inet_diag.c
index 5d50aad..0034092 100644
--- a/net/ipv4/inet_diag.c
+++ b/net/ipv4/inet_diag.c
@@ -43,6 +43,9 @@ struct inet_diag_entry {
u16 userlocks;
u32 ifindex;
u32 mark;
+#ifdef CONFIG_SOCK_CGROUP_DATA
+ u64 cgroup_id;
+#endif
};
static DEFINE_MUTEX(inet_diag_table_mutex);
@@ -162,6 +165,13 @@ int inet_diag_msg_attrs_fill(struct sock *sk, struct sk_buff *skb,
goto errout;
}
+#ifdef CONFIG_SOCK_CGROUP_DATA
+ if (nla_put_u64_64bit(skb, INET_DIAG_CGROUP_ID,
+ cgroup_id(sock_cgroup_ptr(&sk->sk_cgrp_data)),
+ INET_DIAG_PAD))
+ goto errout;
+#endif
+
r->idiag_uid = from_kuid_munged(user_ns, sock_i_uid(sk));
r->idiag_inode = sock_i_ino(sk);
@@ -675,6 +685,16 @@ static int inet_diag_bc_run(const struct nlattr *_bc,
yes = 0;
break;
}
+#ifdef CONFIG_SOCK_CGROUP_DATA
+ case INET_DIAG_BC_CGROUP_COND: {
+ u64 cgroup_id;
+
+ cgroup_id = get_unaligned((const u64 *)(op + 1));
+ if (cgroup_id != entry->cgroup_id)
+ yes = 0;
+ break;
+ }
+#endif
}
if (yes) {
@@ -725,6 +745,9 @@ int inet_diag_bc_sk(const struct nlattr *bc, struct sock *sk)
entry.mark = inet_rsk(inet_reqsk(sk))->ir_mark;
else
entry.mark = 0;
+#ifdef CONFIG_SOCK_CGROUP_DATA
+ entry.cgroup_id = cgroup_id(sock_cgroup_ptr(&sk->sk_cgrp_data));
+#endif
return inet_diag_bc_run(bc, &entry);
}
@@ -814,6 +837,15 @@ static bool valid_markcond(const struct inet_diag_bc_op *op, int len,
return len >= *min_len;
}
+#ifdef CONFIG_SOCK_CGROUP_DATA
+static bool valid_cgroupcond(const struct inet_diag_bc_op *op, int len,
+ int *min_len)
+{
+ *min_len += sizeof(u64);
+ return len >= *min_len;
+}
+#endif
+
static int inet_diag_bc_audit(const struct nlattr *attr,
const struct sk_buff *skb)
{
@@ -856,6 +888,12 @@ static int inet_diag_bc_audit(const struct nlattr *attr,
if (!valid_markcond(bc, len, &min_len))
return -EINVAL;
break;
+#ifdef CONFIG_SOCK_CGROUP_DATA
+ case INET_DIAG_BC_CGROUP_COND:
+ if (!valid_cgroupcond(bc, len, &min_len))
+ return -EINVAL;
+ break;
+#endif
case INET_DIAG_BC_AUTO:
case INET_DIAG_BC_JMP:
case INET_DIAG_BC_NOP:
diff --git a/net/lapb/Kconfig b/net/lapb/Kconfig
index 6acfc99..5b50e8d 100644
--- a/net/lapb/Kconfig
+++ b/net/lapb/Kconfig
@@ -15,7 +15,7 @@
currently supports LAPB only over Ethernet connections. If you want
to use LAPB connections over Ethernet, say Y here and to "LAPB over
Ethernet driver" below. Read
- <file:Documentation/networking/lapb-module.txt> for technical
+ <file:Documentation/networking/lapb-module.rst> for technical
details.
To compile this driver as a module, choose M here: the
diff --git a/net/mac80211/tx.c b/net/mac80211/tx.c
index 82846ac..9849c14 100644
--- a/net/mac80211/tx.c
+++ b/net/mac80211/tx.c
@@ -2144,7 +2144,7 @@ static bool ieee80211_parse_tx_radiotap(struct ieee80211_local *local,
/*
* Please update the file
- * Documentation/networking/mac80211-injection.txt
+ * Documentation/networking/mac80211-injection.rst
* when parsing new fields here.
*/
diff --git a/net/netfilter/Kconfig b/net/netfilter/Kconfig
index 468fea1..3a3915d 100644
--- a/net/netfilter/Kconfig
+++ b/net/netfilter/Kconfig
@@ -1043,7 +1043,7 @@
on Netfilter connection tracking and NAT, unlike REDIRECT.
For it to work you will have to configure certain iptables rules
and use policy routing. For more information on how to set it up
- see Documentation/networking/tproxy.txt.
+ see Documentation/networking/tproxy.rst.
To compile it as a module, choose M here. If unsure, say N.
diff --git a/net/rxrpc/Kconfig b/net/rxrpc/Kconfig
index 57ebb29..d706bb4 100644
--- a/net/rxrpc/Kconfig
+++ b/net/rxrpc/Kconfig
@@ -18,7 +18,7 @@
This module at the moment only supports client operations and is
currently incomplete.
- See Documentation/networking/rxrpc.txt.
+ See Documentation/networking/rxrpc.rst.
config AF_RXRPC_IPV6
bool "IPv6 support for RxRPC"
@@ -41,7 +41,7 @@
help
Say Y here to make runtime controllable debugging messages appear.
- See Documentation/networking/rxrpc.txt.
+ See Documentation/networking/rxrpc.rst.
config RXKAD
@@ -56,4 +56,4 @@
Provide kerberos 4 and AFS kaserver security handling for AF_RXRPC
through the use of the key retention service.
- See Documentation/networking/rxrpc.txt.
+ See Documentation/networking/rxrpc.rst.
diff --git a/net/rxrpc/sysctl.c b/net/rxrpc/sysctl.c
index 2bbb381..174e903 100644
--- a/net/rxrpc/sysctl.c
+++ b/net/rxrpc/sysctl.c
@@ -21,7 +21,7 @@ static const unsigned long max_jiffies = MAX_JIFFY_OFFSET;
/*
* RxRPC operating parameters.
*
- * See Documentation/networking/rxrpc.txt and the variable definitions for more
+ * See Documentation/networking/rxrpc.rst and the variable definitions for more
* information on the individual parameters.
*/
static struct ctl_table rxrpc_sysctl_table[] = {
diff --git a/net/smc/af_smc.c b/net/smc/af_smc.c
index e859e3f..bd9662d 100644
--- a/net/smc/af_smc.c
+++ b/net/smc/af_smc.c
@@ -382,22 +382,24 @@ static int smcr_lgr_reg_rmbs(struct smc_link_group *lgr,
static int smcr_clnt_conf_first_link(struct smc_sock *smc)
{
struct smc_link *link = smc->conn.lnk;
- int rest;
+ struct smc_llc_qentry *qentry;
int rc;
+ link->lgr->type = SMC_LGR_SINGLE;
+
/* receive CONFIRM LINK request from server over RoCE fabric */
- rest = wait_for_completion_interruptible_timeout(
- &link->llc_confirm,
- SMC_LLC_WAIT_FIRST_TIME);
- if (rest <= 0) {
+ qentry = smc_llc_wait(link->lgr, NULL, SMC_LLC_WAIT_TIME,
+ SMC_LLC_CONFIRM_LINK);
+ if (!qentry) {
struct smc_clc_msg_decline dclc;
rc = smc_clc_wait_msg(smc, &dclc, sizeof(dclc),
SMC_CLC_DECLINE, CLC_WAIT_TIME_SHORT);
return rc == -EAGAIN ? SMC_CLC_DECL_TIMEOUT_CL : rc;
}
-
- if (link->llc_confirm_rc)
+ rc = smc_llc_eval_conf_link(qentry, SMC_LLC_REQ);
+ smc_llc_flow_qentry_del(&link->lgr->llc_flow_lcl);
+ if (rc)
return SMC_CLC_DECL_RMBE_EC;
rc = smc_ib_modify_qp_rts(link);
@@ -409,31 +411,30 @@ static int smcr_clnt_conf_first_link(struct smc_sock *smc)
if (smcr_link_reg_rmb(link, smc->conn.rmb_desc, false))
return SMC_CLC_DECL_ERR_REGRMB;
+ /* confirm_rkey is implicit on 1st contact */
+ smc->conn.rmb_desc->is_conf_rkey = true;
+
/* send CONFIRM LINK response over RoCE fabric */
rc = smc_llc_send_confirm_link(link, SMC_LLC_RESP);
if (rc < 0)
return SMC_CLC_DECL_TIMEOUT_CL;
- /* receive ADD LINK request from server over RoCE fabric */
- rest = wait_for_completion_interruptible_timeout(&link->llc_add,
- SMC_LLC_WAIT_TIME);
- if (rest <= 0) {
+ smc_llc_link_active(link);
+
+ /* optional 2nd link, receive ADD LINK request from server */
+ qentry = smc_llc_wait(link->lgr, NULL, SMC_LLC_WAIT_TIME,
+ SMC_LLC_ADD_LINK);
+ if (!qentry) {
struct smc_clc_msg_decline dclc;
rc = smc_clc_wait_msg(smc, &dclc, sizeof(dclc),
SMC_CLC_DECLINE, CLC_WAIT_TIME_SHORT);
- return rc == -EAGAIN ? SMC_CLC_DECL_TIMEOUT_AL : rc;
+ if (rc == -EAGAIN)
+ rc = 0; /* no DECLINE received, go with one link */
+ return rc;
}
-
- /* send add link reject message, only one link supported for now */
- rc = smc_llc_send_add_link(link,
- link->smcibdev->mac[link->ibport - 1],
- link->gid, SMC_LLC_RESP);
- if (rc < 0)
- return SMC_CLC_DECL_TIMEOUT_AL;
-
- smc_llc_link_active(link);
-
+ smc_llc_flow_qentry_clr(&link->lgr->llc_flow_lcl);
+ /* tbd: call smc_llc_cli_add_link(link, qentry); */
return 0;
}
@@ -613,8 +614,8 @@ static int smc_connect_rdma(struct smc_sock *smc,
struct smc_clc_msg_accept_confirm *aclc,
struct smc_init_info *ini)
{
+ int i, reason_code = 0;
struct smc_link *link;
- int reason_code = 0;
ini->is_smcd = false;
ini->ib_lcl = &aclc->lcl;
@@ -627,10 +628,28 @@ static int smc_connect_rdma(struct smc_sock *smc,
mutex_unlock(&smc_client_lgr_pending);
return reason_code;
}
- link = smc->conn.lnk;
smc_conn_save_peer_info(smc, aclc);
+ if (ini->cln_first_contact == SMC_FIRST_CONTACT) {
+ link = smc->conn.lnk;
+ } else {
+ /* set link that was assigned by server */
+ link = NULL;
+ for (i = 0; i < SMC_LINKS_PER_LGR_MAX; i++) {
+ struct smc_link *l = &smc->conn.lgr->lnk[i];
+
+ if (l->peer_qpn == ntoh24(aclc->qpn)) {
+ link = l;
+ break;
+ }
+ }
+ if (!link)
+ return smc_connect_abort(smc, SMC_CLC_DECL_NOSRVLINK,
+ ini->cln_first_contact);
+ smc->conn.lnk = link;
+ }
+
/* create send buffer and rmb */
if (smc_buf_create(smc, false))
return smc_connect_abort(smc, SMC_CLC_DECL_MEM,
@@ -666,7 +685,9 @@ static int smc_connect_rdma(struct smc_sock *smc,
if (ini->cln_first_contact == SMC_FIRST_CONTACT) {
/* QP confirmation over RoCE fabric */
+ smc_llc_flow_initiate(link->lgr, SMC_LLC_FLOW_ADD_LINK);
reason_code = smcr_clnt_conf_first_link(smc);
+ smc_llc_flow_stop(link->lgr, &link->lgr->llc_flow_lcl);
if (reason_code)
return smc_connect_abort(smc, reason_code,
ini->cln_first_contact);
@@ -1019,9 +1040,11 @@ void smc_close_non_accepted(struct sock *sk)
static int smcr_serv_conf_first_link(struct smc_sock *smc)
{
struct smc_link *link = smc->conn.lnk;
- int rest;
+ struct smc_llc_qentry *qentry;
int rc;
+ link->lgr->type = SMC_LGR_SINGLE;
+
if (smcr_link_reg_rmb(link, smc->conn.rmb_desc, false))
return SMC_CLC_DECL_ERR_REGRMB;
@@ -1031,40 +1054,27 @@ static int smcr_serv_conf_first_link(struct smc_sock *smc)
return SMC_CLC_DECL_TIMEOUT_CL;
/* receive CONFIRM LINK response from client over the RoCE fabric */
- rest = wait_for_completion_interruptible_timeout(
- &link->llc_confirm_resp,
- SMC_LLC_WAIT_FIRST_TIME);
- if (rest <= 0) {
+ qentry = smc_llc_wait(link->lgr, link, SMC_LLC_WAIT_TIME,
+ SMC_LLC_CONFIRM_LINK);
+ if (!qentry) {
struct smc_clc_msg_decline dclc;
rc = smc_clc_wait_msg(smc, &dclc, sizeof(dclc),
SMC_CLC_DECLINE, CLC_WAIT_TIME_SHORT);
return rc == -EAGAIN ? SMC_CLC_DECL_TIMEOUT_CL : rc;
}
-
- if (link->llc_confirm_resp_rc)
+ rc = smc_llc_eval_conf_link(qentry, SMC_LLC_RESP);
+ smc_llc_flow_qentry_del(&link->lgr->llc_flow_lcl);
+ if (rc)
return SMC_CLC_DECL_RMBE_EC;
- /* send ADD LINK request to client over the RoCE fabric */
- rc = smc_llc_send_add_link(link,
- link->smcibdev->mac[link->ibport - 1],
- link->gid, SMC_LLC_REQ);
- if (rc < 0)
- return SMC_CLC_DECL_TIMEOUT_AL;
-
- /* receive ADD LINK response from client over the RoCE fabric */
- rest = wait_for_completion_interruptible_timeout(&link->llc_add_resp,
- SMC_LLC_WAIT_TIME);
- if (rest <= 0) {
- struct smc_clc_msg_decline dclc;
-
- rc = smc_clc_wait_msg(smc, &dclc, sizeof(dclc),
- SMC_CLC_DECLINE, CLC_WAIT_TIME_SHORT);
- return rc == -EAGAIN ? SMC_CLC_DECL_TIMEOUT_AL : rc;
- }
+ /* confirm_rkey is implicit on 1st contact */
+ smc->conn.rmb_desc->is_conf_rkey = true;
smc_llc_link_active(link);
+ /* initial contact - try to establish second link */
+ /* tbd: call smc_llc_srv_add_link(link); */
return 0;
}
@@ -1240,7 +1250,9 @@ static int smc_listen_rdma_finish(struct smc_sock *new_smc,
goto decline;
}
/* QP confirmation over RoCE fabric */
+ smc_llc_flow_initiate(link->lgr, SMC_LLC_FLOW_ADD_LINK);
reason_code = smcr_serv_conf_first_link(new_smc);
+ smc_llc_flow_stop(link->lgr, &link->lgr->llc_flow_lcl);
if (reason_code)
goto decline;
}
diff --git a/net/smc/smc_clc.h b/net/smc/smc_clc.h
index 4f2e150..4658767 100644
--- a/net/smc/smc_clc.h
+++ b/net/smc/smc_clc.h
@@ -45,6 +45,7 @@
#define SMC_CLC_DECL_GETVLANERR 0x03080000 /* err to get vlan id of ip device*/
#define SMC_CLC_DECL_ISMVLANERR 0x03090000 /* err to reg vlan id on ism dev */
#define SMC_CLC_DECL_NOACTLINK 0x030a0000 /* no active smc-r link in lgr */
+#define SMC_CLC_DECL_NOSRVLINK 0x030b0000 /* SMC-R link from srv not found */
#define SMC_CLC_DECL_SYNCERR 0x04000000 /* synchronization error */
#define SMC_CLC_DECL_PEERDECL 0x05000000 /* peer declined during handshake */
#define SMC_CLC_DECL_INTERR 0x09990000 /* internal error */
diff --git a/net/smc/smc_core.c b/net/smc/smc_core.c
index db49f8cd..3539cee 100644
--- a/net/smc/smc_core.c
+++ b/net/smc/smc_core.c
@@ -200,7 +200,6 @@ static int smcr_link_send_delete(struct smc_link *lnk, bool orderly)
{
if (lnk->state == SMC_LNK_ACTIVE &&
!smc_llc_send_delete_link(lnk, SMC_LLC_REQ, orderly)) {
- smc_llc_link_deleting(lnk);
return 0;
}
return -ENOTCONN;
@@ -263,6 +262,7 @@ static void smc_lgr_free_work(struct work_struct *work)
if (smc_link_usable(lnk))
lnk->state = SMC_LNK_INACTIVE;
}
+ wake_up_interruptible_all(&lgr->llc_waiter);
}
smc_lgr_free(lgr);
}
@@ -445,13 +445,11 @@ static int smc_lgr_create(struct smc_sock *smc, struct smc_init_info *ini)
}
static void smcr_buf_unuse(struct smc_buf_desc *rmb_desc,
- struct smc_link *lnk)
+ struct smc_link_group *lgr)
{
- struct smc_link_group *lgr = lnk->lgr;
-
if (rmb_desc->is_conf_rkey && !list_empty(&lgr->list)) {
/* unregister rmb with peer */
- smc_llc_do_delete_rkey(lnk, rmb_desc);
+ smc_llc_do_delete_rkey(lgr, rmb_desc);
rmb_desc->is_conf_rkey = false;
}
if (rmb_desc->is_reg_err) {
@@ -474,7 +472,7 @@ static void smc_buf_unuse(struct smc_connection *conn,
if (conn->rmb_desc && lgr->is_smcd)
conn->rmb_desc->used = 0;
else if (conn->rmb_desc)
- smcr_buf_unuse(conn->rmb_desc, conn->lnk);
+ smcr_buf_unuse(conn->rmb_desc, lgr);
}
/* remove a finished connection from its link group */
@@ -696,6 +694,7 @@ static void smc_lgr_cleanup(struct smc_link_group *lgr)
if (smc_link_usable(lnk))
lnk->state = SMC_LNK_INACTIVE;
}
+ wake_up_interruptible_all(&lgr->llc_waiter);
}
}
@@ -767,8 +766,7 @@ void smc_port_terminate(struct smc_ib_device *smcibdev, u8 ibport)
continue;
/* tbd - terminate only when no more links are active */
for (i = 0; i < SMC_LINKS_PER_LGR_MAX; i++) {
- if (!smc_link_usable(&lgr->lnk[i]) ||
- lgr->lnk[i].state == SMC_LNK_DELETING)
+ if (!smc_link_usable(&lgr->lnk[i]))
continue;
if (lgr->lnk[i].smcibdev == smcibdev &&
lgr->lnk[i].ibport == ibport) {
@@ -1167,7 +1165,6 @@ static int smcr_buf_map_usable_links(struct smc_link_group *lgr,
if (!smc_link_usable(lnk))
continue;
if (smcr_buf_map_link(buf_desc, is_rmb, lnk)) {
- smcr_buf_unuse(buf_desc, lnk);
rc = -ENOMEM;
goto out;
}
@@ -1273,6 +1270,7 @@ static int __smc_buf_create(struct smc_sock *smc, bool is_smcd, bool is_rmb)
if (!is_smcd) {
if (smcr_buf_map_usable_links(lgr, buf_desc, is_rmb)) {
+ smcr_buf_unuse(buf_desc, lgr);
return -ENOMEM;
}
}
@@ -1368,6 +1366,53 @@ static inline int smc_rmb_reserve_rtoken_idx(struct smc_link_group *lgr)
return -ENOSPC;
}
+static int smc_rtoken_find_by_link(struct smc_link_group *lgr, int lnk_idx,
+ u32 rkey)
+{
+ int i;
+
+ for (i = 0; i < SMC_RMBS_PER_LGR_MAX; i++) {
+ if (test_bit(i, lgr->rtokens_used_mask) &&
+ lgr->rtokens[i][lnk_idx].rkey == rkey)
+ return i;
+ }
+ return -ENOENT;
+}
+
+/* set rtoken for a new link to an existing rmb */
+void smc_rtoken_set(struct smc_link_group *lgr, int link_idx, int link_idx_new,
+ __be32 nw_rkey_known, __be64 nw_vaddr, __be32 nw_rkey)
+{
+ int rtok_idx;
+
+ rtok_idx = smc_rtoken_find_by_link(lgr, link_idx, ntohl(nw_rkey_known));
+ if (rtok_idx == -ENOENT)
+ return;
+ lgr->rtokens[rtok_idx][link_idx_new].rkey = ntohl(nw_rkey);
+ lgr->rtokens[rtok_idx][link_idx_new].dma_addr = be64_to_cpu(nw_vaddr);
+}
+
+/* set rtoken for a new link whose link_id is given */
+void smc_rtoken_set2(struct smc_link_group *lgr, int rtok_idx, int link_id,
+ __be64 nw_vaddr, __be32 nw_rkey)
+{
+ u64 dma_addr = be64_to_cpu(nw_vaddr);
+ u32 rkey = ntohl(nw_rkey);
+ bool found = false;
+ int link_idx;
+
+ for (link_idx = 0; link_idx < SMC_LINKS_PER_LGR_MAX; link_idx++) {
+ if (lgr->lnk[link_idx].link_id == link_id) {
+ found = true;
+ break;
+ }
+ }
+ if (!found)
+ return;
+ lgr->rtokens[rtok_idx][link_idx].rkey = rkey;
+ lgr->rtokens[rtok_idx][link_idx].dma_addr = dma_addr;
+}
+
/* add a new rtoken from peer */
int smc_rtoken_add(struct smc_link *lnk, __be64 nw_vaddr, __be32 nw_rkey)
{
diff --git a/net/smc/smc_core.h b/net/smc/smc_core.h
index b578151..f12474c 100644
--- a/net/smc/smc_core.h
+++ b/net/smc/smc_core.h
@@ -36,7 +36,6 @@ enum smc_link_state { /* possible states of a link */
SMC_LNK_INACTIVE, /* link is inactive */
SMC_LNK_ACTIVATING, /* link is being activated */
SMC_LNK_ACTIVE, /* link is active */
- SMC_LNK_DELETING, /* link is being deleted */
};
#define SMC_WR_BUF_SIZE 48 /* size of work request buffer */
@@ -120,20 +119,9 @@ struct smc_link {
struct smc_link_group *lgr; /* parent link group */
enum smc_link_state state; /* state of link */
- struct completion llc_confirm; /* wait for rx of conf link */
- struct completion llc_confirm_resp; /* wait 4 rx of cnf lnk rsp */
- int llc_confirm_rc; /* rc from confirm link msg */
- int llc_confirm_resp_rc; /* rc from conf_resp msg */
- struct completion llc_add; /* wait for rx of add link */
- struct completion llc_add_resp; /* wait for rx of add link rsp*/
struct delayed_work llc_testlink_wrk; /* testlink worker */
struct completion llc_testlink_resp; /* wait for rx of testlink */
int llc_testlink_time; /* testlink interval */
- struct completion llc_confirm_rkey_resp; /* w4 rx of cnf rkey */
- int llc_confirm_rkey_resp_rc; /* rc from cnf rkey */
- struct completion llc_delete_rkey_resp; /* w4 rx of del rkey */
- int llc_delete_rkey_resp_rc; /* rc from del rkey */
- struct mutex llc_delete_rkey_mutex; /* serialize usage */
};
/* For now we just allow one parallel link per link group. The SMC protocol
@@ -197,6 +185,28 @@ struct smc_rtoken { /* address/key of remote RMB */
struct smcd_dev;
+enum smc_lgr_type { /* redundancy state of lgr */
+ SMC_LGR_NONE, /* no active links, lgr to be deleted */
+ SMC_LGR_SINGLE, /* 1 active RNIC on each peer */
+ SMC_LGR_SYMMETRIC, /* 2 active RNICs on each peer */
+ SMC_LGR_ASYMMETRIC_PEER, /* local has 2, peer 1 active RNICs */
+ SMC_LGR_ASYMMETRIC_LOCAL, /* local has 1, peer 2 active RNICs */
+};
+
+enum smc_llc_flowtype {
+ SMC_LLC_FLOW_NONE = 0,
+ SMC_LLC_FLOW_ADD_LINK = 2,
+ SMC_LLC_FLOW_DEL_LINK = 4,
+ SMC_LLC_FLOW_RKEY = 6,
+};
+
+struct smc_llc_qentry;
+
+struct smc_llc_flow {
+ enum smc_llc_flowtype type;
+ struct smc_llc_qentry *qentry;
+};
+
struct smc_link_group {
struct list_head list;
struct rb_root conns_all; /* connection tree */
@@ -232,12 +242,24 @@ struct smc_link_group {
DECLARE_BITMAP(rtokens_used_mask, SMC_RMBS_PER_LGR_MAX);
/* used rtoken elements */
u8 next_link_id;
+ enum smc_lgr_type type;
+ /* redundancy state */
struct list_head llc_event_q;
/* queue for llc events */
spinlock_t llc_event_q_lock;
/* protects llc_event_q */
struct work_struct llc_event_work;
/* llc event worker */
+ wait_queue_head_t llc_waiter;
+ /* w4 next llc event */
+ struct smc_llc_flow llc_flow_lcl;
+ /* llc local control field */
+ struct smc_llc_flow llc_flow_rmt;
+ /* llc remote control field */
+ struct smc_llc_qentry *delayed_event;
+ /* arrived when flow active */
+ spinlock_t llc_flow_lock;
+ /* protects llc flow */
int llc_testlink_time;
/* link keep alive time */
};
@@ -329,6 +351,10 @@ int smc_rmb_rtoken_handling(struct smc_connection *conn, struct smc_link *link,
struct smc_clc_msg_accept_confirm *clc);
int smc_rtoken_add(struct smc_link *lnk, __be64 nw_vaddr, __be32 nw_rkey);
int smc_rtoken_delete(struct smc_link *lnk, __be32 nw_rkey);
+void smc_rtoken_set(struct smc_link_group *lgr, int link_idx, int link_idx_new,
+ __be32 nw_rkey_known, __be64 nw_vaddr, __be32 nw_rkey);
+void smc_rtoken_set2(struct smc_link_group *lgr, int rtok_idx, int link_id,
+ __be64 nw_vaddr, __be32 nw_rkey);
void smc_sndbuf_sync_sg_for_cpu(struct smc_connection *conn);
void smc_sndbuf_sync_sg_for_device(struct smc_connection *conn);
void smc_rmb_sync_sg_for_cpu(struct smc_connection *conn);
diff --git a/net/smc/smc_llc.c b/net/smc/smc_llc.c
index e715dd6..327cf30 100644
--- a/net/smc/smc_llc.c
+++ b/net/smc/smc_llc.c
@@ -98,13 +98,8 @@ struct smc_llc_msg_confirm_rkey { /* type 0x06 */
u8 reserved;
};
-struct smc_llc_msg_confirm_rkey_cont { /* type 0x08 */
- struct smc_llc_hdr hd;
- u8 num_rkeys;
- struct smc_rmb_rtoken rtoken[SMC_LLC_RKEYS_PER_MSG];
-};
-
#define SMC_LLC_DEL_RKEY_MAX 8
+#define SMC_LLC_FLAG_RKEY_RETRY 0x10
#define SMC_LLC_FLAG_RKEY_NEG 0x20
struct smc_llc_msg_delete_rkey { /* type 0x09 */
@@ -122,7 +117,6 @@ union smc_llc_msg {
struct smc_llc_msg_del_link delete_link;
struct smc_llc_msg_confirm_rkey confirm_rkey;
- struct smc_llc_msg_confirm_rkey_cont confirm_rkey_cont;
struct smc_llc_msg_delete_rkey delete_rkey;
struct smc_llc_msg_test_link test_link;
@@ -140,6 +134,154 @@ struct smc_llc_qentry {
union smc_llc_msg msg;
};
+struct smc_llc_qentry *smc_llc_flow_qentry_clr(struct smc_llc_flow *flow)
+{
+ struct smc_llc_qentry *qentry = flow->qentry;
+
+ flow->qentry = NULL;
+ return qentry;
+}
+
+void smc_llc_flow_qentry_del(struct smc_llc_flow *flow)
+{
+ struct smc_llc_qentry *qentry;
+
+ if (flow->qentry) {
+ qentry = flow->qentry;
+ flow->qentry = NULL;
+ kfree(qentry);
+ }
+}
+
+static inline void smc_llc_flow_qentry_set(struct smc_llc_flow *flow,
+ struct smc_llc_qentry *qentry)
+{
+ flow->qentry = qentry;
+}
+
+/* try to start a new llc flow, initiated by an incoming llc msg */
+static bool smc_llc_flow_start(struct smc_llc_flow *flow,
+ struct smc_llc_qentry *qentry)
+{
+ struct smc_link_group *lgr = qentry->link->lgr;
+
+ spin_lock_bh(&lgr->llc_flow_lock);
+ if (flow->type) {
+ /* a flow is already active */
+ if ((qentry->msg.raw.hdr.common.type == SMC_LLC_ADD_LINK ||
+ qentry->msg.raw.hdr.common.type == SMC_LLC_DELETE_LINK) &&
+ !lgr->delayed_event) {
+ lgr->delayed_event = qentry;
+ } else {
+ /* forget this llc request */
+ kfree(qentry);
+ }
+ spin_unlock_bh(&lgr->llc_flow_lock);
+ return false;
+ }
+ switch (qentry->msg.raw.hdr.common.type) {
+ case SMC_LLC_ADD_LINK:
+ flow->type = SMC_LLC_FLOW_ADD_LINK;
+ break;
+ case SMC_LLC_DELETE_LINK:
+ flow->type = SMC_LLC_FLOW_DEL_LINK;
+ break;
+ case SMC_LLC_CONFIRM_RKEY:
+ case SMC_LLC_DELETE_RKEY:
+ flow->type = SMC_LLC_FLOW_RKEY;
+ break;
+ default:
+ flow->type = SMC_LLC_FLOW_NONE;
+ }
+ if (qentry == lgr->delayed_event)
+ lgr->delayed_event = NULL;
+ spin_unlock_bh(&lgr->llc_flow_lock);
+ smc_llc_flow_qentry_set(flow, qentry);
+ return true;
+}
+
+/* start a new local llc flow, wait till current flow finished */
+int smc_llc_flow_initiate(struct smc_link_group *lgr,
+ enum smc_llc_flowtype type)
+{
+ enum smc_llc_flowtype allowed_remote = SMC_LLC_FLOW_NONE;
+ int rc;
+
+ /* all flows except confirm_rkey and delete_rkey are exclusive,
+ * confirm/delete rkey flows can run concurrently (local and remote)
+ */
+ if (type == SMC_LLC_FLOW_RKEY)
+ allowed_remote = SMC_LLC_FLOW_RKEY;
+again:
+ if (list_empty(&lgr->list))
+ return -ENODEV;
+ spin_lock_bh(&lgr->llc_flow_lock);
+ if (lgr->llc_flow_lcl.type == SMC_LLC_FLOW_NONE &&
+ (lgr->llc_flow_rmt.type == SMC_LLC_FLOW_NONE ||
+ lgr->llc_flow_rmt.type == allowed_remote)) {
+ lgr->llc_flow_lcl.type = type;
+ spin_unlock_bh(&lgr->llc_flow_lock);
+ return 0;
+ }
+ spin_unlock_bh(&lgr->llc_flow_lock);
+ rc = wait_event_interruptible_timeout(lgr->llc_waiter,
+ (lgr->llc_flow_lcl.type == SMC_LLC_FLOW_NONE &&
+ (lgr->llc_flow_rmt.type == SMC_LLC_FLOW_NONE ||
+ lgr->llc_flow_rmt.type == allowed_remote)),
+ SMC_LLC_WAIT_TIME);
+ if (!rc)
+ return -ETIMEDOUT;
+ goto again;
+}
+
+/* finish the current llc flow */
+void smc_llc_flow_stop(struct smc_link_group *lgr, struct smc_llc_flow *flow)
+{
+ spin_lock_bh(&lgr->llc_flow_lock);
+ memset(flow, 0, sizeof(*flow));
+ flow->type = SMC_LLC_FLOW_NONE;
+ spin_unlock_bh(&lgr->llc_flow_lock);
+ if (!list_empty(&lgr->list) && lgr->delayed_event &&
+ flow == &lgr->llc_flow_lcl)
+ schedule_work(&lgr->llc_event_work);
+ else
+ wake_up_interruptible(&lgr->llc_waiter);
+}
+
+/* lnk is optional and used for early wakeup when link goes down, useful in
+ * cases where we wait for a response on the link after we sent a request
+ */
+struct smc_llc_qentry *smc_llc_wait(struct smc_link_group *lgr,
+ struct smc_link *lnk,
+ int time_out, u8 exp_msg)
+{
+ struct smc_llc_flow *flow = &lgr->llc_flow_lcl;
+
+ wait_event_interruptible_timeout(lgr->llc_waiter,
+ (flow->qentry ||
+ (lnk && !smc_link_usable(lnk)) ||
+ list_empty(&lgr->list)),
+ time_out);
+ if (!flow->qentry ||
+ (lnk && !smc_link_usable(lnk)) || list_empty(&lgr->list)) {
+ smc_llc_flow_qentry_del(flow);
+ goto out;
+ }
+ if (exp_msg && flow->qentry->msg.raw.hdr.common.type != exp_msg) {
+ if (exp_msg == SMC_LLC_ADD_LINK &&
+ flow->qentry->msg.raw.hdr.common.type ==
+ SMC_LLC_DELETE_LINK) {
+ /* flow_start will delay the unexpected msg */
+ smc_llc_flow_start(&lgr->llc_flow_lcl,
+ smc_llc_flow_qentry_clr(flow));
+ return NULL;
+ }
+ smc_llc_flow_qentry_del(flow);
+ }
+out:
+ return flow->qentry;
+}
+
/********************************** send *************************************/
struct smc_llc_tx_pend {
@@ -221,27 +363,44 @@ int smc_llc_send_confirm_link(struct smc_link *link,
}
/* send LLC confirm rkey request */
-static int smc_llc_send_confirm_rkey(struct smc_link *link,
+static int smc_llc_send_confirm_rkey(struct smc_link *send_link,
struct smc_buf_desc *rmb_desc)
{
struct smc_llc_msg_confirm_rkey *rkeyllc;
struct smc_wr_tx_pend_priv *pend;
struct smc_wr_buf *wr_buf;
- int rc;
+ struct smc_link *link;
+ int i, rc, rtok_ix;
- rc = smc_llc_add_pending_send(link, &wr_buf, &pend);
+ rc = smc_llc_add_pending_send(send_link, &wr_buf, &pend);
if (rc)
return rc;
rkeyllc = (struct smc_llc_msg_confirm_rkey *)wr_buf;
memset(rkeyllc, 0, sizeof(*rkeyllc));
rkeyllc->hd.common.type = SMC_LLC_CONFIRM_RKEY;
rkeyllc->hd.length = sizeof(struct smc_llc_msg_confirm_rkey);
+
+ rtok_ix = 1;
+ for (i = 0; i < SMC_LINKS_PER_LGR_MAX; i++) {
+ link = &send_link->lgr->lnk[i];
+ if (link->state == SMC_LNK_ACTIVE && link != send_link) {
+ rkeyllc->rtoken[rtok_ix].link_id = link->link_id;
+ rkeyllc->rtoken[rtok_ix].rmb_key =
+ htonl(rmb_desc->mr_rx[link->link_idx]->rkey);
+ rkeyllc->rtoken[rtok_ix].rmb_vaddr = cpu_to_be64(
+ (u64)sg_dma_address(
+ rmb_desc->sgt[link->link_idx].sgl));
+ rtok_ix++;
+ }
+ }
+ /* rkey of send_link is in rtoken[0] */
+ rkeyllc->rtoken[0].num_rkeys = rtok_ix - 1;
rkeyllc->rtoken[0].rmb_key =
- htonl(rmb_desc->mr_rx[link->link_idx]->rkey);
+ htonl(rmb_desc->mr_rx[send_link->link_idx]->rkey);
rkeyllc->rtoken[0].rmb_vaddr = cpu_to_be64(
- (u64)sg_dma_address(rmb_desc->sgt[link->link_idx].sgl));
+ (u64)sg_dma_address(rmb_desc->sgt[send_link->link_idx].sgl));
/* send llc message */
- rc = smc_wr_tx_send(link, pend);
+ rc = smc_wr_tx_send(send_link, pend);
return rc;
}
@@ -380,54 +539,12 @@ static int smc_llc_send_message(struct smc_link *link, void *llcbuf)
/********************************* receive ***********************************/
-static void smc_llc_rx_confirm_link(struct smc_link *link,
- struct smc_llc_msg_confirm_link *llc)
-{
- struct smc_link_group *lgr = smc_get_lgr(link);
- int conf_rc = 0;
-
- /* RMBE eyecatchers are not supported */
- if (!(llc->hd.flags & SMC_LLC_FLAG_NO_RMBE_EYEC))
- conf_rc = ENOTSUPP;
-
- if (lgr->role == SMC_CLNT &&
- link->state == SMC_LNK_ACTIVATING) {
- link->llc_confirm_rc = conf_rc;
- link->link_id = llc->link_num;
- complete(&link->llc_confirm);
- }
-}
-
-static void smc_llc_rx_add_link(struct smc_link *link,
- struct smc_llc_msg_add_link *llc)
-{
- struct smc_link_group *lgr = smc_get_lgr(link);
-
- if (link->state == SMC_LNK_ACTIVATING) {
- complete(&link->llc_add);
- return;
- }
-
- if (lgr->role == SMC_SERV) {
- smc_llc_prep_add_link(llc, link,
- link->smcibdev->mac[link->ibport - 1],
- link->gid, SMC_LLC_REQ);
-
- } else {
- smc_llc_prep_add_link(llc, link,
- link->smcibdev->mac[link->ibport - 1],
- link->gid, SMC_LLC_RESP);
- }
- smc_llc_send_message(link, llc);
-}
-
static void smc_llc_rx_delete_link(struct smc_link *link,
struct smc_llc_msg_del_link *llc)
{
struct smc_link_group *lgr = smc_get_lgr(link);
smc_lgr_forget(lgr);
- smc_llc_link_deleting(link);
if (lgr->role == SMC_SERV) {
/* client asks to delete this link, send request */
smc_llc_prep_delete_link(llc, link, SMC_LLC_REQ, true);
@@ -439,57 +556,68 @@ static void smc_llc_rx_delete_link(struct smc_link *link,
smc_lgr_terminate_sched(lgr);
}
-static void smc_llc_rx_test_link(struct smc_link *link,
- struct smc_llc_msg_test_link *llc)
+/* process a confirm_rkey request from peer, remote flow */
+static void smc_llc_rmt_conf_rkey(struct smc_link_group *lgr)
{
+ struct smc_llc_msg_confirm_rkey *llc;
+ struct smc_llc_qentry *qentry;
+ struct smc_link *link;
+ int num_entries;
+ int rk_idx;
+ int i;
+
+ qentry = lgr->llc_flow_rmt.qentry;
+ llc = &qentry->msg.confirm_rkey;
+ link = qentry->link;
+
+ num_entries = llc->rtoken[0].num_rkeys;
+ /* first rkey entry is for receiving link */
+ rk_idx = smc_rtoken_add(link,
+ llc->rtoken[0].rmb_vaddr,
+ llc->rtoken[0].rmb_key);
+ if (rk_idx < 0)
+ goto out_err;
+
+ for (i = 1; i <= min_t(u8, num_entries, SMC_LLC_RKEYS_PER_MSG - 1); i++)
+ smc_rtoken_set2(lgr, rk_idx, llc->rtoken[i].link_id,
+ llc->rtoken[i].rmb_vaddr,
+ llc->rtoken[i].rmb_key);
+ /* max links is 3 so there is no need to support conf_rkey_cont msgs */
+ goto out;
+out_err:
+ llc->hd.flags |= SMC_LLC_FLAG_RKEY_NEG;
+ llc->hd.flags |= SMC_LLC_FLAG_RKEY_RETRY;
+out:
llc->hd.flags |= SMC_LLC_FLAG_RESP;
- smc_llc_send_message(link, llc);
+ smc_llc_send_message(link, &qentry->msg);
+ smc_llc_flow_qentry_del(&lgr->llc_flow_rmt);
}
-static void smc_llc_rx_confirm_rkey(struct smc_link *link,
- struct smc_llc_msg_confirm_rkey *llc)
+/* process a delete_rkey request from peer, remote flow */
+static void smc_llc_rmt_delete_rkey(struct smc_link_group *lgr)
{
- int rc;
-
- rc = smc_rtoken_add(link,
- llc->rtoken[0].rmb_vaddr,
- llc->rtoken[0].rmb_key);
-
- /* ignore rtokens for other links, we have only one link */
-
- llc->hd.flags |= SMC_LLC_FLAG_RESP;
- if (rc < 0)
- llc->hd.flags |= SMC_LLC_FLAG_RKEY_NEG;
- smc_llc_send_message(link, llc);
-}
-
-static void smc_llc_rx_confirm_rkey_cont(struct smc_link *link,
- struct smc_llc_msg_confirm_rkey_cont *llc)
-{
- /* ignore rtokens for other links, we have only one link */
- llc->hd.flags |= SMC_LLC_FLAG_RESP;
- smc_llc_send_message(link, llc);
-}
-
-static void smc_llc_rx_delete_rkey(struct smc_link *link,
- struct smc_llc_msg_delete_rkey *llc)
-{
+ struct smc_llc_msg_delete_rkey *llc;
+ struct smc_llc_qentry *qentry;
+ struct smc_link *link;
u8 err_mask = 0;
int i, max;
+ qentry = lgr->llc_flow_rmt.qentry;
+ llc = &qentry->msg.delete_rkey;
+ link = qentry->link;
+
max = min_t(u8, llc->num_rkeys, SMC_LLC_DEL_RKEY_MAX);
for (i = 0; i < max; i++) {
if (smc_rtoken_delete(link, llc->rkey[i]))
err_mask |= 1 << (SMC_LLC_DEL_RKEY_MAX - 1 - i);
}
-
if (err_mask) {
llc->hd.flags |= SMC_LLC_FLAG_RKEY_NEG;
llc->err_mask = err_mask;
}
-
llc->hd.flags |= SMC_LLC_FLAG_RESP;
- smc_llc_send_message(link, llc);
+ smc_llc_send_message(link, &qentry->msg);
+ smc_llc_flow_qentry_del(&lgr->llc_flow_rmt);
}
/* flush the llc event queue */
@@ -509,32 +637,66 @@ static void smc_llc_event_handler(struct smc_llc_qentry *qentry)
{
union smc_llc_msg *llc = &qentry->msg;
struct smc_link *link = qentry->link;
+ struct smc_link_group *lgr = link->lgr;
if (!smc_link_usable(link))
goto out;
switch (llc->raw.hdr.common.type) {
case SMC_LLC_TEST_LINK:
- smc_llc_rx_test_link(link, &llc->test_link);
- break;
- case SMC_LLC_CONFIRM_LINK:
- smc_llc_rx_confirm_link(link, &llc->confirm_link);
+ llc->test_link.hd.flags |= SMC_LLC_FLAG_RESP;
+ smc_llc_send_message(link, llc);
break;
case SMC_LLC_ADD_LINK:
- smc_llc_rx_add_link(link, &llc->add_link);
+ if (list_empty(&lgr->list))
+ goto out; /* lgr is terminating */
+ if (lgr->role == SMC_CLNT) {
+ if (lgr->llc_flow_lcl.type == SMC_LLC_FLOW_ADD_LINK) {
+ /* a flow is waiting for this message */
+ smc_llc_flow_qentry_set(&lgr->llc_flow_lcl,
+ qentry);
+ wake_up_interruptible(&lgr->llc_waiter);
+ } else if (smc_llc_flow_start(&lgr->llc_flow_lcl,
+ qentry)) {
+ /* tbd: schedule_work(&lgr->llc_add_link_work); */
+ }
+ } else if (smc_llc_flow_start(&lgr->llc_flow_lcl, qentry)) {
+ /* as smc server, handle client suggestion */
+ /* tbd: schedule_work(&lgr->llc_add_link_work); */
+ }
+ return;
+ case SMC_LLC_CONFIRM_LINK:
+ if (lgr->llc_flow_lcl.type != SMC_LLC_FLOW_NONE) {
+ /* a flow is waiting for this message */
+ smc_llc_flow_qentry_set(&lgr->llc_flow_lcl, qentry);
+ wake_up_interruptible(&lgr->llc_waiter);
+ return;
+ }
break;
case SMC_LLC_DELETE_LINK:
smc_llc_rx_delete_link(link, &llc->delete_link);
break;
case SMC_LLC_CONFIRM_RKEY:
- smc_llc_rx_confirm_rkey(link, &llc->confirm_rkey);
- break;
+ /* new request from remote, assign to remote flow */
+ if (smc_llc_flow_start(&lgr->llc_flow_rmt, qentry)) {
+ /* process here, does not wait for more llc msgs */
+ smc_llc_rmt_conf_rkey(lgr);
+ smc_llc_flow_stop(lgr, &lgr->llc_flow_rmt);
+ }
+ return;
case SMC_LLC_CONFIRM_RKEY_CONT:
- smc_llc_rx_confirm_rkey_cont(link, &llc->confirm_rkey_cont);
+ /* not used because max links is 3, and 3 rkeys fit into
+ * one CONFIRM_RKEY message
+ */
break;
case SMC_LLC_DELETE_RKEY:
- smc_llc_rx_delete_rkey(link, &llc->delete_rkey);
- break;
+ /* new request from remote, assign to remote flow */
+ if (smc_llc_flow_start(&lgr->llc_flow_rmt, qentry)) {
+ /* process here, does not wait for more llc msgs */
+ smc_llc_rmt_delete_rkey(lgr);
+ smc_llc_flow_stop(lgr, &lgr->llc_flow_rmt);
+ }
+ return;
}
out:
kfree(qentry);
@@ -547,6 +709,16 @@ static void smc_llc_event_work(struct work_struct *work)
llc_event_work);
struct smc_llc_qentry *qentry;
+ if (!lgr->llc_flow_lcl.type && lgr->delayed_event) {
+ if (smc_link_usable(lgr->delayed_event->link)) {
+ smc_llc_event_handler(lgr->delayed_event);
+ } else {
+ qentry = lgr->delayed_event;
+ lgr->delayed_event = NULL;
+ kfree(qentry);
+ }
+ }
+
again:
spin_lock_bh(&lgr->llc_event_q_lock);
if (!list_empty(&lgr->llc_event_q)) {
@@ -561,80 +733,75 @@ static void smc_llc_event_work(struct work_struct *work)
}
/* process llc responses in tasklet context */
-static void smc_llc_rx_response(struct smc_link *link, union smc_llc_msg *llc)
+static void smc_llc_rx_response(struct smc_link *link,
+ struct smc_llc_qentry *qentry)
{
- int rc = 0;
+ u8 llc_type = qentry->msg.raw.hdr.common.type;
- switch (llc->raw.hdr.common.type) {
+ switch (llc_type) {
case SMC_LLC_TEST_LINK:
if (link->state == SMC_LNK_ACTIVE)
complete(&link->llc_testlink_resp);
break;
- case SMC_LLC_CONFIRM_LINK:
- if (!(llc->raw.hdr.flags & SMC_LLC_FLAG_NO_RMBE_EYEC))
- rc = ENOTSUPP;
- if (link->lgr->role == SMC_SERV &&
- link->state == SMC_LNK_ACTIVATING) {
- link->llc_confirm_resp_rc = rc;
- complete(&link->llc_confirm_resp);
- }
- break;
case SMC_LLC_ADD_LINK:
- if (link->state == SMC_LNK_ACTIVATING)
- complete(&link->llc_add_resp);
- break;
+ case SMC_LLC_CONFIRM_LINK:
+ case SMC_LLC_CONFIRM_RKEY:
+ case SMC_LLC_DELETE_RKEY:
+ /* assign responses to the local flow, we requested them */
+ smc_llc_flow_qentry_set(&link->lgr->llc_flow_lcl, qentry);
+ wake_up_interruptible(&link->lgr->llc_waiter);
+ return;
case SMC_LLC_DELETE_LINK:
if (link->lgr->role == SMC_SERV)
smc_lgr_schedule_free_work_fast(link->lgr);
break;
- case SMC_LLC_CONFIRM_RKEY:
- link->llc_confirm_rkey_resp_rc = llc->raw.hdr.flags &
- SMC_LLC_FLAG_RKEY_NEG;
- complete(&link->llc_confirm_rkey_resp);
- break;
case SMC_LLC_CONFIRM_RKEY_CONT:
- /* unused as long as we don't send this type of msg */
- break;
- case SMC_LLC_DELETE_RKEY:
- link->llc_delete_rkey_resp_rc = llc->raw.hdr.flags &
- SMC_LLC_FLAG_RKEY_NEG;
- complete(&link->llc_delete_rkey_resp);
+ /* not used because max links is 3 */
break;
}
+ kfree(qentry);
}
-/* copy received msg and add it to the event queue */
-static void smc_llc_rx_handler(struct ib_wc *wc, void *buf)
+static void smc_llc_enqueue(struct smc_link *link, union smc_llc_msg *llc)
{
- struct smc_link *link = (struct smc_link *)wc->qp->qp_context;
struct smc_link_group *lgr = link->lgr;
struct smc_llc_qentry *qentry;
- union smc_llc_msg *llc = buf;
unsigned long flags;
- if (wc->byte_len < sizeof(*llc))
- return; /* short message */
- if (llc->raw.hdr.length != sizeof(*llc))
- return; /* invalid message */
-
- /* process responses immediately */
- if (llc->raw.hdr.flags & SMC_LLC_FLAG_RESP) {
- smc_llc_rx_response(link, llc);
- return;
- }
-
qentry = kmalloc(sizeof(*qentry), GFP_ATOMIC);
if (!qentry)
return;
qentry->link = link;
INIT_LIST_HEAD(&qentry->list);
memcpy(&qentry->msg, llc, sizeof(union smc_llc_msg));
+
+ /* process responses immediately */
+ if (llc->raw.hdr.flags & SMC_LLC_FLAG_RESP) {
+ smc_llc_rx_response(link, qentry);
+ return;
+ }
+
+ /* add requests to event queue */
spin_lock_irqsave(&lgr->llc_event_q_lock, flags);
list_add_tail(&qentry->list, &lgr->llc_event_q);
spin_unlock_irqrestore(&lgr->llc_event_q_lock, flags);
schedule_work(&link->lgr->llc_event_work);
}
+/* copy received msg and add it to the event queue */
+static void smc_llc_rx_handler(struct ib_wc *wc, void *buf)
+{
+ struct smc_link *link = (struct smc_link *)wc->qp->qp_context;
+ union smc_llc_msg *llc = buf;
+
+ if (wc->byte_len < sizeof(*llc))
+ return; /* short message */
+ if (llc->raw.hdr.length != sizeof(*llc))
+ return; /* invalid message */
+
+ smc_llc_enqueue(link, llc);
+}
+
/***************************** worker, utils *********************************/
static void smc_llc_testlink_work(struct work_struct *work)
@@ -676,6 +843,8 @@ void smc_llc_lgr_init(struct smc_link_group *lgr, struct smc_sock *smc)
INIT_WORK(&lgr->llc_event_work, smc_llc_event_work);
INIT_LIST_HEAD(&lgr->llc_event_q);
spin_lock_init(&lgr->llc_event_q_lock);
+ spin_lock_init(&lgr->llc_flow_lock);
+ init_waitqueue_head(&lgr->llc_waiter);
lgr->llc_testlink_time = net->ipv4.sysctl_tcp_keepalive_time;
}
@@ -683,18 +852,16 @@ void smc_llc_lgr_init(struct smc_link_group *lgr, struct smc_sock *smc)
void smc_llc_lgr_clear(struct smc_link_group *lgr)
{
smc_llc_event_flush(lgr);
+ wake_up_interruptible_all(&lgr->llc_waiter);
cancel_work_sync(&lgr->llc_event_work);
+ if (lgr->delayed_event) {
+ kfree(lgr->delayed_event);
+ lgr->delayed_event = NULL;
+ }
}
int smc_llc_link_init(struct smc_link *link)
{
- init_completion(&link->llc_confirm);
- init_completion(&link->llc_confirm_resp);
- init_completion(&link->llc_add);
- init_completion(&link->llc_add_resp);
- init_completion(&link->llc_confirm_rkey_resp);
- init_completion(&link->llc_delete_rkey_resp);
- mutex_init(&link->llc_delete_rkey_mutex);
init_completion(&link->llc_testlink_resp);
INIT_DELAYED_WORK(&link->llc_testlink_wrk, smc_llc_testlink_work);
return 0;
@@ -710,12 +877,6 @@ void smc_llc_link_active(struct smc_link *link)
}
}
-void smc_llc_link_deleting(struct smc_link *link)
-{
- link->state = SMC_LNK_DELETING;
- smc_wr_wakeup_tx_wait(link);
-}
-
/* called in worker context */
void smc_llc_link_clear(struct smc_link *link)
{
@@ -725,50 +886,74 @@ void smc_llc_link_clear(struct smc_link *link)
smc_wr_wakeup_tx_wait(link);
}
-/* register a new rtoken at the remote peer */
-int smc_llc_do_confirm_rkey(struct smc_link *link,
+/* register a new rtoken at the remote peer (for all links) */
+int smc_llc_do_confirm_rkey(struct smc_link *send_link,
struct smc_buf_desc *rmb_desc)
{
- int rc;
+ struct smc_link_group *lgr = send_link->lgr;
+ struct smc_llc_qentry *qentry = NULL;
+ int rc = 0;
- /* protected by mutex smc_create_lgr_pending */
- reinit_completion(&link->llc_confirm_rkey_resp);
- rc = smc_llc_send_confirm_rkey(link, rmb_desc);
+ rc = smc_llc_flow_initiate(lgr, SMC_LLC_FLOW_RKEY);
if (rc)
return rc;
+ rc = smc_llc_send_confirm_rkey(send_link, rmb_desc);
+ if (rc)
+ goto out;
/* receive CONFIRM RKEY response from server over RoCE fabric */
- rc = wait_for_completion_interruptible_timeout(
- &link->llc_confirm_rkey_resp, SMC_LLC_WAIT_TIME);
- if (rc <= 0 || link->llc_confirm_rkey_resp_rc)
- return -EFAULT;
- return 0;
+ qentry = smc_llc_wait(lgr, send_link, SMC_LLC_WAIT_TIME,
+ SMC_LLC_CONFIRM_RKEY);
+ if (!qentry || (qentry->msg.raw.hdr.flags & SMC_LLC_FLAG_RKEY_NEG))
+ rc = -EFAULT;
+out:
+ if (qentry)
+ smc_llc_flow_qentry_del(&lgr->llc_flow_lcl);
+ smc_llc_flow_stop(lgr, &lgr->llc_flow_lcl);
+ return rc;
}
/* unregister an rtoken at the remote peer */
-int smc_llc_do_delete_rkey(struct smc_link *link,
+int smc_llc_do_delete_rkey(struct smc_link_group *lgr,
struct smc_buf_desc *rmb_desc)
{
+ struct smc_llc_qentry *qentry = NULL;
+ struct smc_link *send_link;
int rc = 0;
- mutex_lock(&link->llc_delete_rkey_mutex);
- if (link->state != SMC_LNK_ACTIVE)
- goto out;
- reinit_completion(&link->llc_delete_rkey_resp);
- rc = smc_llc_send_delete_rkey(link, rmb_desc);
+ send_link = smc_llc_usable_link(lgr);
+ if (!send_link)
+ return -ENOLINK;
+
+ rc = smc_llc_flow_initiate(lgr, SMC_LLC_FLOW_RKEY);
+ if (rc)
+ return rc;
+ /* protected by llc_flow control */
+ rc = smc_llc_send_delete_rkey(send_link, rmb_desc);
if (rc)
goto out;
/* receive DELETE RKEY response from server over RoCE fabric */
- rc = wait_for_completion_interruptible_timeout(
- &link->llc_delete_rkey_resp, SMC_LLC_WAIT_TIME);
- if (rc <= 0 || link->llc_delete_rkey_resp_rc)
+ qentry = smc_llc_wait(lgr, send_link, SMC_LLC_WAIT_TIME,
+ SMC_LLC_DELETE_RKEY);
+ if (!qentry || (qentry->msg.raw.hdr.flags & SMC_LLC_FLAG_RKEY_NEG))
rc = -EFAULT;
- else
- rc = 0;
out:
- mutex_unlock(&link->llc_delete_rkey_mutex);
+ if (qentry)
+ smc_llc_flow_qentry_del(&lgr->llc_flow_lcl);
+ smc_llc_flow_stop(lgr, &lgr->llc_flow_lcl);
return rc;
}
+/* evaluate confirm link request or response */
+int smc_llc_eval_conf_link(struct smc_llc_qentry *qentry,
+ enum smc_llc_reqresp type)
+{
+ if (type == SMC_LLC_REQ) /* SMC server assigns link_id */
+ qentry->link->link_id = qentry->msg.confirm_link.link_num;
+ if (!(qentry->msg.raw.hdr.flags & SMC_LLC_FLAG_NO_RMBE_EYEC))
+ return -ENOTSUPP;
+ return 0;
+}
+
/***************************** init, exit, misc ******************************/
static struct smc_wr_rx_handler smc_llc_rx_handlers[] = {
diff --git a/net/smc/smc_llc.h b/net/smc/smc_llc.h
index 66063f2..48029a5 100644
--- a/net/smc/smc_llc.h
+++ b/net/smc/smc_llc.h
@@ -57,12 +57,21 @@ void smc_llc_lgr_init(struct smc_link_group *lgr, struct smc_sock *smc);
void smc_llc_lgr_clear(struct smc_link_group *lgr);
int smc_llc_link_init(struct smc_link *link);
void smc_llc_link_active(struct smc_link *link);
-void smc_llc_link_deleting(struct smc_link *link);
void smc_llc_link_clear(struct smc_link *link);
-int smc_llc_do_confirm_rkey(struct smc_link *link,
+int smc_llc_do_confirm_rkey(struct smc_link *send_link,
struct smc_buf_desc *rmb_desc);
-int smc_llc_do_delete_rkey(struct smc_link *link,
+int smc_llc_do_delete_rkey(struct smc_link_group *lgr,
struct smc_buf_desc *rmb_desc);
+int smc_llc_flow_initiate(struct smc_link_group *lgr,
+ enum smc_llc_flowtype type);
+void smc_llc_flow_stop(struct smc_link_group *lgr, struct smc_llc_flow *flow);
+int smc_llc_eval_conf_link(struct smc_llc_qentry *qentry,
+ enum smc_llc_reqresp type);
+struct smc_llc_qentry *smc_llc_wait(struct smc_link_group *lgr,
+ struct smc_link *lnk,
+ int time_out, u8 exp_msg);
+struct smc_llc_qentry *smc_llc_flow_qentry_clr(struct smc_llc_flow *flow);
+void smc_llc_flow_qentry_del(struct smc_llc_flow *flow);
int smc_llc_init(void) __init;
#endif /* SMC_LLC_H */
diff --git a/net/wireless/radiotap.c b/net/wireless/radiotap.c
index 6582d15..d5e2823 100644
--- a/net/wireless/radiotap.c
+++ b/net/wireless/radiotap.c
@@ -90,7 +90,7 @@ static const struct ieee80211_radiotap_namespace radiotap_ns = {
* iterator.this_arg for type "type" safely on all arches.
*
* Example code:
- * See Documentation/networking/radiotap-headers.txt
+ * See Documentation/networking/radiotap-headers.rst
*/
int ieee80211_radiotap_iterator_init(
diff --git a/samples/pktgen/README.rst b/samples/pktgen/README.rst
index 3f6483e..f9c53ca 100644
--- a/samples/pktgen/README.rst
+++ b/samples/pktgen/README.rst
@@ -3,7 +3,7 @@
This directory contains some pktgen sample and benchmark scripts, that
can easily be copied and adjusted for your own use-case.
-General doc is located in kernel: Documentation/networking/pktgen.txt
+General doc is located in kernel: Documentation/networking/pktgen.rst
Helper include files
====================