Mauro Carvalho Chehab | 58ccb2b | 2020-05-01 16:44:25 +0200 | [diff] [blame] | 1 | .. SPDX-License-Identifier: GPL-2.0 |
| 2 | |
| 3 | ==================================== |
| 4 | Virtual Routing and Forwarding (VRF) |
| 5 | ==================================== |
| 6 | |
| 7 | The VRF Device |
| 8 | ============== |
| 9 | |
| 10 | The VRF device combined with ip rules provides the ability to create virtual |
| 11 | routing and forwarding domains (aka VRFs, VRF-lite to be specific) in the |
| 12 | Linux network stack. One use case is the multi-tenancy problem where each |
| 13 | tenant has their own unique routing tables and in the very least need |
| 14 | different default gateways. |
| 15 | |
| 16 | Processes can be "VRF aware" by binding a socket to the VRF device. Packets |
| 17 | through the socket then use the routing table associated with the VRF |
| 18 | device. An important feature of the VRF device implementation is that it |
| 19 | impacts only Layer 3 and above so L2 tools (e.g., LLDP) are not affected |
| 20 | (ie., they do not need to be run in each VRF). The design also allows |
| 21 | the use of higher priority ip rules (Policy Based Routing, PBR) to take |
| 22 | precedence over the VRF device rules directing specific traffic as desired. |
| 23 | |
| 24 | In addition, VRF devices allow VRFs to be nested within namespaces. For |
| 25 | example network namespaces provide separation of network interfaces at the |
| 26 | device layer, VLANs on the interfaces within a namespace provide L2 separation |
| 27 | and then VRF devices provide L3 separation. |
| 28 | |
| 29 | Design |
| 30 | ------ |
| 31 | A VRF device is created with an associated route table. Network interfaces |
| 32 | are then enslaved to a VRF device:: |
| 33 | |
| 34 | +-----------------------------+ |
| 35 | | vrf-blue | ===> route table 10 |
| 36 | +-----------------------------+ |
| 37 | | | | |
| 38 | +------+ +------+ +-------------+ |
| 39 | | eth1 | | eth2 | ... | bond1 | |
| 40 | +------+ +------+ +-------------+ |
| 41 | | | |
| 42 | +------+ +------+ |
| 43 | | eth8 | | eth9 | |
| 44 | +------+ +------+ |
| 45 | |
| 46 | Packets received on an enslaved device and are switched to the VRF device |
| 47 | in the IPv4 and IPv6 processing stacks giving the impression that packets |
| 48 | flow through the VRF device. Similarly on egress routing rules are used to |
| 49 | send packets to the VRF device driver before getting sent out the actual |
| 50 | interface. This allows tcpdump on a VRF device to capture all packets into |
| 51 | and out of the VRF as a whole\ [1]_. Similarly, netfilter\ [2]_ and tc rules |
| 52 | can be applied using the VRF device to specify rules that apply to the VRF |
| 53 | domain as a whole. |
| 54 | |
| 55 | .. [1] Packets in the forwarded state do not flow through the device, so those |
| 56 | packets are not seen by tcpdump. Will revisit this limitation in a |
| 57 | future release. |
| 58 | |
| 59 | .. [2] Iptables on ingress supports PREROUTING with skb->dev set to the real |
| 60 | ingress device and both INPUT and PREROUTING rules with skb->dev set to |
| 61 | the VRF device. For egress POSTROUTING and OUTPUT rules can be written |
| 62 | using either the VRF device or real egress device. |
| 63 | |
| 64 | Setup |
| 65 | ----- |
| 66 | 1. VRF device is created with an association to a FIB table. |
| 67 | e.g,:: |
| 68 | |
| 69 | ip link add vrf-blue type vrf table 10 |
| 70 | ip link set dev vrf-blue up |
| 71 | |
| 72 | 2. An l3mdev FIB rule directs lookups to the table associated with the device. |
| 73 | A single l3mdev rule is sufficient for all VRFs. The VRF device adds the |
| 74 | l3mdev rule for IPv4 and IPv6 when the first device is created with a |
| 75 | default preference of 1000. Users may delete the rule if desired and add |
| 76 | with a different priority or install per-VRF rules. |
| 77 | |
| 78 | Prior to the v4.8 kernel iif and oif rules are needed for each VRF device:: |
| 79 | |
| 80 | ip ru add oif vrf-blue table 10 |
| 81 | ip ru add iif vrf-blue table 10 |
| 82 | |
| 83 | 3. Set the default route for the table (and hence default route for the VRF):: |
| 84 | |
| 85 | ip route add table 10 unreachable default metric 4278198272 |
| 86 | |
| 87 | This high metric value ensures that the default unreachable route can |
| 88 | be overridden by a routing protocol suite. FRRouting interprets |
| 89 | kernel metrics as a combined admin distance (upper byte) and priority |
| 90 | (lower 3 bytes). Thus the above metric translates to [255/8192]. |
| 91 | |
| 92 | 4. Enslave L3 interfaces to a VRF device:: |
| 93 | |
| 94 | ip link set dev eth1 master vrf-blue |
| 95 | |
| 96 | Local and connected routes for enslaved devices are automatically moved to |
| 97 | the table associated with VRF device. Any additional routes depending on |
| 98 | the enslaved device are dropped and will need to be reinserted to the VRF |
| 99 | FIB table following the enslavement. |
| 100 | |
| 101 | The IPv6 sysctl option keep_addr_on_down can be enabled to keep IPv6 global |
| 102 | addresses as VRF enslavement changes:: |
| 103 | |
| 104 | sysctl -w net.ipv6.conf.all.keep_addr_on_down=1 |
| 105 | |
| 106 | 5. Additional VRF routes are added to associated table:: |
| 107 | |
| 108 | ip route add table 10 ... |
| 109 | |
| 110 | |
| 111 | Applications |
| 112 | ------------ |
| 113 | Applications that are to work within a VRF need to bind their socket to the |
| 114 | VRF device:: |
| 115 | |
| 116 | setsockopt(sd, SOL_SOCKET, SO_BINDTODEVICE, dev, strlen(dev)+1); |
| 117 | |
| 118 | or to specify the output device using cmsg and IP_PKTINFO. |
| 119 | |
| 120 | By default the scope of the port bindings for unbound sockets is |
| 121 | limited to the default VRF. That is, it will not be matched by packets |
| 122 | arriving on interfaces enslaved to an l3mdev and processes may bind to |
| 123 | the same port if they bind to an l3mdev. |
| 124 | |
| 125 | TCP & UDP services running in the default VRF context (ie., not bound |
| 126 | to any VRF device) can work across all VRF domains by enabling the |
| 127 | tcp_l3mdev_accept and udp_l3mdev_accept sysctl options:: |
| 128 | |
| 129 | sysctl -w net.ipv4.tcp_l3mdev_accept=1 |
| 130 | sysctl -w net.ipv4.udp_l3mdev_accept=1 |
| 131 | |
| 132 | These options are disabled by default so that a socket in a VRF is only |
| 133 | selected for packets in that VRF. There is a similar option for RAW |
| 134 | sockets, which is enabled by default for reasons of backwards compatibility. |
| 135 | This is so as to specify the output device with cmsg and IP_PKTINFO, but |
| 136 | using a socket not bound to the corresponding VRF. This allows e.g. older ping |
| 137 | implementations to be run with specifying the device but without executing it |
| 138 | in the VRF. This option can be disabled so that packets received in a VRF |
| 139 | context are only handled by a raw socket bound to the VRF, and packets in the |
| 140 | default VRF are only handled by a socket not bound to any VRF:: |
| 141 | |
| 142 | sysctl -w net.ipv4.raw_l3mdev_accept=0 |
| 143 | |
| 144 | netfilter rules on the VRF device can be used to limit access to services |
| 145 | running in the default VRF context as well. |
| 146 | |
Benjamin Poirier | b116577 | 2021-08-19 17:38:54 +0900 | [diff] [blame] | 147 | Using VRF-aware applications (applications which simultaneously create sockets |
| 148 | outside and inside VRFs) in conjunction with ``net.ipv4.tcp_l3mdev_accept=1`` |
| 149 | is possible but may lead to problems in some situations. With that sysctl |
| 150 | value, it is unspecified which listening socket will be selected to handle |
| 151 | connections for VRF traffic; ie. either a socket bound to the VRF or an unbound |
| 152 | socket may be used to accept new connections from a VRF. This somewhat |
| 153 | unexpected behavior can lead to problems if sockets are configured with extra |
| 154 | options (ex. TCP MD5 keys) with the expectation that VRF traffic will |
| 155 | exclusively be handled by sockets bound to VRFs, as would be the case with |
| 156 | ``net.ipv4.tcp_l3mdev_accept=0``. Finally and as a reminder, regardless of |
| 157 | which listening socket is selected, established sockets will be created in the |
| 158 | VRF based on the ingress interface, as documented earlier. |
| 159 | |
Mauro Carvalho Chehab | 58ccb2b | 2020-05-01 16:44:25 +0200 | [diff] [blame] | 160 | -------------------------------------------------------------------------------- |
| 161 | |
| 162 | Using iproute2 for VRFs |
| 163 | ======================= |
| 164 | iproute2 supports the vrf keyword as of v4.7. For backwards compatibility this |
| 165 | section lists both commands where appropriate -- with the vrf keyword and the |
| 166 | older form without it. |
| 167 | |
| 168 | 1. Create a VRF |
| 169 | |
| 170 | To instantiate a VRF device and associate it with a table:: |
| 171 | |
| 172 | $ ip link add dev NAME type vrf table ID |
| 173 | |
| 174 | As of v4.8 the kernel supports the l3mdev FIB rule where a single rule |
| 175 | covers all VRFs. The l3mdev rule is created for IPv4 and IPv6 on first |
| 176 | device create. |
| 177 | |
| 178 | 2. List VRFs |
| 179 | |
| 180 | To list VRFs that have been created:: |
| 181 | |
| 182 | $ ip [-d] link show type vrf |
| 183 | NOTE: The -d option is needed to show the table id |
| 184 | |
| 185 | For example:: |
| 186 | |
| 187 | $ ip -d link show type vrf |
| 188 | 11: mgmt: <NOARP,MASTER,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP mode DEFAULT group default qlen 1000 |
| 189 | link/ether 72:b3:ba:91:e2:24 brd ff:ff:ff:ff:ff:ff promiscuity 0 |
| 190 | vrf table 1 addrgenmode eui64 |
| 191 | 12: red: <NOARP,MASTER,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP mode DEFAULT group default qlen 1000 |
| 192 | link/ether b6:6f:6e:f6:da:73 brd ff:ff:ff:ff:ff:ff promiscuity 0 |
| 193 | vrf table 10 addrgenmode eui64 |
| 194 | 13: blue: <NOARP,MASTER,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP mode DEFAULT group default qlen 1000 |
| 195 | link/ether 36:62:e8:7d:bb:8c brd ff:ff:ff:ff:ff:ff promiscuity 0 |
| 196 | vrf table 66 addrgenmode eui64 |
| 197 | 14: green: <NOARP,MASTER,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP mode DEFAULT group default qlen 1000 |
| 198 | link/ether e6:28:b8:63:70:bb brd ff:ff:ff:ff:ff:ff promiscuity 0 |
| 199 | vrf table 81 addrgenmode eui64 |
| 200 | |
| 201 | |
| 202 | Or in brief output:: |
| 203 | |
| 204 | $ ip -br link show type vrf |
| 205 | mgmt UP 72:b3:ba:91:e2:24 <NOARP,MASTER,UP,LOWER_UP> |
| 206 | red UP b6:6f:6e:f6:da:73 <NOARP,MASTER,UP,LOWER_UP> |
| 207 | blue UP 36:62:e8:7d:bb:8c <NOARP,MASTER,UP,LOWER_UP> |
| 208 | green UP e6:28:b8:63:70:bb <NOARP,MASTER,UP,LOWER_UP> |
| 209 | |
| 210 | |
| 211 | 3. Assign a Network Interface to a VRF |
| 212 | |
| 213 | Network interfaces are assigned to a VRF by enslaving the netdevice to a |
| 214 | VRF device:: |
| 215 | |
| 216 | $ ip link set dev NAME master NAME |
| 217 | |
| 218 | On enslavement connected and local routes are automatically moved to the |
| 219 | table associated with the VRF device. |
| 220 | |
| 221 | For example:: |
| 222 | |
| 223 | $ ip link set dev eth0 master mgmt |
| 224 | |
| 225 | |
| 226 | 4. Show Devices Assigned to a VRF |
| 227 | |
| 228 | To show devices that have been assigned to a specific VRF add the master |
| 229 | option to the ip command:: |
| 230 | |
| 231 | $ ip link show vrf NAME |
| 232 | $ ip link show master NAME |
| 233 | |
| 234 | For example:: |
| 235 | |
| 236 | $ ip link show vrf red |
| 237 | 3: eth1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast master red state UP mode DEFAULT group default qlen 1000 |
| 238 | link/ether 02:00:00:00:02:02 brd ff:ff:ff:ff:ff:ff |
| 239 | 4: eth2: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast master red state UP mode DEFAULT group default qlen 1000 |
| 240 | link/ether 02:00:00:00:02:03 brd ff:ff:ff:ff:ff:ff |
| 241 | 7: eth5: <BROADCAST,MULTICAST> mtu 1500 qdisc noop master red state DOWN mode DEFAULT group default qlen 1000 |
| 242 | link/ether 02:00:00:00:02:06 brd ff:ff:ff:ff:ff:ff |
| 243 | |
| 244 | |
| 245 | Or using the brief output:: |
| 246 | |
| 247 | $ ip -br link show vrf red |
| 248 | eth1 UP 02:00:00:00:02:02 <BROADCAST,MULTICAST,UP,LOWER_UP> |
| 249 | eth2 UP 02:00:00:00:02:03 <BROADCAST,MULTICAST,UP,LOWER_UP> |
| 250 | eth5 DOWN 02:00:00:00:02:06 <BROADCAST,MULTICAST> |
| 251 | |
| 252 | |
| 253 | 5. Show Neighbor Entries for a VRF |
| 254 | |
| 255 | To list neighbor entries associated with devices enslaved to a VRF device |
| 256 | add the master option to the ip command:: |
| 257 | |
| 258 | $ ip [-6] neigh show vrf NAME |
| 259 | $ ip [-6] neigh show master NAME |
| 260 | |
| 261 | For example:: |
| 262 | |
| 263 | $ ip neigh show vrf red |
| 264 | 10.2.1.254 dev eth1 lladdr a6:d9:c7:4f:06:23 REACHABLE |
| 265 | 10.2.2.254 dev eth2 lladdr 5e:54:01:6a:ee:80 REACHABLE |
| 266 | |
| 267 | $ ip -6 neigh show vrf red |
| 268 | 2002:1::64 dev eth1 lladdr a6:d9:c7:4f:06:23 REACHABLE |
| 269 | |
| 270 | |
| 271 | 6. Show Addresses for a VRF |
| 272 | |
| 273 | To show addresses for interfaces associated with a VRF add the master |
| 274 | option to the ip command:: |
| 275 | |
| 276 | $ ip addr show vrf NAME |
| 277 | $ ip addr show master NAME |
| 278 | |
| 279 | For example:: |
| 280 | |
| 281 | $ ip addr show vrf red |
| 282 | 3: eth1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast master red state UP group default qlen 1000 |
| 283 | link/ether 02:00:00:00:02:02 brd ff:ff:ff:ff:ff:ff |
| 284 | inet 10.2.1.2/24 brd 10.2.1.255 scope global eth1 |
| 285 | valid_lft forever preferred_lft forever |
| 286 | inet6 2002:1::2/120 scope global |
| 287 | valid_lft forever preferred_lft forever |
| 288 | inet6 fe80::ff:fe00:202/64 scope link |
| 289 | valid_lft forever preferred_lft forever |
| 290 | 4: eth2: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast master red state UP group default qlen 1000 |
| 291 | link/ether 02:00:00:00:02:03 brd ff:ff:ff:ff:ff:ff |
| 292 | inet 10.2.2.2/24 brd 10.2.2.255 scope global eth2 |
| 293 | valid_lft forever preferred_lft forever |
| 294 | inet6 2002:2::2/120 scope global |
| 295 | valid_lft forever preferred_lft forever |
| 296 | inet6 fe80::ff:fe00:203/64 scope link |
| 297 | valid_lft forever preferred_lft forever |
| 298 | 7: eth5: <BROADCAST,MULTICAST> mtu 1500 qdisc noop master red state DOWN group default qlen 1000 |
| 299 | link/ether 02:00:00:00:02:06 brd ff:ff:ff:ff:ff:ff |
| 300 | |
| 301 | Or in brief format:: |
| 302 | |
| 303 | $ ip -br addr show vrf red |
| 304 | eth1 UP 10.2.1.2/24 2002:1::2/120 fe80::ff:fe00:202/64 |
| 305 | eth2 UP 10.2.2.2/24 2002:2::2/120 fe80::ff:fe00:203/64 |
| 306 | eth5 DOWN |
| 307 | |
| 308 | |
| 309 | 7. Show Routes for a VRF |
| 310 | |
| 311 | To show routes for a VRF use the ip command to display the table associated |
| 312 | with the VRF device:: |
| 313 | |
| 314 | $ ip [-6] route show vrf NAME |
| 315 | $ ip [-6] route show table ID |
| 316 | |
| 317 | For example:: |
| 318 | |
| 319 | $ ip route show vrf red |
| 320 | unreachable default metric 4278198272 |
| 321 | broadcast 10.2.1.0 dev eth1 proto kernel scope link src 10.2.1.2 |
| 322 | 10.2.1.0/24 dev eth1 proto kernel scope link src 10.2.1.2 |
| 323 | local 10.2.1.2 dev eth1 proto kernel scope host src 10.2.1.2 |
| 324 | broadcast 10.2.1.255 dev eth1 proto kernel scope link src 10.2.1.2 |
| 325 | broadcast 10.2.2.0 dev eth2 proto kernel scope link src 10.2.2.2 |
| 326 | 10.2.2.0/24 dev eth2 proto kernel scope link src 10.2.2.2 |
| 327 | local 10.2.2.2 dev eth2 proto kernel scope host src 10.2.2.2 |
| 328 | broadcast 10.2.2.255 dev eth2 proto kernel scope link src 10.2.2.2 |
| 329 | |
| 330 | $ ip -6 route show vrf red |
| 331 | local 2002:1:: dev lo proto none metric 0 pref medium |
| 332 | local 2002:1::2 dev lo proto none metric 0 pref medium |
| 333 | 2002:1::/120 dev eth1 proto kernel metric 256 pref medium |
| 334 | local 2002:2:: dev lo proto none metric 0 pref medium |
| 335 | local 2002:2::2 dev lo proto none metric 0 pref medium |
| 336 | 2002:2::/120 dev eth2 proto kernel metric 256 pref medium |
| 337 | local fe80:: dev lo proto none metric 0 pref medium |
| 338 | local fe80:: dev lo proto none metric 0 pref medium |
| 339 | local fe80::ff:fe00:202 dev lo proto none metric 0 pref medium |
| 340 | local fe80::ff:fe00:203 dev lo proto none metric 0 pref medium |
| 341 | fe80::/64 dev eth1 proto kernel metric 256 pref medium |
| 342 | fe80::/64 dev eth2 proto kernel metric 256 pref medium |
| 343 | ff00::/8 dev red metric 256 pref medium |
| 344 | ff00::/8 dev eth1 metric 256 pref medium |
| 345 | ff00::/8 dev eth2 metric 256 pref medium |
| 346 | unreachable default dev lo metric 4278198272 error -101 pref medium |
| 347 | |
| 348 | 8. Route Lookup for a VRF |
| 349 | |
| 350 | A test route lookup can be done for a VRF:: |
| 351 | |
| 352 | $ ip [-6] route get vrf NAME ADDRESS |
| 353 | $ ip [-6] route get oif NAME ADDRESS |
| 354 | |
| 355 | For example:: |
| 356 | |
| 357 | $ ip route get 10.2.1.40 vrf red |
| 358 | 10.2.1.40 dev eth1 table red src 10.2.1.2 |
| 359 | cache |
| 360 | |
| 361 | $ ip -6 route get 2002:1::32 vrf red |
| 362 | 2002:1::32 from :: dev eth1 table red proto kernel src 2002:1::2 metric 256 pref medium |
| 363 | |
| 364 | |
| 365 | 9. Removing Network Interface from a VRF |
| 366 | |
| 367 | Network interfaces are removed from a VRF by breaking the enslavement to |
| 368 | the VRF device:: |
| 369 | |
| 370 | $ ip link set dev NAME nomaster |
| 371 | |
| 372 | Connected routes are moved back to the default table and local entries are |
| 373 | moved to the local table. |
| 374 | |
| 375 | For example:: |
| 376 | |
| 377 | $ ip link set dev eth0 nomaster |
| 378 | |
| 379 | -------------------------------------------------------------------------------- |
| 380 | |
| 381 | Commands used in this example:: |
| 382 | |
| 383 | cat >> /etc/iproute2/rt_tables.d/vrf.conf <<EOF |
| 384 | 1 mgmt |
| 385 | 10 red |
| 386 | 66 blue |
| 387 | 81 green |
| 388 | EOF |
| 389 | |
| 390 | function vrf_create |
| 391 | { |
| 392 | VRF=$1 |
| 393 | TBID=$2 |
| 394 | |
| 395 | # create VRF device |
| 396 | ip link add ${VRF} type vrf table ${TBID} |
| 397 | |
| 398 | if [ "${VRF}" != "mgmt" ]; then |
| 399 | ip route add table ${TBID} unreachable default metric 4278198272 |
| 400 | fi |
| 401 | ip link set dev ${VRF} up |
| 402 | } |
| 403 | |
| 404 | vrf_create mgmt 1 |
| 405 | ip link set dev eth0 master mgmt |
| 406 | |
| 407 | vrf_create red 10 |
| 408 | ip link set dev eth1 master red |
| 409 | ip link set dev eth2 master red |
| 410 | ip link set dev eth5 master red |
| 411 | |
| 412 | vrf_create blue 66 |
| 413 | ip link set dev eth3 master blue |
| 414 | |
| 415 | vrf_create green 81 |
| 416 | ip link set dev eth4 master green |
| 417 | |
| 418 | |
| 419 | Interface addresses from /etc/network/interfaces: |
| 420 | auto eth0 |
| 421 | iface eth0 inet static |
| 422 | address 10.0.0.2 |
| 423 | netmask 255.255.255.0 |
| 424 | gateway 10.0.0.254 |
| 425 | |
| 426 | iface eth0 inet6 static |
| 427 | address 2000:1::2 |
| 428 | netmask 120 |
| 429 | |
| 430 | auto eth1 |
| 431 | iface eth1 inet static |
| 432 | address 10.2.1.2 |
| 433 | netmask 255.255.255.0 |
| 434 | |
| 435 | iface eth1 inet6 static |
| 436 | address 2002:1::2 |
| 437 | netmask 120 |
| 438 | |
| 439 | auto eth2 |
| 440 | iface eth2 inet static |
| 441 | address 10.2.2.2 |
| 442 | netmask 255.255.255.0 |
| 443 | |
| 444 | iface eth2 inet6 static |
| 445 | address 2002:2::2 |
| 446 | netmask 120 |
| 447 | |
| 448 | auto eth3 |
| 449 | iface eth3 inet static |
| 450 | address 10.2.3.2 |
| 451 | netmask 255.255.255.0 |
| 452 | |
| 453 | iface eth3 inet6 static |
| 454 | address 2002:3::2 |
| 455 | netmask 120 |
| 456 | |
| 457 | auto eth4 |
| 458 | iface eth4 inet static |
| 459 | address 10.2.4.2 |
| 460 | netmask 255.255.255.0 |
| 461 | |
| 462 | iface eth4 inet6 static |
| 463 | address 2002:4::2 |
| 464 | netmask 120 |