VRRP for router redundancy with keepalived

Saturday, 6 April 2024

This is a three-part article about a project to set up a home network with redundant network connections. Here are the parts:

  1. PPP over Ethernet: MTU, MSS and Packet Loss
  2. IPv6 setup for a router on Linux
  3. VRRP for router redundancy with keepalived

Part 3

The final step in my router configuration was to set up VRRP. keepalived can be used for VRRP. keepalived has many other functions, such as ensuring that online services stay available, and many of the examples in /usr/share/doc/keepalived/samples are not relevant. There was some trial and error needed for the setup.

VRRP is great. Before I read about it, I'd expected I would need to manage redundancy at the routing level, for example, having a separate LAN (or VLAN) just for routers, and a script running to replace the default route in the event of an outage. If this was done, there would still be a single point of failure. VRRP is a better solution because it can respond to power failures and disconnections of the routers themselves as well as failures at the Internet connection level. In VRRP, the active router periodically multicasts packets to confirm that it is alive and working - if the other routers don't receive these packets, one of them is chosen to take over the active role. VRRP scales to support large numbers of physical routers which act together as a single virtual router. The virtual router has a single static IP address. The MAC address will change depending on which physical router is active.

To use VRRP with Linux, keepalived should be installed on every router. Only one configuration file is needed, /etc/keepalived/keepalived.conf, though sysctl.conf must also contain the line "net.ipv4.ip_nonlocal_bind=1" to allow the use of a virtual IP address.

A backup router can have a very simple configuration. Here's the one used by the "mobile" router at my house, which is a Raspberry Pi 2 with an old Android phone tethered by USB (an incredibly cost-effective way to get a reasonably fast connection).

    global_defs {
        enable_script_security
        nftables
    }

    vrrp_instance VI_1 {
        interface eth0
        state BACKUP
        virtual_router_id 51
        priority 50

        garp_master_delay 2
        garp_master_repeat 1
        advert_int 1
        virtual_ipaddress {
            192.168.1.1
        }
    }

The actual IP address of this router is not 192.168.1.1. No physical machine has that address, it's the virtual IP address shared by all routers.

The priority (50) is fixed. This router will become active if it has the highest priority of all routers that are online. This situation happens if the main router's power fails or if the main router's priority falls below 50 due to a connection problem.

The configuration for the main router is similar, but more complex because it includes connection tests:

    vrrp_script chk_ip_1 {
        script "/usr/bin/ping -s 8 -w 1 -c 1 -n 10.11.12.13"
        interval 2
        timeout 2
        weight -20
        fall 2
        rise 20
    }
    # There are also chk_ip_2, chk_ip_3, chk_ip_4...

    vrrp_instance VI_1 {
        interface eth0
        state MASTER
        virtual_router_id 51
        priority 100

        garp_master_delay 2
        garp_master_repeat 1
        advert_int 1
        virtual_ipaddress {
            192.168.1.1
        }
        track_script {
            chk_ip_1
            chk_ip_2
            chk_ip_3
            chk_ip_4
        }
    }

Note that the same virtual_ipaddress and virtual_router_id are configured here.

The priority of the main router is normally 100, but this is conditional on the connection tests. If one fails, the priority drops by 20. If three or more of these tests fail, the priority will drop below the priority of the backup router, and the backup router will automatically take over.

My mistake with VRRP was assuming that the connection test was automatic, perhaps based on whether the default route was working, but actually it must be configured explicitly, and so I determined a small set of IP addresses that could be pinged in order to check that everything is working.

The configuration does have a slightly awkward property in that connections drop twice - once when the main router goes offline, and again when it comes back. This is not noticeable with applications that automatically reconnect (web browsers work ok, video streaming is fine). But it is a hassle with online games and video calls drop. To try to mitigate against rapid switching, I set "rise 20", requiring at least 20 successful pings (40 seconds) before a test can be considered as passing.

I cannot think of any way to completely avoid TCP/IP connection loss when the fibre connection drops, aside from routing all traffic through a VPN connection, as I know that OpenVPN (for example) will hide changes in the route between the client and the VPN service. But I do not particularly want the additional cost and latency of a VPN, if I can avoid it, and in any case this would simply make the VPN service a single point of failure.

As far as I know, mobile connections in the UK never use IPv6. Both Android and iOS have IPv6 support and it works via WiFi, but the mobile networks themselves are stuck with IPv4 and carrier-grade NAT. Perhaps one day this will change, but for now, I have an excuse to avoid the possible complexity of setting up IPv6 for VRRP.