I read an article some time ago discussing a possibility to use BGP to implement a distributed load balancer. I always wanted to explore this idea and finally had the chance to do so. Typically, when you use HA-Proxy for load balancing, you deploy 2x HA-Proxy instances with different IPs and share a VIP that would be used by one HA-Proxy instance at a time. Keepalived or Pacemaker are the solutions to implement VRRP / VIP hand over. This is an active-passive HA solution, because only one HA-Proxy can have the VIP.
HA Proxy#1 IP=10.1.1.101 HA Proxy#2 IP=10.1.1.102 HA Proxy VIP=10.1.1.100
With this configuration, when you hit 10.1.1.100, it will be actually served by the active server only, which is normally elected by VRRP/Keepalived/Pacemaker. When you need to scale, you’re limited by a single active load balancer or you have to add extra virtual IP addresses which could be a pain when configuring access/DNS/failover/etc.
With ECMP (equal cost multi path) and BGP it is possible to create an active-active load balancer with a single distributed virtual IP address. In this example I’ll be using a pair of Cisco Nexus 9K switches working in ACI mode together with 3 physical servers running Quagga BGP. Physical network diagram is as follows:
Every server will have same IP address and will be BGP peering with the switches. The switches will be load balancing connections to the servers because they have same routing distance. Modern switches are pretty smart to understand that TCP packets should be forwarded via same path to preserve packet ordering, so that should not be a problem.
On every server, install nginx as a test load and Quagga for BGP functions. Also we’ll need to enable routing.
yum -i install nginx quagga touch /etc/quagga/bgpd.conf chown quagga:quagga /etc/quagga/bgpd.conf systemctl enable zebra systemctl enable bgpd systemctl start zebra systemctl start bgpd sysctl -w net.ipv4.conf.all.forwarding=1
Quagga is a software router implementation on Linux (and perhaps, other OS). It has various processes responsible for various functions. Zebra process is responsible for interface configuration and general functions. Bgpd process is responsible for BGP routing functionality.
/etc/nginx/nginx.conf and change
worker_processes auto; to
worker_processes 1;to see the difference with 1,2 or 3 peers active when we run tests. Start nginx after that:
systemctl enable nginx systemctl start nginx
On every server, let’s configure the VIP:
ip tuntap add dev tap-lb1 mode tap ip a a dev tap-lb1 10.1.198.4/29 ip l s dev tap-lb1 mtu 9000 ip l s dev tap-lb1 up
And also configure interfaces for BGP peering
#Server 1 ip link add link bond0 name bond0.999 type vlan id 999 ip link set bond0.999 up ip a a dev bond0.999 10.1.199.35/29 #Server 2 ip link add link bond0 name bond0.998 type vlan id 998 ip link set bond0.998 up ip a a dev bond0.998 10.1.199.43/29 #Server 3 ip link add link bond0 name bond0.997 type vlan id 997 ip link set bond0.997 up ip a a dev bond0.997 10.1.199.51/29
It is a limitation of N9K/ACI to use different VLANs for each BGP link in my case. Could be different with other vendors. Run
vtysh to configure BGP. It has a Cisco-like intuitive CLI.
conf t router bgp 65268 bgp router-id 10.1.199.35 network 10.1.198.0/29 neighbor 10.1.199.33 remote-as 65268 neighbor 10.1.199.34 remote-as 65268
conf t router bgp 65268 bgp router-id 10.1.199.43 network 10.1.198.0/29 neighbor 10.1.199.41 remote-as 65268 neighbor 10.1.199.42 remote-as 65268
conf t router bgp 65268 bgp router-id 10.1.199.51 network 10.1.198.0/29 neighbor 10.1.199.49 remote-as 65268 neighbor 10.1.199.50 remote-as 65268
# vtysh Hello, this is Quagga (version 0.99.22.4). Copyright 1996-2005 Kunihiro Ishiguro, et al. server1# server1# show ip bgp summary BGP router identifier 10.1.199.35, local AS number 65268 RIB entries 3, using 336 bytes of memory Peers 2, using 9120 bytes of memory Neighbor V AS MsgRcvd MsgSent TblVer InQ OutQ Up/Down State/PfxRcd 10.1.199.33 4 65268 1526 1494 0 0 0 01:35:33 1 10.1.199.34 4 65268 1528 1496 0 0 0 01:34:57 1 Total number of neighbors 2 server1# show ip route Codes: K - kernel route, C - connected, S - static, R - RIP, O - OSPF, I - IS-IS, B - BGP, A - Babel, > - selected route, * - FIB route B>* 10.1.50.0/23 [200/0] via 10.1.199.33, bond0.999, 01:36:13 C>* 10.1.198.0/29 is directly connected, tap-lb1 C>* 10.1.199.32/29 is directly connected, bond0.999 C>* 127.0.0.0/8 is directly connected, lo
The switch side BGP routing is left as an exercise for the reader. In my case, I had an external network behind the switch which will be used to access the VIP. Once configured, you should be able to see all peers and routing:
leaf1# show ip bgp summary vrf all BGP summary information for VRF *****:******, address family IPv4 Unicast BGP router identifier **********, local AS number 65268 BGP table version is 218, IPv4 Unicast config peers 5, capable peers 5 16 network entries and 26 paths using 2752 bytes of memory BGP attribute entries [12/1728], BGP AS path entries [0/0] BGP community entries [0/0], BGP clusterlist entries [6/40] Neighbor V AS MsgRcvd MsgSent TblVer InQ OutQ Up/Down State/PfxRcd 10.1.199.35 4 65268 103 134 218 0 0 01:38:04 1 10.1.199.43 4 65268 56 67 218 0 0 00:51:48 1 10.1.199.51 4 65268 51 51 218 0 0 00:46:16 1 ******** 4 65268 21336 21344 218 0 0 2w0d 1 ******** 4 65268 21336 21341 218 0 0 2w0d 1 leaf1# show ip route vrf all IP Route Table for VRF "*****:******" '*' denotes best ucast next-hop '**' denotes best mcast next-hop '[x/y]' denotes [preference/metric] '%<string>' in via output denotes VRF <string> [ .. skipped some output ... ] 10.1.50.0/23, ubest/mbest: 1/0, attached, direct, pervasive *via 10.0.8.65%overlay-1, [1/0], 1d00h, static 10.1.51.1/32, ubest/mbest: 1/0, attached, pervasive *via 10.1.51.1, Vlan23, [1/0], 2w0d, local, local 10.1.198.0/29, ubest/mbest: 3/0 *via 10.1.199.35, [200/0], 01:14:58, bgp-65268, internal, tag 65268 *via 10.1.199.43, [200/0], 00:48:18, bgp-65268, internal, tag 65268 *via 10.1.199.51, [200/0], 00:48:18, bgp-65268, internal, tag 65268 10.1.199.32/29, ubest/mbest: 1/0, attached, direct *via 10.1.199.33, Vlan43, [1/0], 01:40:23, direct 10.1.199.33/32, ubest/mbest: 1/0, attached *via 10.1.199.33, Vlan43, [1/0], 01:40:23, local, local 10.1.199.40/29, ubest/mbest: 1/0, attached, direct *via 10.1.199.41, Vlan42, [1/0], 00:55:26, direct 10.1.199.41/32, ubest/mbest: 1/0, attached *via 10.1.199.41, Vlan42, [1/0], 00:55:26, local, local 10.1.199.48/29, ubest/mbest: 1/0, attached, direct *via 10.1.199.49, Vlan45, [1/0], 00:50:51, direct 10.1.199.49/32, ubest/mbest: 1/0, attached *via 10.1.199.49, Vlan45, [1/0], 00:50:51, local, local
We’ll be generating load from 10.1.50.0/23 subnet using
# ab -n 50000 -c 20 http://10.1.198.4/ ... Requests per second: 47397.41 [#/sec] (mean) ...
Let’s remove one of the peers
# ab -n 50000 -c 20 http://10.1.198.4/ ... Requests per second: 38479.71 [#/sec] (mean) ...
Let’s leave just one peer
# ab -n 50000 -c 20 http://10.1.198.4/ ... Requests per second: 23171.85 [#/sec] (mean) ...
Nice! Of course, you’ll need extra orchestration to detect if application is down on a particular server and disconnect BGP, because BGP will only detect a dead peer if the whole host goes down. But you should get the idea, you may use a single IP address and let the L3 routing and hardware take care about distributed load balancing.