linux ยท bgp

Load Balancing Without Load Balancers

I read an article some time ago discussing a possibility to use BGP to implement a distributed load balancer. I always wanted to explore this idea and finally had the chance to do so. Typically, when you use HA-Proxy for load balancing, you deploy 2x HA-Proxy instances with different IPs and share a VIP that would be used by one HA-Proxy instance at a time. Keepalived or Pacemaker are the solutions to implement VRRP / VIP hand over. This is an active-passive HA solution, because only one HA-Proxy can have the VIP.

For example:

HA Proxy#1 IP=10.1.1.101
HA Proxy#2 IP=10.1.1.102
HA Proxy  VIP=10.1.1.100

With this configuration, when you hit 10.1.1.100, it will be actually served by the active server only, which is normally elected by VRRP/Keepalived/Pacemaker. When you need to scale, you’re limited by a single active load balancer or you have to add extra virtual IP addresses which could be a pain when configuring access/DNS/failover/etc.

With ECMP (equal cost multi path) and BGP it is possible to create an active-active load balancer with a single distributed virtual IP address. In this example I’ll be using a pair of Cisco Nexus 9K switches working in ACI mode together with 3 physical servers running Quagga BGP. Physical network diagram is as follows:

Every server will have same IP address and will be BGP peering with the switches. The switches will be load balancing connections to the servers because they have same routing distance. Modern switches are pretty smart to understand that TCP packets should be forwarded via same path to preserve packet ordering, so that should not be a problem.

Hack Time

On every server, install nginx as a test load and Quagga for BGP functions. Also we’ll need to enable routing.

yum -i install nginx quagga
touch /etc/quagga/bgpd.conf
chown quagga:quagga /etc/quagga/bgpd.conf
systemctl enable zebra
systemctl enable bgpd
systemctl start zebra
systemctl start bgpd
sysctl -w net.ipv4.conf.all.forwarding=1

Quagga is a software router implementation on Linux (and perhaps, other OS). It has various processes responsible for various functions. Zebra process is responsible for interface configuration and general functions. Bgpd process is responsible for BGP routing functionality.

Edit /etc/nginx/nginx.conf and change worker_processes auto; to worker_processes 1;to see the difference with 1,2 or 3 peers active when we run tests. Start nginx after that:

systemctl enable nginx
systemctl start nginx

On every server, let’s configure the VIP:

ip tuntap add dev tap-lb1 mode tap
ip a a dev tap-lb1 10.1.198.4/29
ip l s dev tap-lb1 mtu 9000
ip l s dev tap-lb1 up

And also configure interfaces for BGP peering

#Server 1
ip link add link bond0 name bond0.999 type vlan id 999
ip link set bond0.999 up
ip a a dev bond0.999 10.1.199.35/29

#Server 2
ip link add link bond0 name bond0.998 type vlan id 998
ip link set bond0.998 up
ip a a dev bond0.998 10.1.199.43/29

#Server 3
ip link add link bond0 name bond0.997 type vlan id 997
ip link set bond0.997 up
ip a a dev bond0.997 10.1.199.51/29

It is a limitation of N9K/ACI to use different VLANs for each BGP link in my case. Could be different with other vendors. Run vtysh to configure BGP. It has a Cisco-like intuitive CLI.

Server #1

conf t
router bgp 65268
 bgp router-id 10.1.199.35
 network 10.1.198.0/29
 neighbor 10.1.199.33 remote-as 65268
 neighbor 10.1.199.34 remote-as 65268

Server #2

conf t
 router bgp 65268
 bgp router-id 10.1.199.43
 network 10.1.198.0/29
 neighbor 10.1.199.41 remote-as 65268
 neighbor 10.1.199.42 remote-as 65268

Server #3

conf t
router bgp 65268
 bgp router-id 10.1.199.51
 network 10.1.198.0/29
 neighbor 10.1.199.49 remote-as 65268
 neighbor 10.1.199.50 remote-as 65268

Verify:

# vtysh

Hello, this is Quagga (version 0.99.22.4).
Copyright 1996-2005 Kunihiro Ishiguro, et al.

server1# 
server1# show ip bgp summary
BGP router identifier 10.1.199.35, local AS number 65268
RIB entries 3, using 336 bytes of memory
Peers 2, using 9120 bytes of memory

Neighbor        V    AS MsgRcvd MsgSent   TblVer  InQ OutQ Up/Down  State/PfxRcd
10.1.199.33     4 65268    1526    1494        0    0    0 01:35:33        1
10.1.199.34     4 65268    1528    1496        0    0    0 01:34:57        1

Total number of neighbors 2
server1# show ip route
Codes: K - kernel route, C - connected, S - static, R - RIP,
       O - OSPF, I - IS-IS, B - BGP, A - Babel,
       > - selected route, * - FIB route

B>* 10.1.50.0/23 [200/0] via 10.1.199.33, bond0.999, 01:36:13
C>* 10.1.198.0/29 is directly connected, tap-lb1
C>* 10.1.199.32/29 is directly connected, bond0.999
C>* 127.0.0.0/8 is directly connected, lo

The switch side BGP routing is left as an exercise for the reader. In my case, I had an external network behind the switch which will be used to access the VIP. Once configured, you should be able to see all peers and routing:

leaf1# show ip bgp summary vrf all
BGP summary information for VRF *****:******, address family IPv4 Unicast
BGP router identifier **********, local AS number 65268
BGP table version is 218, IPv4 Unicast config peers 5, capable peers 5
16 network entries and 26 paths using 2752 bytes of memory
BGP attribute entries [12/1728], BGP AS path entries [0/0]
BGP community entries [0/0], BGP clusterlist entries [6/40]

Neighbor        V    AS MsgRcvd MsgSent   TblVer  InQ OutQ Up/Down  State/PfxRcd
10.1.199.35     4 65268     103     134      218    0    0 01:38:04 1         
10.1.199.43     4 65268      56      67      218    0    0 00:51:48 1         
10.1.199.51     4 65268      51      51      218    0    0 00:46:16 1         
********          4 65268   21336   21344      218    0    0     2w0d 1
********          4 65268   21336   21341      218    0    0     2w0d 1

leaf1# show ip route vrf all
IP Route Table for VRF "*****:******"
'*' denotes best ucast next-hop
'**' denotes best mcast next-hop
'[x/y]' denotes [preference/metric]
'%<string>' in via output denotes VRF <string>

[ .. skipped some output ... ]
10.1.50.0/23, ubest/mbest: 1/0, attached, direct, pervasive
    *via 10.0.8.65%overlay-1, [1/0], 1d00h, static
10.1.51.1/32, ubest/mbest: 1/0, attached, pervasive
    *via 10.1.51.1, Vlan23, [1/0], 2w0d, local, local
10.1.198.0/29, ubest/mbest: 3/0
    *via 10.1.199.35, [200/0], 01:14:58, bgp-65268, internal, tag 65268
    *via 10.1.199.43, [200/0], 00:48:18, bgp-65268, internal, tag 65268
    *via 10.1.199.51, [200/0], 00:48:18, bgp-65268, internal, tag 65268
10.1.199.32/29, ubest/mbest: 1/0, attached, direct
    *via 10.1.199.33, Vlan43, [1/0], 01:40:23, direct
10.1.199.33/32, ubest/mbest: 1/0, attached
    *via 10.1.199.33, Vlan43, [1/0], 01:40:23, local, local
10.1.199.40/29, ubest/mbest: 1/0, attached, direct
    *via 10.1.199.41, Vlan42, [1/0], 00:55:26, direct
10.1.199.41/32, ubest/mbest: 1/0, attached
    *via 10.1.199.41, Vlan42, [1/0], 00:55:26, local, local
10.1.199.48/29, ubest/mbest: 1/0, attached, direct
    *via 10.1.199.49, Vlan45, [1/0], 00:50:51, direct
10.1.199.49/32, ubest/mbest: 1/0, attached
    *via 10.1.199.49, Vlan45, [1/0], 00:50:51, local, local

Testing time!

We’ll be generating load from 10.1.50.0/23 subnet using ab.

# ab -n 50000 -c 20  http://10.1.198.4/
...
Requests per second:    47397.41 [#/sec] (mean)
...

Let’s remove one of the peers

# ab -n 50000 -c 20  http://10.1.198.4/
...
Requests per second:    38479.71 [#/sec] (mean)
...

Let’s leave just one peer

# ab -n 50000 -c 20  http://10.1.198.4/
...
Requests per second:    23171.85 [#/sec] (mean)
...

Nice! Of course, you’ll need extra orchestration to detect if application is down on a particular server and disconnect BGP, because BGP will only detect a dead peer if the whole host goes down. But you should get the idea, you may use a single IP address and let the L3 routing and hardware take care about distributed load balancing.

Published:
comments powered by Disqus