What we needed
Most of our infrastructure runs on Docker, on bare metal clusters. At some point we needed to give the ability to our apps to talk to each other using overlay networking instead of the usual Docker port binding. We could've used service discovery and port binding, but it would've been more complicated than we needed: we would've had to bind several ports for a single app on a single host if we wanted to have several replica per host and change our load balancing configuration each time to match the topology change. Plus, the setup of an overlay network is a nice preamble to a migration on a Kubernetes cluster.
Why did we chose BGP?
Our software choice comes from a simple observation:Internet relies on a commonly unknown protocol, a.k.a. BGP. The very principle of this protocol is to provide dynamic routing based on some basic principles. A lot of network defining softwares are relying on either OSPF or BGP. We chose BGP for its simplicity and proven robustness. A lot of open-source initiatives software made similar design choices.
Why did we chose Quagga?
Quagga is quite reliable and often seen in networking stacks with proprietary hardware such as Cisco or Juniper. It is still in active development and has had a lot of updates over the time even though Zebra (the routing program behind Quagga) seems to be a bit outdated from an outside perspective.
A bit of tech
Since Kubernetes requires overlay networking, the documentation suggest to use a various set of tools to achieve this. A lot of those tools are either blackboxes or unpractical in our use case. We thought that using this network abstraction would also be nice for our "not yet kubernetes compliant" applications, so we did a bit of testing and got a satisfying result.
A cluster, a lot of networks.
The very principle of the overlay network is to provide a network for each node inside a cluster accessible by every other nodes. Our stack looked like this from a logical point of view :
As you can see, each docker container needs a individual port binding to work properly. Those ports then have to be mapped in our load-balancers to receive some trafic.Then with the overlay network we created a basic abstraction of this bare-metal entanglement :
Now each container has its own routable IP address and can use the same port even on the same host.
This topology change is given by BGPd on Quagga, a sample configuration snippet could look like this:
1! Ansible managed 2log file /var/log/quagga/bgpd.log 3!debug bgp events 4!debug bgp filters 5!debug bgp fsm 6!debug bgp keepalives 7!debug bgp updates 8router bgp 65500 9 bgp router-id 172.16.200.1 10! # This is the ipaddress of the observed host 11 timers bgp 30 90 12 redistribute static 13! # we want to send away our static routes 14 network 10.0.0.0 mask 255.255.255.0 15! # This is the docker0 network, so we need to append 16! # the "bip": "10.0.0.1/24" flag to docker's daemon.json 17! 18! # Following : a description of our neighbors in the same AS 19 neighbor 172.16.200.10 remote-as 65500 20 neighbor 172.16.200.10 route-map foo out 21 neighbor 172.16.200.10 route-map foo in 22 neighbor 172.16.200.10 activate 23 neighbor 172.16.200.200 remote-as 65500 24 neighbor 172.16.200.200 route-map bar out 25 neighbor 172.16.200.200 route-map bar in 26 neighbor 172.16.200.200 activate 27! 28! 29! 30! # We set the same preference to each router 31route-map foo permit 10 32 set local-preference 222 33! 34route-map bar permit 10 35 set local-preference 222
The resulting routing table is quite straightforward :
1 $ ip r|grep -i zebra 2 10.0.2.0/24 via 172.16.200.10 dev eth1 proto zebra 3 10.0.3.0/24 via 172.16.200.200 dev eth1 proto zebra
We now have a fully operational overlay network. At this point, you may think that a classic SDN tool like calico would be easier to manage. Upon a certain scale it could be true, but we also need to take account of the main constraint in our environment : we are not on a public cloud. Therefore, we need to manually manage some stuff. Fortunately for us, a long time ago, ansible appeared in our world to make our life easier. At this moment, rolling a topology change in our overlay stack is a matter of seconds, with no service interruption whatsoever.
Quagga being not self sufficient, we added a home-brewed service discovery software to ensure that all of our live apps' capability to communicate with eachother and receive trafic from our load-balancers. We subsequently have been able with this networking feature to enable automated gossiping between our apps and do a lot of other fun stuff.
For Kubernetes : Load balancers, ingresses and stuff
Since there is a lot to read on the Internet about those ones, I think it's better to point out the "good ones", rather than poorly paraphrasing :
- Julia Evans' article about networking in kubernetes
- Kubernetes networking documentation
- Undertanding kubernetes networking pods
- The amazing talk "life of a packet" (slides are here)
- Last but not least Haproxy's article
Our load-balancers are aware of every app on every container, they are able to send packets to each application based on our ACLs. Ingresses will be coming in an other article to be written.
Paving the way to multihomed infrastructure
Since we have a reproductible network model, why not apply it to a L2/L3 interconnection? You can check why and how here.
Even though the setup is quite simple while it is running, starting it from scratch was quite a ride. We have to acknowledge here the help provided by Paul Jakma on some steps of debugging our BGP setup.