This is my first post of a series which will cover how you can distribute your OpenShift cluster across multiple datacenter domains and increase availability and performance of your control plane.
Background
Originally the on-premise OpenShift IPI architecture was designed to deploy an internal (called OpenShift Managed) load balancer based on HAproxy and Keepalived. However when you want to distribute your cluster across multiple failure domains, your control plane has to be deployable on multiple L2 networks, which are usually isolated per rack, and routed with protocols like BGP.
Stretched vs L3 networks
A single stretched L2 network brings challenges:
- Network latency is not predictable
- Traffic bottlenecks
- L2 domain failures
- Network management complexity
Smaller (L3 routed) networks however has these benefits:
- Optimize East-West traffic
- Low and predictable latency
- Easier to extend and manage
- Failure domain isolated to a network
- Non blocking network fabric
OpenShift Managed Load-Balancer
For on-prem platforms (VSphere, Baremetal, OpenStack, Ovirt and Nutanix), the control plane load balancer is based on HAproxy and Keepalived. It means that the control plane VIPs (for API & Ingress services) will be managed in Active/Passive mode. The Keepalived master (elected by VRRPv2) will host the VIPs and therefore all the API & Ingress traffic will always go through one node and then load-balanced across the control plane. This bottleneck has been an issue at large scale.
Also, Keepalived doesn’t deal with L3 routing, so if the VIPs aren’t within the same subnet as the L2 networks, the network fabric can’t know where the VIPs actually are.
User Managed Load-Balancer
When we initially looked at the limitations of the OpenShift Managed Load-Balancer, we thought we would just add BGP to the OpenShift control plane, so the VIPs could be routed across the datacenter. You can have a look at this demo that shows how it would work. After the initial proposal which brought up a lot of good ideas, it was decided that for now we would rather try to externalize the Load-Balancer and let the customers dealing with it, rather than implementing something new in OpenShift (I’ll come back to it the wrap-up).
Indeed, a lot of our customers already have (enterprise-grade) load balancers that they use for their workloads. Some of them want to re-use these appliances and manage the OpenShift control plane traffic with them.
We realized that some of them want BGP, some of them don’t. Some want to keep stretched L2 networks, some don’t. There were a lot of decisions we would make if we would have implemented BGP within the OpenShift control plane so we decided that for now we will allow to use an external (user-managed) load balancer, like it’s already the case for the workloads themselves (e.g. with MetalLB).
More details on the design can be found in this OpenShift enhancement.
Deploy your own Load-Balancer
I want to share how someone can deploy a load balancer that will be used by the OpenShift control plane. For that, I’ve decided to create an Ansible role named ansible-role-routed-lb.
This will deploy an advanced Load-Balancer capable of managing routed VIPs with FRR (using BGP) and load-balance traffic with HAproxy.
The role will do the following:
- If BGP neighbors are provided in the config, it’ll deploy FRR and peer with your BGP neighbor(s). If the VIPs are created on the node, they’ll be routed in your infrastructure.
- Deploy HAproxy to load-balance and monitor your service. If the VIPs are provided in the config, they will be created if a minimum number of backend(s) are found healthy for a given service, and therefore routed in BGP if FRR is deployed. They will be removed if no backend was found healthy for a given service, therefore not routed in BGP if FRR is deployed
So if you’re hosting multiple Load-Balancers, your OpenShift control plane traffic will be:
- routed thanks to BGP if FRR is deployed
- load-balanced and high-availability at the VIP level thanks to BGP if FRR is deployed
- load-balanced between healthy backends thanks to HAproxy
Let’s deploy it!
In this blog post, we won’t cover the Failure Domains yet, and will deploy OpenShift within a single Leaf. Therefore, we’ll deploy only one load balancer.
Create your Ansible inventory.yaml
file:
---
all:
hosts:
lb:
ansible_host: 192.168.11.2
ansible_user: cloud-user
ansible_become: true
192.168.11.2
is the IP address of the load balancer.
Create the Ansible playbook.yaml
file:
---
- hosts: lb
vars:
config: lb
tasks:
- name: Deploy the LBs
include_role:
name: emilienm.routed_lb
Write the LB configs in Ansible vars.yaml
:
---
configs:
lb:
bgp_asn: 64998
bgp_neighbors:
- ip: 192.168.11.1
password: f00barZ
services:
- name: api
vips:
- 192.168.100.240
min_backends: 1
healthcheck: "httpchk GET /readyz HTTP/1.0"
balance: roundrobin
frontend_port: 6443
haproxy_monitor_port: 8081
backend_opts: "check check-ssl inter 1s fall 2 rise 3 verify none"
backend_port: 6443
backend_hosts: &lb_hosts
- name: rack1-10
ip: 192.168.11.10
- name: rack1-11
ip: 192.168.11.11
- name: rack1-12
ip: 192.168.11.12
- name: rack1-13
ip: 192.168.11.13
- name: rack1-14
ip: 192.168.11.14
- name: rack1-15
ip: 192.168.11.15
- name: rack1-16
ip: 192.168.11.16
- name: rack1-17
ip: 192.168.11.17
- name: rack1-18
ip: 192.168.11.18
- name: rack1-19
ip: 192.168.11.19
- name: rack1-20
ip: 192.168.11.20
- name: ingress_http
vips:
- 192.168.100.250
min_backends: 1
healthcheck: "httpchk GET /healthz/ready HTTP/1.0"
frontend_port: 80
haproxy_monitor_port: 8082
balance: roundrobin
backend_opts: "check check-ssl port 1936 inter 1s fall 2 rise 3 verify none"
backend_port: 80
backend_hosts: *lb_hosts
- name: ingress_https
vips:
- 192.168.100.250
min_backends: 1
healthcheck: "httpchk GET /healthz/ready HTTP/1.0"
frontend_port: 443
haproxy_monitor_port: 8083
balance: roundrobin
backend_opts: "check check-ssl port 1936 inter 1s fall 2 rise 3 verify none"
backend_port: 443
backend_hosts: *lb_hosts
- name: mcs
vips:
- 192.168.100.240
min_backends: 1
frontend_port: 22623
haproxy_monitor_port: 8084
balance: roundrobin
backend_opts: "check check-ssl inter 5s fall 2 rise 3 verify none"
backend_port: 22623
backend_hosts: *lb_hosts
In this case, we deploy OpenShift on OpenStack which doesn’t support static IPs. Therefore, we have to put all the available IPs from the subnet used for the machines, in the HAproxy backends.
Install the role and the dependencies:
ansible-galaxy install emilienm.routed_lb,1.0.0
ansible-galaxy collection install ansible.posix ansible.utils
Deploy the LBs:
ansible-playbook -i inventory.yaml -e "@vars.yaml" playbook.yaml
Deploy OpenShift
This feature will be available in the 4.13 release as TechPreview.
Here is how you can simply enable it via the install-config.yaml
:
apiVersion: v1
baseDomain: mydomain.test
compute:
- name: worker
platform:
openstack:
type: m1.xlarge
replicas: 3
controlPlane:
name: master
platform:
openstack:
type: m1.xlarge
replicas: 3
metadata:
name: mycluster
networking:
clusterNetwork:
- cidr: 10.128.0.0/14
hostPrefix: 23
machineNetwork:
- cidr: 192.168.11.0/24
- cidr: 192.168.100.0/24
platform:
openstack:
cloud: mycloud
machinesSubnet: 8586bf1a-cc3c-4d40-bdf6-c243decc603a
apiVIPs:
- 192.168.100.240
ingressVIPs:
- 192.168.100.250
loadBalancer:
type: UserManaged
featureSet: TechPreviewNoUpgrade
You can also watch this demo which shows the outcome.
Known limitations
- Deploying OpenShift with static IPs for the machines is only supported on Baremetal platform for now but it’s in the roadmap to support it on VSphere and OpenStack as well.
- Changing the IP address for any OpenShift control plane VIP (API + Ingress) is currently not supported. So once the external LB and the OpenShift cluster is deployed, the VIPs can’t be changed. This is in our roadmap.
- Migrating an OpenShift cluster from the OpenShift managed LB to an external LB is currently not supported. It’s in our roadmap as well.
Keep in mind that the feature will be TechPreview at first and once it has reached some maturity, we’ll promote it to GA.
Wrap-up
Having the VIPs highly available, routed across multiple domains is only a first step into distributing the OpenShift control plane. In the future, we’ll discuss about how Failure Domains will be configured when deploying OpenShift on OpenStack. Note that this is already doable on Baremetal and Vsphere.
With this effort, our customers can now decide which Load-Balancer to use, and if they have some expertise in their appliance now they can use it for the OpenShift control plane.
The way we implemented it is flexible and will allow us to implement new load balancers in OpenShift if we want to in the future. The proof of concept done a few months ago with BGP in the control plane could be restored if there is growing interest.
I hope you liked this article and stay tuned for the next ones!