High Availability with Leader Election¶
This guide explains how to deploy and operate the HAProxy Template Ingress Controller in high availability (HA) mode with multiple replicas.
Overview¶
The controller supports running multiple replicas for high availability using leader election based on Kubernetes Leases. Only the elected leader performs write operations (deploying configurations to HAProxy), while all replicas continue watching resources, rendering templates, and validating configurations to maintain "hot standby" status.
Benefits of HA deployment:
- Zero-downtime during controller upgrades (rolling updates)
- Automatic failover if leader pod crashes (~15-20 seconds)
- All replicas ready to take over immediately (hot standby)
- Balanced leader distribution across nodes
How it works:
- All replicas watch Kubernetes resources and render HAProxy configurations
- Leader election determines which replica can deploy configs to HAProxy
- When leader fails, followers automatically elect a new leader
- Leadership transitions are logged and tracked via Prometheus metrics
Configuration¶
Enable Leader Election¶
Leader election is enabled by default when deploying with 2+ replicas via Helm:
# values.yaml (defaults)
replicaCount: 2 # Run 2 replicas for HA
controller:
config:
controller:
leader_election:
enabled: true
lease_name: haptic-leader
lease_duration: 60s # Failover happens within this time
renew_deadline: 15s # Leader tries to renew for this long
retry_period: 5s # Interval between renewal attempts
Disable Leader Election¶
For development or single-replica deployments:
# values.yaml
replicaCount: 1
controller:
config:
controller:
leader_election:
enabled: false # Disabled in single-replica mode
Timing Parameters¶
The timing parameters control failover speed and tolerance:
| Parameter | Default | Purpose | Recommendations |
|---|---|---|---|
lease_duration |
60s | Max time followers wait before taking over | Increase for flaky networks (120s) |
renew_deadline |
15s | How long leader retries before giving up | Should be < lease_duration (1/4 ratio) |
retry_period |
5s | Interval between leader renewal attempts | Should be < renew_deadline (1/3 ratio) |
Failover time calculation:
Worst-case failover = lease_duration + renew_deadline
Default failover = 60s + 15s = 75s (but typically 15-20s)
Clock skew tolerance:
Skew tolerance = lease_duration - renew_deadline
Default = 60s - 15s = 45s (handles up to 4x clock differences)
Deployment¶
Standard HA Deployment¶
Deploy with 2-3 replicas (default Helm configuration):
Scaling¶
Scale the deployment dynamically:
# Scale to 3 replicas
kubectl scale deployment haptic-controller --replicas=3
# Scale back to 2
kubectl scale deployment haptic-controller --replicas=2
RBAC Requirements¶
The controller requires these additional permissions for leader election:
These are automatically configured in the Helm chart's ClusterRole.
Monitoring Leadership¶
Check Current Leader¶
# View Lease resource
kubectl get lease -n <namespace> haptic-leader -o yaml
# Output shows current leader:
# spec:
# holderIdentity: haptic-7d9f8b4c6d-abc12
View Leadership Status in Logs¶
# Leader logs show:
kubectl logs -n <namespace> deployment/haptic-controller | grep -E "leader|election"
# Example output:
# level=INFO msg="Leader election started" identity=pod-abc12 lease=haptic-leader
# level=INFO msg="Became leader: pod-abc12" identity=pod-abc12
Prometheus Metrics¶
Monitor leader election via metrics endpoint:
kubectl port-forward -n <namespace> deployment/haptic-controller 9090:9090
curl http://localhost:9090/metrics | grep leader_election
Key metrics:
# Current leader (should be 1 across all replicas)
sum(haptic_leader_election_is_leader)
# Identify which pod is leader
haptic_leader_election_is_leader{pod=~".*"} == 1
# Leadership transition rate (should be low)
rate(haptic_leader_election_transitions_total[1h])
Troubleshooting¶
No Leader Elected¶
Symptoms:
- No deployments happening
- All replicas show
is_leader=0 - Logs show constant election failures
Common causes:
- Missing RBAC permissions:
kubectl auth can-i get leases --as=system:serviceaccount:<namespace>:haptic
kubectl auth can-i create leases --as=system:serviceaccount:<namespace>:haptic
kubectl auth can-i update leases --as=system:serviceaccount:<namespace>:haptic
- Missing environment variables:
kubectl get pod <pod-name> -o yaml | grep -A2 "POD_NAME\|POD_NAMESPACE"
# Should show:
# - name: POD_NAME
# valueFrom:
# fieldRef:
# fieldPath: metadata.name
- API server connectivity:
Multiple Leaders (Split-Brain)¶
Symptoms:
sum(haptic_leader_election_is_leader) > 1- Multiple pods deploying configs simultaneously
- Conflicting deployments in HAProxy
This should never happen with proper Kubernetes Lease implementation. If it does:
- Check for severe clock skew between nodes:
- Verify Kubernetes API server health:
- Restart all controller pods:
Frequent Leadership Changes¶
Symptoms:
rate(haptic_leader_election_transitions_total[1h]) > 5- Logs show frequent "Lost leadership" / "Became leader" messages
- Deployments failing intermittently
Common causes:
- Resource contention - Leader pod can't renew lease in time:
Solution: Increase CPU/memory limits
- Network issues - API server communication delays:
Solution: Increase lease_duration and renew_deadline
- Node issues - Leader pod node experiencing problems:
Solution: Drain and investigate node
Leader Not Deploying¶
Symptoms:
- One replica shows
is_leader=1 - No deployment errors in logs
- HAProxy configs not updating
Diagnosis:
# Check leader logs for deployment activity
kubectl logs <leader-pod> | grep -i "deploy"
# Verify leader-only components started
kubectl logs <leader-pod> | grep "Started.*Deployer\|DeploymentScheduler"
Common causes:
- Deployment components failed to start (check logs for errors)
- Rate limiting preventing deployment (check drift prevention interval)
- HAProxy instances unreachable (check network connectivity)
Best Practices¶
Replica Count¶
Development:
- 1 replica with
leader_election.enabled: false
Staging:
- 2 replicas with leader election enabled
Production:
- 2-3 replicas across multiple availability zones
- Enable PodDisruptionBudget:
Resource Allocation¶
Allocate sufficient resources for hot standby:
resources:
requests:
cpu: 100m
memory: 128Mi
limits:
cpu: 500m # Allow bursts during leader work
memory: 512Mi
All replicas perform the same work (watching, rendering, validating), so resource usage is similar.
Anti-Affinity¶
Distribute replicas across nodes for better availability:
affinity:
podAntiAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 100
podAffinityTerm:
labelSelector:
matchLabels:
app.kubernetes.io/name: haptic
topologyKey: kubernetes.io/hostname
Monitoring and Alerts¶
Set up Prometheus alerts for leader election health:
groups:
- name: haproxy-ic-leader-election
rules:
# No leader
- alert: NoLeaderElected
expr: sum(haptic_leader_election_is_leader) < 1
for: 1m
annotations:
summary: "No HAProxy controller leader elected"
# Multiple leaders (split-brain)
- alert: MultipleLeaders
expr: sum(haptic_leader_election_is_leader) > 1
annotations:
summary: "Multiple HAProxy controller leaders detected (split-brain)"
# Frequent transitions
- alert: FrequentLeadershipChanges
expr: rate(haptic_leader_election_transitions_total[1h]) > 5
for: 15m
annotations:
summary: "HAProxy controller experiencing frequent leadership changes"
Migration from Single-Replica¶
To migrate an existing single-replica deployment to HA:
-
Verify RBAC permissions (Helm chart updates this automatically)
-
Update values.yaml:
- Upgrade with Helm:
- Verify leadership:
- Confirm one leader:
kubectl get pods -l app.kubernetes.io/name=haptic,app.kubernetes.io/component=controller \
-o custom-columns=NAME:.metadata.name,LEADER:.status.podIP
# Check metrics to identify leader
for pod in $(kubectl get pods -l app.kubernetes.io/name=haptic,app.kubernetes.io/component=controller -o name); do
echo "$pod:"
kubectl exec $pod -- wget -qO- localhost:9090/metrics | grep is_leader
done
See Also¶
- Leader Election Design - Architecture and implementation details
- Monitoring Guide - Prometheus metrics and alerting
- Debugging Guide - Runtime introspection and troubleshooting
- Security Guide - RBAC and security best practices
- Performance Guide - Resource sizing and optimization
- Troubleshooting Guide - General troubleshooting