High Availability with Leader Election¶

This guide explains how to deploy and operate the HAProxy Template Ingress Controller in high availability (HA) mode with multiple replicas.

Overview¶

The controller supports running multiple replicas for high availability using leader election based on Kubernetes Leases. Only the elected leader performs write operations (deploying configurations to HAProxy), while all replicas continue watching resources, rendering templates, and validating configurations to maintain "hot standby" status.

Benefits of HA deployment:

Zero-downtime during controller upgrades (rolling updates)
Automatic failover if leader pod crashes (~15-20 seconds)
All replicas ready to take over immediately (hot standby)
Balanced leader distribution across nodes

How it works:

All replicas watch Kubernetes resources and render HAProxy configurations
Leader election determines which replica can deploy configs to HAProxy
When leader fails, followers automatically elect a new leader
Leadership transitions are logged and tracked via Prometheus metrics

Configuration¶

Enable Leader Election¶

Leader election is enabled by default when deploying with 2+ replicas via Helm:

# values.yaml (defaults)
replicaCount: 2  # Run 2 replicas for HA

controller:
  config:
    controller:
      leader_election:
        enabled: true
        lease_name: haptic-leader
        lease_duration: 60s    # Failover happens within this time
        renew_deadline: 15s    # Leader tries to renew for this long
        retry_period: 5s       # Interval between renewal attempts

Disable Leader Election¶

For development or single-replica deployments:

# values.yaml
replicaCount: 1

controller:
  config:
    controller:
      leader_election:
        enabled: false  # Disabled in single-replica mode

Timing Parameters¶

The timing parameters control failover speed and tolerance:

Parameter	Default	Purpose	Recommendations
`lease_duration`	60s	Max time followers wait before taking over	Increase for flaky networks (120s)
`renew_deadline`	15s	How long leader retries before giving up	Should be < `lease_duration` (1/4 ratio)
`retry_period`	5s	Interval between leader renewal attempts	Should be < `renew_deadline` (1/3 ratio)

Failover time calculation:

Worst-case failover = lease_duration + renew_deadline
Default failover    = 60s + 15s = 75s (but typically 15-20s)

Clock skew tolerance:

Skew tolerance = lease_duration - renew_deadline
Default        = 60s - 15s = 45s (handles up to 4x clock differences)

Deployment¶

Standard HA Deployment¶

Deploy with 2-3 replicas (default Helm configuration):

helm install haproxy-ic charts/haptic \
  --set replicaCount=2

Scaling¶

Scale the deployment dynamically:

# Scale to 3 replicas
kubectl scale deployment haptic-controller --replicas=3

# Scale back to 2
kubectl scale deployment haptic-controller --replicas=2

RBAC Requirements¶

The controller requires these additional permissions for leader election:

apiGroups: ["coordination.k8s.io"]
resources: ["leases"]
verbs: ["get", "create", "update"]

These are automatically configured in the Helm chart's ClusterRole.

Monitoring Leadership¶

Check Current Leader¶

# View Lease resource
kubectl get lease -n <namespace> haptic-leader -o yaml

# Output shows current leader:
# spec:
#   holderIdentity: haptic-7d9f8b4c6d-abc12

View Leadership Status in Logs¶

# Leader logs show:
kubectl logs -n <namespace> deployment/haptic-controller | grep -E "leader|election"

# Example output:
# level=INFO msg="Leader election started" identity=pod-abc12 lease=haptic-leader
# level=INFO msg="Became leader: pod-abc12" identity=pod-abc12

Prometheus Metrics¶

Monitor leader election via metrics endpoint:

kubectl port-forward -n <namespace> deployment/haptic-controller 9090:9090
curl http://localhost:9090/metrics | grep leader_election

Key metrics:

# Current leader (should be 1 across all replicas)
sum(haptic_leader_election_is_leader)

# Identify which pod is leader
haptic_leader_election_is_leader{pod=~".*"} == 1

# Leadership transition rate (should be low)
rate(haptic_leader_election_transitions_total[1h])

Troubleshooting¶

No Leader Elected¶

Symptoms:

No deployments happening
All replicas show is_leader=0
Logs show constant election failures

Common causes:

Missing RBAC permissions:

kubectl auth can-i get leases --as=system:serviceaccount:<namespace>:haptic
kubectl auth can-i create leases --as=system:serviceaccount:<namespace>:haptic
kubectl auth can-i update leases --as=system:serviceaccount:<namespace>:haptic

Missing environment variables:

kubectl get pod <pod-name> -o yaml | grep -A2 "POD_NAME\|POD_NAMESPACE"

# Should show:
# - name: POD_NAME
#   valueFrom:
#     fieldRef:
#       fieldPath: metadata.name

API server connectivity:

kubectl logs <pod-name> | grep "connection refused\|timeout"

Multiple Leaders (Split-Brain)¶

Symptoms:

sum(haptic_leader_election_is_leader) > 1
Multiple pods deploying configs simultaneously
Conflicting deployments in HAProxy

This should never happen with proper Kubernetes Lease implementation. If it does:

Check for severe clock skew between nodes:

# On each node
timedatectl status

Verify Kubernetes API server health:

kubectl get --raw /healthz

Restart all controller pods:

kubectl rollout restart deployment haptic-controller

Frequent Leadership Changes¶

Symptoms:

rate(haptic_leader_election_transitions_total[1h]) > 5
Logs show frequent "Lost leadership" / "Became leader" messages
Deployments failing intermittently

Common causes:

Resource contention - Leader pod can't renew lease in time:

kubectl top pods -n <namespace>
kubectl describe pod <leader-pod> | grep -A10 "Limits\|Requests"

Solution: Increase CPU/memory limits

Network issues - API server communication delays:

kubectl logs <pod-name> | grep "lease renew\|deadline"

Solution: Increase lease_duration and renew_deadline

Node issues - Leader pod node experiencing problems:

kubectl describe node <node-name>

Solution: Drain and investigate node

Leader Not Deploying¶

Symptoms:

One replica shows is_leader=1
No deployment errors in logs
HAProxy configs not updating

Diagnosis:

# Check leader logs for deployment activity
kubectl logs <leader-pod> | grep -i "deploy"

# Verify leader-only components started
kubectl logs <leader-pod> | grep "Started.*Deployer\|DeploymentScheduler"

Common causes:

Deployment components failed to start (check logs for errors)
Rate limiting preventing deployment (check drift prevention interval)
HAProxy instances unreachable (check network connectivity)

Best Practices¶

Replica Count¶

Development:

1 replica with leader_election.enabled: false

Staging:

2 replicas with leader election enabled

Production:

2-3 replicas across multiple availability zones
Enable PodDisruptionBudget:

podDisruptionBudget:
  enabled: true
  minAvailable: 1

Resource Allocation¶

Allocate sufficient resources for hot standby:

resources:
  requests:
    cpu: 100m
    memory: 128Mi
  limits:
    cpu: 500m      # Allow bursts during leader work
    memory: 512Mi

All replicas perform the same work (watching, rendering, validating), so resource usage is similar.

Anti-Affinity¶

Distribute replicas across nodes for better availability:

affinity:
  podAntiAffinity:
    preferredDuringSchedulingIgnoredDuringExecution:
      - weight: 100
        podAffinityTerm:
          labelSelector:
            matchLabels:
              app.kubernetes.io/name: haptic
          topologyKey: kubernetes.io/hostname

Monitoring and Alerts¶

Set up Prometheus alerts for leader election health:

groups:
  - name: haproxy-ic-leader-election
    rules:
      # No leader
      - alert: NoLeaderElected
        expr: sum(haptic_leader_election_is_leader) < 1
        for: 1m
        annotations:
          summary: "No HAProxy controller leader elected"

      # Multiple leaders (split-brain)
      - alert: MultipleLeaders
        expr: sum(haptic_leader_election_is_leader) > 1
        annotations:
          summary: "Multiple HAProxy controller leaders detected (split-brain)"

      # Frequent transitions
      - alert: FrequentLeadershipChanges
        expr: rate(haptic_leader_election_transitions_total[1h]) > 5
        for: 15m
        annotations:
          summary: "HAProxy controller experiencing frequent leadership changes"

Migration from Single-Replica¶

To migrate an existing single-replica deployment to HA:

Verify RBAC permissions (Helm chart updates this automatically)
Update values.yaml:

replicaCount: 2
controller:
  config:
    controller:
      leader_election:
        enabled: true

Upgrade with Helm:

helm upgrade haproxy-ic charts/haptic \
  --reuse-values \
  -f new-values.yaml

Verify leadership:

kubectl logs -f deployment/haptic-controller | grep leader

Confirm one leader:

kubectl get pods -l app.kubernetes.io/name=haptic,app.kubernetes.io/component=controller \
  -o custom-columns=NAME:.metadata.name,LEADER:.status.podIP

# Check metrics to identify leader
for pod in $(kubectl get pods -l app.kubernetes.io/name=haptic,app.kubernetes.io/component=controller -o name); do
  echo "$pod:"
  kubectl exec $pod -- wget -qO- localhost:9090/metrics | grep is_leader
done

High Availability with Leader Election¶

Overview¶

Configuration¶

Enable Leader Election¶

Disable Leader Election¶

Timing Parameters¶

Deployment¶

Standard HA Deployment¶

Scaling¶

RBAC Requirements¶

Monitoring Leadership¶

Check Current Leader¶

View Leadership Status in Logs¶

Prometheus Metrics¶

Troubleshooting¶

No Leader Elected¶

Multiple Leaders (Split-Brain)¶

Frequent Leadership Changes¶

Leader Not Deploying¶

Best Practices¶

Replica Count¶

Resource Allocation¶

Anti-Affinity¶

Monitoring and Alerts¶

Migration from Single-Replica¶

See Also¶