Leader Election for High Availability¶

Overview¶

This document describes the leader election system for the HAProxy Template Ingress Controller, which enables running multiple controller replicas for high availability while preventing conflicting updates to HAProxy instances.

Problem Statement¶

The controller currently runs as a single instance. Running multiple replicas without coordination would cause:

Resource waste: Multiple replicas performing identical dataplane API calls
Potential conflicts: Race conditions when multiple controllers push updates simultaneously
Unnecessary HAProxy reloads: Multiple deployments of the same configuration

However, all replicas should:

Watch Kubernetes resources (to maintain hot cache for failover)
Render templates (to have configurations ready)
Validate configurations (to share the workload)
Handle webhook requests (for high availability)

Only deployment operations (pushing configurations to HAProxy Dataplane API) need exclusivity.

State-of-the-Art Solution¶

Use k8s.io/client-go/tools/leaderelection with Lease-based resource locks, the industry standard for Kubernetes operator high availability.

Why Lease-based Locks?¶

Lower overhead: Leases create less watch traffic than ConfigMaps or Endpoints
Purpose-built: Designed specifically for leader election
Reliable: Used by core Kubernetes components (kube-controller-manager, kube-scheduler)
Clock skew tolerant: Configurable tolerance for node clock differences

Recommended Configuration¶

LeaderElectionConfig{
    LeaseDuration: 60 * time.Second,  // How long leader holds lock
    RenewDeadline: 15 * time.Second,  // Renewal deadline before losing leadership
    RetryPeriod:   5 * time.Second,   // Interval between renewal attempts
    ReleaseOnCancel: true,            // Cleanup on graceful shutdown
}

Tolerance formula: LeaseDuration / RenewDeadline = clock skew tolerance ratio

With 60s/15s settings, the system tolerates nodes progressing 4x faster than others.

Architecture Changes¶

Component Classification¶

All replicas run (read-only or validation operations):

ConfigWatcher - Monitors ConfigMap changes
CredentialsLoader - Monitors Secret changes
ResourceWatcher - Watches Kubernetes resources (Ingress, Service, etc.)
Reconciler - Debounces changes and triggers reconciliation
Renderer - Generates HAProxy configurations from templates
HAProxyValidator - Validates generated configurations
Executor - Orchestrates reconciliation workflow
Discovery - Discovers HAProxy pod endpoints
ConfigValidators - Validates controller configuration
WebhookValidators - Validates admission webhook requests
Commentator - Logs events for observability
Metrics - Records Prometheus metrics
StateCache - Maintains debug state

Leader-only components (write operations to dataplane API):

Deployer - Deploys configurations to HAProxy instances
DeploymentScheduler - Rate-limits and queues deployments
DriftMonitor - Monitors and corrects configuration drift

New Component: LeaderElector¶

Package: pkg/controller/leaderelection/

Responsibilities:

Create and manage Lease lock in controller namespace
Use pod name as unique identity (via POD_NAME env var)
Publish leader election events to EventBus
Provide IsLeader() method for status queries
Handle graceful leadership release on shutdown

Event integration:

type LeaderElector struct {
    eventBus *events.EventBus
    elector  *leaderelection.LeaderElector
    isLeader atomic.Bool
}

// Callbacks publish events
OnStartedLeading: func(ctx context.Context) {
    e.isLeader.Store(true)
    e.eventBus.Publish(events.NewBecameLeaderEvent())
}

OnStoppedLeading: func() {
    e.isLeader.Store(false)
    e.eventBus.Publish(events.NewLostLeadershipEvent())
}

OnNewLeader: func(identity string) {
    e.eventBus.Publish(events.NewNewLeaderObservedEvent(identity))
}

New Events¶

Leader election events (pkg/controller/events/types.go):

// LeaderElectionStartedEvent is published when leader election begins
type LeaderElectionStartedEvent struct {
    Identity      string
    LeaseName     string
    LeaseNamespace string
}

// BecameLeaderEvent is published when this replica becomes leader
type BecameLeaderEvent struct {
    Identity   string
    Timestamp  time.Time
}

// LostLeadershipEvent is published when this replica loses leadership
type LostLeadershipEvent struct {
    Identity   string
    Timestamp  time.Time
    Reason     string  // graceful_shutdown, lease_expired, etc.
}

// NewLeaderObservedEvent is published when a new leader is observed
type NewLeaderObservedEvent struct {
    NewLeaderIdentity string
    PreviousLeader    string
    Timestamp         time.Time
}

These events enable:

Observability: Commentator logs all transitions
Metrics: Track leadership duration, transition count
Debugging: Understand which replica is active

Controller Startup Changes¶

Modified startup sequence (pkg/controller/controller.go):

Stage 0: Leader Election Initialization (NEW)
  - Read POD_NAME from environment
  - Create LeaderElector with Lease lock
  - Start leader election loop in background goroutine
  - Continue startup (don't block on becoming leader)

Stage 1: Config Management Components
  - ConfigWatcher (all replicas)
  - ConfigValidator (all replicas)
  - EventBus.Start()

Stage 2: Wait for Valid Config
  - All replicas block here

Stage 3: Resource Watchers
  - Create ResourceWatcher (all replicas)
  - Start IndexSynchronizationTracker (all replicas)

Stage 4: Wait for Index Sync
  - All replicas block here

Stage 5: Reconciliation Components
  - Reconciler (all replicas)
  - Renderer (all replicas)
  - HAProxyValidator (all replicas)
  - Executor (all replicas)
  - Discovery (all replicas)
  - Deployer (LEADER ONLY - NEW)
  - DeploymentScheduler (LEADER ONLY - NEW)
  - DriftMonitor (LEADER ONLY - NEW)

Stage 6: Webhook Validation
  - Webhook component (all replicas)
  - DryRunValidator (all replicas)

Stage 7: Debug Infrastructure
  - Debug server (all replicas)
  - Metrics server (all replicas)

Conditional Component Startup¶

Implementation pattern:

// Create separate context for leader-only components
leaderCtx, leaderCancel := context.WithCancel(iterCtx)

// Track leader-only components
var leaderComponents struct {
    sync.Mutex
    deployer            *deployer.Component
    deploymentScheduler *deployer.DeploymentScheduler
    driftMonitor        *deployer.DriftPreventionMonitor
    cancel              context.CancelFunc
}

// Leadership callbacks
OnStartedLeading: func(ctx context.Context) {
    logger.Info("Became leader, starting deployment components")

    leaderComponents.Lock()
    defer leaderComponents.Unlock()

    // Create fresh context for leader components
    leaderComponents.cancel = leaderCancel

    // Create and start leader-only components
    leaderComponents.deployer = deployer.New(bus, logger)
    leaderComponents.deploymentScheduler = deployer.NewDeploymentScheduler(bus, logger, minInterval)
    leaderComponents.driftMonitor = deployer.NewDriftPreventionMonitor(bus, logger, driftInterval)

    go leaderComponents.deployer.Start(leaderCtx)
    go leaderComponents.deploymentScheduler.Start(leaderCtx)
    go leaderComponents.driftMonitor.Start(leaderCtx)
}

OnStoppedLeading: func() {
    logger.Warn("Lost leadership, stopping deployment components")

    leaderComponents.Lock()
    defer leaderComponents.Unlock()

    if leaderComponents.cancel != nil {
        leaderComponents.cancel()
        leaderComponents.cancel = nil
    }
}

Graceful transition:

Old leader loses lease → stops deployment components
Brief pause (lease expiry time)
New leader acquires lease → starts deployment components
New leader has hot cache and rendered config → immediate reconciliation

Configuration¶

New configuration section (pkg/core/config/config.go):

controller:
  # ... existing fields ...

  leaderElection:
    enabled: true  # Enable leader election (default: true)
    leaseName: "haptic-leader"
    leaseDuration: 60s
    renewDeadline: 15s
    retryPeriod: 5s

Backwards compatibility:

enabled: false → Run without leader election (single replica mode)
Existing single-replica deployments work unchanged

RBAC Requirements¶

New permissions (charts/haptic/templates/rbac.yaml):

apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: haptic
rules:
  # ... existing rules ...

  # Leader election
  - apiGroups: ["coordination.k8s.io"]
    resources: ["leases"]
    verbs: ["get", "create", "update"]

The controller creates a Lease in its own namespace (not cluster-wide).

Deployment Changes¶

Environment variables (charts/haptic/templates/deployment.yaml):

spec:
  template:
    spec:
      containers:
      - name: controller
        env:
        # ... existing env vars ...

        # Pod identity for leader election
        - name: POD_NAME
          valueFrom:
            fieldRef:
              fieldPath: metadata.name
        - name: POD_NAMESPACE
          valueFrom:
            fieldRef:
              fieldPath: metadata.namespace

Multiple replicas:

spec:
  replicas: 2  # Change from 1 to 2+ for HA

Resource adjustments:

No changes needed - non-leader replicas consume similar resources since they perform all read-only work.

Observability¶

Metrics¶

New Prometheus metrics (pkg/controller/metrics/metrics.go):

// controller_leader_transitions_total
// Counter of leadership changes (acquire + lose)
controller_leader_transitions_total counter

// controller_is_leader
// Gauge indicating current leadership status (1=leader, 0=follower)
controller_is_leader{pod="<pod-name>"} gauge

// controller_leader_election_duration_seconds
// Histogram of time to acquire leadership after startup
controller_leader_election_duration_seconds histogram

// controller_time_as_leader_seconds
// Counter of cumulative seconds spent as leader
controller_time_as_leader_seconds counter

Usage:

Alert on frequent transitions (indicates instability)
Dashboard showing current leader identity
Track leadership duration distribution

Logging¶

Commentator enhancements (pkg/controller/commentator/commentator.go):

case LeaderElectionStartedEvent:
    c.logger.Info("leader election started",
        "identity", e.Identity,
        "lease", e.LeaseName,
        "namespace", e.LeaseNamespace)

case BecameLeaderEvent:
    c.logger.Info("became leader",
        "identity", e.Identity)

case LostLeadershipEvent:
    c.logger.Warn("lost leadership",
        "identity", e.Identity,
        "reason", e.Reason)

case NewLeaderObservedEvent:
    c.logger.Info("new leader observed",
        "new_leader", e.NewLeaderIdentity,
        "previous_leader", e.PreviousLeader)

Debug Endpoints¶

Lease status (via debug server):

GET /debug/vars

{
  "leader_election": {
    "enabled": true,
    "is_leader": true,
    "identity": "haptic-7f8d9c5b-abc123",
    "lease_name": "haptic-leader",
    "lease_holder": "haptic-7f8d9c5b-abc123",
    "time_as_leader": "45m32s",
    "transitions": 2
  }
}

Testing Strategy¶

Unit Tests¶

LeaderElector tests (pkg/controller/leaderelection/elector_test.go):

// Test leader election configuration
func TestLeaderElector_Config(t *testing.T)

// Test event publishing on leadership changes
func TestLeaderElector_EventPublishing(t *testing.T)

// Test IsLeader() method accuracy
func TestLeaderElector_IsLeaderStatus(t *testing.T)

// Test graceful shutdown
func TestLeaderElector_GracefulShutdown(t *testing.T)

Integration Tests¶

Multi-replica tests (tests/integration/leader_election_test.go):

// Deploy 2 replicas, verify only one deploys configs
func TestLeaderElection_OnlyLeaderDeploys(t *testing.T)

// Kill leader pod, verify follower takes over
func TestLeaderElection_Failover(t *testing.T)

// Verify both replicas watch resources
func TestLeaderElection_BothReplicasWatchResources(t *testing.T)

// Verify both replicas render configs
func TestLeaderElection_BothReplicasRenderConfigs(t *testing.T)

Test setup:

Use kind cluster with multi-node setup
Deploy controller with 3 replicas
Create test Ingress resources
Verify deployment behavior
Simulate pod failures

Manual Testing¶

Verification steps:

# Deploy with 3 replicas
kubectl scale deployment haptic-controller --replicas=3

# Check lease status
kubectl get lease -n haproxy-system haptic-leader -o yaml

# Verify leader via metrics
kubectl port-forward deployment/haptic-controller 9090:9090
curl http://localhost:9090/metrics | grep controller_is_leader

# Check logs for leadership events
kubectl logs -l app=haptic --tail=100 | grep -i leader

# Simulate failover
kubectl delete pod <leader-pod>

# Verify new leader takes over
watch kubectl get lease -n haproxy-system haptic-leader

# Check HAProxy configs only deployed once per change
kubectl logs -l app=haptic | grep "deployment completed"

Failure Scenarios¶

Leader Pod Crashes¶

Behavior:

Leader lease expires (15s after last renewal)
Followers detect expired lease
First follower to update lease becomes new leader
New leader starts deployment components
Reconciliation continues from hot cache

Downtime: ~15-20 seconds (RenewDeadline + startup time)

Network Partition¶

Scenario: Leader pod loses connectivity to Kubernetes API

Behavior:

Leader cannot renew lease
After RenewDeadline (15s), leader voluntarily releases leadership
Leader stops deployment components
Connected replica acquires lease
System continues with new leader

Protection: Split-brain prevented by Kubernetes API acting as coordination point

Clock Skew¶

Scenario: Nodes have different clock speeds

Tolerance: Configured ratio of LeaseDuration/RenewDeadline

With 60s/15s: Tolerates 4x clock speed difference
If exceeded: May experience frequent leadership changes

Mitigation: Run NTP on cluster nodes (Kubernetes best practice)

All Replicas Down¶

Behavior:

Lease expires
No deployments occur (expected behavior)
HAProxy continues serving with last known configuration
When replica starts, acquires lease and reconciles

Impact: No new configuration updates until controller recovers

Migration Path¶

Phase 1: Code Implementation¶

Implement LeaderElector package
Add leader election events
Modify controller startup for conditional components
Add configuration options
Update RBAC manifests

Phase 2: Testing¶

Unit tests for LeaderElector
Integration tests with multi-replica setup
Chaos testing (kill leaders, network partitions)
Performance testing (ensure no regression)

Phase 3: Documentation¶

Update deployment guide for HA setup
Document troubleshooting procedures
Update architecture diagrams
Create runbooks for common scenarios

Phase 4: Rollout¶

Release with enabled: false default
Document opt-in HA setup
Collect feedback from early adopters
After validation, change default to enabled: true

Alternatives Considered¶

Single Active Replica with Pod Disruption Budget¶

Rejected: Doesn't provide HA, just prevents voluntary disruptions

Active-Active with Distributed Locking per HAProxy Instance¶

Rejected: More complex, potential deadlocks, not idiomatic for Kubernetes

External Coordination (etcd, Consul)¶

Rejected: Adds operational complexity, Kubernetes API sufficient

Config Generation Only (No Deployment)¶

Rejected: Requires external system to deploy, doesn't solve core problem

Leader Election for High Availability¶

Overview¶

Problem Statement¶

State-of-the-Art Solution¶

Why Lease-based Locks?¶

Recommended Configuration¶

Architecture Changes¶

Component Classification¶

New Component: LeaderElector¶

New Events¶

Controller Startup Changes¶

Conditional Component Startup¶

Configuration¶

RBAC Requirements¶

Deployment Changes¶

Observability¶

Metrics¶

Logging¶

Debug Endpoints¶

Testing Strategy¶

Unit Tests¶

Integration Tests¶

Manual Testing¶

Failure Scenarios¶

Leader Pod Crashes¶

Network Partition¶

Clock Skew¶

All Replicas Down¶

Migration Path¶

Phase 1: Code Implementation¶

Phase 2: Testing¶

Phase 3: Documentation¶

Phase 4: Rollout¶

Alternatives Considered¶

Single Active Replica with Pod Disruption Budget¶

Active-Active with Distributed Locking per HAProxy Instance¶

External Coordination (etcd, Consul)¶

Config Generation Only (No Deployment)¶

References¶