Skip to content

Monitoring Guide

This guide explains how to monitor HAPTIC (HAProxy Template Ingress Controller) using Prometheus metrics, including setup, key metrics, alerting, and dashboards.

Overview

The controller exposes Prometheus metrics via an HTTP endpoint, providing visibility into reconciliation performance, deployment status, resource counts, and leader election state.

Key monitoring areas:

  • Reconciliation cycle performance and errors
  • HAProxy deployment latency and success rates
  • Configuration validation status
  • Kubernetes resource counts
  • Leader election for HA deployments

Enabling Metrics

Metrics are enabled by default. Configure the metrics port in Helm values:

# values.yaml
controller:
  config:
    controller:
      metrics_port: 9090  # Default

Set to 0 to disable metrics.

Accessing Metrics

Prometheus Scrape Configuration

Add a scrape config for the controller:

scrape_configs:
  - job_name: 'haptic''
    kubernetes_sd_configs:
      - role: pod
    relabel_configs:
      - source_labels: [__meta_kubernetes_pod_label_app_kubernetes_io_name]
        regex: haptic
        action: keep
      - source_labels: [__meta_kubernetes_pod_container_port_number]
        regex: "9090"
        action: keep

ServiceMonitor (Prometheus Operator)

If using Prometheus Operator, enable the ServiceMonitor in Helm:

# values.yaml
serviceMonitor:
  enabled: true
  interval: 30s
  labels:
    release: prometheus  # Match your Prometheus selector

Manual Access

# Port-forward to metrics endpoint
kubectl port-forward -n <namespace> deployment/haptic-controller 9090:9090

# Fetch metrics
curl http://localhost:9090/metrics

Metrics Reference

Reconciliation Metrics

Metric Type Description
haptic_reconciliation_total Counter Total reconciliation cycles triggered
haptic_reconciliation_duration_seconds Histogram Time spent in reconciliation cycles
haptic_reconciliation_errors_total Counter Failed reconciliation cycles

Key queries:

# Reconciliation rate per second
rate(haptic_reconciliation_total[5m])

# Average reconciliation duration
rate(haptic_reconciliation_duration_seconds_sum[5m]) /
rate(haptic_reconciliation_duration_seconds_count[5m])

# Success rate percentage
100 * (1 - (
  rate(haptic_reconciliation_errors_total[5m]) /
  rate(haptic_reconciliation_total[5m])
))

Deployment Metrics

Metric Type Description
haptic_deployment_total Counter Total deployment attempts
haptic_deployment_duration_seconds Histogram Time spent deploying to HAProxy
haptic_deployment_errors_total Counter Failed deployments

Key queries:

# Deployment rate
rate(haptic_deployment_total[5m])

# 95th percentile deployment latency
histogram_quantile(0.95, rate(haptic_deployment_duration_seconds_bucket[5m]))

# Deployment success rate
100 * (1 - (
  rate(haptic_deployment_errors_total[5m]) /
  rate(haptic_deployment_total[5m])
))

Validation Metrics

Metric Type Description
haptic_validation_total Counter Total validation attempts
haptic_validation_errors_total Counter Failed validations

Key queries:

# Validation rate
rate(haptic_validation_total[5m])

# Validation success rate
100 * (1 - (
  rate(haptic_validation_errors_total[5m]) /
  rate(haptic_validation_total[5m])
))

Resource Metrics

Metric Type Labels Description
haptic_resource_count Gauge type Current count of watched resources

Key queries:

# All resource counts
haptic_resource_count

# Specific resource types
haptic_resource_count{type="ingresses"}
haptic_resource_count{type="services"}
haptic_resource_count{type="haproxy-pods"}

# Resource count changes
delta(haptic_resource_count[1h])

Event Metrics

Metric Type Description
haptic_event_subscribers Gauge Active event subscribers
haptic_events_published_total Counter Total events published

Key queries:

# Event publishing rate
rate(haptic_events_published_total[5m])

# Subscriber count (should be constant)
haptic_event_subscribers

# Subscriber changes (indicates component restarts)
delta(haptic_event_subscribers[5m])

Leader Election Metrics

Metric Type Description
haptic_leader_election_is_leader Gauge 1 if this replica is leader, 0 otherwise
haptic_leader_election_transitions_total Counter Leadership transitions (gain/loss)
haptic_leader_election_time_as_leader_seconds_total Counter Cumulative time as leader

Key queries:

# Current leader count (should be exactly 1)
sum(haptic_leader_election_is_leader)

# Identify leader pod
haptic_leader_election_is_leader == 1

# Leadership transition rate
rate(haptic_leader_election_transitions_total[1h])

# Average time as leader per transition
haptic_leader_election_time_as_leader_seconds_total /
haptic_leader_election_transitions_total

Alerting Rules

groups:
  - name: haptic
    rules:
      # Reconciliation failures
      - alert: HAProxyICHighReconciliationErrorRate
        expr: |
          rate(haptic_reconciliation_errors_total[5m]) /
          rate(haptic_reconciliation_total[5m]) > 0.1
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High reconciliation error rate (>10%)"
          description: "Controller is failing to reconcile configurations"

      # Deployment latency
      - alert: HAProxyICHighDeploymentLatency
        expr: |
          histogram_quantile(0.95,
            rate(haptic_deployment_duration_seconds_bucket[5m])
          ) > 5
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "95th percentile deployment latency >5s"
          description: "Deploying configs to HAProxy is taking too long"

      # Validation failures
      - alert: HAProxyICValidationFailures
        expr: |
          rate(haptic_validation_errors_total[5m]) > 0
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "Configuration validation failing"
          description: "HAProxy configuration has syntax or validation errors"

      # Component crash
      - alert: HAProxyICComponentStopped
        expr: |
          delta(haptic_event_subscribers[5m]) < 0
        labels:
          severity: critical
        annotations:
          summary: "Event subscriber count decreased"
          description: "A controller component may have crashed"

      # No leader elected (HA)
      - alert: HAProxyICNoLeader
        expr: sum(haptic_leader_election_is_leader) < 1
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "No HAProxy controller leader elected"
          description: "No controller replica is elected as leader"

      # Multiple leaders (split-brain)
      - alert: HAProxyICMultipleLeaders
        expr: sum(haptic_leader_election_is_leader) > 1
        labels:
          severity: critical
        annotations:
          summary: "Multiple HAProxy controller leaders detected"
          description: "Split-brain condition - multiple replicas think they are leader"

      # Frequent leadership changes
      - alert: HAProxyICFrequentLeadershipChanges
        expr: rate(haptic_leader_election_transitions_total[1h]) > 5
        for: 15m
        labels:
          severity: warning
        annotations:
          summary: "Frequent leadership transitions"
          description: "Controller leadership changing too often, may indicate cluster instability"

      # No HAProxy pods discovered
      - alert: HAProxyICNoHAProxyPods
        expr: haptic_resource_count{type="haproxy-pods"} < 1
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "No HAProxy pods discovered"
          description: "Controller cannot find any HAProxy pods to manage"

Dashboard Examples

Grafana Dashboard Queries

Reconciliation Overview Panel:

# Success rate (stat panel)
100 * (1 - (
  rate(haptic_reconciliation_errors_total[5m]) /
  rate(haptic_reconciliation_total[5m])
))

# Rate over time (graph)
rate(haptic_reconciliation_total[5m])
rate(haptic_reconciliation_errors_total[5m])

Deployment Latency Panel:

# P50, P95, P99 latencies
histogram_quantile(0.50, rate(haptic_deployment_duration_seconds_bucket[5m]))
histogram_quantile(0.95, rate(haptic_deployment_duration_seconds_bucket[5m]))
histogram_quantile(0.99, rate(haptic_deployment_duration_seconds_bucket[5m]))

Resource Count Panel:

# All resource types
haptic_resource_count

# Stacked area chart by type
haptic_resource_count{type=~"ingresses|services|endpoints"}

Leader Election Panel:

# Current leader indicator
haptic_leader_election_is_leader == 1

# Transition count over time
increase(haptic_leader_election_transitions_total[1h])

Dashboard JSON Template

A basic Grafana dashboard template can be imported from:

{
  "title": "HAPTIC",
  "panels": [
    {
      "title": "Reconciliation Rate",
      "targets": [
        {"expr": "rate(haptic_reconciliation_total[5m])"}
      ]
    },
    {
      "title": "Reconciliation Success Rate",
      "targets": [
        {"expr": "100 * (1 - rate(haptic_reconciliation_errors_total[5m]) / rate(haptic_reconciliation_total[5m]))"}
      ]
    },
    {
      "title": "Deployment Latency",
      "targets": [
        {"expr": "histogram_quantile(0.95, rate(haptic_deployment_duration_seconds_bucket[5m]))"}
      ]
    },
    {
      "title": "Resource Counts",
      "targets": [
        {"expr": "haptic_resource_count"}
      ]
    },
    {
      "title": "Leader Status",
      "targets": [
        {"expr": "haptic_leader_election_is_leader"}
      ]
    }
  ]
}

Operational Insights

Key Health Indicators

Indicator Healthy Range Action if Unhealthy
Reconciliation success rate >99% Check logs for template/validation errors
Deployment success rate >99% Check HAProxy pod connectivity
P95 deployment latency <2s Check HAProxy DataPlane API performance
Leader count =1 Check HA configuration and network
Event subscribers Constant Restart controller if dropping

Capacity Planning

Monitor these metrics for capacity planning:

# Reconciliation frequency (how often config changes)
rate(haptic_reconciliation_total[1h]) * 3600

# Ingress growth rate
deriv(haptic_resource_count{type="ingresses"}[1d])

# Average reconciliation overhead
avg_over_time(haptic_reconciliation_duration_seconds_sum[1d]) /
avg_over_time(haptic_reconciliation_duration_seconds_count[1d])

Troubleshooting with Metrics

High reconciliation error rate:

  1. Check haptic_validation_errors_total - template/config issues
  2. Check haptic_deployment_errors_total - HAProxy connectivity issues
  3. Review controller logs for specific error messages

Missing metrics:

  1. Verify metrics port is enabled (controller.config.controller.metrics_port)
  2. Check ServiceMonitor selector matches Prometheus configuration
  3. Verify network policies allow scraping

Leader election issues:

  1. Check if sum(haptic_leader_election_is_leader) != 1
  2. Review rate(haptic_leader_election_transitions_total[1h]) for instability
  3. See High Availability Guide for troubleshooting

Integration with Existing Monitoring

Prometheus Operator

# values.yaml
serviceMonitor:
  enabled: true
  interval: 30s
  scrapeTimeout: 10s
  labels:
    release: prometheus
  namespaceSelector:
    matchNames:
      - haptic

Victoria Metrics

Use the same Prometheus scrape configuration - Victoria Metrics is compatible.

Datadog

Configure Datadog Agent to scrape Prometheus metrics:

# datadog-agent values
datadog:
  prometheusScrape:
    enabled: true
    serviceEndpoints: true

See Also