Skip to content

Observability — Prometheus, Grafana, and Datadog Integration

Kysira exposes Prometheus metrics from both the ext-proc and inference components. This document explains how those metrics are discovered by in-cluster agents (Grafana / Prometheus Operator, Datadog), what is already wired up, and what customers need to configure.


Architecture

Metrics are pulled by an in-cluster agent — Kysira services do not push. Each service exposes a standard Prometheus text-format /metrics endpoint. The customer's observability agent (Grafana Alloy, Prometheus, Datadog Agent) scrapes it on a configurable interval and forwards the data to their backend.

┌────────────────────────── k8s cluster ──────────────────────────┐
│                                                                  │
│  ┌─────────────────┐       ┌──────────────────────────┐         │
│  │  kysira-ext-proc │       │   kysira-inference        │         │
│  │  :9090/metrics  │       │   :8081/metrics           │         │
│  └────────┬────────┘       └────────────┬─────────────┘         │
│           │ scrape                       │ scrape                │
│           └──────────────┬──────────────┘                       │
│                   ┌──────▼──────────┐                           │
│                   │  In-cluster     │                            │
│                   │  agent          │                            │
│                   │  (DaemonSet /   │                            │
│                   │   Deployment)   │                            │
│                   └──────┬──────────┘                           │
└──────────────────────────┼───────────────────────────────────── ┘
                           │ HTTPS + API key
                  ┌────────▼─────────┐
                  │  Grafana Cloud   │  ← or Datadog / self-hosted
                  │  or Datadog      │
                  └──────────────────┘

Metrics exposed

ext-proc (kysira_extproc_*)

Metric Type Description
kysira_extproc_requests_total Counter All requests inspected, labelled by action
kysira_extproc_request_duration_seconds Histogram Inference call latency, labelled by action
kysira_extproc_flagged_total Counter Requests whose score exceeded the kill threshold
kysira_extproc_killed_total Counter Requests blocked with 403
kysira_extproc_inference_errors_total Counter Inference sidecar call failures (fails open)
kysira_extproc_active_streams Gauge Open ext_proc gRPC streams

Port: 9090, path: /metrics (separate from gRPC port 50051).

inference (kysira_inference_*)

Metric Type Description
kysira_inference_requests_total Counter Total scoring requests, labelled by endpoint
kysira_inference_errors_total Counter Inference errors, labelled by endpoint and error_type
kysira_inference_duration_seconds Histogram Per-request scoring latency, labelled by endpoint
kysira_model_loaded Gauge Model load status — 1 = loaded, 0 = not loaded, labelled by model_type

Port: 8081, path: /metrics (same port as the main API).


Agent discovery mechanisms

There are three distinct discovery mechanisms. Which one applies depends on what the customer runs. All three are supported — each is controlled by a separate Helm value.

1. Prometheus pod annotations (metrics.prometheusAnnotations.enabled)

The simplest mechanism. Vanilla Prometheus and Grafana Alloy (with the default prometheus.scrape component) discover pods that carry:

prometheus.io/scrape: "true"
prometheus.io/path: /metrics
prometheus.io/port: "9090"   # or 8081 for inference

When to use: The customer runs Grafana Alloy or a self-managed Prometheus configured to use annotation-based pod discovery. This is the default in many self-hosted setups.

When NOT to rely on this: kube-prometheus-stack (the most common enterprise Grafana install) uses the Prometheus Operator and ignores these annotations. Use ServiceMonitors instead.

2. Prometheus Operator ServiceMonitor (metrics.serviceMonitor.enabled)

kube-prometheus-stack deploys a Prometheus Operator that watches ServiceMonitor CRDs. When a ServiceMonitor exists, the Operator automatically adds the target to Prometheus's scrape config without any manual prometheus.yml editing.

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  labels:
    release: prometheus   # must match Prometheus Operator's serviceMonitorSelector
spec:
  selector:
    matchLabels:
      app.kubernetes.io/name: kysira-ext-proc
  endpoints:
    - port: http-metrics
      path: /metrics
      interval: 30s

The release: prometheus label (or whatever label the customer's Prometheus Operator is configured to select on) must match. Set it via metrics.serviceMonitor.additionalLabels.

When to use: Customer runs kube-prometheus-stack or any Prometheus Operator deployment.

3. Datadog Agent autodiscovery (metrics.datadog.enabled)

The Datadog Agent uses its own annotation format entirely. The prometheus.io/* annotations are invisible to it. Instead, it reads ad.datadoghq.com/<container-name>.checks from the pod:

ad.datadoghq.com/ext-proc.checks: |
  {
    "openmetrics": {
      "instances": [{
        "openmetrics_endpoint": "http://%%host%%:9090/metrics",
        "namespace": "kysira",
        "metrics": ["kysira_extproc_.*"]
      }]
    }
  }

%%host%% is a Datadog autodiscovery template variable that resolves to the pod IP at runtime.

When to use: Customer runs the Datadog Agent in their cluster with the OpenMetrics check enabled.


Helm configuration

All three mechanisms are opt-in and independent. Enable whichever ones match the customer's stack.

ext-proc values

metrics:
  enabled: true                   # master switch — exposes /metrics at all
  path: /metrics
  port: 9090

  prometheusAnnotations:
    enabled: true                 # add prometheus.io/* pod annotations

  serviceMonitor:
    enabled: false                # create a ServiceMonitor CRD (Prometheus Operator)
    interval: 30s
    additionalLabels: {}          # e.g. { release: prometheus }

  datadog:
    enabled: false                # add ad.datadoghq.com/* pod annotations
    namespace: kysira             # Datadog metric namespace prefix

inference values

metrics:
  enabled: true
  path: /metrics
  port: 8081                      # same port as the main API

  prometheusAnnotations:
    enabled: true

  serviceMonitor:
    enabled: false
    interval: 30s
    additionalLabels: {}

  datadog:
    enabled: false
    namespace: kysira

Enabling for kube-prometheus-stack

# deploy/values-<env>.yaml
kysira-ext-proc:
  metrics:
    serviceMonitor:
      enabled: true
      additionalLabels:
        release: prometheus   # match your Prometheus Operator's serviceMonitorSelector

kysira-inference:
  metrics:
    serviceMonitor:
      enabled: true
      additionalLabels:
        release: prometheus

Enabling for Datadog

# deploy/values-<env>.yaml
kysira-ext-proc:
  metrics:
    prometheusAnnotations:
      enabled: false          # Datadog ignores these, disable to keep annotations clean
    datadog:
      enabled: true

kysira-inference:
  metrics:
    prometheusAnnotations:
      enabled: false
    datadog:
      enabled: true

Customer deployment guide

If your cluster already has Grafana or Datadog running, Kysira will be picked up automatically once you enable the right option above. No changes to your existing observability stack are required.

You have... Set this
Vanilla Prometheus / Grafana Alloy (annotation discovery) metrics.prometheusAnnotations.enabled: true (default)
kube-prometheus-stack (Prometheus Operator) metrics.serviceMonitor.enabled: true + additionalLabels matching your operator's selector
Datadog Agent metrics.datadog.enabled: true
Both Grafana and Datadog Enable serviceMonitor and datadog; disable prometheusAnnotations

Self-hosted Prometheus (no operator)

If the customer runs a plain Prometheus deployment (not via the Operator), they need to add a scrape job manually to their prometheus.yml, or use the pod annotation method. ServiceMonitors won't be picked up without the Operator.

scrape_configs:
  - job_name: kysira-ext-proc
    kubernetes_sd_configs:
      - role: pod
    relabel_configs:
      - source_labels: [__meta_kubernetes_pod_label_app_kubernetes_io_name]
        regex: kysira-ext-proc
        action: keep
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_port]
        target_label: __address__
        regex: (.+)
        replacement: ${__meta_kubernetes_pod_ip}:$1

  - job_name: kysira-inference
    kubernetes_sd_configs:
      - role: pod
    relabel_configs:
      - source_labels: [__meta_kubernetes_pod_label_app_kubernetes_io_name]
        regex: kysira-inference
        action: keep
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_port]
        target_label: __address__
        regex: (.+)
        replacement: ${__meta_kubernetes_pod_ip}:$1

Relationship to Kysira telemetry

The Prometheus metrics described above are for the customer's observability stack — they stay in the customer's cluster.