Observability — Prometheus, Grafana, and Datadog Integration¶

Kysira exposes Prometheus metrics from both the ext-proc and inference components. This document explains how those metrics are discovered by in-cluster agents (Grafana / Prometheus Operator, Datadog), what is already wired up, and what customers need to configure.

Architecture¶

Metrics are pulled by an in-cluster agent — Kysira services do not push. Each service exposes a standard Prometheus text-format /metrics endpoint. The customer's observability agent (Grafana Alloy, Prometheus, Datadog Agent) scrapes it on a configurable interval and forwards the data to their backend.

┌────────────────────────── k8s cluster ──────────────────────────┐
│                                                                  │
│  ┌─────────────────┐       ┌──────────────────────────┐         │
│  │  kysira-ext-proc │       │   kysira-inference        │         │
│  │  :9090/metrics  │       │   :8081/metrics           │         │
│  └────────┬────────┘       └────────────┬─────────────┘         │
│           │ scrape                       │ scrape                │
│           └──────────────┬──────────────┘                       │
│                   ┌──────▼──────────┐                           │
│                   │  In-cluster     │                            │
│                   │  agent          │                            │
│                   │  (DaemonSet /   │                            │
│                   │   Deployment)   │                            │
│                   └──────┬──────────┘                           │
└──────────────────────────┼───────────────────────────────────── ┘
                           │ HTTPS + API key
                  ┌────────▼─────────┐
                  │  Grafana Cloud   │  ← or Datadog / self-hosted
                  │  or Datadog      │
                  └──────────────────┘

Metrics exposed¶

ext-proc (`kysira_extproc_*`)¶

Metric	Type	Description
`kysira_extproc_requests_total`	Counter	All requests inspected, labelled by `action`
`kysira_extproc_request_duration_seconds`	Histogram	Inference call latency, labelled by `action`
`kysira_extproc_flagged_total`	Counter	Requests whose score exceeded the kill threshold
`kysira_extproc_killed_total`	Counter	Requests blocked with 403
`kysira_extproc_inference_errors_total`	Counter	Inference sidecar call failures (fails open)
`kysira_extproc_active_streams`	Gauge	Open ext_proc gRPC streams

Port: 9090, path: /metrics (separate from gRPC port 50051).

inference (`kysira_inference_*`)¶

Metric	Type	Description
`kysira_inference_requests_total`	Counter	Total scoring requests, labelled by `endpoint`
`kysira_inference_errors_total`	Counter	Inference errors, labelled by `endpoint` and `error_type`
`kysira_inference_duration_seconds`	Histogram	Per-request scoring latency, labelled by `endpoint`
`kysira_model_loaded`	Gauge	Model load status — `1` = loaded, `0` = not loaded, labelled by `model_type`

Port: 8081, path: /metrics (same port as the main API).

Agent discovery mechanisms¶

There are three distinct discovery mechanisms. Which one applies depends on what the customer runs. All three are supported — each is controlled by a separate Helm value.

1. Prometheus pod annotations (`metrics.prometheusAnnotations.enabled`)¶

The simplest mechanism. Vanilla Prometheus and Grafana Alloy (with the default prometheus.scrape component) discover pods that carry:

prometheus.io/scrape: "true"
prometheus.io/path: /metrics
prometheus.io/port: "9090"   # or 8081 for inference

When to use: The customer runs Grafana Alloy or a self-managed Prometheus configured to use annotation-based pod discovery. This is the default in many self-hosted setups.

When NOT to rely on this: kube-prometheus-stack (the most common enterprise Grafana install) uses the Prometheus Operator and ignores these annotations. Use ServiceMonitors instead.

2. Prometheus Operator ServiceMonitor (`metrics.serviceMonitor.enabled`)¶

kube-prometheus-stack deploys a Prometheus Operator that watches ServiceMonitor CRDs. When a ServiceMonitor exists, the Operator automatically adds the target to Prometheus's scrape config without any manual prometheus.yml editing.

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  labels:
    release: prometheus   # must match Prometheus Operator's serviceMonitorSelector
spec:
  selector:
    matchLabels:
      app.kubernetes.io/name: kysira-ext-proc
  endpoints:
    - port: http-metrics
      path: /metrics
      interval: 30s

The release: prometheus label (or whatever label the customer's Prometheus Operator is configured to select on) must match. Set it via metrics.serviceMonitor.additionalLabels.

When to use: Customer runs kube-prometheus-stack or any Prometheus Operator deployment.

3. Datadog Agent autodiscovery (`metrics.datadog.enabled`)¶

The Datadog Agent uses its own annotation format entirely. The prometheus.io/* annotations are invisible to it. Instead, it reads ad.datadoghq.com/<container-name>.checks from the pod:

ad.datadoghq.com/ext-proc.checks: |
  {
    "openmetrics": {
      "instances": [{
        "openmetrics_endpoint": "http://%%host%%:9090/metrics",
        "namespace": "kysira",
        "metrics": ["kysira_extproc_.*"]
      }]
    }
  }

%%host%% is a Datadog autodiscovery template variable that resolves to the pod IP at runtime.

When to use: Customer runs the Datadog Agent in their cluster with the OpenMetrics check enabled.

Helm configuration¶

All three mechanisms are opt-in and independent. Enable whichever ones match the customer's stack.

ext-proc values¶

metrics:
  enabled: true                   # master switch — exposes /metrics at all
  path: /metrics
  port: 9090

  prometheusAnnotations:
    enabled: true                 # add prometheus.io/* pod annotations

  serviceMonitor:
    enabled: false                # create a ServiceMonitor CRD (Prometheus Operator)
    interval: 30s
    additionalLabels: {}          # e.g. { release: prometheus }

  datadog:
    enabled: false                # add ad.datadoghq.com/* pod annotations
    namespace: kysira             # Datadog metric namespace prefix

inference values¶

metrics:
  enabled: true
  path: /metrics
  port: 8081                      # same port as the main API

  prometheusAnnotations:
    enabled: true

  serviceMonitor:
    enabled: false
    interval: 30s
    additionalLabels: {}

  datadog:
    enabled: false
    namespace: kysira

Enabling for kube-prometheus-stack¶

# deploy/values-<env>.yaml
kysira-ext-proc:
  metrics:
    serviceMonitor:
      enabled: true
      additionalLabels:
        release: prometheus   # match your Prometheus Operator's serviceMonitorSelector

kysira-inference:
  metrics:
    serviceMonitor:
      enabled: true
      additionalLabels:
        release: prometheus

Enabling for Datadog¶

# deploy/values-<env>.yaml
kysira-ext-proc:
  metrics:
    prometheusAnnotations:
      enabled: false          # Datadog ignores these, disable to keep annotations clean
    datadog:
      enabled: true

kysira-inference:
  metrics:
    prometheusAnnotations:
      enabled: false
    datadog:
      enabled: true

Customer deployment guide¶

If your cluster already has Grafana or Datadog running, Kysira will be picked up automatically once you enable the right option above. No changes to your existing observability stack are required.

You have...	Set this
Vanilla Prometheus / Grafana Alloy (annotation discovery)	`metrics.prometheusAnnotations.enabled: true` (default)
kube-prometheus-stack (Prometheus Operator)	`metrics.serviceMonitor.enabled: true` + `additionalLabels` matching your operator's selector
Datadog Agent	`metrics.datadog.enabled: true`
Both Grafana and Datadog	Enable serviceMonitor and datadog; disable prometheusAnnotations

Self-hosted Prometheus (no operator)¶

If the customer runs a plain Prometheus deployment (not via the Operator), they need to add a scrape job manually to their prometheus.yml, or use the pod annotation method. ServiceMonitors won't be picked up without the Operator.

scrape_configs:
  - job_name: kysira-ext-proc
    kubernetes_sd_configs:
      - role: pod
    relabel_configs:
      - source_labels: [__meta_kubernetes_pod_label_app_kubernetes_io_name]
        regex: kysira-ext-proc
        action: keep
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_port]
        target_label: __address__
        regex: (.+)
        replacement: ${__meta_kubernetes_pod_ip}:$1

  - job_name: kysira-inference
    kubernetes_sd_configs:
      - role: pod
    relabel_configs:
      - source_labels: [__meta_kubernetes_pod_label_app_kubernetes_io_name]
        regex: kysira-inference
        action: keep
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_port]
        target_label: __address__
        regex: (.+)
        replacement: ${__meta_kubernetes_pod_ip}:$1

Relationship to Kysira telemetry¶

The Prometheus metrics described above are for the customer's observability stack — they stay in the customer's cluster.