Observability — Prometheus, Grafana, and Datadog Integration¶
Kysira exposes Prometheus metrics from both the ext-proc and inference components. This document explains how those metrics are discovered by in-cluster agents (Grafana / Prometheus Operator, Datadog), what is already wired up, and what customers need to configure.
Architecture¶
Metrics are pulled by an in-cluster agent — Kysira services do not push. Each service exposes a standard Prometheus text-format /metrics endpoint. The customer's observability agent (Grafana Alloy, Prometheus, Datadog Agent) scrapes it on a configurable interval and forwards the data to their backend.
┌────────────────────────── k8s cluster ──────────────────────────┐
│ │
│ ┌─────────────────┐ ┌──────────────────────────┐ │
│ │ kysira-ext-proc │ │ kysira-inference │ │
│ │ :9090/metrics │ │ :8081/metrics │ │
│ └────────┬────────┘ └────────────┬─────────────┘ │
│ │ scrape │ scrape │
│ └──────────────┬──────────────┘ │
│ ┌──────▼──────────┐ │
│ │ In-cluster │ │
│ │ agent │ │
│ │ (DaemonSet / │ │
│ │ Deployment) │ │
│ └──────┬──────────┘ │
└──────────────────────────┼───────────────────────────────────── ┘
│ HTTPS + API key
┌────────▼─────────┐
│ Grafana Cloud │ ← or Datadog / self-hosted
│ or Datadog │
└──────────────────┘
Metrics exposed¶
ext-proc (kysira_extproc_*)¶
| Metric | Type | Description |
|---|---|---|
kysira_extproc_requests_total | Counter | All requests inspected, labelled by action |
kysira_extproc_request_duration_seconds | Histogram | Inference call latency, labelled by action |
kysira_extproc_flagged_total | Counter | Requests whose score exceeded the kill threshold |
kysira_extproc_killed_total | Counter | Requests blocked with 403 |
kysira_extproc_inference_errors_total | Counter | Inference sidecar call failures (fails open) |
kysira_extproc_active_streams | Gauge | Open ext_proc gRPC streams |
Port: 9090, path: /metrics (separate from gRPC port 50051).
inference (kysira_inference_*)¶
| Metric | Type | Description |
|---|---|---|
kysira_inference_requests_total | Counter | Total scoring requests, labelled by endpoint |
kysira_inference_errors_total | Counter | Inference errors, labelled by endpoint and error_type |
kysira_inference_duration_seconds | Histogram | Per-request scoring latency, labelled by endpoint |
kysira_model_loaded | Gauge | Model load status — 1 = loaded, 0 = not loaded, labelled by model_type |
Port: 8081, path: /metrics (same port as the main API).
Agent discovery mechanisms¶
There are three distinct discovery mechanisms. Which one applies depends on what the customer runs. All three are supported — each is controlled by a separate Helm value.
1. Prometheus pod annotations (metrics.prometheusAnnotations.enabled)¶
The simplest mechanism. Vanilla Prometheus and Grafana Alloy (with the default prometheus.scrape component) discover pods that carry:
prometheus.io/scrape: "true"
prometheus.io/path: /metrics
prometheus.io/port: "9090" # or 8081 for inference
When to use: The customer runs Grafana Alloy or a self-managed Prometheus configured to use annotation-based pod discovery. This is the default in many self-hosted setups.
When NOT to rely on this: kube-prometheus-stack (the most common enterprise Grafana install) uses the Prometheus Operator and ignores these annotations. Use ServiceMonitors instead.
2. Prometheus Operator ServiceMonitor (metrics.serviceMonitor.enabled)¶
kube-prometheus-stack deploys a Prometheus Operator that watches ServiceMonitor CRDs. When a ServiceMonitor exists, the Operator automatically adds the target to Prometheus's scrape config without any manual prometheus.yml editing.
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
labels:
release: prometheus # must match Prometheus Operator's serviceMonitorSelector
spec:
selector:
matchLabels:
app.kubernetes.io/name: kysira-ext-proc
endpoints:
- port: http-metrics
path: /metrics
interval: 30s
The release: prometheus label (or whatever label the customer's Prometheus Operator is configured to select on) must match. Set it via metrics.serviceMonitor.additionalLabels.
When to use: Customer runs kube-prometheus-stack or any Prometheus Operator deployment.
3. Datadog Agent autodiscovery (metrics.datadog.enabled)¶
The Datadog Agent uses its own annotation format entirely. The prometheus.io/* annotations are invisible to it. Instead, it reads ad.datadoghq.com/<container-name>.checks from the pod:
ad.datadoghq.com/ext-proc.checks: |
{
"openmetrics": {
"instances": [{
"openmetrics_endpoint": "http://%%host%%:9090/metrics",
"namespace": "kysira",
"metrics": ["kysira_extproc_.*"]
}]
}
}
%%host%% is a Datadog autodiscovery template variable that resolves to the pod IP at runtime.
When to use: Customer runs the Datadog Agent in their cluster with the OpenMetrics check enabled.
Helm configuration¶
All three mechanisms are opt-in and independent. Enable whichever ones match the customer's stack.
ext-proc values¶
metrics:
enabled: true # master switch — exposes /metrics at all
path: /metrics
port: 9090
prometheusAnnotations:
enabled: true # add prometheus.io/* pod annotations
serviceMonitor:
enabled: false # create a ServiceMonitor CRD (Prometheus Operator)
interval: 30s
additionalLabels: {} # e.g. { release: prometheus }
datadog:
enabled: false # add ad.datadoghq.com/* pod annotations
namespace: kysira # Datadog metric namespace prefix
inference values¶
metrics:
enabled: true
path: /metrics
port: 8081 # same port as the main API
prometheusAnnotations:
enabled: true
serviceMonitor:
enabled: false
interval: 30s
additionalLabels: {}
datadog:
enabled: false
namespace: kysira
Enabling for kube-prometheus-stack¶
# deploy/values-<env>.yaml
kysira-ext-proc:
metrics:
serviceMonitor:
enabled: true
additionalLabels:
release: prometheus # match your Prometheus Operator's serviceMonitorSelector
kysira-inference:
metrics:
serviceMonitor:
enabled: true
additionalLabels:
release: prometheus
Enabling for Datadog¶
# deploy/values-<env>.yaml
kysira-ext-proc:
metrics:
prometheusAnnotations:
enabled: false # Datadog ignores these, disable to keep annotations clean
datadog:
enabled: true
kysira-inference:
metrics:
prometheusAnnotations:
enabled: false
datadog:
enabled: true
Customer deployment guide¶
If your cluster already has Grafana or Datadog running, Kysira will be picked up automatically once you enable the right option above. No changes to your existing observability stack are required.
| You have... | Set this |
|---|---|
| Vanilla Prometheus / Grafana Alloy (annotation discovery) | metrics.prometheusAnnotations.enabled: true (default) |
| kube-prometheus-stack (Prometheus Operator) | metrics.serviceMonitor.enabled: true + additionalLabels matching your operator's selector |
| Datadog Agent | metrics.datadog.enabled: true |
| Both Grafana and Datadog | Enable serviceMonitor and datadog; disable prometheusAnnotations |
Self-hosted Prometheus (no operator)¶
If the customer runs a plain Prometheus deployment (not via the Operator), they need to add a scrape job manually to their prometheus.yml, or use the pod annotation method. ServiceMonitors won't be picked up without the Operator.
scrape_configs:
- job_name: kysira-ext-proc
kubernetes_sd_configs:
- role: pod
relabel_configs:
- source_labels: [__meta_kubernetes_pod_label_app_kubernetes_io_name]
regex: kysira-ext-proc
action: keep
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_port]
target_label: __address__
regex: (.+)
replacement: ${__meta_kubernetes_pod_ip}:$1
- job_name: kysira-inference
kubernetes_sd_configs:
- role: pod
relabel_configs:
- source_labels: [__meta_kubernetes_pod_label_app_kubernetes_io_name]
regex: kysira-inference
action: keep
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_port]
target_label: __address__
regex: (.+)
replacement: ${__meta_kubernetes_pod_ip}:$1
Relationship to Kysira telemetry¶
The Prometheus metrics described above are for the customer's observability stack — they stay in the customer's cluster.