Skip to content

Grafana

Namespace: grafana | URL: https://grafana.astaup.de | Manifests: infrastructure/monitoring/grafana/

Deployment

Deployed via the Grafana Operator (grafana.integreatly.org/v1beta1). The operator manages the Grafana Deployment and reconciles GrafanaDatasource, GrafanaDashboard, GrafanaAlertRuleGroup, and GrafanaContactPoint CRDs into a running Grafana instance.

The dashboards: grafana label on the Grafana CR is the selector that all CRDs use to target this instance (instanceSelector.matchLabels). This allows multiple Grafana instances to coexist in future.

Datasources

Both datasources use access: proxy — Grafana fetches on behalf of the browser, so the browser never needs direct cluster access.

NameTypeURLNotes
MimirPrometheusmimir-gateway.mimir/prometheusDefault datasource; prometheusType: Mimir enables Mimir-specific query hints; httpMethod: POST for large queries
LokiLokiloki-gateway.loki

OIDC (Keycloak)

Auth via Keycloak at idp.astaup.de/realms/astaup.de. Client secret is stored SOPS-encrypted in secrets/monitoring/grafana-oidc-secret.enc.yaml, decrypted to a grafana-oauth secret in the grafana namespace, and injected as AUTH_CLIENT_SECRET env var.

  • use_pkce: "true" — required because Keycloak enforces PKCE (RFC 9700) by default. Without it, Keycloak rejects the auth request with Missing parameter: code_challenge_method
  • allow_sign_up: "true" — accounts are created on first login; no pre-provisioning needed
  • disable_login_form: "false" — keeps the local admin login available as a fallback

Role mapping from Keycloak groups:

  • /Mitarbeitende/IT-Administration → Admin
  • /Mitarbeitende (any other) → Viewer
  • Everyone else → Viewer

Alerting

All alerting config is managed as Grafana Operator CRDs, committed to Git. Grafana's built-in alerting engine evaluates rules and routes via its internal Alertmanager — no external Alertmanager is used.

Notification routing (GrafanaNotificationPolicy):

  • All alerts → Slack #it-alerts
  • Grouped by alertname + namespace
  • group_wait: 30s, group_interval: 5m, repeat_interval: 4h
  • Webhook URL from grafana-slack-webhook secret

Alert rules (GrafanaAlertRuleGroup, all query Mimir):

RuleConditionSeverity
Pod CrashLoopBackOff>2 restarts in 15m and pod not readycritical
PVC disk usageconfigurable threshold
ztunnel HBONEztunnel connectivity issues
Zammadapplication-specific checks

Ingress

Istio Gateway (grafana in grafana namespace) with dedicated load-balancer IPs assigned via mikrolb annotation. HTTP redirects to HTTPS (301). TLS cert from cert-manager.

Traffic path: Internet → mikrolb → Istio Gateway → grafana-service:3000

Istio AuthorizationPolicies:

  • allow-intra-namespace — intra-namespace traffic allowed
  • grafana-istio-allow — allows HTTP/HTTPS to the Gateway pod
  • grafana-allow — only the Istio gateway's service account and the Grafana Operator's service account can reach the Grafana pod on port 3000