An open-source Kubernetes operator that detects cluster incidents using only
native Kubernetes APIs, deduplicates signals into durable
IncidentReport
records, and serves a built-in incident dashboard — no external databases or AI required.
The traditional SRE loop takes minutes or hours. RCA Operator collapses it using real-time Kubernetes-native signal collection and a durable incident engine.
Active — preventing false positives from transient flaps.IncidentReport CR, notifies via Slack/PagerDuty, and powers the built-in dashboard — all from the same source of truth.Active after 5 continuous minutes of confirmed failure — eliminating noise from transient flaps and rolling restarts.RCAAgent and IncidentReport are first-class Kubernetes objects. Full audit trail, queryable via kubectl, GitOps-compatible out of the box.IncidentReport and RCAAgent CRs — no raw cluster reads from the UI.RCACorrelationRule. Priority-based evaluation, templated summaries — no operator redeploy needed.IncidentReport CRs. Supports minutes, hours, or days (e.g., 5m, 24h, 30d).We're adding following features to make RCA Operator even more powerful. Follow the project on GitHub to stay updated.
A single operator deployment using informer-backed collectors, a single-writer incident engine, and CR-backed durable state. No external databases or AI dependencies.
A clear, phased roadmap built around Kubernetes-native incident detection. Each phase ships a production-worthy slice.
Choose kubectl or Helm. Both paths have your operator watching your cluster in under 5 minutes.
RCA Operator is MIT licensed and actively developed. Every contribution — bug reports, docs, playbooks, or code — moves us closer to production-ready.
make test, and open a Pull Request.RCA Operator is free, open source, and ready to deploy into your cluster today. Your on-call rotation will thank you.