Kubernetes-Native Incident Detection Operator — Phase 1

Your cluster never
sleeps. Neither does
RCA Operator.

An open-source Kubernetes operator that detects cluster incidents using only native Kubernetes APIs, deduplicates signals into durable IncidentReport records, and serves a built-in incident dashboard — no external databases or AI required.

Go 1.25+ Kubernetes 1.26+ MIT License Phase 1 kubebuilder
Quick Install View on GitHub 📖 Read Docs
rca-operator — live cluster watch
kubectl logs
IncidentReport
RCAAgent CR
time="2025-03-17T09:14:01Z" level=info msg="collector started" namespaces=production,staging
time="2025-03-17T09:14:23Z" level=warn  msg="signal received" reason=OOMKilled pod=api-server-7d9f6c
time="2025-03-17T09:14:24Z" level=warn  msg="signal received" reason=NodeMemoryPressure node=worker-2
time="2025-03-17T09:14:24Z" level=info msg="incident detecting" fingerprint=ns|WorkloadCrashLoop|api-server
time="2025-03-17T09:19:24Z" level=info msg="incident activated" stabilization=5m0s severity=P2
time="2025-03-17T09:19:25Z" level=info msg="IncidentReport created" name=incident-4f9c2
time="2025-03-17T09:19:26Z" level=info msg="Slack notification sent" channel=#incidents
 
$ 
apiVersion: rca.rca-operator.tech/v1alpha1
kind: IncidentReport
metadata:
  name: incident-4f9c2
  namespace: production
spec:
  fingerprint: production|WorkloadCrashLoop|api-server
  incidentType: WorkloadCrashLoop
status:
  phase: Active
  severity: P2
  firstObservedAt: "2025-03-17T09:14:23Z"
  activeAt: "2025-03-17T09:19:24Z"
  signalCount: 7
  summary: "Deployment api-server crash-looping"
apiVersion: rca.rca-operator.tech/v1alpha1
kind: RCAAgent
metadata:
  name: sre-agent
  namespace: rca-system
spec:
  watchNamespaces: [production, staging]
  incidentRetention: 30d
  notifications:
    slack:
      webhookSecretRef: slack-webhook
      channel: "#incidents"
      mentionOnP1: "@oncall"
    pagerduty:
      secretRef: pagerduty-key
      severity: P2
How it works

From failure signal to incident record
in under 30 seconds

The traditional SRE loop takes minutes or hours. RCA Operator collapses it using real-time Kubernetes-native signal collection and a durable incident engine.

01
🔭
Collect
Read-only informer-based collectors watch pods, nodes, workloads, and events from the Kubernetes API — no polling, no external agents.
02
🔗
Correlate
Normalized signals are fingerprinted and deduplicated. Pods are mapped to their top-level workload owner so one incident represents the real blast radius.
03
⏱️
Stabilize
The incident engine enforces a 5-minute stabilization window before promotion to Active — preventing false positives from transient flaps.
04
📣
Act
Creates a durable IncidentReport CR, notifies via Slack/PagerDuty, and powers the built-in dashboard — all from the same source of truth.
Phase 1 — Completed

Production-grade features,
built from day one

Kubernetes-Native Signal Collection
Informer-based collectors watch pods, nodes, workloads, and events using only core Kubernetes APIs — no metrics-server, no Prometheus, no log agents.
⏱️
5-Minute Stabilization Window
Incidents only become Active after 5 continuous minutes of confirmed failure — eliminating noise from transient flaps and rolling restarts.
🔗
Workload-Aware Fingerprinting
Pods are resolved to their top-level owner (Deployment, StatefulSet, DaemonSet, Job) so one incident covers the full blast radius — not one per pod replica.
📋
Durable IncidentReport CRDs
RCAAgent and IncidentReport are first-class Kubernetes objects. Full audit trail, queryable via kubectl, GitOps-compatible out of the box.
📣
Rich Notifications
Slack and PagerDuty notifications driven from durable incident state. Includes incident fingerprint, severity, affected resources, and timeline — not just "something is broken."
📊
Built-in Incident Dashboard
Embedded web dashboard served by the operator itself. Reads only IncidentReport and RCAAgent CRs — no raw cluster reads from the UI.
🔐
RBAC Native & Secure
Runs with least-privilege read-only permissions on cluster resources. Supports multi-tenant clusters with namespace-scoped RCAAgents. Security policy is documented and auditable.
♻️
Automatic Incident Resolution
Incidents auto-resolve when the broken state clears and no confirming signal is seen for 5 minutes. Operator restarts do not duplicate or lose open incidents.
🛠️
Built on kubebuilder
Leverages kubebuilder and controller-runtime — the same foundation as Crossplane, Cluster API, and cert-manager. Production-grade controller patterns from day one.
🧩
Correlation Rule Engine
5 built-in correlation rules plus dynamic CRD-based custom rules via RCACorrelationRule. Priority-based evaluation, templated summaries — no operator redeploy needed.
📈
Prometheus Metrics
11+ metrics tracking incident lifecycle, signal processing, rule evaluations, and notification delivery. Ready for Prometheus scraping out of the box.
🔭
OpenTelemetry Tracing
Optional OTLP gRPC export for distributed tracing. Configurable endpoint, service name, sampling rate, and TLS — integrates with SigNoz or any OTel collector.
🏗️
Leader Election & HA
Supports high-availability deployments with leader election. Single active incident writer ensures dedup correctness. Safe startup recovery from existing CRs.
🗑️
Incident Retention & Cleanup
Configurable retention policy (default 30 days). Auto-prunes resolved IncidentReport CRs. Supports minutes, hours, or days (e.g., 5m, 24h, 30d).
Phase 2

🔥 Phase 2 is starting

We're adding following features to make RCA Operator even more powerful. Follow the project on GitHub to stay updated.

In Progress 🔗
Correlation Engine
Correlate logs + metrics + traces together to pinpoint exact root causes. Cross-signal incident analysis powered by normalized data pipelines.
In Progress 📊
Topology Visualization UI
Interactive topology visualization showing service dependencies, failure propagation, and incident blast radius in real-time.
In Progress 📡
OpenTelemetry Integrations
Native OTel collector integration for automatic trace correlation, span-based incident detection, and distributed tracing context preservation.
In Progress 🧠
Smarter RCA Insights
AI-powered root cause analysis with LLM-driven investigations. Automatic anomaly detection and pattern-based incident prediction.
Planned 📝
Log Collection & Analysis
Deep integration with Loki, ELK, and Datadog for contextualized log analysis within incidents.
Planned 📈
Grafana Dashboards
Pre-built Grafana dashboard templates for visualizing incident trends, signal patterns, and operator health.
Planned 🎯
SLO Tracking
Define and track Service Level Objectives with automatic incident correlation when SLOs are breached.
Planned 🔧
Autonomous Remediation
Automated remediation actions triggered by specific incident patterns — restart pods, scale deployments, and more.
Architecture

Phase 1 architecture —
designed for reliability

A single operator deployment using informer-backed collectors, a single-writer incident engine, and CR-backed durable state. No external databases or AI dependencies.

 rca-operator Phase 1 architecture  ·  full diagram →
Kubernetes Cluster RCA Operator — controller-runtime Manager · leader election Kubernetes API Server Signal Collectors node · pod · workload · event ownership enricher read-only · informer-backed Dashboard API Server reads IncidentReport CRs reads RCAAgent CRs no raw cluster reads Incident Engine fingerprinting · 5 min stabilization dedup · lifecycle · resolve single writer for IncidentReport
IncidentReport CRD
durable source of truth
Notifications
Slack · PagerDuty · K8s Events
Dashboard UI
reads CRs only
Roadmap

From foundation to production-ready

A clear, phased roadmap built around Kubernetes-native incident detection. Each phase ships a production-worthy slice.

v0.0.15
Released
Phase 1 Foundation — Signal collectors (node, pod, workload, event), incident engine with 5-min stabilization, IncidentReport CRD, Slack & PagerDuty notifications, built-in dashboard, Helm chart
v0.0.16
In Progress
Phase 2 — Correlation & Topology — Correlation engine (logs + metrics + traces), topology visualization UI, OpenTelemetry integrations, smarter RCA insights, startup recovery, leader election, incident retention, full test coverage
v0.2.0
Planned
Observability & Intelligence — Grafana dashboards, SLO tracking, log collection, auto post-mortems, AI/LLM root cause analysis
v1.0
Planned
Production Ready — Multi-cluster support, full web UI, autonomous remediation, OLM / OperatorHub listing, CNCF sandbox proposal
Quick Start

Deployed in minutes,
not hours

Choose kubectl or Helm. Both paths have your operator watching your cluster in under 5 minutes.

Helm (recommended)
kubectl
helm-install.sh
#1# Add the RCA Operator Helm repository
helm repo add rca-operator https://gaurangkudale.github.io/rca-operator.github.io
helm repo update
#2# Install the operator
helm install rca-operator rca-operator/rca-operator \
  --namespace rca-system \
  --create-namespace
kubectl-install.sh
#1# Install CRDs and operator (all-in-one) - use specific version
kubectl apply -f https://github.com/gaurangkudale/RCA-Operator/releases/download/v0.0.15/install.yaml
#2# Apply a minimal RCAAgent to start collecting signals
kubectl apply -f config/samples/rca_v1alpha1_rcaagent.yaml
#3# Verify the operator is running
kubectl get pods -n rca-system
#4# Check incident reports
kubectl get incidentreport -A
📄 Full Installation Guide ⚡ 5-min Quick Start ✅ Prerequisites
Community

Built in the open.
Shaped by the community.

RCA Operator is MIT licensed and actively developed. Every contribution — bug reports, docs, playbooks, or code — moves us closer to production-ready.

🐛
Report Issues & Request Features
Found a bug? Have an idea for a new analyzer, integration, or workflow improvement? Open an issue on GitHub — everything is triaged and prioritized transparently.
Open an Issue →
🧑‍💻
Contribute Code or Docs
Read CONTRIBUTING.md and CODE_OF_CONDUCT.md, fork the repo, create a feature branch, run make test, and open a Pull Request.
Contributing Guide →
💬
Join Discussions
Use GitHub Discussions to ask questions, share your setup, propose architecture ideas, or just let us know how RCA Operator is working for you.
Join Discussions →
Star & Share
The best way to support the project is to give it a star on GitHub and share it with your SRE, platform engineering, and DevOps communities. It genuinely helps.
Star on GitHub →
Get started today

Stop waking up to alerts.
Start waking up to answers.

RCA Operator is free, open source, and ready to deploy into your cluster today. Your on-call rotation will thank you.

Deploy Now → View on GitHub