Autonomous AI-Powered SRE Operator for Kubernetes

Your cluster never
sleeps. Neither does
RCA Operator.

An open-source Kubernetes operator that watches your namespaces 24x7, correlates signals across pods, nodes, and services, and autonomously performs root cause analysis the moment an incident occurs.

Go 1.22+ Kubernetes 1.26+ Apache 2.0 AI-Powered RCA kubebuilder
Quick Install View on GitHub 📖 Read Docs
rca-operator — live cluster watch
kubectl logs
IncidentReport
RCAAgent CR
time="2025-03-17T09:14:01Z" level=info msg="watcher started" namespaces=production,staging
time="2025-03-17T09:14:23Z" level=warn  msg="pod OOMKilled" pod=api-server-7d9f6c exit=137
time="2025-03-17T09:14:24Z" level=warn  msg="node memory pressure" node=worker-2
time="2025-03-17T09:14:24Z" level=info msg="correlating signals" events=4 window=30s
time="2025-03-17T09:14:25Z" level=info msg="RCA complete" cause=OOMKilled confidence=0.94
time="2025-03-17T09:14:25Z" level=info msg="IncidentReport created" name=incident-4f9c2
time="2025-03-17T09:14:26Z" level=info msg="Slack notification sent" channel=#incidents
 
$ 
apiVersion: rca.operator.io/v1alpha1
kind: IncidentReport
metadata:
  name: incident-4f9c2
  namespace: production
status:
  phase: Resolved
  rootCause:
    type: OOMKilled
    summary: "No memory limits set on api-server"
    confidence: 0.94
    suggestion: "Set resources.limits.memory in Deployment"
  affectedPods: [api-server-7d9f6c]
apiVersion: rca.operator.io/v1alpha1
kind: RCAAgent
metadata:
  name: production-agent
spec:
  watchNamespaces: [production, staging]
  autonomyLevel: NotifyOnly # Notify | AutoRemediate
  aiConfig:
    provider: openai
    secretRef: rca-agent-openai-secret
  notifications:
    slack: "#incidents"
    pagerduty: enabled
How it works

From incident to root cause
in under 30 seconds

The traditional SRE loop takes minutes or hours. RCA Operator collapses it to seconds using real-time signal correlation and AI-assisted analysis.

01
🔭
Watch
Subscribes to Kubernetes events, pod lifecycle changes, and metrics across watched namespaces 24x7 with zero missed events.
02
🔗
Correlate
Builds a real-time dependency graph and correlates signals across pods, nodes, services, and HPA within a configurable time window.
03
🧠
Analyze
Rule-based heuristics combined with AI/LLM analysis (OpenAI) surfaces the root cause with a confidence score and evidence chain.
04
📣
Act
Creates an IncidentReport CR, notifies via Slack/PagerDuty, and optionally triggers autonomous remediation.
Capabilities

Production-grade features,
built from day one

24x7 Real-time Watching
Continuously monitors pod events, node conditions, and resource states. Uses Kubernetes informers for zero-latency detection — no polling.
🧠
AI-Powered RCA Engine
Integrates with OpenAI to analyze incidents beyond simple rules. Provides natural-language summaries, confidence scores, and remediation suggestions.
🔗
Signal Correlation
Cross-references events across pods, nodes, HPAs, and services to find root causes — not just surface symptoms. Dependency graph is built on the fly.
🎯
Configurable Autonomy Levels
From NotifyOnly to AutoRemediate — you choose how autonomous the operator is. Start cautious, graduate to full automation.
📋
Native K8s CRDs
RCAAgent and IncidentReport are first-class Kubernetes objects. Full audit trail, queryable via kubectl, GitOps-compatible out of the box.
📣
Rich Alert Enrichment
Slack and PagerDuty notifications include the full root cause, affected resources, confidence score, and suggested fix — not just "something is broken."
🔐
RBAC Native & Secure
Runs with least-privilege permissions. Supports multi-tenant clusters with namespace-scoped RCAAgents. Security policy is documented and auditable.
🏗️
GitOps Friendly
All configuration via CRDs — fits naturally into Flux, ArgoCD, and any GitOps workflow. Helm chart included for easy deployment.
🛠️
Built on kubebuilder
Leverages kubebuilder and controller-runtime — the same foundation as Crossplane, Cluster API, and cert-manager. Production-grade controller patterns from day one.
Architecture

Designed for reliability
at every layer

Built on battle-tested controller-runtime patterns. Each component is independently testable and replaceable.

 rca-operator system architecture  ·  full diagram →
Kubernetes Cluster RCA Operator Watcher Layer Informers · Events Correlator Graph · Triage RCA Engine Rules + AI/LLM Confidence Scoring Incident Manager IncidentReport CRD Remediation Playbooks · Runbooks Decision Engine Autonomy Level Reporting & Notification Layer Slack · PagerDuty · Webhooks · K8s Events
Roadmap

From foundation to production-ready

A clear, phased roadmap from MVP to a CNCF-ready, multi-cluster operator with full web UI.

v0.1
In Progress
Foundation — CRDs, Pod watcher, event correlator, Slack & PagerDuty notifications, Helm chart
v0.2
Planned
RCA Engine — Rule-based + AI/LLM analyzer, evidence gathering, confidence scoring, natural-language summaries
v0.3
Planned
Autonomous Remediation — Configurable autonomy levels, built-in playbooks, custom runbooks, safe rollback
v0.4
Planned
Observability — Auto post-mortems, Grafana dashboards, SLO tracking, Prometheus metrics
v1.0
Planned
Production Ready — OLM / OperatorHub listing, multi-cluster support, web UI, CNCF sandbox proposal
Quick Start

Deployed in minutes,
not hours

Choose kubectl or Helm. Both paths have your operator watching your cluster in under 5 minutes.

kubectl
helm
install.sh
#1# Install CRDs and operator
kubectl apply -f https://github.com/gaurangkudale/rca-operator/releases/latest/download/install.yaml
#2# Create OpenAI secret
kubectl create secret generic rca-agent-openai-secret \
  --from-literal=apiKey=<YOUR_OPENAI_KEY> -n default
#3# Apply a minimal RCAAgent
kubectl apply -f config/samples/rca_v1alpha1_rcaagent.yaml
#4# Verify the operator is running
kubectl get pods -n rca-system
NAME                       READY  STATUS   AGE
rca-operator-6b9f5d-xk2pq  1/1    Running  12s
helm-install.sh
#1# Add the Helm repository
helm repo add rca-operator https://rca-operator.github.io/helm-charts
helm repo update
#2# Install with your OpenAI key
helm install rca-operator rca-operator/rca-operator \
  --namespace rca-system --create-namespace \
  --set ai.apiKey=<YOUR_OPENAI_KEY>
#3# Check rollout status
helm status rca-operator -n rca-system
STATUS: deployed
📄 Full Installation Guide ⚡ 5-min Quick Start ✅ Prerequisites
Documentation

Everything you need
to get productive

Community

Built in the open.
Shaped by the community.

RCA Operator is Apache 2.0 licensed and actively developed. Every contribution — bug reports, docs, playbooks, or code — moves us closer to production-ready.

🐛
Report Issues & Request Features
Found a bug? Have an idea for a new analyzer, integration, or workflow improvement? Open an issue on GitHub — everything is triaged and prioritized transparently.
Open an Issue →
🧑‍💻
Contribute Code or Docs
Read CONTRIBUTING.md and CODE_OF_CONDUCT.md, fork the repo, create a feature branch, run make test, and open a Pull Request.
Contributing Guide →
💬
Join Discussions
Use GitHub Discussions to ask questions, share your setup, propose architecture ideas, or just let us know how RCA Operator is working for you.
Join Discussions →
Star & Share
The best way to support the project is to give it a star on GitHub and share it with your SRE, platform engineering, and DevOps communities. It genuinely helps.
Star on GitHub →
Get started today

Stop waking up to alerts.
Start waking up to answers.

RCA Operator is free, open source, and ready to deploy into your cluster today. Your on-call rotation will thank you.

Deploy Now → View on GitHub