Kubernetes-Native Incident Correlation Operator — Phase 2 Released

Your cluster never
sleeps. Neither does
RCA Operator.

An open-source Kubernetes RCA Operator correlates Kubernetes events, OpenTelemetry traces, and logs into durable IncidentReport records, enriches them with service topology, and serves a built-in dashboard for incident review.

Go 1.25+ Kubernetes 1.26+ MIT License Phase 2 OpenTelemetry Jaeger topology
Quick Install View on GitHub Read Docs
rca-operator — phase 2 signal pipeline
signal pipeline
IncidentReport
Helm profile
time="2026-04-30T07:12:01Z" level=info msg="otel receiver started" endpoint=0.0.0.0:4317
time="2026-04-30T07:12:08Z" level=warn  msg="span translated" service=checkout status=ERROR
time="2026-04-30T07:12:09Z" level=warn  msg="log translated" severity=ERROR trace_id=6f8c9b2a
time="2026-04-30T07:12:10Z" level=info msg="correlation rule fired" rule=checkout-latency-spike
time="2026-04-30T07:12:12Z" level=info msg="incident graph built" nodes=8 edges=11
time="2026-04-30T07:12:13Z" level=info msg="IncidentReport updated" name=checkout-6f8c9
time="2026-04-30T07:12:14Z" level=info msg="trace detail available" source=jaeger-query
 
$ 
apiVersion: rca.rca-operator.tech/v1alpha1
kind: IncidentReport
metadata:
  name: checkout-6f8c9
  namespace: production
spec:
  fingerprint: otel|checkout|latency-spike
  incidentType: ServiceLatencySpike
status:
  phase: Active
  severity: P2 High
  traceIDs: ["6f8c9b2a", "ac04e1d7"]
  firedRule: checkout-latency-spike
  signalCount: 12
  incidentGraph: "{ nodes: 8, edges: 11 }"
  summary: "Checkout latency correlated with payment errors"
$ helm install rca-operator rca-operator/rca-operator \
  --namespace rca-system --create-namespace \
  --values helm/values-full.yaml
 
includes:
  - RCA Operator control plane
  - OpenTelemetry Operator
  - OpenTelemetryCollector CR
  - Instrumentation CRs
  - Jaeger trace storage and query
  - Dashboard topology and trace detail endpoints
How it works

From telemetry signal to incident evidence
with topology context

Phase 2 connects Kubernetes-native signals with OpenTelemetry traces and logs, then persists the correlated evidence in IncidentReport CRs.

01
01
Collect
Informer-backed collectors watch Kubernetes resources while the OTLP receiver accepts spans and logs from instrumented services.
02
02
Translate
The OTLP translator turns ResourceSpans and ResourceLogs into normalized correlator events with service, severity, trace, and attribute context.
03
03
Correlate
Rules evaluate event types and attributes, coalesce duplicate signals, preserve trace IDs, and attach the fired rule to the incident.
04
04
Enrich
Jaeger and Kubernetes resource lookups build service and incident topology for the dashboard, trace modal, and responder workflow.
Phase 2 — Released Architecture

Trace-aware incident correlation,
still Kubernetes-native

OT
OpenTelemetry Ingest
OTLP spans and logs are translated into correlator events with error status, latency spikes, severity, attributes, and trace IDs preserved.
CR
Correlation Rule Engine
RCACorrelationRule supports event types, attribute predicates, trace matching, priorities, and templated incident summaries.
TG
Incident Topology Graph
IncidentReports can store topology graphs with Kubernetes resources and service-call edges enriched from Jaeger trace data.
IR
Durable IncidentReport CRDs
IncidentReport now includes trace IDs, fired rule metadata, severity labels, evidence counts, and serialized graph data.
UI
Dashboard With Trace Detail
The embedded dashboard adds topology tabs, service graph views, traffic statistics, inline Jaeger trace detail, and clearer severity presentation.
JG
Jaeger Integration
The graph builder and dashboard trace APIs query Jaeger with bounded lookback and timeout handling for predictable incident enrichment.
RB
RBAC and NetworkPolicy Templates
Helm templates include dedicated OpenTelemetry Collector RBAC, ingest services, instrumentation namespace handling, and network policy resources.
DD
Signal Deduplication
Configurable buffer sizes and deduplication windows coalesce repeated OTLP signals before they create noisy incident updates.
HP
Helm Install Profiles
Use minimal, full, production, development, or external-observability values depending on how much of the observability stack you want bundled.
SG
Service Graph Layout
Service topology uses namespace-aware graph building and improved layout logic so dependency paths are easier to scan during incident review.
PI
PII Redaction
Log translation supports configured redaction patterns before messages enter the correlation and incident evidence pipeline.
TS
Phase 2 Test Coverage
Translator, rule engine, dashboard handlers, graph builders, signal buffering, webhooks, and golden Phase 2 flows are covered by focused tests.
HA
Leader Election & HA
Supports high-availability deployments with leader election. Single active incident writer ensures dedup correctness. Safe startup recovery from existing CRs.
RT
Incident Retention & Cleanup
Resolved incident reports and stale incident graph data can be pruned based on configured retention policy.
Phase 2

Phase 2 architecture is released

Phase 2 expands RCA Operator with OpenTelemetry resources, trace-aware correlation, topology graphs, dashboard trace detail, and production-ready Helm profiles.

Released OT
OTLP Translator
ResourceSpans and ResourceLogs are converted into correlator events with service, trace, severity, latency, and attribute context.
Released TG
Topology Visualization
Incident and service graphs show Kubernetes resources, service dependencies, and trace-backed edges inside the embedded dashboard.
Released OC
OpenTelemetry Resources
Helm templates add OpenTelemetryCollector, Instrumentation, RBAC, services, and network policy support for opt-in telemetry.
Released JD
Jaeger Trace Detail
Dashboard endpoints fetch bounded Jaeger trace details and render inline trace evidence alongside incident and service topology.
Released HP
Helm Profiles
Minimal, full, dev, production, and external-observability values make the install path clearer for different cluster setups.
Released DD
Signal Deduplication
Configurable signal buffers and deduplication windows reduce repeated OTLP noise before it reaches incident lifecycle handling.
Released TS
Phase 2 Tests
Golden tests, translator tests, graph tests, dashboard handler tests, and webhook coverage make the new architecture reviewable.
Released PR
Production Hardening
Timeout handling, lookback limits, namespace resolution, and prerelease verification scripts support safer installs.
Architecture

Phase 2 architecture —
telemetry plus Kubernetes context

A controller-runtime operator with Kubernetes collectors, OTLP ingest, correlation rules, Jaeger trace enrichment, CR-backed incident state, and a dashboard for evidence review.

 rca-operator Phase 2 architecture  ·  full diagram →
Kubernetes Cluster RCA Operator — controller-runtime Manager · OTel ingest · leader election Kubernetes API Server Signal Ingest k8s events · spans · logs OTLP translator dedup · PII redaction Dashboard API Server incidents · topology trace detail modal service graph views Correlation Engine rules · attributes · trace IDs incident lifecycle · coalescing graph builder · Jaeger enrichment
IncidentReport CRD
trace IDs · rules · graph
OpenTelemetry Stack
Collector · Instrumentation · Jaeger
Dashboard UI
incidents · topology · traces
Roadmap

From foundation to trace-aware RCA

A phased roadmap that keeps Kubernetes CRDs as the source of truth while adding telemetry-driven incident context.

v0.0.15
Released
Phase 1 Foundation — Kubernetes collectors, incident lifecycle, IncidentReport CRD, Slack & PagerDuty notifications, built-in dashboard, Helm chart
v0.0.16
Released
Phase 2 — OTLP Correlation & Topology — OpenTelemetry Collector and Instrumentation resources, OTLP translator, service topology, incident graph persistence, Jaeger trace detail, Helm profiles, and expanded tests
v0.2.0
Planned
Operational Hardening — broader integrations, Grafana-ready outputs, SLO workflows, richer post-incident reporting, and production validation
v1.0
Planned
Production Ready — Multi-cluster support, refined web UI, remediation hooks, OLM / OperatorHub packaging, and CNCF sandbox preparation
Quick Start

Deployed in minutes,
not hours

Use Helm for the current chart and Phase 2 values profiles. Minimal installs run the control plane; full installs include OpenTelemetry and Jaeger components.

Helm (recommended)
kubectl
helm-install.sh
#1# Add the RCA Operator Helm repository
helm repo add rca-operator https://gaurangkudale.github.io/rca-operator.github.io
helm repo update
#2# Install the minimal operator profile
helm install rca-operator rca-operator/rca-operator \
  --namespace rca-system \
  --create-namespace \
  --values helm/values-minimal.yaml
#3# Use helm/values-full.yaml for OTel Operator, Collector, Instrumentation, and Jaeger
kubectl-install.sh
#1# Install CRDs and operator from a release manifest
kubectl apply -f https://github.com/gaurangkudale/RCA-Operator/releases/download/v0.0.16/install.yaml
#2# Apply an RCAAgent and correlation rules
kubectl apply -f config/samples/rca_v1alpha1_rcaagent.yaml
#3# Verify the operator is running
kubectl get pods -n rca-system
#4# Check incident reports
kubectl get incidentreport -A
Helm Installation Guide Quick Start Phase 2 Release Notes
Community

Built in the open.
Shaped by the community.

RCA Operator is MIT licensed and actively developed. Every contribution — bug reports, docs, playbooks, or code — moves us closer to production-ready.

ISS
Report Issues & Request Features
Found a bug? Have an idea for a new analyzer, integration, or workflow improvement? Open an issue on GitHub — everything is triaged and prioritized transparently.
Open an Issue →
PR
Contribute Code or Docs
Read CONTRIBUTING.md and CODE_OF_CONDUCT.md, fork the repo, create a feature branch, run make test, and open a Pull Request.
Contributing Guide →
DISC
Join Discussions
Use GitHub Discussions to ask questions, share your setup, propose architecture ideas, or just let us know how RCA Operator is working for you.
Join Discussions →
GH
Star & Share
The best way to support the project is to give it a star on GitHub and share it with your SRE, platform engineering, and DevOps communities. It genuinely helps.
Star on GitHub →
Get started today

Bring incident evidence
back into Kubernetes.

RCA Operator is free, open source, and designed for platform teams that want trace-aware RCA without leaving the Kubernetes control plane.

Deploy Now → View on GitHub