Kubernetes-Native Incident Correlation Operator — Phase 2 Released

Your cluster never
sleeps. Neither does
RCA Operator.

An open-source Kubernetes RCA Operator correlates Kubernetes events, OpenTelemetry traces, and logs into durable IncidentReport records, enriches them with service topology, and serves a built-in dashboard for incident review.

Go 1.25+ Kubernetes 1.26+ MIT License Phase 2 OpenTelemetry Jaeger topology

Quick Install View on GitHub Read Docs

signal pipeline

IncidentReport

Helm profile

time="2026-04-30T07:12:01Z" level=info msg="otel receiver started" endpoint=0.0.0.0:4317

time="2026-04-30T07:12:08Z" level=warn msg="span translated" service=checkout status=ERROR

time="2026-04-30T07:12:09Z" level=warn msg="log translated" severity=ERROR trace_id=6f8c9b2a

time="2026-04-30T07:12:10Z" level=info msg="correlation rule fired" rule=checkout-latency-spike

time="2026-04-30T07:12:12Z" level=info msg="incident graph built" nodes=8 edges=11

time="2026-04-30T07:12:13Z" level=info msg="IncidentReport updated" name=checkout-6f8c9

time="2026-04-30T07:12:14Z" level=info msg="trace detail available" source=jaeger-query

apiVersion: rca.rca-operator.tech/v1alpha1

kind: IncidentReport

metadata:

name: checkout-6f8c9

namespace: production

spec:

fingerprint: otel|checkout|latency-spike

incidentType: ServiceLatencySpike

status:

phase: Active

severity: P2 High

traceIDs: ["6f8c9b2a", "ac04e1d7"]

firedRule: checkout-latency-spike

signalCount: 12

incidentGraph: "{ nodes: 8, edges: 11 }"

summary: "Checkout latency correlated with payment errors"

$ helm install rca-operator rca-operator/rca-operator \

--namespace rca-system --create-namespace \

--values helm/values-full.yaml

includes:

- RCA Operator control plane

- OpenTelemetry Operator

- OpenTelemetryCollector CR

- Instrumentation CRs

- Jaeger trace storage and query

- Dashboard topology and trace detail endpoints

How it works

From telemetry signal to incident evidence
with topology context

Phase 2 connects Kubernetes-native signals with OpenTelemetry traces and logs, then persists the correlated evidence in IncidentReport CRs.

Collect

Informer-backed collectors watch Kubernetes resources while the OTLP receiver accepts spans and logs from instrumented services.

Translate

The OTLP translator turns ResourceSpans and ResourceLogs into normalized correlator events with service, severity, trace, and attribute context.

Correlate

Rules evaluate event types and attributes, coalesce duplicate signals, preserve trace IDs, and attach the fired rule to the incident.

Enrich

Jaeger and Kubernetes resource lookups build service and incident topology for the dashboard, trace modal, and responder workflow.

Phase 2 — Released Architecture

Trace-aware incident correlation,
still Kubernetes-native

OpenTelemetry Ingest

OTLP spans and logs are translated into correlator events with error status, latency spikes, severity, attributes, and trace IDs preserved.

Correlation Rule Engine

RCACorrelationRule supports event types, attribute predicates, trace matching, priorities, and templated incident summaries.

Incident Topology Graph

IncidentReports can store topology graphs with Kubernetes resources and service-call edges enriched from Jaeger trace data.

Durable IncidentReport CRDs

IncidentReport now includes trace IDs, fired rule metadata, severity labels, evidence counts, and serialized graph data.

Dashboard With Trace Detail

The embedded dashboard adds topology tabs, service graph views, traffic statistics, inline Jaeger trace detail, and clearer severity presentation.

Jaeger Integration

The graph builder and dashboard trace APIs query Jaeger with bounded lookback and timeout handling for predictable incident enrichment.

RBAC and NetworkPolicy Templates

Helm templates include dedicated OpenTelemetry Collector RBAC, ingest services, instrumentation namespace handling, and network policy resources.

Signal Deduplication

Configurable buffer sizes and deduplication windows coalesce repeated OTLP signals before they create noisy incident updates.

Helm Install Profiles

Use minimal, full, production, development, or external-observability values depending on how much of the observability stack you want bundled.

Service Graph Layout

Service topology uses namespace-aware graph building and improved layout logic so dependency paths are easier to scan during incident review.

PII Redaction

Log translation supports configured redaction patterns before messages enter the correlation and incident evidence pipeline.

Phase 2 Test Coverage

Translator, rule engine, dashboard handlers, graph builders, signal buffering, webhooks, and golden Phase 2 flows are covered by focused tests.

Leader Election & HA

Supports high-availability deployments with leader election. Single active incident writer ensures dedup correctness. Safe startup recovery from existing CRs.

Incident Retention & Cleanup

Resolved incident reports and stale incident graph data can be pruned based on configured retention policy.

Phase 2

Phase 2 architecture is released

Phase 2 expands RCA Operator with OpenTelemetry resources, trace-aware correlation, topology graphs, dashboard trace detail, and production-ready Helm profiles.

Released OT

OTLP Translator

ResourceSpans and ResourceLogs are converted into correlator events with service, trace, severity, latency, and attribute context.

Released TG

Topology Visualization

Incident and service graphs show Kubernetes resources, service dependencies, and trace-backed edges inside the embedded dashboard.

Released OC

OpenTelemetry Resources

Helm templates add OpenTelemetryCollector, Instrumentation, RBAC, services, and network policy support for opt-in telemetry.

Released JD

Jaeger Trace Detail

Dashboard endpoints fetch bounded Jaeger trace details and render inline trace evidence alongside incident and service topology.

Released HP

Helm Profiles

Minimal, full, dev, production, and external-observability values make the install path clearer for different cluster setups.

Released DD

Signal Deduplication

Configurable signal buffers and deduplication windows reduce repeated OTLP noise before it reaches incident lifecycle handling.

Released TS

Phase 2 Tests

Golden tests, translator tests, graph tests, dashboard handler tests, and webhook coverage make the new architecture reviewable.

Released PR

Production Hardening

Timeout handling, lookback limits, namespace resolution, and prerelease verification scripts support safer installs.

Architecture

Phase 2 architecture —
telemetry plus Kubernetes context

A controller-runtime operator with Kubernetes collectors, OTLP ingest, correlation rules, Jaeger trace enrichment, CR-backed incident state, and a dashboard for evidence review.

IncidentReport CRD

trace IDs · rules · graph

OpenTelemetry Stack

Collector · Instrumentation · Jaeger

Dashboard UI

incidents · topology · traces

Roadmap

From foundation to trace-aware RCA

A phased roadmap that keeps Kubernetes CRDs as the source of truth while adding telemetry-driven incident context.

v0.0.15

Released

Phase 1 Foundation — Kubernetes collectors, incident lifecycle, IncidentReport CRD, Slack & PagerDuty notifications, built-in dashboard, Helm chart

v0.0.16

Released

Phase 2 — OTLP Correlation & Topology — OpenTelemetry Collector and Instrumentation resources, OTLP translator, service topology, incident graph persistence, Jaeger trace detail, Helm profiles, and expanded tests

v0.2.0

Planned

Operational Hardening — broader integrations, Grafana-ready outputs, SLO workflows, richer post-incident reporting, and production validation

v1.0

Planned

Production Ready — Multi-cluster support, refined web UI, remediation hooks, OLM / OperatorHub packaging, and CNCF sandbox preparation

Quick Start

Deployed in minutes,
not hours

Use Helm for the current chart and Phase 2 values profiles. Minimal installs run the control plane; full installs include OpenTelemetry and Jaeger components.

Helm (recommended)

kubectl

helm-install.sh

#1# Add the RCA Operator Helm repository
helm repo add rca-operator https://gaurangkudale.github.io/rca-operator.github.io
helm repo update
#2# Install the minimal operator profile
helm install rca-operator rca-operator/rca-operator \
  --namespace rca-system \
  --create-namespace \
  --values helm/values-minimal.yaml
#3# Use helm/values-full.yaml for OTel Operator, Collector, Instrumentation, and Jaeger

kubectl-install.sh

#1# Install CRDs and operator from a release manifest
kubectl apply -f https://github.com/gaurangkudale/RCA-Operator/releases/download/v0.0.16/install.yaml
#2# Apply an RCAAgent and correlation rules
kubectl apply -f config/samples/rca_v1alpha1_rcaagent.yaml
#3# Verify the operator is running
kubectl get pods -n rca-system
#4# Check incident reports
kubectl get incidentreport -A

Helm Installation Guide Quick Start Phase 2 Release Notes

Documentation

Everything you need
to get productive

OTLP Ingest

Trace and log translation into correlator events.

Topology Graph

Incident graph and service graph behavior.

Dashboard

Topology tabs, trace modal, and service views.

Architecture

Current system design, data flow, and component interaction overview.

Correlation Rule CRD

Event types, predicates, attribute matching, and rule output.

Helm Installation

Minimal, full, production, and external observability profiles.

RBAC Reference

Permissions explained — what the operator needs and why.

Testing Guide

Manual scenarios, Topology UI checks, and OTLP ingest testing.

Community

Built in the open.
Shaped by the community.

RCA Operator is MIT licensed and actively developed. Every contribution — bug reports, docs, playbooks, or code — moves us closer to production-ready.

ISS

Report Issues & Request Features

Found a bug? Have an idea for a new analyzer, integration, or workflow improvement? Open an issue on GitHub — everything is triaged and prioritized transparently.

Open an Issue →

Contribute Code or Docs

Read CONTRIBUTING.md and CODE_OF_CONDUCT.md, fork the repo, create a feature branch, run make test, and open a Pull Request.

Contributing Guide →

DISC

Join Discussions

Use GitHub Discussions to ask questions, share your setup, propose architecture ideas, or just let us know how RCA Operator is working for you.

Join Discussions →

Star & Share

The best way to support the project is to give it a star on GitHub and share it with your SRE, platform engineering, and DevOps communities. It genuinely helps.

Star on GitHub →

Get started today

Bring incident evidence
back into Kubernetes.

RCA Operator is free, open source, and designed for platform teams that want trace-aware RCA without leaving the Kubernetes control plane.

Deploy Now → View on GitHub

Your cluster never sleeps. Neither does RCA Operator.

From telemetry signal to incident evidencewith topology context

Trace-aware incident correlation,still Kubernetes-native

Phase 2 architecture is released

Phase 2 architecture —telemetry plus Kubernetes context

From foundation to trace-aware RCA

Deployed in minutes,not hours

Everything you needto get productive

Built in the open.Shaped by the community.

Bring incident evidenceback into Kubernetes.