Like AOF? Give us a star!

If you find AOF useful, please star us on GitHub. It helps us reach more developers and grow the community.

Tutorial: Building an RCA (Root Cause Analysis) Fleet

Learn how to build a multi-agent fleet that performs comprehensive root cause analysis for production incidents.

What You'll Build

A fleet of specialized agents that work together to diagnose incidents:

┌─────────────────────────────────────────────────────────────────┐
│                    RCA Fleet Architecture                       │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  ┌─────────────┐  ┌─────────────┐  ┌─────────────┐             │
│  │   Error     │  │ Dependency  │  │   Config    │             │
│  │  Analyzer   │  │Investigator │  │   Auditor   │             │
│  │             │  │             │  │             │             │
│  │ • Logs      │  │ • DB health │  │ • Git diff  │             │
│  │ • Traces    │  │ • API calls │  │ • Env vars  │             │
│  │ • Patterns  │  │ • Network   │  │ • Configs   │             │
│  └──────┬──────┘  └──────┬──────┘  └──────┬──────┘             │
│         │                │                │                     │
│         └────────────────┼────────────────┘                     │
│                          ▼                                      │
│                   ┌─────────────┐                               │
│                   │     RCA     │                               │
│                   │ Coordinator │                               │
│                   │             │                               │
│                   │ Synthesizes │                               │
│                   │   Report    │                               │
│                   └─────────────┘                               │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

Prerequisites

AOF installed (aofctl version)
Google API key (export GOOGLE_API_KEY=...)
Basic understanding of YAML

Step 1: Understand the RCA Pattern

Root Cause Analysis benefits from multiple perspectives:

Specialist	Focus Area	Why Separate Agent?
Error Analyzer	Logs, traces, exceptions	Deep pattern matching
Dependency Investigator	External services, DBs	Network/connectivity focus
Config Auditor	Recent changes, settings	Change correlation
RCA Coordinator	Synthesis, prioritization	Big picture view

Why not one agent?

Each specialist has focused, specific instructions → better results
Parallel execution → faster analysis
Consensus → more reliable conclusions
Cheaper models work well with focused tasks

Step 2: Create the Fleet YAML

Create rca-fleet.yaml:

apiVersion: aof.dev/v1
kind: AgentFleet
metadata:
  name: rca-team
  labels:
    purpose: incident-response
spec:
  agents:
    # Agent 1: Error Pattern Analyzer
    - name: error-analyzer
      role: specialist
      spec:
        model: google:gemini-2.5-flash
        instructions: |
          You are an Error Pattern Analyzer.

          Your job:
          1. Search logs for errors and exceptions
          2. Identify the FIRST occurrence (patient zero)
          3. Find patterns in error frequency
          4. Extract key stack trace information

          Focus ONLY on error analysis. Other agents handle dependencies and config.

          Output format:
          ## Error Analysis
          - **First Error**: [timestamp and message]
          - **Error Count**: [number in timeframe]
          - **Pattern**: [what's repeating]
          - **Key Stack Frames**: [relevant code paths]
          - **Likely Component**: [where the bug is]
        tools:
          - shell
          - read_file

    # Agent 2: Dependency Investigator
    - name: dependency-checker
      role: specialist
      spec:
        model: google:gemini-2.5-flash
        instructions: |
          You are a Dependency Investigator.

          Your job:
          1. Check if external services are healthy
          2. Test database connectivity
          3. Verify API endpoints respond
          4. Check network connectivity

          Use these commands:
          - curl -I <url> (HTTP health checks)
          - nc -zv <host> <port> (port connectivity)
          - ping <host> (basic connectivity)

          Focus ONLY on dependencies. Other agents handle logs and config.

          Output format:
          ## Dependency Health
          | Service | Status | Response Time | Notes |
          |---------|--------|---------------|-------|
        tools:
          - shell

    # Agent 3: Configuration Auditor
    - name: config-auditor
      role: specialist
      spec:
        model: google:gemini-2.5-flash
        instructions: |
          You are a Configuration Auditor.

          Your job:
          1. Check git history for recent changes
          2. Look for config file modifications
          3. Verify environment variables
          4. Identify any suspicious settings

          Commands to use:
          - git log --oneline -20 (recent commits)
          - git diff HEAD~5 (recent changes)
          - env | grep -i <app> (environment)

          Focus ONLY on configuration. Other agents handle logs and dependencies.

          Output format:
          ## Configuration Audit
          - **Recent Changes**: [list with dates]
          - **Suspicious Settings**: [any red flags]
          - **Rollback Candidate**: [if applicable]
        tools:
          - shell
          - read_file
          - git

    # Agent 4: RCA Coordinator (Manager)
    - name: rca-coordinator
      role: manager
      spec:
        model: google:gemini-2.5-flash
        instructions: |
          You are the RCA Coordinator.

          Your job:
          1. Review findings from all specialists
          2. Determine the most likely root cause
          3. Create prioritized remediation steps
          4. Write a clear incident report

          Output this exact format:

          # Incident RCA Report

          ## Summary
          [One paragraph describing what happened]

          ## Root Cause
          [The primary cause with evidence]

          ## Contributing Factors
          1. [Factor 1]
          2. [Factor 2]

          ## Immediate Actions
          - [ ] [Action 1 - do now]
          - [ ] [Action 2 - do now]

          ## Follow-up Actions
          - [ ] [Action for this week]

          ## Prevention
          - [How to prevent recurrence]
        tools:
          - shell

  # All agents run in parallel, then reach consensus
  coordination:
    mode: peer
    distribution: round-robin
    consensus:
      algorithm: majority
      min_votes: 3
      timeout_ms: 120000
      allow_partial: true

Step 3: Run the Fleet

# Set your API key
export GOOGLE_API_KEY=AIza...

# Run the RCA fleet
aofctl run fleet rca-fleet.yaml \
  --input "Investigate: Users reporting 500 errors on the checkout API since 2pm"

Step 4: Understanding the Output

The fleet will:

Parallel Execution: All 4 agents start simultaneously
Independent Analysis: Each focuses on their specialty
Consensus: Results are combined with majority voting
Unified Report: Coordinator synthesizes the final RCA

Example output:

[AGENT] Started: error-analyzer
[AGENT] Started: dependency-checker
[AGENT] Started: config-auditor
[AGENT] Started: rca-coordinator
[FLEET] Started: rca-team with 4 agents
[TASK] Submitted: abc123

... agents execute in parallel ...

[CONSENSUS] Reached for task abc123 with 4 votes
[FLEET] Stopped: rca-team

Result: {
  "response": "# Incident RCA Report\n\n## Summary\nThe checkout API began..."
}

Step 5: Customize for Your Stack

Kubernetes-Specific RCA

# Add kubectl tool for K8s analysis
- name: pod-analyzer
  spec:
    instructions: |
      Analyze pod status using:
      - kubectl get pods -A | grep -v Running
      - kubectl describe pod <name>
      - kubectl logs <pod> --tail=100
    tools:
      - kubectl

Database-Specific RCA

# Add database analysis
- name: db-analyzer
  spec:
    instructions: |
      Analyze database health:
      - Check slow query logs
      - Verify connection pool status
      - Look for lock contention
    tools:
      - shell

Cloud-Specific RCA (AWS)

# Add AWS CLI for cloud analysis
- name: aws-analyzer
  spec:
    instructions: |
      Check AWS service health:
      - aws cloudwatch get-metric-statistics
      - aws logs filter-log-events
      - aws ecs describe-services
    tools:
      - shell  # Requires AWS CLI configured

Step 6: Advanced Patterns

Hierarchical Mode (Manager-Led)

For complex incidents where you want the coordinator to delegate:

coordination:
  mode: hierarchical  # Manager delegates instead of parallel

agents:
  - name: incident-commander
    role: manager
    spec:
      instructions: |
        You lead the investigation.
        1. Assess the incident severity
        2. Decide which specialists to engage
        3. Coordinate the investigation
        4. Deliver the final report

Weighted Consensus (Senior Reviewers)

Give more weight to experienced agents:

consensus:
  algorithm: weighted
  weights:
    senior-analyst: 2.0    # Counts as 2 votes
    junior-analyst: 1.0    # Counts as 1 vote

Pipeline Mode (Sequential Analysis)

When each step depends on the previous:

coordination:
  mode: pipeline

agents:
  - name: data-collector      # Step 1: Gather data
  - name: pattern-analyzer    # Step 2: Find patterns
  - name: root-cause-finder   # Step 3: Identify cause
  - name: report-writer       # Step 4: Write report

Complete Example: Production-Ready RCA Fleet

See the full examples in the repository:

Kubernetes RCA: examples/fleets/k8s-rca-team.yaml
Application RCA: examples/fleets/application-rca-team.yaml
Database RCA: examples/fleets/database-rca-team.yaml

Best Practices

1. Keep Agent Instructions Focused

# ❌ Bad: Too broad
instructions: Investigate the incident and find the root cause.

# ✅ Good: Specific and focused
instructions: |
  You are an ERROR ANALYZER. Focus ONLY on:
  1. Log patterns and error messages
  2. Stack traces and exceptions
  3. Error frequency and timing

  DO NOT analyze configs or dependencies - other agents do that.

2. Use Appropriate Timeouts

consensus:
  timeout_ms: 120000    # 2 minutes for standard RCA
  allow_partial: true   # Don't fail if one agent is slow

3. Add Observability

# Future: Send RCA reports to Slack/PagerDuty
communication:
  pattern: broadcast
  broadcast:
    channel: incident-response

4. Test Incrementally

# Test individual agents first
aofctl run agent error-analyzer.yaml --input "Check /var/log/app.log for errors"

# Then test the full fleet
aofctl run fleet rca-fleet.yaml --input "Full incident analysis"

Troubleshooting

Agents Timing Out

Increase timeout or reduce scope:

consensus:
  timeout_ms: 300000  # 5 minutes

Inconsistent Results

Increase minimum votes:

consensus:
  algorithm: unanimous  # All must agree
  # OR
  min_votes: 4         # Need all 4 agents

Too Verbose Output

Add output constraints in instructions:

instructions: |
  ...
  Keep your analysis under 500 words.
  Focus on actionable findings only.

Summary

You've learned how to:

✅ Design an RCA fleet with specialized agents
✅ Configure peer mode with consensus
✅ Run parallel investigations
✅ Synthesize findings into actionable reports
✅ Customize for different tech stacks

Next Steps

Fleet Concepts - Deep dive into fleet architecture
Fleet Examples - More fleet configurations
Fleet YAML Reference - Complete spec documentation

Have an incident? Run your RCA fleet and let the agents investigate while you focus on mitigation!

What You'll Build​

Prerequisites​

Step 1: Understand the RCA Pattern​

Step 2: Create the Fleet YAML​

Step 3: Run the Fleet​

Step 4: Understanding the Output​

Step 5: Customize for Your Stack​

Kubernetes-Specific RCA​

Database-Specific RCA​

Cloud-Specific RCA (AWS)​

Step 6: Advanced Patterns​

Hierarchical Mode (Manager-Led)​

Weighted Consensus (Senior Reviewers)​

Pipeline Mode (Sequential Analysis)​

Complete Example: Production-Ready RCA Fleet​

Best Practices​

1. Keep Agent Instructions Focused​

2. Use Appropriate Timeouts​

3. Add Observability​

4. Test Incrementally​

Troubleshooting​

Agents Timing Out​

Inconsistent Results​

Too Verbose Output​

Summary​

Next Steps​