Like AOF? Give us a star!

If you find AOF useful, please star us on GitHub. It helps us reach more developers and grow the community.

Tutorial: Multi-Model RCA with Tiered Execution

Build a production-grade Root Cause Analysis fleet that combines multiple LLMs for consensus-based incident diagnosis.

What You'll Build

A tiered fleet architecture where:

Tier 1: Cheap, fast models collect observability data in parallel
Tier 2: Multiple reasoning models (Claude, Gemini, GPT-4) analyze with diverse perspectives
Tier 3: A coordinator synthesizes findings into a final RCA report

┌─────────────────────────────────────────────────────────────────┐
│                  MULTI-MODEL RCA ARCHITECTURE                   │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  TIER 1: Data Collectors (~$0.075/1M tokens)                   │
│  ┌────────┐ ┌────────┐ ┌────────┐ ┌────────┐                   │
│  │  Loki  │ │Promethe│ │  K8s   │ │  Git   │                   │
│  │  Logs  │ │Metrics │ │ State  │ │Changes │                   │
│  └───┬────┘ └───┬────┘ └───┬────┘ └───┬────┘                   │
│      └──────────┼──────────┼──────────┘                        │
│                 ▼ (parallel execution)                         │
│                                                                 │
│  TIER 2: Reasoning Models (multi-model consensus)              │
│  ┌────────────┐ ┌────────────┐ ┌────────────┐                  │
│  │   Claude   │ │   Gemini   │ │   GPT-4    │                  │
│  │ (wt: 1.5)  │ │ (wt: 1.0)  │ │ (wt: 1.0)  │                  │
│  └─────┬──────┘ └─────┬──────┘ └─────┬──────┘                  │
│        └──────────────┼──────────────┘                         │
│                       ▼ (weighted consensus)                   │
│                                                                 │
│  TIER 3: Coordinator                                           │
│  ┌──────────────────────────────────────────┐                  │
│  │            RCA Coordinator               │                  │
│  │     (synthesize final report)            │                  │
│  └──────────────────────────────────────────┘                  │
│                       ▼                                         │
│                Final RCA Report                                 │
└─────────────────────────────────────────────────────────────────┘

Why Multi-Model RCA?

The Problem with Single-Model Analysis

When a production incident occurs, relying on a single LLM has risks:

Risk	Impact
Model Bias	Each LLM has different training data and reasoning patterns
Hallucination	No cross-validation means false conclusions go unchecked
Blind Spots	One model might miss what another catches
Single Point of Failure	If the model is wrong, your RCA is wrong

The Multi-Model Solution

By using multiple LLMs with weighted consensus:

Diverse Perspectives: Claude, Gemini, and GPT-4 approach problems differently
Cross-Validation: Areas of agreement have higher confidence
Disagreement Detection: Conflicting conclusions are surfaced for human review
Cost Optimization: Cheap models for data collection, premium for reasoning

Prerequisites

AOF installed (aofctl version)

API keys for multiple providers:

export ANTHROPIC_API_KEY=sk-ant-...
export GOOGLE_API_KEY=AIza...
export OPENAI_API_KEY=sk-...

Step 1: Create Data Collector Agents

These Tier 1 agents use cheap, fast models to gather observability data.

Loki Log Collector

Create agents/observability/loki-collector.yaml:

apiVersion: aof.dev/v1
kind: Agent
metadata:
  name: loki-collector
  labels:
    tier: "1"
    category: observability
    cost: low
spec:
  model: google:gemini-2.0-flash  # ~$0.075/1M tokens

  instructions: |
    You are a Log Collector agent that queries Loki for relevant logs.

    ## Your Task
    Extract logs related to the incident using LogQL queries.

    ## Query Strategy
    1. Start broad: {job=".*"} |~ "error|Error|ERROR"
    2. Filter by timeframe around incident
    3. Group by service/container
    4. Extract key error patterns

    ## Output Format
    Return structured JSON:
    ```json
    {
      "source": "loki",
      "timeframe": {"start": "...", "end": "..."},
      "error_summary": {
        "total_errors": 123,
        "by_service": {"api": 80, "worker": 43},
        "top_patterns": ["Connection refused", "Timeout"]
      },
      "key_logs": [
        {"timestamp": "...", "service": "...", "message": "..."}
      ],
      "first_error": {"timestamp": "...", "message": "..."}
    }

tools:

shell

max_iterations: 3 temperature: 0.2

### Prometheus Metrics Collector

Create `agents/observability/prometheus-collector.yaml`:

```yaml
apiVersion: aof.dev/v1
kind: Agent
metadata:
  name: prometheus-collector
  labels:
    tier: "1"
    category: observability
    cost: low
spec:
  model: google:gemini-2.0-flash

  instructions: |
    You are a Metrics Collector agent that queries Prometheus.

    ## Key Metrics to Check
    - Error rates: rate(http_requests_total{status=~"5.."}[5m])
    - Latency: histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m]))
    - Resource usage: container_cpu_usage_seconds_total, container_memory_working_set_bytes
    - Saturation: container_cpu_throttled_seconds_total

    ## Output Format
    ```json
    {
      "source": "prometheus",
      "timeframe": {"start": "...", "end": "..."},
      "anomalies": [
        {
          "metric": "error_rate",
          "baseline": 0.01,
          "current": 0.15,
          "deviation": "15x normal"
        }
      ],
      "resource_status": {
        "cpu_throttled": false,
        "memory_pressure": true,
        "disk_pressure": false
      },
      "correlated_events": ["deploy at 14:02", "traffic spike at 14:05"]
    }

tools:

shell

max_iterations: 3 temperature: 0.2

### Kubernetes State Collector

Create `agents/observability/k8s-collector.yaml`:

```yaml
apiVersion: aof.dev/v1
kind: Agent
metadata:
  name: k8s-collector
  labels:
    tier: "1"
    category: observability
    cost: low
spec:
  model: google:gemini-2.0-flash

  instructions: |
    You are a Kubernetes State Collector.

    ## Commands to Run
    - kubectl get pods -A | grep -v Running
    - kubectl get events --sort-by='.lastTimestamp' | tail -50
    - kubectl top pods --all-namespaces
    - kubectl describe deployment <affected-service>

    ## Output Format
    ```json
    {
      "source": "kubernetes",
      "cluster_health": "degraded|healthy|critical",
      "unhealthy_pods": [
        {"name": "...", "status": "CrashLoopBackOff", "restarts": 5}
      ],
      "recent_events": [
        {"type": "Warning", "reason": "...", "message": "..."}
      ],
      "resource_pressure": {
        "cpu_constrained_pods": [],
        "memory_constrained_pods": []
      },
      "recent_deployments": []
    }

tools:

shell

max_iterations: 3 temperature: 0.2

### Git Change Auditor

Create `agents/observability/git-collector.yaml`:

```yaml
apiVersion: aof.dev/v1
kind: Agent
metadata:
  name: git-collector
  labels:
    tier: "1"
    category: observability
    cost: low
spec:
  model: google:gemini-2.0-flash

  instructions: |
    You are a Git Change Auditor.

    ## Commands to Run
    - git log --oneline --since="24 hours ago"
    - git diff HEAD~10 --stat
    - git log --oneline --all --graph | head -20

    ## Output Format
    ```json
    {
      "source": "git",
      "recent_commits": [
        {
          "hash": "abc123",
          "author": "...",
          "message": "...",
          "timestamp": "...",
          "files_changed": 5
        }
      ],
      "suspicious_changes": [
        {
          "commit": "abc123",
          "reason": "Config file modified",
          "files": ["config/database.yml"]
        }
      ],
      "rollback_candidates": ["abc123", "def456"]
    }

tools:

shell
git

max_iterations: 3 temperature: 0.2

## Step 2: Create Reasoning Agents

These Tier 2 agents use different LLMs to analyze the collected data.

### Claude Analyzer

Create `agents/reasoning/claude-analyzer.yaml`:

```yaml
apiVersion: aof.dev/v1
kind: Agent
metadata:
  name: claude-analyzer
  labels:
    tier: "2"
    category: reasoning
    cost: medium
spec:
  model: anthropic:claude-sonnet-4-20250514

  instructions: |
    You are a Root Cause Analysis Reasoning Agent powered by Claude.

    ## Your Role
    Analyze collected data from tier 1 agents to identify the root cause.

    ## Analysis Approach

    ### 1. Timeline Reconstruction
    - Order events chronologically
    - Identify the trigger event
    - Map the cascade of failures

    ### 2. Correlation Analysis
    - What changed before the incident?
    - Which metrics correlate with errors?
    - Are there common services/components?

    ### 3. Root Cause Identification
    - Apply the "5 Whys" technique
    - Distinguish symptoms from causes
    - Consider both technical and process factors

    ## Output Format
    ```json
    {
      "analysis_summary": "One paragraph summary",
      "confidence": 0.0-1.0,
      "root_cause": {
        "category": "code|config|infrastructure|dependency|capacity",
        "description": "Clear description",
        "evidence": ["evidence 1", "evidence 2"],
        "timeline_position": "What triggered the cascade"
      },
      "contributing_factors": [
        {"factor": "...", "impact": "high|medium|low", "evidence": "..."}
      ],
      "immediate_actions": [
        {"action": "...", "priority": "critical|high|medium", "expected_impact": "..."}
      ],
      "prevention_recommendations": ["..."]
    }

tools:

shell

max_iterations: 3 temperature: 0.4

### Gemini Analyzer

Create `agents/reasoning/gemini-analyzer.yaml`:

```yaml
apiVersion: aof.dev/v1
kind: Agent
metadata:
  name: gemini-analyzer
  labels:
    tier: "2"
    category: reasoning
    cost: medium
spec:
  model: google:gemini-2.5-pro

  instructions: |
    You are a Root Cause Analysis Reasoning Agent powered by Gemini.

    ## Analysis Approach
    Use a structured, methodical approach:

    ### 1. Data Synthesis
    - Combine information from all data sources
    - Build a unified timeline
    - Identify correlations across sources

    ### 2. Hypothesis Generation
    - Generate multiple possible root causes
    - Rank by likelihood based on evidence
    - Consider both obvious and non-obvious causes

    ### 3. Evidence Evaluation
    - Evaluate supporting evidence for each hypothesis
    - Look for contradicting evidence
    - Apply Occam's Razor

    ## Output Format
    ```json
    {
      "analysis_summary": "One paragraph summary",
      "confidence": 0.0-1.0,
      "root_cause": {
        "category": "code|config|infrastructure|dependency|capacity",
        "description": "Clear description",
        "evidence": ["evidence 1", "evidence 2"]
      },
      "alternative_hypotheses": [
        {"hypothesis": "...", "likelihood": "low|medium", "missing_evidence": "..."}
      ],
      "immediate_actions": [
        {"action": "...", "priority": "critical|high|medium", "expected_impact": "..."}
      ],
      "verification_steps": ["..."],
      "prevention_recommendations": ["..."]
    }

tools:

shell

max_iterations: 3 temperature: 0.4

### GPT-4 Analyzer

Create `agents/reasoning/gpt4-analyzer.yaml`:

```yaml
apiVersion: aof.dev/v1
kind: Agent
metadata:
  name: gpt4-analyzer
  labels:
    tier: "2"
    category: reasoning
    cost: medium
spec:
  model: openai:gpt-4o

  instructions: |
    You are a Root Cause Analysis Reasoning Agent powered by GPT-4.

    ## Analytical Framework

    ### Five Whys Analysis
    - Start with the symptom
    - Ask "why" iteratively
    - Drill down to the fundamental cause

    ### Fault Tree Analysis
    - Work backwards from the failure
    - Identify all possible causes
    - Evaluate each branch

    ### STAMP Analysis (System-Theoretic)
    - Consider the system as a whole
    - Look for control loop failures
    - Identify missing constraints

    ## Output Format
    ```json
    {
      "analysis_summary": "One paragraph summary",
      "confidence": 0.0-1.0,
      "root_cause": {
        "category": "code|config|infrastructure|dependency|capacity",
        "description": "Clear description",
        "evidence": ["evidence 1", "evidence 2"],
        "five_whys": ["Why 1", "Why 2", "Why 3", "Why 4", "Root cause"]
      },
      "system_weaknesses": [
        {"weakness": "...", "recommendation": "..."}
      ],
      "immediate_actions": [
        {"action": "...", "priority": "critical|high|medium", "risk": "..."}
      ],
      "long_term_fixes": [
        {"fix": "...", "effort": "low|medium|high", "impact": "..."}
      ]
    }

tools:

shell

max_iterations: 3 temperature: 0.4

## Step 3: Create the RCA Coordinator

This Tier 3 manager synthesizes all analyses into a final report.

Create `agents/reasoning/rca-coordinator.yaml`:

```yaml
apiVersion: aof.dev/v1
kind: Agent
metadata:
  name: rca-coordinator
  labels:
    tier: "3"
    category: coordinator
    cost: medium
spec:
  model: anthropic:claude-sonnet-4-20250514

  instructions: |
    You are the RCA Coordinator, responsible for synthesizing analyses from
    multiple reasoning agents into a coherent, actionable report.

    ## Your Role
    - Review analyses from all tier 2 reasoning agents
    - Identify areas of agreement and disagreement
    - Synthesize a consensus view with confidence levels
    - Produce the final RCA report

    ## Synthesis Process

    ### 1. Agreement Analysis
    - What do all/most agents agree on? (high confidence)
    - Where is there strong consensus?

    ### 2. Disagreement Resolution
    - Where do agents disagree?
    - Evaluate the evidence for each position
    - Note unresolved disagreements (lower confidence)

    ### 3. Evidence Aggregation
    - Combine evidence from all analyses
    - Remove duplicates, merge similar points
    - Rank evidence by strength

    ## Final Report Format

    ```markdown
    # Root Cause Analysis Report

    ## Incident Summary
    - **Incident**: [description]
    - **Duration**: [start to resolution]
    - **Impact**: [what was affected]
    - **Severity**: [P1/P2/P3/P4]

    ## Executive Summary
    [2-3 paragraphs summarizing the incident]

    ## Root Cause
    **Primary Cause**: [description]
    **Category**: [code|config|infrastructure|dependency|capacity]
    **Confidence**: [high|medium|low] ([X]% of analyzers agreed)

    ### Evidence
    1. [evidence point 1]
    2. [evidence point 2]

    ## Contributing Factors
    | Factor | Impact | Evidence |
    |--------|--------|----------|
    | [factor] | High/Med/Low | [evidence] |

    ## Timeline
    | Time | Event | Significance |
    |------|-------|--------------|
    | [time] | [event] | [why it matters] |

    ## Immediate Actions
    - [ ] [action 1]
    - [ ] [action 2]

    ## Follow-up Actions
    | Action | Priority | Owner | Due Date |
    |--------|----------|-------|----------|
    | [action] | P1/P2/P3 | TBD | TBD |

    ## Prevention Measures
    ### Short-term
    - [measure 1]

    ### Long-term
    - [measure 1]

    ## Appendix
    ### Analyzer Agreement Matrix
    | Finding | Claude | Gemini | GPT-4 | Consensus |
    |---------|--------|--------|-------|-----------|

    ---
    *Generated by AOF Multi-Model RCA Fleet*

tools:

shell

max_iterations: 5 temperature: 0.5

## Step 4: Create the Fleet Definition

Create `fleets/multi-model-rca-fleet.yaml`:

```yaml
apiVersion: aof.dev/v1
kind: AgentFleet
metadata:
  name: multi-model-rca
  labels:
    purpose: incident-response
    type: rca
    multi-model: "true"

spec:
  agents:
    # ============================================
    # TIER 1: Data Collectors (cheap, parallel)
    # ============================================
    - name: loki-collector
      config: ../agents/observability/loki-collector.yaml
      tier: 1
      role: specialist

    - name: prometheus-collector
      config: ../agents/observability/prometheus-collector.yaml
      tier: 1
      role: specialist

    - name: k8s-collector
      config: ../agents/observability/k8s-collector.yaml
      tier: 1
      role: specialist

    - name: git-collector
      config: ../agents/observability/git-collector.yaml
      tier: 1
      role: specialist

    # ============================================
    # TIER 2: Reasoning Agents (multi-model)
    # ============================================
    - name: claude-analyzer
      config: ../agents/reasoning/claude-analyzer.yaml
      tier: 2
      weight: 1.5  # Higher weight for Claude

    - name: gemini-analyzer
      config: ../agents/reasoning/gemini-analyzer.yaml
      tier: 2
      weight: 1.0

    - name: gpt4-analyzer
      config: ../agents/reasoning/gpt4-analyzer.yaml
      tier: 2
      weight: 1.0

    # ============================================
    # TIER 3: Coordinator (synthesis)
    # ============================================
    - name: rca-coordinator
      config: ../agents/reasoning/rca-coordinator.yaml
      tier: 3
      role: manager

  coordination:
    mode: tiered
    distribution: round-robin

    # Global consensus configuration
    consensus:
      algorithm: weighted
      min_votes: 2
      timeout_ms: 180000  # 3 minutes
      allow_partial: true
      min_confidence: 0.6

    # Tiered execution configuration
    tiered:
      pass_all_results: true
      final_aggregation: manager_synthesis

      # Per-tier consensus
      tier_consensus:
        "1":
          algorithm: first_wins  # Just collect data fast
        "2":
          algorithm: weighted    # Multi-model analysis
          min_votes: 2
          min_confidence: 0.5
        "3":
          algorithm: first_wins  # Single coordinator

  # Shared memory for cross-agent communication
  shared:
    memory:
      type: inmemory
      namespace: rca-session
      ttl: 3600

Step 5: Run the Fleet

# Set API keys
export ANTHROPIC_API_KEY=sk-ant-...
export GOOGLE_API_KEY=AIza...
export OPENAI_API_KEY=sk-...

# Run the multi-model RCA
aofctl run fleet fleets/multi-model-rca-fleet.yaml \
  --input "Investigate: Users reporting 500 errors on checkout API since 2pm UTC"

Understanding the Output

Execution Flow

[FLEET] Initializing multi-model-rca with 8 agents
[TIER 1] Starting 4 data collectors in parallel...
  [AGENT] loki-collector: Querying Loki for error logs
  [AGENT] prometheus-collector: Querying Prometheus metrics
  [AGENT] k8s-collector: Checking Kubernetes state
  [AGENT] git-collector: Auditing recent changes
[TIER 1] Complete. Consensus: first_wins (4 results)

[TIER 2] Starting 3 reasoning agents in parallel...
  [AGENT] claude-analyzer: Analyzing with Claude
  [AGENT] gemini-analyzer: Analyzing with Gemini
  [AGENT] gpt4-analyzer: Analyzing with GPT-4
[TIER 2] Complete. Weighted consensus reached (confidence: 0.85)
  - Root cause agreement: 3/3 models
  - Contributing factors: 2/3 agreement

[TIER 3] Starting coordinator synthesis...
  [AGENT] rca-coordinator: Generating final report
[TIER 3] Complete.

[FLEET] Final RCA Report generated

Consensus Results

The fleet tracks agreement between models:

Analyzer Agreement Matrix:
| Finding              | Claude | Gemini | GPT-4 | Consensus |
|----------------------|--------|--------|-------|-----------|
| DB connection issue  | ✓      | ✓      | ✓     | HIGH      |
| Memory pressure      | ✓      | ✓      |       | MEDIUM    |
| Config change        |        | ✓      | ✓     | MEDIUM    |
| Network latency      | ✓      |        |       | LOW       |

Cost Optimization

Estimated Costs per RCA

Tier	Agents	Model	Cost/1M tokens	Typical Usage
1	4	Gemini Flash	$0.075	~50K tokens
2	3	Claude/Gemini/GPT-4	$3-15	~20K tokens each
3	1	Claude Sonnet	$3	~10K tokens

Total estimated cost: ~$0.50-1.00 per RCA (vs $5-10 for single GPT-4 analysis)

Cost Reduction Strategies

Tier 1 Caching: Cache observability data for similar incidents
Selective Tier 2: Use 2 models instead of 3 for lower-severity incidents
Model Selection: Use cheaper models for routine analysis

Customization

Adding Custom Data Sources

agents:
  - name: splunk-collector
    tier: 1
    spec:
      model: google:gemini-2.0-flash
      instructions: |
        Query Splunk for application logs...
      tools:
        - shell

Adjusting Consensus Weights

agents:
  - name: claude-analyzer
    tier: 2
    weight: 2.0  # Claude counts as 2 votes

  - name: junior-model
    tier: 2
    weight: 0.5  # Less experienced model counts as 0.5

Using Human Review for Critical Incidents

coordination:
  consensus:
    algorithm: human_review  # Always flag for human decision
    min_confidence: 0.9

Best Practices

1. Keep Tier 1 Agents Fast and Cheap

Tier 1 should collect data, not analyze it:

Use the cheapest models available
Keep instructions simple and focused
Output structured JSON for downstream agents

2. Diversify Tier 2 Models

Use models from different providers:

Anthropic (Claude): Strong reasoning, safety-focused
Google (Gemini): Good at structured data
OpenAI (GPT-4): Broad knowledge base

3. Weight Based on Track Record

Adjust weights based on your historical accuracy:

weight: 1.5  # This model has been more accurate for your use case

4. Set Appropriate Timeouts

timeout_ms: 180000  # 3 minutes for full RCA
tier_consensus:
  "1":
    timeout_ms: 30000   # 30s for data collection
  "2":
    timeout_ms: 120000  # 2 min for reasoning

Troubleshooting

Models Disagree on Root Cause

This is actually valuable information! The disagreement matrix in the final report helps you:

Identify where human judgment is needed
Understand the uncertainty in the analysis
Prioritize follow-up investigation

Tier 1 Agents Timing Out

Check connectivity to observability tools:

# Test Loki
curl -G "http://loki:3100/loki/api/v1/query" --data-urlencode 'query={job="test"}'

# Test Prometheus
curl "http://prometheus:9090/api/v1/query?query=up"

Low Confidence Results

Increase data quality:

Add more Tier 1 collectors
Extend time window for data collection
Add specific instructions for your stack

Summary

You've built a production-grade multi-model RCA system that:

✅ Collects data from multiple observability sources in parallel
✅ Analyzes with diverse LLM perspectives
✅ Uses weighted consensus for reliable conclusions
✅ Produces actionable RCA reports
✅ Optimizes costs with tiered model selection

Next Steps

Fleet Concepts - Deep dive into fleet coordination modes
Fleet YAML Reference - Complete specification
Multi-Model RCA Quickstart - 5-minute test guide

Production incident? Deploy your multi-model RCA fleet and let diverse AI perspectives find the root cause while you focus on mitigation!

What You'll Build​

Why Multi-Model RCA?​

The Problem with Single-Model Analysis​

The Multi-Model Solution​

Prerequisites​

Step 1: Create Data Collector Agents​

Loki Log Collector​

Step 5: Run the Fleet​

Understanding the Output​

Execution Flow​

Consensus Results​

Cost Optimization​

Estimated Costs per RCA​

Cost Reduction Strategies​

Customization​

Adding Custom Data Sources​

Adjusting Consensus Weights​

Using Human Review for Critical Incidents​

Best Practices​

1. Keep Tier 1 Agents Fast and Cheap​

2. Diversify Tier 2 Models​

3. Weight Based on Track Record​

4. Set Appropriate Timeouts​

Troubleshooting​

Models Disagree on Root Cause​

Tier 1 Agents Timing Out​

Low Confidence Results​

Summary​

Next Steps​

What You'll Build

Why Multi-Model RCA?

The Problem with Single-Model Analysis

The Multi-Model Solution

Prerequisites

Step 1: Create Data Collector Agents

Loki Log Collector

Step 5: Run the Fleet

Understanding the Output

Execution Flow

Consensus Results

Cost Optimization

Estimated Costs per RCA

Cost Reduction Strategies

Customization

Adding Custom Data Sources

Adjusting Consensus Weights

Using Human Review for Critical Incidents

Best Practices

1. Keep Tier 1 Agents Fast and Cheap

2. Diversify Tier 2 Models

3. Weight Based on Track Record

4. Set Appropriate Timeouts

Troubleshooting

Models Disagree on Root Cause

Tier 1 Agents Timing Out

Low Confidence Results

Summary

Next Steps