Deep Analysis Fleet Tutorial
Learn how to build an agentic investigation system using AOF's deep coordination mode. This pattern enables iterative, self-directed analysis where a planner agent creates investigation plans and worker agents execute them until the root cause is found.
What You'll Build
A multi-agent fleet that:
- Uses deep coordination for iterative investigation
- Integrates MCP servers for filesystem and GitHub access
- Combines code analysis and infrastructure analysis
- Synthesizes findings into actionable conclusions
Prerequisites
- AOF installed (
curl -sSL https://docs.aof.sh/install.sh | bash) - Google AI API key (or other supported provider)
- Node.js (for MCP servers)
- kubectl configured (for infrastructure analysis)
- Optional: GitHub token, Prometheus/Loki endpoints
Understanding Deep Coordination
Traditional coordination modes (hierarchical, peer, pipeline) have fixed execution patterns. Deep coordination is an agentic pattern where:
- Planner analyzes the problem and creates an investigation plan
- Workers execute investigation steps and report findings
- Planner reviews findings and refines the plan
- Loop continues until confidence threshold is met
- Synthesizer combines all findings into conclusions
┌─────────────────────────────────────────────────────────────────┐
│ Deep Coordination Flow │
├─────────────────────────────────────────────────────────────────┤
│ │
│ ┌──────────┐ ┌───────────────┐ ┌──────────────────┐ │
│ │ Task │────▶│ Planner │────▶│ Investigation │ │
│ │ Input │ │ (Manager) │ │ Plan │ │
│ └──────────┘ └───────────────┘ └────────┬─────────┘ │
│ ▲ │ │
│ │ ▼ │
│ ┌──────┴──────┐ ┌──────────────────┐ │
│ │ Refine │ │ Worker Agents │ │
│ │ Plan │ │ Execute Steps │ │
│ └──────┬──────┘ └────────┬─────────┘ │
│ │ │ │
│ │ ┌────────────┘ │
│ │ ▼ │
│ ┌──────┴──────────────┐ │
│ │ Findings │ │
│ │ Confidence < 0.8 │───────┐ │
│ └─────────────────────┘ │ │
│ │ │ │
│ │ Confidence ≥ 0.8 │ Loop │
│ ▼ │ │
│ ┌─────────────────────┐ │ │
│ │ Synthesizer │◀──────┘ │
│ │ Final Conclusions │ │
│ └─────────────────────┘ │
└─────────────────────────────────────────────────────────────────┘
Step 1: Create the Fleet Configuration
Create deep-analysis-fleet.yaml:
apiVersion: aof.dev/v1
kind: AgentFleet
metadata:
name: deep-analysis-fleet
labels:
purpose: investigation
pattern: deep-agentic
spec:
agents:
# Planner - creates and refines investigation plans
- name: planner
role: manager
spec:
model: google:gemini-2.5-flash
instructions: |
You are a senior investigation planner. Your role is to:
1. Analyze the problem statement and current findings
2. Create a structured investigation plan with specific steps
3. Prioritize steps based on likelihood of finding root cause
4. Refine the plan based on worker agent findings
Output investigation plans in this format:
INVESTIGATION PLAN:
- [Step description] - [Expected outcome]
- [Step description] - [Expected outcome]
PRIORITY: [high|medium|low] HYPOTHESIS: [Current working hypothesis]
tools:
- shell
max_iterations: 5
temperature: 0.3
# Code analyzer with MCP filesystem access
- name: code-analyzer
role: specialist
spec:
model: google:gemini-2.5-flash
instructions: |
You are a code analysis specialist. Examine source code
for bugs, misconfigurations, and recent changes that might
cause the reported issue.
Always provide specific file paths and line numbers.
tools:
- git
- shell
mcp_servers:
- name: filesystem
transport: stdio
command: npx
args: ["-y", "@modelcontextprotocol/server-filesystem", "/workspace"]
max_iterations: 8
temperature: 0.2
# Infrastructure analyzer
- name: infra-analyzer
role: specialist
spec:
model: google:gemini-2.5-flash
instructions: |
You are an infrastructure analysis specialist. Query
Kubernetes, logs, and metrics to identify infrastructure
issues causing the reported problem.
tools:
- kubectl
- prometheus_query
- loki_query
max_iterations: 8
temperature: 0.2
# Synthesizer - combines findings
- name: synthesizer
role: validator
spec:
model: google:gemini-2.5-flash
instructions: |
You are an analysis synthesizer. Review all findings,
identify patterns, determine root cause with confidence
level, and recommend remediation steps.
tools:
- shell
max_iterations: 5
temperature: 0.3
coordination:
mode: deep
manager: planner
deep:
max_iterations: 5
confidence_threshold: 0.8
context_strategy: cumulative
investigation_agents:
- code-analyzer
- infra-analyzer
synthesis_agent: synthesizer
distribution: capability_based
Step 2: Add MCP Server Integration
The code-analyzer agent uses MCP servers for enhanced capabilities:
mcp_servers:
# Filesystem access for reading code
- name: filesystem
transport: stdio
command: npx
args: ["-y", "@modelcontextprotocol/server-filesystem", "/workspace"]
# GitHub access for PR history, issues, etc.
- name: github
transport: stdio
command: npx
args: ["-y", "@modelcontextprotocol/server-github"]
env:
GITHUB_TOKEN: "${GITHUB_TOKEN}"
Install the MCP servers:
# These are installed on-demand by npx, but you can pre-install:
npm install -g @modelcontextprotocol/server-filesystem
npm install -g @modelcontextprotocol/server-github
Step 3: Configure Environment
Set up required environment variables:
# Required
export GOOGLE_API_KEY="your-google-api-key"
# For infrastructure analysis
export KUBECONFIG="${HOME}/.kube/config"
export PROMETHEUS_URL="http://prometheus:9090"
export LOKI_URL="http://loki:3100"
# For GitHub MCP server (optional)
export GITHUB_TOKEN="ghp_your_github_token"
Step 4: Run the Fleet
Execute the fleet with an investigation task:
# Basic usage
aofctl run fleet deep-analysis-fleet.yaml \
--task "Analyze why the payment service is returning 500 errors"
# With verbose output
aofctl run fleet deep-analysis-fleet.yaml \
--task "Investigate high latency in the checkout API" \
--verbose
# With custom working directory for MCP
aofctl run fleet deep-analysis-fleet.yaml \
--task "Find the cause of memory leaks in user-service" \
--env WORKSPACE=/path/to/your/repo
Step 5: Understanding the Output
The fleet produces structured output at each iteration:
Iteration 1: Initial Plan
=== DEEP ANALYSIS: Iteration 1 ===
[Planner] Creating initial investigation plan...
INVESTIGATION PLAN:
1. Check payment-service pod logs for error patterns
2. Query Prometheus for error rate and latency metrics
3. Review recent deployments and config changes
4. Examine payment-service source for error handling
PRIORITY: high
HYPOTHESIS: None yet - gathering initial data
Iteration 2: First Findings
=== DEEP ANALYSIS: Iteration 2 ===
[code-analyzer] Analyzing recent changes...
Finding: PR #234 modified payment validation logic 2 days ago
Finding: New timeout configuration in config/payment.yaml
[infra-analyzer] Checking infrastructure...
Finding: payment-service pods showing increased memory usage
Finding: 503 errors correlate with memory pressure
[Planner] Refining plan based on findings...
INVESTIGATION PLAN:
1. Deep dive into PR #234 changes
2. Check memory limits vs actual usage
3. Trace payment validation code path
HYPOTHESIS: Memory pressure causing request failures
CONFIDENCE: 0.6
Final Synthesis
=== DEEP ANALYSIS: Complete ===
[Synthesizer] Combining findings...
FINDINGS SUMMARY:
- PR #234 introduced unbounded cache in PaymentValidator
- Memory usage increased 300% after deployment
- Pods hitting memory limits, triggering OOMKills
- Requests failing during garbage collection pauses
ROOT CAUSE: Memory leak in PaymentValidator cache
CONFIDENCE: high (0.92)
EVIDENCE:
- Memory growth pattern matches cache accumulation
- Errors correlate with GC pause times
- Rolling restart temporarily resolves issue
REMEDIATION:
1. Immediate: Increase memory limits to 2Gi
2. Short-term: Add cache eviction policy
3. Follow-up: Review PR #234, add cache size monitoring
PREVENTION:
- Add memory usage alerts
- Include memory profiling in CI
- Require cache size limits in code review
Advanced Configuration
Custom Investigation Agents
Add domain-specific specialists:
agents:
# Database specialist
- name: db-analyzer
role: specialist
spec:
model: google:gemini-2.5-flash
instructions: |
You are a database specialist. Analyze query performance,
connection pools, and data consistency issues.
mcp_servers:
- name: postgres
transport: stdio
command: npx
args: ["-y", "@modelcontextprotocol/server-postgres"]
env:
DATABASE_URL: "${DATABASE_URL}"
Confidence Tuning
Adjust when the investigation concludes:
coordination:
deep:
# Higher threshold = more thorough investigation
confidence_threshold: 0.9
# More iterations allowed
max_iterations: 10
# How context is passed between iterations
# - cumulative: All findings accumulate
# - windowed: Only last N findings
# - summarized: LLM summarizes findings
context_strategy: cumulative
Multi-Model Investigation
Use different models for different roles:
agents:
- name: planner
spec:
model: google:gemini-2.5-flash # Fast for planning
# ...
- name: code-analyzer
spec:
model: anthropic:claude-3-5-sonnet # Strong at code
# ...
- name: synthesizer
spec:
model: openai:gpt-4o # Good at synthesis
# ...
Best Practices
1. Clear Investigation Scope
Provide specific, actionable tasks:
# Good - specific and actionable
aofctl run fleet deep-analysis-fleet.yaml \
--task "Find why checkout API returns 504 timeout errors after 10pm UTC"
# Too vague
aofctl run fleet deep-analysis-fleet.yaml \
--task "Fix the API"
2. Appropriate Tool Access
Give agents only the tools they need:
# Code analyzer needs file access, not kubectl
- name: code-analyzer
spec:
tools: [git, shell]
mcp_servers: [filesystem]
# Infra analyzer needs kubectl, not file access
- name: infra-analyzer
spec:
tools: [kubectl, prometheus_query]
3. Resource Management
Set appropriate limits for long investigations:
agents:
- name: code-analyzer
spec:
max_iterations: 8 # Limit per-agent iterations
temperature: 0.2 # Lower = more focused
coordination:
deep:
max_iterations: 5 # Limit total planning cycles
4. Secure MCP Configuration
Use environment variables for sensitive data:
mcp_servers:
- name: github
env:
GITHUB_TOKEN: "${GITHUB_TOKEN}" # Never hardcode tokens
Troubleshooting
MCP Server Won't Start
# Verify npx works
npx -y @modelcontextprotocol/server-filesystem --version
# Check Node.js version (requires 18+)
node --version
Investigation Not Converging
- Lower
confidence_threshold(e.g., 0.7) - Increase
max_iterations - Check if task is too broad - narrow the scope
Missing Findings
- Verify tool permissions (kubectl access, file paths)
- Check environment variables are set
- Review agent logs for errors
Complete Example
See the full example at:
- Fleet config:
examples/fleets/deep-analysis-fleet.yaml - Agent configs:
examples/agents/library/
Next Steps
- Fleet Reference - Complete fleet specification
- MCP Integration - Advanced MCP configuration
- Tools Reference - Available built-in tools