AOF Production Deployment Guide
Comprehensive guide for deploying the Agentic Ops Framework (AOF) in production environments
Table of Contents
- Prerequisites
- Deployment Options
- Platform Setup
- Configuration
- Security
- Monitoring
- Scaling
- Troubleshooting
1. Prerequisites
System Requirements
Minimum Requirements:
- CPU: 2 cores (4 cores recommended)
- RAM: 4 GB (8 GB recommended)
- Disk: 10 GB available space
- OS: Linux (Ubuntu 20.04+, RHEL 8+), macOS 11+, Windows Server 2019+
Network Requirements:
- Outbound HTTPS (443) for LLM provider APIs
- Inbound ports for webhook servers (default: 8080)
- Low latency connection to LLM providers (<100ms recommended)
Dependencies
Core Dependencies:
# Rust toolchain (1.75+)
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh
rustup default stable
rustup update
# Build tools (Linux)
sudo apt-get update
sudo apt-get install -y build-essential pkg-config libssl-dev
# Build tools (macOS)
xcode-select --install
# Optional: Docker & Docker Compose
curl -fsSL https://get.docker.com -o get-docker.sh
sh get-docker.sh
Runtime Dependencies:
- OpenSSL 1.1.1+ or 3.0+
- glibc 2.31+ (Linux)
- libc++ (macOS)
API Keys Required
-
LLM Provider Keys (at least one):
- Anthropic:
ANTHROPIC_API_KEY - OpenAI:
OPENAI_API_KEY - AWS Bedrock: AWS credentials configured
- Anthropic:
-
Platform Integration Keys (optional):
- WhatsApp Business:
WHATSAPP_ACCESS_TOKEN,WHATSAPP_VERIFY_TOKEN - Telegram:
TELEGRAM_BOT_TOKEN - Slack:
SLACK_BOT_TOKEN,SLACK_SIGNING_SECRET - Discord:
DISCORD_BOT_TOKEN,DISCORD_PUBLIC_KEY
- WhatsApp Business:
-
MCP Server Access (optional):
- Server-specific authentication tokens
- OAuth credentials for cloud-based MCP servers
2. Deployment Options
Option A: Standalone Binary
Best for: Simple deployments, single-server setups, development
# Build release binary
cargo build --release --workspace
# Install aofctl globally
cargo install --path crates/aofctl
# Verify installation
aofctl --version
# Run agent
aofctl run --config /etc/aof/agent.yaml \
--input "Deploy application to production"
Production Setup:
# Create system user
sudo useradd -r -s /bin/false aof
# Install binary
sudo cp target/release/aofctl /usr/local/bin/
sudo chmod +x /usr/local/bin/aofctl
# Create directories
sudo mkdir -p /etc/aof /var/lib/aof /var/log/aof
sudo chown -R aof:aof /var/lib/aof /var/log/aof
# Create systemd service
sudo tee /etc/systemd/system/aof-agent.service > /dev/null <<'EOF'
[Unit]
Description=AOF Agent Service
After=network-online.target
Wants=network-online.target
[Service]
Type=simple
User=aof
Group=aof
ExecStart=/usr/local/bin/aofctl run --config /etc/aof/agent.yaml
Restart=always
RestartSec=10
StandardOutput=append:/var/log/aof/agent.log
StandardError=append:/var/log/aof/agent.err
# Security settings
NoNewPrivileges=true
PrivateTmp=true
ProtectSystem=strict
ProtectHome=true
ReadWritePaths=/var/lib/aof /var/log/aof
[Install]
WantedBy=multi-user.target
EOF
# Enable and start service
sudo systemctl daemon-reload
sudo systemctl enable aof-agent
sudo systemctl start aof-agent
Option B: Docker Container
Best for: Containerized environments, Kubernetes, cloud platforms
Dockerfile:
# Multi-stage build for minimal image size
FROM rust:1.75-slim-bookworm AS builder
WORKDIR /build
# Install dependencies
RUN apt-get update && apt-get install -y \
pkg-config \
libssl-dev \
&& rm -rf /var/lib/apt/lists/*
# Copy workspace
COPY . .
# Build release
RUN cargo build --release --workspace
# Runtime stage
FROM debian:bookworm-slim
# Install runtime dependencies
RUN apt-get update && apt-get install -y \
ca-certificates \
libssl3 \
&& rm -rf /var/lib/apt/lists/*
# Create non-root user
RUN useradd -r -s /bin/false aof
# Copy binaries
COPY --from=builder /build/target/release/aofctl /usr/local/bin/
COPY --from=builder /build/target/release/aof-triggers /usr/local/bin/
# Set permissions
RUN chown aof:aof /usr/local/bin/aofctl /usr/local/bin/aof-triggers
USER aof
WORKDIR /app
EXPOSE 8080
CMD ["aofctl", "run", "--config", "/config/agent.yaml"]
Build and Run:
# Build image
docker build -t aof:latest .
# Run container
docker run -d \
--name aof-agent \
--restart unless-stopped \
-v $(pwd)/config:/config:ro \
-v aof-data:/data \
-e ANTHROPIC_API_KEY="${ANTHROPIC_API_KEY}" \
-e RUST_LOG=info \
-p 8080:8080 \
aof:latest
# Check logs
docker logs -f aof-agent
Docker Compose:
version: '3.8'
services:
aof-agent:
build: .
container_name: aof-agent
restart: unless-stopped
ports:
- "8080:8080"
volumes:
- ./config:/config:ro
- aof-data:/data
- aof-logs:/logs
environment:
- ANTHROPIC_API_KEY=${ANTHROPIC_API_KEY}
- OPENAI_API_KEY=${OPENAI_API_KEY}
- RUST_LOG=info
- AOF_MEMORY_BACKEND=redis
- REDIS_URL=redis://redis:6379
depends_on:
- redis
networks:
- aof-network
healthcheck:
test: ["CMD", "curl", "-f", "http://localhost:8080/health"]
interval: 30s
timeout: 10s
retries: 3
redis:
image: redis:7-alpine
container_name: aof-redis
restart: unless-stopped
volumes:
- redis-data:/data
networks:
- aof-network
command: redis-server --appendonly yes
aof-triggers:
build: .
container_name: aof-triggers
restart: unless-stopped
ports:
- "8081:8080"
volumes:
- ./config:/config:ro
environment:
- WHATSAPP_ACCESS_TOKEN=${WHATSAPP_ACCESS_TOKEN}
- TELEGRAM_BOT_TOKEN=${TELEGRAM_BOT_TOKEN}
- SLACK_BOT_TOKEN=${SLACK_BOT_TOKEN}
- DISCORD_BOT_TOKEN=${DISCORD_BOT_TOKEN}
networks:
- aof-network
command: ["/usr/local/bin/aof-triggers"]
volumes:
aof-data:
aof-logs:
redis-data:
networks:
aof-network:
driver: bridge
Option C: Kubernetes Deployment
Best for: Large-scale deployments, high availability, auto-scaling
Namespace:
apiVersion: v1
kind: Namespace
metadata:
name: aof-system
labels:
app.kubernetes.io/name: aof
ConfigMap:
apiVersion: v1
kind: ConfigMap
metadata:
name: aof-config
namespace: aof-system
data:
agent.yaml: |
apiVersion: aof.dev/v1
kind: Agent
metadata:
name: production-agent
spec:
model:
provider: anthropic
model: claude-3-5-sonnet-20241022
tools:
- name: kubectl
type: mcp
config:
command: npx
args: ["-y", "kubectl-mcp"]
memory:
backend: redis
config:
url: redis://aof-redis:6379
Secret:
apiVersion: v1
kind: Secret
metadata:
name: aof-secrets
namespace: aof-system
type: Opaque
stringData:
anthropic-api-key: "sk-ant-..."
openai-api-key: "sk-..."
whatsapp-token: "..."
telegram-token: "..."
Deployment:
apiVersion: apps/v1
kind: Deployment
metadata:
name: aof-agent
namespace: aof-system
labels:
app: aof-agent
spec:
replicas: 3
selector:
matchLabels:
app: aof-agent
template:
metadata:
labels:
app: aof-agent
spec:
serviceAccountName: aof-agent
securityContext:
runAsNonRoot: true
runAsUser: 1000
fsGroup: 1000
containers:
- name: agent
image: aof:latest
imagePullPolicy: Always
ports:
- containerPort: 8080
name: http
protocol: TCP
env:
- name: ANTHROPIC_API_KEY
valueFrom:
secretKeyRef:
name: aof-secrets
key: anthropic-api-key
- name: RUST_LOG
value: "info"
- name: AOF_MEMORY_BACKEND
value: "redis"
- name: REDIS_URL
value: "redis://aof-redis:6379"
volumeMounts:
- name: config
mountPath: /config
readOnly: true
- name: data
mountPath: /data
resources:
requests:
memory: "512Mi"
cpu: "500m"
limits:
memory: "2Gi"
cpu: "2000m"
livenessProbe:
httpGet:
path: /health
port: 8080
initialDelaySeconds: 30
periodSeconds: 10
readinessProbe:
httpGet:
path: /ready
port: 8080
initialDelaySeconds: 5
periodSeconds: 5
volumes:
- name: config
configMap:
name: aof-config
- name: data
persistentVolumeClaim:
claimName: aof-data
Service:
apiVersion: v1
kind: Service
metadata:
name: aof-agent
namespace: aof-system
spec:
type: ClusterIP
selector:
app: aof-agent
ports:
- port: 80
targetPort: 8080
protocol: TCP
name: http
HorizontalPodAutoscaler:
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: aof-agent
namespace: aof-system
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: aof-agent
minReplicas: 3
maxReplicas: 10
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 70
- type: Resource
resource:
name: memory
target:
type: Utilization
averageUtilization: 80
Deploy to Kubernetes:
# Apply manifests
kubectl apply -f k8s/namespace.yaml
kubectl apply -f k8s/configmap.yaml
kubectl apply -f k8s/secret.yaml
kubectl apply -f k8s/deployment.yaml
kubectl apply -f k8s/service.yaml
kubectl apply -f k8s/hpa.yaml
# Verify deployment
kubectl get pods -n aof-system
kubectl logs -f deployment/aof-agent -n aof-system
# Check scaling
kubectl get hpa -n aof-system -w
Option D: Desktop App Distribution
Best for: End-user deployments, local development, GUI access
Build Desktop App:
# Build Tauri application
cd crates/aof-gui
npm install
npm run tauri build
# Output locations:
# - macOS: target/release/bundle/dmg/AOF.dmg
# - Windows: target/release/bundle/msi/AOF.msi
# - Linux: target/release/bundle/deb/aof_*.deb
macOS Distribution:
# Sign application
codesign --deep --force --verify --verbose \
--sign "Developer ID Application: Your Name" \
--options runtime \
target/release/bundle/macos/AOF.app
# Notarize
xcrun notarytool submit target/release/bundle/dmg/AOF.dmg \
--apple-id "your@email.com" \
--password "app-specific-password" \
--team-id "TEAM_ID"
# Staple ticket
xcrun stapler staple target/release/bundle/dmg/AOF.dmg
Windows Distribution:
# Sign MSI
signtool sign /f cert.pfx /p password /tr http://timestamp.digicert.com /td sha256 /fd sha256 target/release/bundle/msi/AOF.msi
Linux Distribution:
# Build AppImage
cd crates/aof-gui
cargo install cargo-appimage
cargo appimage
# Build Flatpak
flatpak-builder --repo=repo build-dir io.aof.App.yaml
3. Platform Setup
WhatsApp Business API Setup
Prerequisites:
- Meta Business Account
- WhatsApp Business Account
- Phone number (not used on WhatsApp)
Step 1: Create WhatsApp Business App
# Via Meta Developer Console
# 1. Go to https://developers.facebook.com/apps
# 2. Create App > Business > WhatsApp
# 3. Add WhatsApp product
# 4. Get test number or add your own
Step 2: Configure Webhook
# config/whatsapp.yaml
webhook_url: "https://your-domain.com/webhooks/whatsapp"
verify_token: "your-random-verify-token-here"
access_token: "EAAxxxxxxxxxxxxxxxx"
phone_number_id: "123456789"
Step 3: Set Environment Variables
export WHATSAPP_ACCESS_TOKEN="EAAxxxxxxxxxxxxxxxx"
export WHATSAPP_VERIFY_TOKEN="your-random-verify-token-here"
export WHATSAPP_PHONE_NUMBER_ID="123456789"
Step 4: Verify Webhook
# Meta will send GET request to verify
# Your server must return challenge parameter
curl "https://your-domain.com/webhooks/whatsapp?hub.mode=subscribe&hub.challenge=CHALLENGE&hub.verify_token=your-random-verify-token-here"
Step 5: Test Integration
# Start webhook server
cargo run -p aof-triggers
# Send test message via WhatsApp
# Server logs should show incoming webhook
Telegram Bot Creation
Step 1: Create Bot via BotFather
# 1. Open Telegram and search for @BotFather
# 2. Send /newbot
# 3. Follow prompts to set name and username
# 4. Save the bot token
Step 2: Configure Bot
# Set webhook
curl -X POST "https://api.telegram.org/bot<TOKEN>/setWebhook" \
-H "Content-Type: application/json" \
-d '{
"url": "https://your-domain.com/webhooks/telegram",
"secret_token": "your-secret-token"
}'
# Verify webhook
curl "https://api.telegram.org/bot<TOKEN>/getWebhookInfo"
Step 3: Environment Variables
export TELEGRAM_BOT_TOKEN="1234567890:ABCdefGHIjklMNOpqrsTUVwxyz"
export TELEGRAM_WEBHOOK_SECRET="your-secret-token"
Step 4: Bot Commands
# Set bot commands
curl -X POST "https://api.telegram.org/bot<TOKEN>/setMyCommands" \
-H "Content-Type: application/json" \
-d '{
"commands": [
{"command": "start", "description": "Start the agent"},
{"command": "help", "description": "Show help"},
{"command": "status", "description": "Check agent status"}
]
}'
Slack App Configuration
Step 1: Create Slack App
# 1. Go to https://api.slack.com/apps
# 2. Click "Create New App"
# 3. Choose "From scratch"
# 4. Enter app name and workspace
Step 2: Configure OAuth & Permissions
# Required Bot Token Scopes:
- chat:write
- chat:write.public
- commands
- im:history
- im:read
- im:write
- users:read
Step 3: Enable Event Subscriptions
# Request URL: https://your-domain.com/webhooks/slack
# Subscribe to bot events:
- message.im
- app_mention
Step 4: Install App to Workspace
# Get tokens from OAuth & Permissions page
export SLACK_BOT_TOKEN="xoxb-..."
export SLACK_SIGNING_SECRET="..."
export SLACK_APP_TOKEN="xapp-..." # For Socket Mode
Step 5: Create Slash Commands (Optional)
# Command: /aof
# Request URL: https://your-domain.com/webhooks/slack/commands
# Short Description: "Run AOF agent"
Discord Bot Setup
Step 1: Create Discord Application
# 1. Go to https://discord.com/developers/applications
# 2. Click "New Application"
# 3. Enter application name
# 4. Go to "Bot" section
# 5. Click "Add Bot"
Step 2: Configure Bot Permissions
# Required Permissions:
- Send Messages
- Read Message History
- Use Slash Commands
- Embed Links
- Attach Files
Step 3: Get Bot Token
export DISCORD_BOT_TOKEN="MTxxxxxxxxxxxxxxxxxx.xxxxxx.xxxxxxxxxxxxxxxxxxxxxxxxxxx"
export DISCORD_PUBLIC_KEY="xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"
export DISCORD_APPLICATION_ID="123456789012345678"
Step 4: Register Slash Commands
curl -X POST \
"https://discord.com/api/v10/applications/${DISCORD_APPLICATION_ID}/commands" \
-H "Authorization: Bot ${DISCORD_BOT_TOKEN}" \
-H "Content-Type: application/json" \
-d '{
"name": "aof",
"description": "Execute AOF agent task",
"options": [
{
"name": "task",
"description": "Task description",
"type": 3,
"required": true
}
]
}'
Step 5: Invite Bot to Server
# Generate OAuth2 URL with bot scope and permissions
# https://discord.com/oauth2/authorize?client_id=CLIENT_ID&scope=bot+applications.commands&permissions=PERMISSIONS
4. Configuration
Environment Variables
Core Variables:
# LLM Provider Configuration
export ANTHROPIC_API_KEY="sk-ant-..."
export OPENAI_API_KEY="sk-..."
# AWS Bedrock (alternative to env vars)
export AWS_REGION="us-east-1"
export AWS_ACCESS_KEY_ID="..."
export AWS_SECRET_ACCESS_KEY="..."
# Application Settings
export RUST_LOG="info,aof_core=debug"
export AOF_CONFIG_PATH="/etc/aof/agent.yaml"
export AOF_DATA_DIR="/var/lib/aof"
# Memory Backend
export AOF_MEMORY_BACKEND="redis" # Options: memory, redis, sled, file
export REDIS_URL="redis://localhost:6379"
# Webhook Server
export AOF_WEBHOOK_HOST="0.0.0.0"
export AOF_WEBHOOK_PORT="8080"
# Platform Tokens
export WHATSAPP_ACCESS_TOKEN="..."
export TELEGRAM_BOT_TOKEN="..."
export SLACK_BOT_TOKEN="..."
export DISCORD_BOT_TOKEN="..."
Production .env File:
# /etc/aof/.env
ANTHROPIC_API_KEY=sk-ant-api03-xxxxx
RUST_LOG=info,aof_core=debug,aof_runtime=info
AOF_MEMORY_BACKEND=redis
REDIS_URL=redis://:password@localhost:6379/0
AOF_WEBHOOK_HOST=0.0.0.0
AOF_WEBHOOK_PORT=8080
WHATSAPP_ACCESS_TOKEN=EAAxxxxx
TELEGRAM_BOT_TOKEN=123456:ABCxxx
YAML Configuration
Agent Configuration:
# /etc/aof/agent.yaml
apiVersion: aof.dev/v1
kind: Agent
metadata:
name: production-devops-agent
labels:
environment: production
team: platform
spec:
# Model configuration
model:
provider: anthropic # Options: anthropic, openai, bedrock
model: claude-3-5-sonnet-20241022
temperature: 0.7
max_tokens: 4096
timeout_seconds: 300
# Tool configuration
tools:
- name: kubectl
type: mcp
config:
command: npx
args: ["-y", "kubectl-mcp"]
transport: stdio
- name: aws-cli
type: mcp
config:
command: npx
args: ["-y", "aws-mcp"]
transport: stdio
- name: terraform
type: custom
config:
command: terraform
allowed_commands: ["plan", "apply", "destroy"]
# Memory configuration
memory:
backend: redis
config:
url: redis://localhost:6379
db: 0
pool_size: 10
timeout_seconds: 5
ttl_seconds: 3600
# Execution limits
limits:
max_iterations: 50
max_execution_time_seconds: 1800
max_memory_mb: 2048
# Retry configuration
retry:
max_attempts: 3
backoff_seconds: 5
exponential: true
Multi-Agent Configuration:
# /etc/aof/agents.yaml
apiVersion: aof.dev/v1
kind: AgentList
agents:
- metadata:
name: k8s-operator
spec:
model:
provider: anthropic
model: claude-3-5-sonnet-20241022
tools: [kubectl, helm]
- metadata:
name: aws-architect
spec:
model:
provider: bedrock
model: anthropic.claude-3-5-sonnet-20241022-v2:0
tools: [aws-cli, terraform]
- metadata:
name: security-auditor
spec:
model:
provider: openai
model: gpt-4
tools: [trivy, kubesec]
Secrets Management
Option 1: Environment Variables
# Load from secure vault
export ANTHROPIC_API_KEY=$(vault kv get -field=key secret/aof/anthropic)
export OPENAI_API_KEY=$(vault kv get -field=key secret/aof/openai)
Option 2: Kubernetes Secrets
# Create secret from file
kubectl create secret generic aof-secrets \
--from-file=anthropic-key=/path/to/key \
--from-file=openai-key=/path/to/key \
-n aof-system
# Use in deployment
env:
- name: ANTHROPIC_API_KEY
valueFrom:
secretKeyRef:
name: aof-secrets
key: anthropic-key
Option 3: AWS Secrets Manager
// Load at runtime
use aws_config::load_from_env;
use aws_sdk_secretsmanager::Client;
async fn load_secret(name: &str) -> String {
let config = load_from_env().await;
let client = Client::new(&config);
let resp = client
.get_secret_value()
.secret_id(name)
.send()
.await
.unwrap();
resp.secret_string().unwrap().to_string()
}
Option 4: HashiCorp Vault
# Initialize Vault
vault kv put secret/aof/anthropic key="sk-ant-..."
vault kv put secret/aof/openai key="sk-..."
# Create policy
vault policy write aof-policy - <<EOF
path "secret/data/aof/*" {
capabilities = ["read"]
}
EOF
# Generate token
vault token create -policy=aof-policy
MCP Server Setup
Stdio Transport:
tools:
- name: filesystem
type: mcp
config:
command: npx
args: ["-y", "@modelcontextprotocol/server-filesystem", "/allowed/path"]
transport: stdio
HTTP Transport:
tools:
- name: remote-api
type: mcp
config:
url: "https://mcp.example.com"
transport: http
headers:
Authorization: "Bearer ${MCP_API_KEY}"
SSE Transport:
tools:
- name: streaming-api
type: mcp
config:
url: "https://mcp.example.com/sse"
transport: sse
reconnect: true
reconnect_delay_ms: 1000
5. Security
Webhook Signature Verification
WhatsApp Signature Verification:
use hmac::{Hmac, Mac};
use sha2::Sha256;
fn verify_whatsapp_signature(
payload: &[u8],
signature: &str,
app_secret: &str,
) -> bool {
type HmacSha256 = Hmac<Sha256>;
let expected = signature.strip_prefix("sha256=").unwrap_or(signature);
let mut mac = HmacSha256::new_from_slice(app_secret.as_bytes())
.expect("HMAC can take key of any size");
mac.update(payload);
let result = mac.finalize();
let code_bytes = result.into_bytes();
hex::encode(code_bytes) == expected
}
Telegram Signature Verification:
use hmac::{Hmac, Mac};
use sha2::Sha256;
fn verify_telegram_signature(
secret_token: &str,
header_token: &str,
) -> bool {
secret_token == header_token
}
Slack Signature Verification:
use hmac::{Hmac, Mac};
use sha2::Sha256;
fn verify_slack_signature(
signing_secret: &str,
timestamp: &str,
body: &str,
signature: &str,
) -> bool {
type HmacSha256 = Hmac<Sha256>;
let sig_basestring = format!("v0:{}:{}", timestamp, body);
let mut mac = HmacSha256::new_from_slice(signing_secret.as_bytes())
.expect("HMAC can take key of any size");
mac.update(sig_basestring.as_bytes());
let result = mac.finalize();
let code_bytes = result.into_bytes();
let computed = format!("v0={}", hex::encode(code_bytes));
computed == signature
}
Discord Signature Verification:
use ed25519_dalek::{PublicKey, Signature, Verifier};
fn verify_discord_signature(
public_key: &str,
signature: &str,
timestamp: &str,
body: &str,
) -> bool {
let message = format!("{}{}", timestamp, body);
let public_key_bytes = hex::decode(public_key).unwrap();
let signature_bytes = hex::decode(signature).unwrap();
let public_key = PublicKey::from_bytes(&public_key_bytes).unwrap();
let signature = Signature::from_bytes(&signature_bytes).unwrap();
public_key.verify(message.as_bytes(), &signature).is_ok()
}
Rate Limiting
Application-Level Rate Limiting:
use governor::{Quota, RateLimiter};
use std::num::NonZeroU32;
// Create rate limiter: 100 requests per minute
let quota = Quota::per_minute(NonZeroU32::new(100).unwrap());
let limiter = RateLimiter::direct(quota);
// Check rate limit
if limiter.check().is_err() {
return Err("Rate limit exceeded");
}
Nginx Rate Limiting:
http {
limit_req_zone $binary_remote_addr zone=webhook_limit:10m rate=10r/s;
server {
location /webhooks/ {
limit_req zone=webhook_limit burst=20 nodelay;
proxy_pass http://aof-backend;
}
}
}
Kubernetes Rate Limiting (Ingress):
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
name: aof-ingress
annotations:
nginx.ingress.kubernetes.io/limit-rps: "10"
nginx.ingress.kubernetes.io/limit-connections: "100"
spec:
rules:
- host: aof.example.com
http:
paths:
- path: /webhooks
pathType: Prefix
backend:
service:
name: aof-agent
port:
number: 80
API Key Rotation
Automated Rotation Script:
#!/bin/bash
# rotate-keys.sh
# Rotate Anthropic API key
NEW_KEY=$(vault kv get -field=key secret/aof/anthropic-new)
kubectl set env deployment/aof-agent \
ANTHROPIC_API_KEY="${NEW_KEY}" \
-n aof-system
# Wait for rollout
kubectl rollout status deployment/aof-agent -n aof-system
# Archive old key
vault kv put secret/aof/anthropic-archived \
key="$(vault kv get -field=key secret/aof/anthropic)" \
rotated_at="$(date -u +%Y-%m-%dT%H:%M:%SZ)"
# Update current key
vault kv put secret/aof/anthropic key="${NEW_KEY}"
echo "API key rotation complete"
Rotation Schedule (Cron):
# /etc/cron.d/aof-key-rotation
0 2 1 * * /usr/local/bin/rotate-keys.sh >> /var/log/aof/key-rotation.log 2>&1
Network Policies
Kubernetes NetworkPolicy:
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: aof-agent-network-policy
namespace: aof-system
spec:
podSelector:
matchLabels:
app: aof-agent
policyTypes:
- Ingress
- Egress
ingress:
- from:
- namespaceSelector:
matchLabels:
name: ingress-nginx
ports:
- protocol: TCP
port: 8080
egress:
# Allow DNS
- to:
- namespaceSelector:
matchLabels:
name: kube-system
ports:
- protocol: UDP
port: 53
# Allow Redis
- to:
- podSelector:
matchLabels:
app: aof-redis
ports:
- protocol: TCP
port: 6379
# Allow HTTPS to LLM providers
- to:
- podSelector: {}
ports:
- protocol: TCP
port: 443
Firewall Rules (iptables):
# Allow incoming webhook traffic
iptables -A INPUT -p tcp --dport 8080 -j ACCEPT
# Allow outgoing HTTPS to LLM providers
iptables -A OUTPUT -p tcp --dport 443 -j ACCEPT
# Block all other incoming traffic
iptables -A INPUT -j DROP
6. Monitoring
Logging Configuration
Structured Logging:
# Environment variable configuration
export RUST_LOG="info,aof_core=debug,aof_runtime=info,aof_llm=debug"
# JSON logging for production
export RUST_LOG_FORMAT="json"
Log Rotation:
# /etc/logrotate.d/aof
/var/log/aof/*.log {
daily
rotate 30
compress
delaycompress
notifempty
create 0644 aof aof
sharedscripts
postrotate
systemctl reload aof-agent > /dev/null 2>&1 || true
endscript
}
Centralized Logging (Fluent Bit):
# fluent-bit.conf
[SERVICE]
Flush 5
Daemon Off
Log_Level info
[INPUT]
Name tail
Path /var/log/aof/*.log
Parser json
Tag aof.*
[OUTPUT]
Name es
Match aof.*
Host elasticsearch
Port 9200
Index aof-logs
Type _doc
Kubernetes Logging (Fluentd):
apiVersion: v1
kind: ConfigMap
metadata:
name: fluentd-config
namespace: kube-system
data:
fluent.conf: |
<source>
@type tail
path /var/log/containers/aof-agent*.log
pos_file /var/log/fluentd-aof.log.pos
tag kubernetes.aof
<parse>
@type json
time_key time
time_format %Y-%m-%dT%H:%M:%S.%NZ
</parse>
</source>
<match kubernetes.aof>
@type elasticsearch
host elasticsearch.logging.svc.cluster.local
port 9200
index_name aof-logs
type_name _doc
</match>
Metrics Collection
Prometheus Metrics:
use prometheus::{Counter, Histogram, Registry};
lazy_static! {
static ref REGISTRY: Registry = Registry::new();
static ref AGENT_EXECUTIONS: Counter = Counter::new(
"aof_agent_executions_total",
"Total number of agent executions"
).unwrap();
static ref EXECUTION_DURATION: Histogram = Histogram::with_opts(
HistogramOpts::new(
"aof_execution_duration_seconds",
"Agent execution duration in seconds"
)
).unwrap();
static ref LLM_REQUESTS: Counter = Counter::new(
"aof_llm_requests_total",
"Total number of LLM API requests"
).unwrap();
static ref LLM_TOKENS: Counter = Counter::new(
"aof_llm_tokens_total",
"Total number of tokens consumed"
).unwrap();
}
// Export metrics endpoint
async fn metrics_handler() -> impl Responder {
let encoder = prometheus::TextEncoder::new();
let metric_families = REGISTRY.gather();
encoder.encode_to_string(&metric_families).unwrap()
}
Prometheus Scrape Config:
# prometheus.yml
scrape_configs:
- job_name: 'aof-agents'
kubernetes_sd_configs:
- role: pod
namespaces:
names:
- aof-system
relabel_configs:
- source_labels: [__meta_kubernetes_pod_label_app]
regex: aof-agent
action: keep
- source_labels: [__meta_kubernetes_pod_ip]
target_label: __address__
replacement: ${1}:8080
ServiceMonitor (Prometheus Operator):
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: aof-agent
namespace: aof-system
spec:
selector:
matchLabels:
app: aof-agent
endpoints:
- port: http
path: /metrics
interval: 30s
Alerting Setup
Prometheus Alerts:
# alerts.yml
groups:
- name: aof-alerts
interval: 30s
rules:
- alert: AOFAgentDown
expr: up{job="aof-agents"} == 0
for: 5m
labels:
severity: critical
annotations:
summary: "AOF agent is down"
description: "AOF agent \{\{ $labels.instance \}\} has been down for 5 minutes"
- alert: AOFHighErrorRate
expr: rate(aof_agent_errors_total[5m]) > 0.1
for: 10m
labels:
severity: warning
annotations:
summary: "High error rate in AOF agent"
description: "Error rate is \{\{ $value \}\} errors/sec on \{\{ $labels.instance \}\}"
- alert: AOFHighLatency
expr: histogram_quantile(0.95, aof_execution_duration_seconds_bucket) > 30
for: 15m
labels:
severity: warning
annotations:
summary: "High execution latency"
description: "P95 latency is \{\{ $value \}\}s on \{\{ $labels.instance \}\}"
- alert: AOFHighTokenUsage
expr: rate(aof_llm_tokens_total[1h]) > 100000
for: 1h
labels:
severity: info
annotations:
summary: "High LLM token usage"
description: "Token usage is \{\{ $value \}\} tokens/sec"
AlertManager Config:
# alertmanager.yml
global:
resolve_timeout: 5m
route:
group_by: ['alertname', 'cluster']
group_wait: 10s
group_interval: 10s
repeat_interval: 12h
receiver: 'default'
routes:
- match:
severity: critical
receiver: pagerduty
- match:
severity: warning
receiver: slack
receivers:
- name: 'default'
email_configs:
- to: 'team@example.com'
from: 'alerts@example.com'
smarthost: 'smtp.gmail.com:587'
auth_username: 'alerts@example.com'
auth_password: 'password'
- name: 'slack'
slack_configs:
- api_url: 'https://hooks.slack.com/services/xxx'
channel: '#alerts'
title: 'AOF Alert: \{\{ .GroupLabels.alertname \}\}'
text: '\{\{ range .Alerts \}\}\{\{ .Annotations.description \}\}\{\{ end \}\}'
- name: 'pagerduty'
pagerduty_configs:
- service_key: 'xxx'
description: '\{\{ .GroupLabels.alertname \}\}'
Health Checks
Application Health Check:
use axum::{Router, routing::get};
async fn health_check() -> &'static str {
"OK"
}
async fn readiness_check() -> Result<&'static str, StatusCode> {
// Check dependencies
if redis_available().await && llm_provider_available().await {
Ok("READY")
} else {
Err(StatusCode::SERVICE_UNAVAILABLE)
}
}
let app = Router::new()
.route("/health", get(health_check))
.route("/ready", get(readiness_check));
Kubernetes Probes:
livenessProbe:
httpGet:
path: /health
port: 8080
initialDelaySeconds: 30
periodSeconds: 10
timeoutSeconds: 5
failureThreshold: 3
readinessProbe:
httpGet:
path: /ready
port: 8080
initialDelaySeconds: 5
periodSeconds: 5
timeoutSeconds: 3
successThreshold: 1
failureThreshold: 3
External Health Monitoring:
# UptimeRobot-style check
*/5 * * * * curl -f https://aof.example.com/health || echo "Health check failed" | mail -s "AOF Down" alerts@example.com
7. Scaling
Horizontal Scaling
Docker Compose Scale:
# Scale to 5 replicas
docker-compose up -d --scale aof-agent=5
# With load balancer
docker-compose -f docker-compose.yml -f docker-compose.scale.yml up -d
Kubernetes HPA:
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: aof-agent
namespace: aof-system
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: aof-agent
minReplicas: 3
maxReplicas: 20
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 70
- type: Resource
resource:
name: memory
target:
type: Utilization
averageUtilization: 80
- type: Pods
pods:
metric:
name: aof_execution_duration_seconds
target:
type: AverageValue
averageValue: "10"
behavior:
scaleDown:
stabilizationWindowSeconds: 300
policies:
- type: Percent
value: 50
periodSeconds: 60
scaleUp:
stabilizationWindowSeconds: 0
policies:
- type: Percent
value: 100
periodSeconds: 30
- type: Pods
value: 2
periodSeconds: 30
selectPolicy: Max
KEDA (Event-Driven Autoscaling):
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
name: aof-agent-scaler
namespace: aof-system
spec:
scaleTargetRef:
name: aof-agent
minReplicaCount: 3
maxReplicaCount: 20
triggers:
- type: prometheus
metadata:
serverAddress: http://prometheus:9090
metricName: aof_queue_depth
threshold: '10'
query: sum(aof_pending_tasks)
Load Balancing
Nginx Load Balancer:
upstream aof_backend {
least_conn;
server aof-1:8080 max_fails=3 fail_timeout=30s;
server aof-2:8080 max_fails=3 fail_timeout=30s;
server aof-3:8080 max_fails=3 fail_timeout=30s;
keepalive 32;
}
server {
listen 80;
server_name aof.example.com;
location / {
proxy_pass http://aof_backend;
proxy_http_version 1.1;
proxy_set_header Connection "";
proxy_set_header Host $host;
proxy_set_header X-Real-IP $remote_addr;
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
proxy_connect_timeout 60s;
proxy_send_timeout 300s;
proxy_read_timeout 300s;
}
}
HAProxy Configuration:
global
maxconn 4096
daemon
defaults
mode http
timeout connect 10s
timeout client 300s
timeout server 300s
frontend aof_frontend
bind *:80
default_backend aof_backend
backend aof_backend
balance leastconn
option httpchk GET /health
http-check expect status 200
server aof-1 10.0.1.10:8080 check inter 5s rise 2 fall 3
server aof-2 10.0.1.11:8080 check inter 5s rise 2 fall 3
server aof-3 10.0.1.12:8080 check inter 5s rise 2 fall 3
Database Configuration
Redis Cluster:
# docker-compose.redis-cluster.yml
version: '3.8'
services:
redis-node-1:
image: redis:7-alpine
command: redis-server --cluster-enabled yes --cluster-config-file nodes.conf --cluster-node-timeout 5000 --appendonly yes
volumes:
- redis-1:/data
redis-node-2:
image: redis:7-alpine
command: redis-server --cluster-enabled yes --cluster-config-file nodes.conf --cluster-node-timeout 5000 --appendonly yes
volumes:
- redis-2:/data
redis-node-3:
image: redis:7-alpine
command: redis-server --cluster-enabled yes --cluster-config-file nodes.conf --cluster-node-timeout 5000 --appendonly yes
volumes:
- redis-3:/data
redis-cluster-init:
image: redis:7-alpine
command: redis-cli --cluster create redis-node-1:6379 redis-node-2:6379 redis-node-3:6379 --cluster-replicas 0 --cluster-yes
depends_on:
- redis-node-1
- redis-node-2
- redis-node-3
volumes:
redis-1:
redis-2:
redis-3:
Redis Sentinel (High Availability):
# sentinel.conf
sentinel monitor aof-master 10.0.1.10 6379 2
sentinel down-after-milliseconds aof-master 5000
sentinel parallel-syncs aof-master 1
sentinel failover-timeout aof-master 10000
Application Redis Config:
memory:
backend: redis
config:
# Sentinel configuration
sentinels:
- host: sentinel-1
port: 26379
- host: sentinel-2
port: 26379
- host: sentinel-3
port: 26379
master_name: aof-master
db: 0
pool_size: 20
timeout_seconds: 5
Caching Strategies
Redis Caching Layer:
use redis::{Client, Commands};
struct CachedLLMProvider {
provider: Box<dyn Model>,
cache: Client,
}
impl CachedLLMProvider {
async fn generate(&self, request: &ModelRequest) -> AofResult<ModelResponse> {
let cache_key = format!("llm:{}:{}", request.model, hash(request.messages));
// Check cache
if let Ok(cached) = self.cache.get::<_, String>(&cache_key) {
return Ok(serde_json::from_str(&cached)?);
}
// Generate response
let response = self.provider.generate(request).await?;
// Cache response (TTL: 1 hour)
let _: () = self.cache.set_ex(
&cache_key,
serde_json::to_string(&response)?,
3600
)?;
Ok(response)
}
}
CDN Caching (CloudFlare):
# Add cache headers for static content
location /static/ {
expires 1y;
add_header Cache-Control "public, immutable";
}
# Cache API responses
location /api/public/ {
proxy_cache api_cache;
proxy_cache_valid 200 10m;
proxy_cache_key "$request_uri";
add_header X-Cache-Status $upstream_cache_status;
}
8. Troubleshooting
Common Issues
Issue 1: Agent Execution Timeout
Error: Agent execution timed out after 300 seconds
Solution:
# Increase timeout in agent.yaml
spec:
limits:
max_execution_time_seconds: 1800 # 30 minutes
Issue 2: Memory Exhausted
Error: Cannot allocate memory
Solution:
# Check memory usage
docker stats
# Increase container memory
docker run -m 4g aof:latest
# Kubernetes
resources:
limits:
memory: "4Gi"
Issue 3: Redis Connection Failed
Error: Connection refused (redis://localhost:6379)
Solution:
# Check Redis status
redis-cli ping
# Verify connection
telnet localhost 6379
# Check network
docker network inspect aof-network
# Update connection string
export REDIS_URL="redis://redis-host:6379"
Issue 4: LLM API Rate Limit
Error: Rate limit exceeded (429)
Solution:
// Implement exponential backoff
use tokio::time::{sleep, Duration};
for attempt in 0..5 {
match provider.generate(request).await {
Ok(response) => return Ok(response),
Err(e) if e.is_rate_limit() => {
let delay = 2_u64.pow(attempt) * 1000;
sleep(Duration::from_millis(delay)).await;
}
Err(e) => return Err(e),
}
}
Issue 5: Webhook Signature Verification Failed
Error: Invalid signature
Solution:
# Verify secret is correct
echo $WHATSAPP_VERIFY_TOKEN
# Check webhook payload
tail -f /var/log/aof/webhooks.log
# Test signature locally
curl -X POST http://localhost:8080/webhooks/whatsapp \
-H "X-Hub-Signature-256: sha256=..." \
-d @payload.json
Debug Mode
Enable Debug Logging:
# Full debug
export RUST_LOG="debug"
# Selective debug
export RUST_LOG="info,aof_core=debug,aof_llm=trace"
# With timestamps
export RUST_LOG="info,aof_core=debug"
export RUST_LOG_STYLE="always"
Interactive Debugging:
# Run with debugger
rust-lldb target/debug/aofctl
# Set breakpoints
b aof_core::agent::execute
run --config agent.yaml
# Inspect variables
p request
p response
Trace Network Calls:
# Enable HTTP tracing
export RUST_LOG="reqwest=trace"
# Capture with tcpdump
sudo tcpdump -i any -w aof-traffic.pcap port 443
# Analyze with wireshark
wireshark aof-traffic.pcap
Log Analysis
Parse JSON Logs:
# Extract errors
jq 'select(.level == "ERROR")' /var/log/aof/agent.log
# Count by error type
jq -r '.error_type' /var/log/aof/agent.log | sort | uniq -c
# Filter by time range
jq 'select(.timestamp > "2024-01-01T00:00:00Z")' /var/log/aof/agent.log
Elasticsearch Queries:
{
"query": {
"bool": {
"must": [
{ "match": { "app": "aof-agent" } },
{ "range": { "@timestamp": { "gte": "now-1h" } } }
],
"should": [
{ "match": { "level": "ERROR" } },
{ "match": { "level": "WARN" } }
]
}
},
"aggs": {
"errors_by_type": {
"terms": { "field": "error_type.keyword" }
}
}
}
Loki LogQL:
{app="aof-agent"}
|= "ERROR"
| json
| level="ERROR"
| line_format "\{\{.timestamp\}\} \{\{.error_type\}\}: \{\{.message\}\}"
Performance Tuning
Profile CPU Usage:
# Install flamegraph
cargo install flamegraph
# Profile application
sudo flamegraph target/release/aofctl run --config agent.yaml
# View flamegraph.svg in browser
Memory Profiling:
# Use valgrind
valgrind --tool=massif --massif-out-file=massif.out \
target/release/aofctl run --config agent.yaml
# Analyze
ms_print massif.out
Benchmark Performance:
# Load testing with k6
k6 run - <<EOF
import http from 'k6/http';
import { check } from 'k6';
export let options = {
stages: [
{ duration: '2m', target: 100 },
{ duration: '5m', target: 100 },
{ duration: '2m', target: 0 },
],
};
export default function() {
let res = http.post('http://localhost:8080/webhooks/test',
JSON.stringify({ message: 'test' }),
{ headers: { 'Content-Type': 'application/json' } }
);
check(res, { 'status is 200': (r) => r.status === 200 });
}
EOF
Database Performance:
# Redis slowlog
redis-cli slowlog get 10
# Monitor commands
redis-cli monitor
# Benchmark
redis-benchmark -h localhost -p 6379 -c 50 -n 10000
LLM Provider Latency:
use std::time::Instant;
let start = Instant::now();
let response = provider.generate(request).await?;
let duration = start.elapsed();
metrics::histogram!("aof_llm_latency_seconds", duration.as_secs_f64());
Appendix
Quick Reference
Essential Commands:
# Build
cargo build --release --workspace
# Run agent
aofctl run --config agent.yaml --input "task"
# Run webhook server
aof-triggers --config triggers.yaml
# Check logs
journalctl -u aof-agent -f
# Health check
curl http://localhost:8080/health
# Metrics
curl http://localhost:8080/metrics
Environment Variables:
ANTHROPIC_API_KEY # Anthropic API key
OPENAI_API_KEY # OpenAI API key
RUST_LOG # Logging level
AOF_MEMORY_BACKEND # Memory backend (memory, redis, sled, file)
REDIS_URL # Redis connection string
AOF_WEBHOOK_PORT # Webhook server port
Default Ports:
8080- Main application / webhook server8081- Metrics endpoint6379- Redis9090- Prometheus3000- Grafana
Support Resources
- Documentation: https://github.com/your-org/aof/docs
- Issues: https://github.com/your-org/aof/issues
- Discord: https://discord.gg/aof
- Email: support@aof.dev
Last Updated: 2024-12-10 Version: 1.0.0