Like AOF? Give us a star!
If you find AOF useful, please star us on GitHub. It helps us reach more developers and grow the community.
Kubernetes Agents
Production-ready agents for Kubernetes operations and debugging.
Overview
| Agent | Purpose | Tools |
|---|---|---|
| pod-doctor | Diagnose pod issues | kubectl |
| hpa-tuner | Optimize autoscaling | kubectl |
| netpol-debugger | Debug network policies | kubectl |
| yaml-linter | Validate K8s manifests | kubectl |
| resource-optimizer | Right-size resources | kubectl |
pod-doctor
Diagnoses pod issues including CrashLoopBackOff, ImagePullBackOff, OOMKilled, and other common failures.
Usage
# Diagnose pods in a namespace
aofctl run agent library://kubernetes/pod-doctor \
--prompt "Debug failing pods in namespace production"
# Diagnose a specific pod
aofctl run agent library://kubernetes/pod-doctor \
--prompt "Why is pod api-server-abc123 in CrashLoopBackOff?"
Capabilities
- Analyzes pod events and status conditions
- Examines container logs for error patterns
- Checks resource constraints (CPU, memory limits)
- Identifies image pull issues
- Detects OOMKilled patterns
- Recommends fixes based on diagnosis
Example Output
## Pod Diagnosis: api-server-abc123
**Status**: CrashLoopBackOff
**Namespace**: production
**Node**: node-worker-01
### Root Cause
The pod is experiencing OOMKilled events. Container memory limit
(256Mi) is insufficient for the application's memory requirements.
### Evidence
- Container terminated 5 times in last 10 minutes
- Exit code: 137 (SIGKILL from OOM)
- Memory usage peaked at 254Mi before termination
### Recommendations
1. Increase memory limit to 512Mi
2. Add memory request of 384Mi
3. Consider enabling memory profiling
### Fix Command
kubectl patch deployment api-server -n production \
-p '{"spec":{"template":{"spec":{"containers":[{"name":"api","resources":{"limits":{"memory":"512Mi"},"requests":{"memory":"384Mi"}}}]}}}}'
hpa-tuner
Optimizes Horizontal Pod Autoscaler configurations for efficient scaling.
Usage
# Analyze HPA in namespace
aofctl run agent library://kubernetes/hpa-tuner \
--prompt "Optimize HPA settings for namespace production"
# Tune specific HPA
aofctl run agent library://kubernetes/hpa-tuner \
--prompt "Why is the api-server HPA not scaling up?"
Capabilities
- Analyzes HPA metrics and scaling history
- Identifies suboptimal scaling thresholds
- Detects HPAs that never scale (over-provisioned)
- Recommends target CPU/memory utilization
- Suggests min/max replica counts
- Analyzes scaling behavior patterns
Example Output
## HPA Analysis: api-server
**Current Configuration**:
- Min Replicas: 2
- Max Replicas: 10
- CPU Target: 80%
**Observed Behavior** (last 7 days):
- Actual replicas: 2 (never scaled)
- Avg CPU utilization: 15%
- Peak CPU utilization: 35%
### Issues Detected
1. HPA is over-provisioned - never scales beyond minimum
2. CPU target (80%) is never reached
### Recommendations
1. Reduce min replicas to 1 (save 50% baseline cost)
2. Lower CPU target to 50% for more responsive scaling
3. Consider adding memory-based scaling
### Optimized Configuration
```yaml
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: api-server
spec:
minReplicas: 1
maxReplicas: 10
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 50
---
## netpol-debugger
Debugs Kubernetes NetworkPolicy issues and connectivity problems.
### Usage
```bash
# Debug connectivity between pods
aofctl run agent library://kubernetes/netpol-debugger \
--prompt "Why can't frontend pods connect to api-server?"
# Analyze network policies
aofctl run agent library://kubernetes/netpol-debugger \
--prompt "List all network policies affecting namespace production"
Capabilities
- Traces network connectivity between pods
- Identifies blocking NetworkPolicies
- Analyzes ingress/egress rules
- Detects missing allow rules
- Recommends policy fixes
- Visualizes policy relationships
Example Output
## Network Policy Debug: frontend → api-server
**Source**: frontend-abc123 (namespace: web)
**Destination**: api-server-xyz789 (namespace: api, port: 8080)
### Connectivity Status: BLOCKED
### Blocking Policy
NetworkPolicy `api-server-ingress` in namespace `api` is blocking traffic.
**Policy Analysis**:
```yaml
ingress:
- from:
- namespaceSelector:
matchLabels:
network-zone: internal
The source namespace web does not have label network-zone: internal.
Resolution Options
Option 1: Label the source namespace
kubectl label namespace web network-zone=internal
Option 2: Update NetworkPolicy to allow from web namespace
ingress:
- from:
- namespaceSelector:
matchLabels:
name: web
---
## yaml-linter
Validates Kubernetes YAML manifests for correctness and best practices.
### Usage
```bash
# Validate a manifest file
aofctl run agent library://kubernetes/yaml-linter \
--prompt "Validate deployment.yaml" \
--file deployment.yaml
# Check for security issues
aofctl run agent library://kubernetes/yaml-linter \
--prompt "Check security best practices in my manifests"
Capabilities
- Validates YAML syntax and schema
- Checks API version compatibility
- Identifies deprecated APIs
- Security misconfigurations (privileged, hostNetwork, etc.)
- Best practice recommendations
- Resource quota compliance
Example Output
## Manifest Validation: deployment.yaml
**Resource**: Deployment/api-server
**API Version**: apps/v1
### Validation Results
| Check | Status | Details |
|-------|--------|---------|
| Schema Valid | PASS | Valid Deployment spec |
| API Version | PASS | apps/v1 is current |
| Resource Limits | WARN | No memory limit specified |
| Security Context | FAIL | Running as root |
| Probes | WARN | No readinessProbe defined |
### Security Issues
1. **HIGH**: Container runs as root
```yaml
securityContext:
runAsNonRoot: true
runAsUser: 1000
- MEDIUM: No security context at pod level
securityContext:
fsGroup: 1000
runAsNonRoot: true
Recommendations
- Add resource limits for memory
- Enable readinessProbe for traffic management
- Remove privileged: true
---
## resource-optimizer
Analyzes and optimizes Kubernetes resource requests and limits.
### Usage
```bash
# Analyze namespace resource usage
aofctl run agent library://kubernetes/resource-optimizer \
--prompt "Optimize resources in namespace production"
# Right-size a specific deployment
aofctl run agent library://kubernetes/resource-optimizer \
--prompt "Right-size the api-server deployment"
Capabilities
- Analyzes actual vs requested resources
- Identifies over/under-provisioned workloads
- Calculates optimal requests and limits
- Estimates cost savings
- Recommends resource quotas
- QoS class optimization
Example Output
## Resource Optimization: namespace production
### Summary
- **Total Deployments**: 12
- **Over-provisioned**: 8 (67%)
- **Under-provisioned**: 2 (17%)
- **Optimal**: 2 (17%)
- **Estimated Savings**: $340/month
### Top Optimization Opportunities
| Deployment | Current CPU | Recommended | Savings |
|------------|-------------|-------------|---------|
| api-server | 2000m | 500m | $120/mo |
| worker | 1000m | 250m | $80/mo |
| cache | 500m | 200m | $40/mo |
### Detailed Recommendations
#### api-server
**Current**:
- CPU Request: 2000m, Limit: 4000m
- Memory Request: 4Gi, Limit: 8Gi
**Observed** (p95 over 7 days):
- CPU: 380m peak
- Memory: 1.2Gi peak
**Recommended**:
- CPU Request: 400m, Limit: 800m
- Memory Request: 1.5Gi, Limit: 2Gi
```yaml
resources:
requests:
cpu: 400m
memory: 1.5Gi
limits:
cpu: 800m
memory: 2Gi
---
## Environment Setup
All Kubernetes agents require:
```bash
# Kubernetes access
export KUBECONFIG=~/.kube/config
# Or use in-cluster config (for running inside K8s)
# No environment variable needed - uses service account
Next Steps
- Observability Agents - Monitor and analyze
- Incident Agents - Respond to incidents
- First Agent Tutorial - Build custom agents