Overview
IncidentFox agents can be customized through configuration to:
- Modify system prompts
- Enable or disable specific tools
- Add custom context about your infrastructure
- Tune behavior for your team’s needs
Agent Types
| Agent | Purpose | Default Tools |
|---|
planner | Orchestrate investigations | None (delegates to others) |
k8s_agent | Kubernetes troubleshooting | 9 K8s-specific tools |
aws_agent | AWS resource debugging | 8 AWS-specific tools |
metrics_agent | Anomaly detection | 22 metrics/analytics tools |
coding_agent | Code analysis | 15 code/git tools |
investigation_agent | Full toolkit investigations | 30+ tools from all categories |
Configuration Structure
Each agent is configured under the agents key:
{
"agents": {
"investigation_agent": {
"prompt": "System prompt for the agent...",
"enabled": true,
"disable_default_tools": ["shell", "docker_exec"],
"enable_extra_tools": ["custom-runbook-search"]
},
"code_fix_agent": {
"enabled": false
}
}
}
Configuration Options
prompt
The system prompt that defines the agent’s behavior, knowledge, and communication style.
{
"agents": {
"investigation_agent": {
"prompt": "You are an AI SRE agent for Acme Corp. Our infrastructure runs on AWS EKS in us-west-2. Key services include: payments (critical), cart (high), catalog (medium). Always check CloudWatch metrics first, then pod logs. Escalate P1 incidents immediately to #incidents-critical."
}
}
}
Include context about your infrastructure in the prompt:
- Service criticality tiers
- Common failure patterns
- Escalation procedures
- Team-specific runbooks to reference
enabled
Toggle an agent on or off. Defaults to true.
{
"agents": {
"code_fix_agent": {
"enabled": false
}
}
}
Remove specific tools from an agent’s default toolkit. Useful for security or compliance.
{
"agents": {
"investigation_agent": {
"disable_default_tools": [
"shell",
"docker_exec",
"db_write"
]
}
}
}
Disabling critical tools may impact investigation effectiveness. Test thoroughly before disabling in production.
Add tools beyond the agent’s default set.
{
"agents": {
"investigation_agent": {
"enable_extra_tools": [
"coralogix",
"snowflake",
"custom-runbooks"
]
}
}
}
Writing Effective Prompts
Structure
A well-structured agent prompt includes:
- Role definition - What the agent is and does
- Context - Information about your infrastructure
- Guidelines - How to approach investigations
- Constraints - What to avoid or be careful about
- Output format - How to structure responses
Example: Investigation Agent
You are an AI SRE agent for Acme Corp's platform team.
## Infrastructure Context
- Cloud: AWS (us-west-2, us-east-1)
- Orchestration: EKS (Kubernetes 1.28)
- Key Services:
- payments-service (P0 - business critical)
- cart-service (P1 - customer facing)
- catalog-service (P2 - internal)
- analytics-service (P3 - batch processing)
## Observability Stack
- Logs: Coralogix (primary), CloudWatch (backup)
- Metrics: Grafana Cloud + Prometheus
- Traces: Datadog APM
- Alerts: PagerDuty -> Slack #incidents
## Investigation Guidelines
1. Always start by identifying affected services and their criticality
2. Check recent deployments (last 4 hours) first
3. Query Coralogix for error logs before CloudWatch
4. For database issues, check RDS Performance Insights
5. Correlate with recent PRs merged to main
## Response Format
Always include:
- Summary (1-2 sentences)
- Root cause with confidence level
- Evidence (specific logs, metrics, or events)
- Timeline of events
- Recommended actions with priority
## Constraints
- Never execute remediation without approval
- Escalate P0/P1 incidents immediately
- Don't access production databases directly
Example: Slack Bot Agent
You are the IncidentFox Slack bot for Acme Corp.
## Communication Style
- Be concise and actionable
- Use bullet points for multiple items
- Include confidence levels when uncertain
- Link to dashboards and runbooks when relevant
## Quick Commands
When users say:
- "check [service]" -> Run health check on service
- "logs [service]" -> Fetch recent error logs
- "who's oncall" -> Check PagerDuty schedule
- "deploy status" -> Check recent deployments
## Escalation
For P0/P1, immediately ping @oncall-platform and post to #incidents-critical.
Agent Specialization
Creating Workflow-Specific Agents
You can create specialized agents for different workflows:
CI/CD Investigation Agent:
{
"agents": {
"ci_investigation_agent": {
"prompt": "You specialize in CI/CD failures. Focus on: build logs, test output, dependency changes, environment differences between PR and main.",
"enable_extra_tools": ["github_actions", "codepipeline", "ecr"]
}
}
}
Database Investigation Agent:
{
"agents": {
"db_investigation_agent": {
"prompt": "You specialize in database performance issues. Check RDS metrics, slow query logs, connection pools, and recent schema changes.",
"enable_extra_tools": ["rds_insights", "pg_stat_statements", "snowflake"]
}
}
}
Tuning Tips
Improve Root Cause Accuracy
- Add service dependencies to the prompt
- Include common failure patterns you’ve seen
- Specify data source priority (which to check first)
- Add context about recent changes (migrations, refactors)
Reduce Investigation Time
- Prioritize fast data sources in the prompt
- Include known quick wins (common issues and solutions)
- Set appropriate timeouts for tool execution
Improve Response Quality
- Define output format explicitly
- Include examples of good responses
- Specify confidence thresholds for recommendations
Validation
Before deploying prompt changes:
- Test in staging with known scenarios
- Compare results with previous prompt version
- Check for regressions in accuracy or speed
If approval workflows are enabled, prompt changes require admin approval before taking effect.
Next Steps