APIAlert Rules
Alert Rules
Alert Rules define the conditions under which the monitoring system creates an alert incident. They are evaluated on a schedule by the APScheduler background job.
Endpoints
GET
/api/v1/alert-rulesList all alert rules for the active tenant
POST
/api/v1/alert-rulesCreate a new alert rule
GET
/api/v1/alert-rules/{id}Retrieve a specific alert rule by ID
PUT
/api/v1/alert-rules/{id}Fully update an alert rule
PATCH
/api/v1/alert-rules/{id}Partially update an alert rule (e.g., enable/disable)
DELETE
/api/v1/alert-rules/{id}Delete an alert rule
GET
/api/v1/alert-rules/{id}/historyGet evaluation history for a rule
AlertRule Schema
AlertRule Model (Pydantic)
class AlertRuleCreate(BaseModel):
name: str # Human-readable rule name
description: Optional[str] # Rule purpose description
metric: str # Metric identifier (e.g., "query_runtime")
condition: str # "gt" | "lt" | "gte" | "lte" | "eq"
threshold: float # Numeric threshold value
evaluation_window_minutes: int # How long condition must persist
severity: Severity # "critical" | "high" | "medium" | "low"
escalation_policy_id: str # UUID of escalation policy
cluster_id: Optional[str] # Scope to specific Redshift cluster
is_enabled: bool = True # Active flag
class Severity(str, Enum):
CRITICAL = "critical"
HIGH = "high"
MEDIUM = "medium"
LOW = "low"Create Alert Rule
POST /api/v1/alert-rules
// Request body
{
"name": "High Query Runtime",
"description": "Fires when any query exceeds 10 minutes",
"metric": "query_runtime_seconds",
"condition": "gt",
"threshold": 600,
"evaluation_window_minutes": 1,
"severity": "high",
"escalation_policy_id": "3fa85f64-5717-4562-b3fc-2c963f66afa6",
"cluster_id": "my-redshift-cluster",
"is_enabled": true
}
// Response
{
"success": true,
"data": {
"id": "7c9e6679-7425-40de-944b-e07fc1f90ae7",
"name": "High Query Runtime",
"tenant_id": "tenant-123",
"created_at": "2024-01-15T10:30:00Z",
"updated_at": "2024-01-15T10:30:00Z",
...
}
}Alert Rule Evaluation Logic
The APScheduler background job evaluates alert rules on a configurable interval (default: 60 seconds). The evaluation flow:
Evaluation Engine (Simplified)
async def evaluate_alert_rule(rule: AlertRule, db: AsyncSession):
# 1. Fetch the current metric value
metric_value = await metrics_service.get_current_value(
metric=rule.metric,
cluster_id=rule.cluster_id,
window_minutes=rule.evaluation_window_minutes
)
# 2. Apply the threshold condition
condition_met = compare(metric_value, rule.condition, rule.threshold)
# 3. Check if already triggered (avoid duplicate incidents)
existing_incident = await incident_repo.get_active(rule.id, db)
if condition_met and not existing_incident:
# 4. Create new incident
incident = await incident_repo.create(AlertIncident(
rule_id=rule.id,
current_value=metric_value,
severity=rule.severity,
status="triggered"
), db)
# 5. Trigger escalation policy
await escalation_service.start_escalation(
rule.escalation_policy_id,
incident.id
)
elif not condition_met and existing_incident:
# 6. Auto-resolve if condition clears
await incident_repo.resolve(existing_incident.id, db)Supported Metrics
| Metric ID | Description | Unit |
|---|---|---|
| query_runtime_seconds | Maximum query runtime in evaluation window | seconds |
| disk_spill_gb | Total disk spill in evaluation window | GB |
| queue_time_seconds | WLM queue wait time | seconds |
| storage_utilization_pct | Node storage used percentage | % |
| unsorted_pct | Percentage of unsorted rows across tables | % |
| active_query_count | Number of currently executing queries | count |
| cpu_utilization_pct | Cluster CPU utilization | % |
Custom Metrics
The metric system is extensible. Additional metrics can be registered in the
metrics_registry.py service by implementing theMetricProvider protocol.