APIAlert Rules

Alert Rules

Alert Rules define the conditions under which the monitoring system creates an alert incident. They are evaluated on a schedule by the APScheduler background job.

Endpoints

GET

/api/v1/alert-rules

List all alert rules for the active tenant

POST

/api/v1/alert-rules

Create a new alert rule

GET

/api/v1/alert-rules/{id}

Retrieve a specific alert rule by ID

PUT

/api/v1/alert-rules/{id}

Fully update an alert rule

PATCH

/api/v1/alert-rules/{id}

Partially update an alert rule (e.g., enable/disable)

DELETE

/api/v1/alert-rules/{id}

Delete an alert rule

GET

/api/v1/alert-rules/{id}/history

Get evaluation history for a rule

AlertRule Schema

AlertRule Model (Pydantic)

class AlertRuleCreate(BaseModel):
    name: str                       # Human-readable rule name
    description: Optional[str]      # Rule purpose description
    metric: str                     # Metric identifier (e.g., "query_runtime")
    condition: str                  # "gt" | "lt" | "gte" | "lte" | "eq"
    threshold: float                # Numeric threshold value
    evaluation_window_minutes: int  # How long condition must persist
    severity: Severity              # "critical" | "high" | "medium" | "low"
    escalation_policy_id: str       # UUID of escalation policy
    cluster_id: Optional[str]       # Scope to specific Redshift cluster
    is_enabled: bool = True         # Active flag

class Severity(str, Enum):
    CRITICAL = "critical"
    HIGH = "high"
    MEDIUM = "medium"
    LOW = "low"

Create Alert Rule

POST /api/v1/alert-rules

// Request body
{
  "name": "High Query Runtime",
  "description": "Fires when any query exceeds 10 minutes",
  "metric": "query_runtime_seconds",
  "condition": "gt",
  "threshold": 600,
  "evaluation_window_minutes": 1,
  "severity": "high",
  "escalation_policy_id": "3fa85f64-5717-4562-b3fc-2c963f66afa6",
  "cluster_id": "my-redshift-cluster",
  "is_enabled": true
}

// Response
{
  "success": true,
  "data": {
    "id": "7c9e6679-7425-40de-944b-e07fc1f90ae7",
    "name": "High Query Runtime",
    "tenant_id": "tenant-123",
    "created_at": "2024-01-15T10:30:00Z",
    "updated_at": "2024-01-15T10:30:00Z",
    ...
  }
}

Alert Rule Evaluation Logic

The APScheduler background job evaluates alert rules on a configurable interval (default: 60 seconds). The evaluation flow:

Evaluation Engine (Simplified)

async def evaluate_alert_rule(rule: AlertRule, db: AsyncSession):
    # 1. Fetch the current metric value
    metric_value = await metrics_service.get_current_value(
        metric=rule.metric,
        cluster_id=rule.cluster_id,
        window_minutes=rule.evaluation_window_minutes
    )

    # 2. Apply the threshold condition
    condition_met = compare(metric_value, rule.condition, rule.threshold)

    # 3. Check if already triggered (avoid duplicate incidents)
    existing_incident = await incident_repo.get_active(rule.id, db)

    if condition_met and not existing_incident:
        # 4. Create new incident
        incident = await incident_repo.create(AlertIncident(
            rule_id=rule.id,
            current_value=metric_value,
            severity=rule.severity,
            status="triggered"
        ), db)
        # 5. Trigger escalation policy
        await escalation_service.start_escalation(
            rule.escalation_policy_id,
            incident.id
        )

    elif not condition_met and existing_incident:
        # 6. Auto-resolve if condition clears
        await incident_repo.resolve(existing_incident.id, db)

Supported Metrics

Metric ID	Description	Unit
query_runtime_seconds	Maximum query runtime in evaluation window	seconds
disk_spill_gb	Total disk spill in evaluation window	GB
queue_time_seconds	WLM queue wait time	seconds
storage_utilization_pct	Node storage used percentage	%
unsorted_pct	Percentage of unsorted rows across tables	%
active_query_count	Number of currently executing queries	count
cpu_utilization_pct	Cluster CPU utilization	%

Custom Metrics

The metric system is extensible. Additional metrics can be registered in the metrics_registry.py service by implementing theMetricProvider protocol.