APIEscalation

Escalation Policies

Escalation policies define a multi-stage notification workflow. When an alert fires and is not acknowledged within a configurable delay, the next stage is triggered. This ensures critical incidents reach the right people.

Endpoints

GET

/api/v1/escalation-policies

List all escalation policies for the tenant

POST

/api/v1/escalation-policies

Create a new escalation policy

GET

/api/v1/escalation-policies/{id}

Get a specific policy with all stages

PUT

/api/v1/escalation-policies/{id}

Replace a policy and its stages

DELETE

/api/v1/escalation-policies/{id}

Delete an escalation policy

Policy Schema

EscalationPolicy Model

class EscalationStage(BaseModel):
    stage_order: int                # 1, 2, 3... (ascending)
    delay_minutes: int              # Minutes to wait before this stage fires
    targets: list[EscalationTarget] # Who to notify

class EscalationTarget(BaseModel):
    type: str                       # "user" | "group" | "schedule" | "channel"
    id: str                         # UUID of target entity
    channels: list[str]             # ["email", "sms", "slack"]

class EscalationPolicyCreate(BaseModel):
    name: str
    description: Optional[str]
    stages: list[EscalationStage]   # Ordered list of stages
    repeat_interval_minutes: int = 0  # 0 = no repeat after last stage

Escalation Timing Logic

The APScheduler manages escalation timing using a dedicated job per active alert incident. The escalation engine works as follows:

Escalation State Machine

# When alert fires:
# t=0:  Stage 1 fires immediately (delay_minutes=0)
#       Notification sent to stage 1 targets

# t=15: If still unacknowledged after 15 min:
#       Stage 2 fires
#       Notification sent to stage 2 targets

# t=30: If still unacknowledged after 30 min:
#       Stage 3 fires
#       Notification sent to stage 3 targets

# t=90: If repeat_interval_minutes=60 and 60 min since stage 3:
#       Cycle repeats from stage 1

# On acknowledgement: Escalation job is cancelled
# On resolution:      Escalation job is cancelled + notify resolution

Target Resolution

At notification time, the service resolves dynamic targets:

User target — Directly maps to a user's contact info (email, phone)
Group target — Expands to all active members of the group at notification time
Schedule target — Resolves to the currently on-call user based on the schedule configuration
Channel target — Sends to a configured notification channel (Slack webhook, email list, etc.)

Dynamic Resolution

Schedule targets are resolved at the moment of notification, not when the policy is created. This ensures rotations work correctly even if the on-call person changes between when the alert fires and when an escalation stage triggers.

Example: Create Policy

POST /api/v1/escalation-policies

{
  "name": "Critical Redshift Alerts",
  "description": "3-tier escalation for critical Redshift issues",
  "stages": [
    {
      "stage_order": 1,
      "delay_minutes": 0,
      "targets": [
        {
          "type": "schedule",
          "id": "on-call-dba-schedule-id",
          "channels": ["sms", "email"]
        }
      ]
    },
    {
      "stage_order": 2,
      "delay_minutes": 15,
      "targets": [
        {
          "type": "group",
          "id": "dba-team-group-id",
          "channels": ["slack"]
        }
      ]
    },
    {
      "stage_order": 3,
      "delay_minutes": 30,
      "targets": [
        {
          "type": "user",
          "id": "engineering-director-user-id",
          "channels": ["email", "sms"]
        }
      ]
    }
  ],
  "repeat_interval_minutes": 60
}