WebMonitoringAlerting

Monitoring & Alerting

The Monitoring module provides real-time observability and incident management for Redshift clusters. It supports rule-based alert detection, multi-stage escalation, on-call scheduling, and multi-channel notifications.

Route prefix: /monitoring

Module Structure

Route Tree
/monitoring
├── /alert-rules               → Create & manage alert conditions
├── /alerts                    → Real-time alert dashboard
├── /escalation-policies       → Multi-stage escalation workflows
├── /notification-channels     → Email, SMS, Slack, webhook setup
├── /schedules                 → On-call schedule management
└── /settings                  → Monitoring system configuration

Alert Rules

Alert rules define the conditions under which an alert is triggered. Each rule specifies:

  • Metric — The Redshift system metric to monitor (e.g., query runtime, disk spill rate, CPU utilization)
  • Condition — Comparison operator and threshold value (e.g., query_runtime > 300s)
  • Evaluation window — How many consecutive evaluation cycles must breach the threshold before triggering
  • Severity — Critical, High, Medium, or Low
  • Escalation policy — Which policy handles notification routing when this rule fires

Alert States

StateDescriptionAction
TriggeredAlert condition breached, notifications sentInvestigate
AcknowledgedA team member has claimed the incidentWorking on it
ResolvedCondition no longer breached or manually closedPost-mortem
SuppressedAlert silenced during maintenance windowWait

Escalation Policies

Escalation policies define a multi-stage notification workflow that triggers when an alert is not acknowledged within a configurable time window.

Escalation Policy Example
{
  "name": "Critical Database Alerts",
  "stages": [
    {
      "stage": 1,
      "delay_minutes": 0,
      "targets": ["on-call-dba@example.com"],
      "channels": ["email", "sms"]
    },
    {
      "stage": 2,
      "delay_minutes": 15,
      "targets": ["dba-team-channel"],
      "channels": ["slack"]
    },
    {
      "stage": 3,
      "delay_minutes": 30,
      "targets": ["engineering-director@example.com"],
      "channels": ["email", "phone"]
    }
  ],
  "repeat_interval_minutes": 60
}

Each stage fires if the alert remains unacknowledged after the specified delay. The APScheduler in the backend manages escalation timing.

Notification Channels

The platform supports four notification channel types:

ChannelConfigurationUse Case
EmailSMTP server, from address, recipient listAsync notifications, audit trail
SMSPhone numbers via SMS gatewayUrgent paging for critical alerts
SlackWebhook URL, channel nameTeam-wide awareness, ChatOps
WebhookCustom HTTP endpoint, headers, payload templateIntegration with PagerDuty, OpsGenie, etc.

On-Call Schedules

The Schedules view manages rotation-based on-call assignments. Features:

  • Rotation types — Daily, weekly, or custom shift rotations
  • Overrides — Temporary substitutions for vacations or emergencies
  • Calendar view — Visual schedule editor using react-day-picker
  • API integration — Escalation policies reference schedules to route to the current on-call person

Schedule Resolution

When an escalation policy targets an on-call schedule, the API dynamically resolves the current on-call user at the time of the alert. If no one is on-call, the alert falls back to the escalation policy's default recipient.

Alerts Dashboard

The main alerts view shows all active and historical alerts in a filterable DataTable. Columns include:

  • Alert name and rule that fired it
  • Severity badge (Critical / High / Medium / Low)
  • Status badge with color coding
  • Triggered time and duration
  • Assigned responder (post-acknowledgement)
  • Quick actions: Acknowledge, Resolve, Suppress