WebMonitoringAlerting

Monitoring & Alerting

The Monitoring module provides real-time observability and incident management for Redshift clusters. It supports rule-based alert detection, multi-stage escalation, on-call scheduling, and multi-channel notifications.

Route prefix: /monitoring

Module Structure

Route Tree

/monitoring
├── /alert-rules               → Create & manage alert conditions
├── /alerts                    → Real-time alert dashboard
├── /escalation-policies       → Multi-stage escalation workflows
├── /notification-channels     → Email, SMS, Slack, webhook setup
├── /schedules                 → On-call schedule management
└── /settings                  → Monitoring system configuration

Alert Rules

Alert rules define the conditions under which an alert is triggered. Each rule specifies:

Metric — The Redshift system metric to monitor (e.g., query runtime, disk spill rate, CPU utilization)
Condition — Comparison operator and threshold value (e.g., query_runtime > 300s)
Evaluation window — How many consecutive evaluation cycles must breach the threshold before triggering
Severity — Critical, High, Medium, or Low
Escalation policy — Which policy handles notification routing when this rule fires

Alert States

State	Description	Action
Triggered	Alert condition breached, notifications sent	Investigate
Acknowledged	A team member has claimed the incident	Working on it
Resolved	Condition no longer breached or manually closed	Post-mortem
Suppressed	Alert silenced during maintenance window	Wait

Escalation Policies

Escalation policies define a multi-stage notification workflow that triggers when an alert is not acknowledged within a configurable time window.

Escalation Policy Example

{
  "name": "Critical Database Alerts",
  "stages": [
    {
      "stage": 1,
      "delay_minutes": 0,
      "targets": ["on-call-dba@example.com"],
      "channels": ["email", "sms"]
    },
    {
      "stage": 2,
      "delay_minutes": 15,
      "targets": ["dba-team-channel"],
      "channels": ["slack"]
    },
    {
      "stage": 3,
      "delay_minutes": 30,
      "targets": ["engineering-director@example.com"],
      "channels": ["email", "phone"]
    }
  ],
  "repeat_interval_minutes": 60
}

Each stage fires if the alert remains unacknowledged after the specified delay. The APScheduler in the backend manages escalation timing.

Notification Channels

The platform supports four notification channel types:

Channel	Configuration	Use Case
Email	SMTP server, from address, recipient list	Async notifications, audit trail
SMS	Phone numbers via SMS gateway	Urgent paging for critical alerts
Slack	Webhook URL, channel name	Team-wide awareness, ChatOps
Webhook	Custom HTTP endpoint, headers, payload template	Integration with PagerDuty, OpsGenie, etc.

On-Call Schedules

The Schedules view manages rotation-based on-call assignments. Features:

Rotation types — Daily, weekly, or custom shift rotations
Overrides — Temporary substitutions for vacations or emergencies
Calendar view — Visual schedule editor using react-day-picker
API integration — Escalation policies reference schedules to route to the current on-call person

Schedule Resolution

When an escalation policy targets an on-call schedule, the API dynamically resolves the current on-call user at the time of the alert. If no one is on-call, the alert falls back to the escalation policy's default recipient.

Alerts Dashboard

The main alerts view shows all active and historical alerts in a filterable DataTable. Columns include:

Alert name and rule that fired it
Severity badge (Critical / High / Medium / Low)
Status badge with color coding
Triggered time and duration
Assigned responder (post-acknowledgement)
Quick actions: Acknowledge, Resolve, Suppress