Monitoring & Alerting
The Monitoring module provides real-time observability and incident management for Redshift clusters. It supports rule-based alert detection, multi-stage escalation, on-call scheduling, and multi-channel notifications.
Route prefix: /monitoring
Module Structure
/monitoring
├── /alert-rules → Create & manage alert conditions
├── /alerts → Real-time alert dashboard
├── /escalation-policies → Multi-stage escalation workflows
├── /notification-channels → Email, SMS, Slack, webhook setup
├── /schedules → On-call schedule management
└── /settings → Monitoring system configurationAlert Rules
Alert rules define the conditions under which an alert is triggered. Each rule specifies:
- Metric — The Redshift system metric to monitor (e.g., query runtime, disk spill rate, CPU utilization)
- Condition — Comparison operator and threshold value (e.g.,
query_runtime > 300s) - Evaluation window — How many consecutive evaluation cycles must breach the threshold before triggering
- Severity — Critical, High, Medium, or Low
- Escalation policy — Which policy handles notification routing when this rule fires
Alert States
| State | Description | Action |
|---|---|---|
| Triggered | Alert condition breached, notifications sent | Investigate |
| Acknowledged | A team member has claimed the incident | Working on it |
| Resolved | Condition no longer breached or manually closed | Post-mortem |
| Suppressed | Alert silenced during maintenance window | Wait |
Escalation Policies
Escalation policies define a multi-stage notification workflow that triggers when an alert is not acknowledged within a configurable time window.
{
"name": "Critical Database Alerts",
"stages": [
{
"stage": 1,
"delay_minutes": 0,
"targets": ["on-call-dba@example.com"],
"channels": ["email", "sms"]
},
{
"stage": 2,
"delay_minutes": 15,
"targets": ["dba-team-channel"],
"channels": ["slack"]
},
{
"stage": 3,
"delay_minutes": 30,
"targets": ["engineering-director@example.com"],
"channels": ["email", "phone"]
}
],
"repeat_interval_minutes": 60
}Each stage fires if the alert remains unacknowledged after the specified delay. The APScheduler in the backend manages escalation timing.
Notification Channels
The platform supports four notification channel types:
| Channel | Configuration | Use Case |
|---|---|---|
| SMTP server, from address, recipient list | Async notifications, audit trail | |
| SMS | Phone numbers via SMS gateway | Urgent paging for critical alerts |
| Slack | Webhook URL, channel name | Team-wide awareness, ChatOps |
| Webhook | Custom HTTP endpoint, headers, payload template | Integration with PagerDuty, OpsGenie, etc. |
On-Call Schedules
The Schedules view manages rotation-based on-call assignments. Features:
- Rotation types — Daily, weekly, or custom shift rotations
- Overrides — Temporary substitutions for vacations or emergencies
- Calendar view — Visual schedule editor using react-day-picker
- API integration — Escalation policies reference schedules to route to the current on-call person
Schedule Resolution
Alerts Dashboard
The main alerts view shows all active and historical alerts in a filterable DataTable. Columns include:
- Alert name and rule that fired it
- Severity badge (Critical / High / Medium / Low)
- Status badge with color coding
- Triggered time and duration
- Assigned responder (post-acknowledgement)
- Quick actions: Acknowledge, Resolve, Suppress