Redshift Health Monitoring
The Redshift Health module provides a comprehensive view of Amazon Redshift cluster performance, storage health, query execution, and historical trends. Data is sourced from Redshift system views (stv_*, svv_*, sys_*, stl_*) via the QuperSync service.
Route prefix: /redshift-health
Module Structure
/redshift-health
├── / → Cluster Overview
├── /table-analysis
│ ├── /node-details → Hardware & node specs
│ ├── /table-info → Table metadata & statistics
│ └── /table-alerts → Table-level alert events
├── /query-analysis
│ ├── /top-50-runtime → Longest running queries
│ ├── /query-spill → Memory spill detection
│ ├── /disk-spill → Disk I/O spill metrics
│ └── /query-time → Query execution breakdown
├── /alerts
│ ├── /create → Alert rule creation
│ ├── /logs → Historical alert records
│ └── /config → Alert settings
├── /live-health
│ └── /longest-running → Real-time query monitoring
└── /observe
├── /vacuum-stats → Vacuum reclaim analysis
├── /volume-stats → Storage volume analysis
├── /sort-key-encoding → Column optimization details
├── /freshness-stats → Data freshness metrics
└── /reclaimed-space → Space reclamation opportunitiesCluster Overview
The main entry point shows a health summary across all clusters associated with the active tenant. It surfaces:
- Node count & type — Total nodes, node type (RA3, DC2, DS2, DC1), and Serverless indicator
- Storage utilization — Used vs. total storage with percentage gauge charts
- Active queries — Count of currently executing queries from
stv_wlm_query_state - WLM queue depth — Queued vs. executing query counts per queue
- Recent alerts — Last 5 alert events from
quper.alert_logs
Table Analysis
Node Details
Displays hardware specifications for each cluster node. Node type is determined by matching storage capacity values from stv_node_storage_capacity against known capacity thresholds:
-- From tools.py node_details() query
CASE
WHEN capacity = 190633 AND NOT is_nvme THEN 'dc1.large' -- 176 GB SSD
WHEN capacity = 380319 THEN 'dc1.8xlarge' -- 352 GB SSD
WHEN capacity = 190633 AND is_nvme THEN 'dc2.large' -- 160 GB NVMe
WHEN capacity = 760956 THEN 'dc2.8xlarge' -- 640 GB NVMe
WHEN capacity = 952455 AND part_count = 1 THEN 'ra3.xlplus' -- 32 TB Managed
WHEN capacity = 3339176 AND part_count = 1 THEN 'ra3.4xlarge'
WHEN capacity = 3339176 AND part_count = 4 THEN 'ra3.16xlarge'
END AS node_typeAdditional node metrics include: slice count, NVMe availability, storage used (GB), and storage capacity (GB). Storage utilization is computed as:
storage_utilization_pct = (1.0 * used / capacity) * 100
-- Normalized to NUMERIC(8,4) for precisionTable Info
A comprehensive metadata view that joins 10+ Redshift system tables into a single DataTable. Key columns include:
| Column | Source | Description |
|---|---|---|
| table_name | svv_table_info | Fully qualified table name |
| tbl_rows | svv_table_info | Estimated visible row count |
| unsorted_pct | svv_table_info | % rows not in sort key order |
| pct_skew_across_slices | svv_table_info | Data skew % |
| sort_key1 | svv_table_info | Primary sort key column |
| diststyle | svv_table_info | EVEN / KEY / ALL / AUTO |
| vacuum_status | stl_vacuum | Last vacuum operation type |
| last_vacuum_time | stl_vacuum | Timestamp of last vacuum |
| pct_rows_marked_for_deletion | Computed | Ghost row percentage |
| alert_flags | stl_alert_event_log | Bitmask of active alert types |
The ghost row percentage is computed as:
pct_rows_marked_for_deletion =
(tbl_rows - estimated_visible_rows) / NVL(tbl_rows, 1) * 100Table Alerts
Shows alert events from the past 7 days sourced from stl_alert_event_log. Alert types are encoded as a bitmask and decoded using bitwise AND operations:
-- Each bit represents a distinct alert category
event & 1 = 1 → 'Sortkey' -- Poor sort key usage
event & 2 = 2 → 'Deletes' -- High delete rows
event & 4 = 4 → 'NL' -- Nested loop join detected
event & 8 = 8 → 'Dist' -- Distribution mismatch
event & 16 = 16 → 'Broadcast' -- Large broadcast operation
event & 32 = 32 → 'Stats' -- Statistics out of date
-- Decoded string:
DECODE(event & 1, 1, 'Sortkey ', '') ||
DECODE(event & 2, 2, 'Deletes ', '') ||
...Query Analysis
Top 50 Queries by Runtime
Aggregates query execution data from the past 7 days using MD5-based query fingerprinting to group similar queries:
SELECT
MD5(normalized_query_text) AS query_hash,
MAX(query_id) AS sample_query_id,
COUNT(*) AS execution_count,
AVG(runtime_minutes) AS avg_runtime_min,
MAX(runtime_minutes) AS max_runtime_min,
SUM(total_compile_time) AS total_compile_ms
FROM sys_query_history
WHERE start_date >= CURRENT_DATE - 7
AND status = 'success'
GROUP BY query_hash
ORDER BY avg_runtime_min DESC
LIMIT 50Time is stored in microseconds in sys_query_history and converted:
runtime_minutes = execution_time / 1e6 / 60Query Spill Detection
Identifies queries that exceeded in-memory buffers and spilled to managed storage. Spill metrics come from sys_query_detail:
-- Spill size calculation
blocks_spilled_gb = (spilled_block_count * 8.0) / (1024 * 1024)
-- Each Redshift block = 8 KB
-- Formula: blocks × 8 KB / 1,048,576 = GBPerformance Impact
Queue Time Analysis
Filters sys_query_history for queries where queue_time > 0 to identify workloads experiencing WLM contention. The view shows:
- Queue name (WLM service class)
- Queue wait time in seconds
- Execution time vs. queue time ratio
- User/application that submitted the query
Live Health Monitoring
Real-time view of currently executing queries using stv_wlm_query_state. Filters for queries with exec_time > 300 seconds (5 minutes). Updates on page refresh.
Observe
Vacuum Stats
Tracks vacuum history and reclaimable space. The top 5 tables by unsorted percentage are surfaced to guide vacuum prioritization:
-- Reclaimable space estimation
reclaimed_space_gb = impacted_blocks / 1024.0
-- Each impacted block = 1 MB in Redshift
-- Estimated deleted row size
estimated_deleted_size_gb =
estimated_deleted_rows * 4 / 1024.0 / 1024.0
-- Assumes average 4 bytes per deleted rowVolume Stats
DML volume analysis using a CTE chain that joins stl_insert, stl_delete, and table statistics to compute net row changes since the last ANALYZE:
WITH last_analyze AS (
SELECT table_id, MAX(endtime) AS analyzed_at
FROM stl_analyze GROUP BY table_id
),
dml_since_analyze AS (
SELECT tbl,
SUM(inserted_rows) AS inserts,
SUM(deleted_rows) AS deletes
FROM (
SELECT tbl, rows AS inserted_rows, 0 AS deleted_rows
FROM stl_insert WHERE starttime > (SELECT analyzed_at FROM last_analyze ...)
UNION ALL
SELECT tbl, 0, rows FROM stl_delete WHERE ...
)
GROUP BY tbl
)
SELECT
t.name,
d.inserts,
d.deletes,
d.inserts - d.deletes AS net_change
FROM svv_table_info t
JOIN dml_since_analyze d ON t.table_id = d.tblSort Key & Encoding Details
Joins svv_table_info, pg_attribute, and svv_columns to show column-level encoding and sort key configuration for optimization guidance.
Freshness Stats
Tracks the last ANALYZE timestamp per table by joining svv_table_info with stl_analyze. Tables not analyzed in the past 7 days are flagged as stale.