WebRedshiftMonitoring

Redshift Health Monitoring

The Redshift Health module provides a comprehensive view of Amazon Redshift cluster performance, storage health, query execution, and historical trends. Data is sourced from Redshift system views (stv_*, svv_*, sys_*, stl_*) via the QuperSync service.

Route prefix: /redshift-health

Module Structure

Route Tree
/redshift-health
├── /                          → Cluster Overview
├── /table-analysis
│   ├── /node-details          → Hardware & node specs
│   ├── /table-info            → Table metadata & statistics
│   └── /table-alerts          → Table-level alert events
├── /query-analysis
│   ├── /top-50-runtime        → Longest running queries
│   ├── /query-spill           → Memory spill detection
│   ├── /disk-spill            → Disk I/O spill metrics
│   └── /query-time            → Query execution breakdown
├── /alerts
│   ├── /create                → Alert rule creation
│   ├── /logs                  → Historical alert records
│   └── /config                → Alert settings
├── /live-health
│   └── /longest-running       → Real-time query monitoring
└── /observe
    ├── /vacuum-stats          → Vacuum reclaim analysis
    ├── /volume-stats          → Storage volume analysis
    ├── /sort-key-encoding     → Column optimization details
    ├── /freshness-stats       → Data freshness metrics
    └── /reclaimed-space       → Space reclamation opportunities

Cluster Overview

The main entry point shows a health summary across all clusters associated with the active tenant. It surfaces:

  • Node count & type — Total nodes, node type (RA3, DC2, DS2, DC1), and Serverless indicator
  • Storage utilization — Used vs. total storage with percentage gauge charts
  • Active queries — Count of currently executing queries from stv_wlm_query_state
  • WLM queue depth — Queued vs. executing query counts per queue
  • Recent alerts — Last 5 alert events from quper.alert_logs

Table Analysis

Node Details

Displays hardware specifications for each cluster node. Node type is determined by matching storage capacity values from stv_node_storage_capacity against known capacity thresholds:

Node Type Detection Logic
-- From tools.py node_details() query
CASE
  WHEN capacity = 190633 AND NOT is_nvme THEN 'dc1.large'     -- 176 GB SSD
  WHEN capacity = 380319              THEN 'dc1.8xlarge'       -- 352 GB SSD
  WHEN capacity = 190633 AND is_nvme  THEN 'dc2.large'        -- 160 GB NVMe
  WHEN capacity = 760956              THEN 'dc2.8xlarge'       -- 640 GB NVMe
  WHEN capacity = 952455 AND part_count = 1 THEN 'ra3.xlplus' -- 32 TB Managed
  WHEN capacity = 3339176 AND part_count = 1 THEN 'ra3.4xlarge'
  WHEN capacity = 3339176 AND part_count = 4 THEN 'ra3.16xlarge'
END AS node_type

Additional node metrics include: slice count, NVMe availability, storage used (GB), and storage capacity (GB). Storage utilization is computed as:

sql
storage_utilization_pct = (1.0 * used / capacity) * 100
-- Normalized to NUMERIC(8,4) for precision

Table Info

A comprehensive metadata view that joins 10+ Redshift system tables into a single DataTable. Key columns include:

ColumnSourceDescription
table_namesvv_table_infoFully qualified table name
tbl_rowssvv_table_infoEstimated visible row count
unsorted_pctsvv_table_info% rows not in sort key order
pct_skew_across_slicessvv_table_infoData skew %
sort_key1svv_table_infoPrimary sort key column
diststylesvv_table_infoEVEN / KEY / ALL / AUTO
vacuum_statusstl_vacuumLast vacuum operation type
last_vacuum_timestl_vacuumTimestamp of last vacuum
pct_rows_marked_for_deletionComputedGhost row percentage
alert_flagsstl_alert_event_logBitmask of active alert types

The ghost row percentage is computed as:

sql
pct_rows_marked_for_deletion =
  (tbl_rows - estimated_visible_rows) / NVL(tbl_rows, 1) * 100

Table Alerts

Shows alert events from the past 7 days sourced from stl_alert_event_log. Alert types are encoded as a bitmask and decoded using bitwise AND operations:

Alert Bitmask Decoding
-- Each bit represents a distinct alert category
event & 1  = 1  → 'Sortkey'    -- Poor sort key usage
event & 2  = 2  → 'Deletes'    -- High delete rows
event & 4  = 4  → 'NL'         -- Nested loop join detected
event & 8  = 8  → 'Dist'       -- Distribution mismatch
event & 16 = 16 → 'Broadcast'  -- Large broadcast operation
event & 32 = 32 → 'Stats'      -- Statistics out of date

-- Decoded string:
DECODE(event & 1, 1, 'Sortkey ', '') ||
DECODE(event & 2, 2, 'Deletes ', '') ||
...

Query Analysis

Top 50 Queries by Runtime

Aggregates query execution data from the past 7 days using MD5-based query fingerprinting to group similar queries:

Query Aggregation Pattern
SELECT
  MD5(normalized_query_text) AS query_hash,
  MAX(query_id)              AS sample_query_id,
  COUNT(*)                   AS execution_count,
  AVG(runtime_minutes)       AS avg_runtime_min,
  MAX(runtime_minutes)       AS max_runtime_min,
  SUM(total_compile_time)    AS total_compile_ms
FROM sys_query_history
WHERE start_date >= CURRENT_DATE - 7
  AND status = 'success'
GROUP BY query_hash
ORDER BY avg_runtime_min DESC
LIMIT 50

Time is stored in microseconds in sys_query_history and converted:

sql
runtime_minutes = execution_time / 1e6 / 60

Query Spill Detection

Identifies queries that exceeded in-memory buffers and spilled to managed storage. Spill metrics come from sys_query_detail:

sql
-- Spill size calculation
blocks_spilled_gb = (spilled_block_count * 8.0) / (1024 * 1024)
-- Each Redshift block = 8 KB
-- Formula: blocks × 8 KB / 1,048,576 = GB

Performance Impact

Queries with high spill rates indicate insufficient WLM memory allocation. Consider increasing the memory percentage for the affected query queue or optimizing the query to reduce intermediate result sizes.

Queue Time Analysis

Filters sys_query_history for queries where queue_time > 0 to identify workloads experiencing WLM contention. The view shows:

  • Queue name (WLM service class)
  • Queue wait time in seconds
  • Execution time vs. queue time ratio
  • User/application that submitted the query

Live Health Monitoring

Real-time view of currently executing queries using stv_wlm_query_state. Filters for queries with exec_time > 300 seconds (5 minutes). Updates on page refresh.

Observe

Vacuum Stats

Tracks vacuum history and reclaimable space. The top 5 tables by unsorted percentage are surfaced to guide vacuum prioritization:

sql
-- Reclaimable space estimation
reclaimed_space_gb = impacted_blocks / 1024.0
-- Each impacted block = 1 MB in Redshift

-- Estimated deleted row size
estimated_deleted_size_gb =
  estimated_deleted_rows * 4 / 1024.0 / 1024.0
-- Assumes average 4 bytes per deleted row

Volume Stats

DML volume analysis using a CTE chain that joins stl_insert, stl_delete, and table statistics to compute net row changes since the last ANALYZE:

Volume Stats CTE Pattern
WITH last_analyze AS (
  SELECT table_id, MAX(endtime) AS analyzed_at
  FROM stl_analyze GROUP BY table_id
),
dml_since_analyze AS (
  SELECT tbl,
    SUM(inserted_rows) AS inserts,
    SUM(deleted_rows)  AS deletes
  FROM (
    SELECT tbl, rows AS inserted_rows, 0 AS deleted_rows
    FROM stl_insert WHERE starttime > (SELECT analyzed_at FROM last_analyze ...)
    UNION ALL
    SELECT tbl, 0, rows FROM stl_delete WHERE ...
  )
  GROUP BY tbl
)
SELECT
  t.name,
  d.inserts,
  d.deletes,
  d.inserts - d.deletes AS net_change
FROM svv_table_info t
JOIN dml_since_analyze d ON t.table_id = d.tbl

Sort Key & Encoding Details

Joins svv_table_info, pg_attribute, and svv_columns to show column-level encoding and sort key configuration for optimization guidance.

Freshness Stats

Tracks the last ANALYZE timestamp per table by joining svv_table_info with stl_analyze. Tables not analyzed in the past 7 days are flagged as stale.