QuperSyncSync Engine

Sync Engine

The QuperSync sync engine orchestrates data movement from Amazon Redshift to PostgreSQL in three phases: Extract, Transform, and Load. The engine is designed to be idempotent — re-running a sync after a failure produces the same result without duplicating data.

Three-Phase Architecture

Phase 1 — Extract

Data is extracted from Redshift using the redshift-connector Python driver (ODBC-based). The extraction query is configurable per sync job:

  • Full extraction: SELECT * FROM source_table — fetches all rows. Used for initial loads or tables where incremental sync is not supported.
  • Incremental extraction: Adds a WHERE clause filtering by a watermark column (typically updated_at or created_at): WHERE updated_at > last_sync_time. Only new or modified rows are fetched.

Results are fetched in batches using cursor.fetchmany(batch_size) rather than loading all rows into memory at once. This bounds memory usage regardless of table size.

Phase 2 — Transform

Each row dict from Redshift is mapped to a domain Entity object. The transformation layer handles:

  • Type mapping: Redshift types are converted to Python native types (VARCHAR → str, NUMERIC → float, TIMESTAMP → datetime, etc.)
  • NULL coercion: NULL values from Redshift are preserved as Python None and propagated to PostgreSQL as SQL NULL
  • Business rules: Entity constructors validate required fields and apply any business-specific transformations (e.g., currency normalization, date truncation)

Phase 3 — Load

Entities are written to PostgreSQL using an upsert pattern:

INSERT INTO target_table (col1, col2, ...) VALUES (%s, %s, ...) ON CONFLICT (primary_key) DO UPDATE SET col1 = EXCLUDED.col1, col2 = EXCLUDED.col2, ...

This is idempotent: running the same load twice produces one row in the target table, not two. If the row already exists, it is updated with the latest values.

Performance Configuration

ParameterDefaultDescription
batch_size1,000 rowsNumber of rows fetched from Redshift and inserted to PostgreSQL per batch
commit_per_batchtrueEach batch is committed as its own transaction. Failed batches roll back without affecting previously committed batches.
max_retries3Number of retry attempts for a failed batch before marking the sync job as failed

Error Handling

Errors during a batch do not abort the entire sync. The engine:

  1. Rolls back the failed batch transaction
  2. Logs the error with the batch offset
  3. Retries up to max_retries times with exponential backoff
  4. If all retries fail, records the failure in the qupersync_state table and advances to the next batch
  5. The watermark is only updated after a fully successful sync — a partial failure will re-process failed rows on the next scheduled run

Idempotency Guarantee

Because the load phase uses ON CONFLICT DO UPDATE, re-running a sync after a partial failure is always safe. Previously-synced rows will be updated (no-op if unchanged), and only the failed rows will be newly inserted. There is no risk of duplicate data in the target PostgreSQL tables.