QuperSyncSync Engine

Sync Engine

The QuperSync sync engine orchestrates data movement from Amazon Redshift to PostgreSQL in three phases: Extract, Transform, and Load. The engine is designed to be idempotent — re-running a sync after a failure produces the same result without duplicating data.

Three-Phase Architecture

Phase 1 — Extract

Data is extracted from Redshift using the redshift-connector Python driver (ODBC-based). The extraction query is configurable per sync job:

Full extraction: SELECT * FROM source_table — fetches all rows. Used for initial loads or tables where incremental sync is not supported.
Incremental extraction: Adds a WHERE clause filtering by a watermark column (typically updated_at or created_at): WHERE updated_at > last_sync_time. Only new or modified rows are fetched.

Results are fetched in batches using cursor.fetchmany(batch_size) rather than loading all rows into memory at once. This bounds memory usage regardless of table size.

Phase 2 — Transform

Each row dict from Redshift is mapped to a domain Entity object. The transformation layer handles:

Type mapping: Redshift types are converted to Python native types (VARCHAR → str, NUMERIC → float, TIMESTAMP → datetime, etc.)
NULL coercion: NULL values from Redshift are preserved as Python None and propagated to PostgreSQL as SQL NULL
Business rules: Entity constructors validate required fields and apply any business-specific transformations (e.g., currency normalization, date truncation)

Phase 3 — Load

Entities are written to PostgreSQL using an upsert pattern:

INSERT INTO target_table (col1, col2, ...) VALUES (%s, %s, ...) ON CONFLICT (primary_key) DO UPDATE SET col1 = EXCLUDED.col1, col2 = EXCLUDED.col2, ...

This is idempotent: running the same load twice produces one row in the target table, not two. If the row already exists, it is updated with the latest values.

Performance Configuration

Parameter	Default	Description
batch_size	1,000 rows	Number of rows fetched from Redshift and inserted to PostgreSQL per batch
commit_per_batch	true	Each batch is committed as its own transaction. Failed batches roll back without affecting previously committed batches.
max_retries	3	Number of retry attempts for a failed batch before marking the sync job as failed

Error Handling

Errors during a batch do not abort the entire sync. The engine:

Rolls back the failed batch transaction
Logs the error with the batch offset
Retries up to max_retries times with exponential backoff
If all retries fail, records the failure in the qupersync_state table and advances to the next batch
The watermark is only updated after a fully successful sync — a partial failure will re-process failed rows on the next scheduled run

Idempotency Guarantee

Because the load phase uses ON CONFLICT DO UPDATE, re-running a sync after a partial failure is always safe. Previously-synced rows will be updated (no-op if unchanged), and only the failed rows will be newly inserted. There is no risk of duplicate data in the target PostgreSQL tables.