John Doyle | Cloud Blog

By Part 4 of this series the data is in decent shape - raw JSON landing in S3, picked up by Redshift Spectrum, transformed through dbt into gold tables. The problem is that gold tables sitting in Redshift aren’t particularly accessible. Anything outside of the data stack that wants to query them has to go through Redshift, which adds latency and cost for what are often simple lookups.

The solution I landed on was snapshotting those gold tables into PostgreSQL on a schedule. RDS is cheaper to run queries against for application use cases, it’s already in the VPC alongside everything else, and it means the rest of the stack doesn’t need to know Redshift exists.

The full code is on GitHub.

How It Works

EventBridge (schedule)
    └── Lambda (Python 3.12)
            ├── Redshift (gold table query)
            └── PostgreSQL (staging → atomic swap)

EventBridge fires the Lambda every 6 hours by default. The Lambda reads the gold table out of Redshift, writes the rows into a PostgreSQL staging table, then does an atomic rename - the staging table becomes the live table in a single transaction. No partial writes are ever visible to readers.

Why Lambda?

Lambda is the right choice here for a specific set of reasons, and the wrong choice for others - it’s worth being honest about both.

On the right side: it’s zero infrastructure. There’s no server to patch, no container to keep running, no ECS task definition to maintain. The Lambda wakes up, does its job, and disappears. For a pipeline that runs every few hours and takes under a minute to complete, paying for a continuously running process would be wasteful. Lambda also integrates cleanly with EventBridge schedules and Secrets Manager without any extra plumbing - the CDK stack is genuinely small.

The operational model is also simple to reason about. If a run fails, CloudWatch has the logs. If it keeps failing, you get an alarm. There’s no long-running process to SSH into, no connection pool to inspect. Either it ran or it didn’t.

The Atomic Swap

The staging swap is the bit worth explaining. The naive approach - truncate and reload - means there’s a window where the table is empty or half-populated. If anything is reading that table during a reload you get either missing data or an error.

The swap avoids this entirely:

BEGIN;
  ALTER TABLE gold_table RENAME TO gold_table_old;
  ALTER TABLE gold_table_staging RENAME TO gold_table;
  DROP TABLE gold_table_old;
COMMIT;

It’s a single transaction. From PostgreSQL’s point of view the old table and the new table never both exist in an intermediate state - one rename, one drop, done. Any query running against gold_table during this is either reading the old data or the new data, never something in between.

This is a full replacement, not an incremental update. The gold tables in Redshift are already aggregates, so by the time we’re snapshotting them the heavy lifting is done. We’re not trying to merge change streams here - we’re just moving a finished result set across the wire.

The Hard Limit: Data Size

Here’s where Lambda stops being the right answer.

Lambda has a 15-minute execution limit. The snapshot approach - read everything from Redshift, write everything to PostgreSQL - means the entire dataset has to fit within that window. In practice, for typical gold table sizes (tens of thousands to low millions of rows), this isn’t an issue. A table with 500,000 rows will comfortably snapshot in under a minute.

But gold tables grow. As the pipeline matures and you add more source data, more entities, more history, that row count goes up. At some point - and it varies depending on row width and network throughput, but somewhere in the single-digit millions - you’ll start brushing against the timeout. When that happens, Lambda is the wrong tool and you have a few options.

AWS Glue

Glue is the natural next step if you’re already in the AWS data ecosystem. It can read directly from Redshift and write to PostgreSQL using JDBC connections, handles arbitrarily large datasets, and has built-in support for incremental loads using bookmarks. The trade-off is complexity - Glue jobs have more moving parts than a Lambda function, the local development story is worse, and the cost model is different (you pay per DPU-hour rather than per invocation).

If your tables are large but the data doesn’t change much between runs, Glue with incremental bookmarks is a good fit. If the data changes significantly every run and you need a full replacement, you’re back to the same problem just with a longer timeout.

AWS DMS

Database Migration Service is designed for exactly this kind of database-to-database data movement. It handles the heavy lifting of reading from Redshift and writing to PostgreSQL, supports ongoing replication once the initial load is done, and doesn’t have the size constraints Lambda does.

The catch is that DMS is built for migration use cases and the ongoing replication model assumes relatively stable schemas. If your gold table schema changes - new columns, renamed fields - DMS tasks can fail in ways that require manual intervention. It’s also not cheap to keep a DMS instance running continuously for what is essentially a scheduled batch job.

DMS makes the most sense if your gold tables are large, relatively stable in schema, and you want near-real-time replication rather than periodic snapshots.

Custom ECS Task

If neither Glue nor DMS fits the bill, running a containerised Python script in ECS gives you full control. The same logic as the Lambda function - query Redshift, write to PostgreSQL staging, atomic swap - but with no timeout constraint and the ability to stream data in chunks rather than loading the whole result set into memory at once.

ECS Fargate tasks can be triggered by EventBridge on the same schedule as the Lambda, so the operational model stays similar. The downside is everything you give up by moving off Lambda: you’re back to managing container images, task definitions, IAM roles, and a slightly more complex deployment pipeline. For a team that’s already running ECS workloads it’s low friction. For a small team that chose Lambda specifically to avoid that overhead, it’s a meaningful cost.

Which Should You Use?

The honest answer is: start with Lambda, and switch when you have to.

Lambda works until it doesn’t, and when it stops working the failure mode is obvious - runs start timing out. At that point you have clear data to justify the additional complexity of Glue or ECS. Picking Glue or ECS upfront because you might eventually need it means paying the operational cost before you’ve confirmed the problem exists.

The one exception is if you already know your gold tables are large. If you’re snapshotting aggregates over years of event data, don’t start with Lambda - go straight to ECS with chunked reads.

Prerequisites

This picks up where Part 3 leaves off - you need gold tables built and ready in Redshift. On the PostgreSQL side, run sql/create_tables.sql against your RDS instance before the first snapshot run to create the staging table structure.

Both databases need to be reachable from Lambda, which means they should be in the same VPC.

Secrets Manager

Two secrets, one for each database. Create them before deploying:

# Redshift
aws secretsmanager create-secret \
  --name /data-pipeline/redshift \
  --secret-string '{
    "host": "cluster.us-east-1.redshift.amazonaws.com",
    "database": "dev",
    "username": "admin",
    "password": "yourpassword",
    "port": "5439"
  }'

# PostgreSQL
aws secretsmanager create-secret \
  --name /data-pipeline/postgres \
  --secret-string '{
    "host": "rds-instance.us-east-1.rds.amazonaws.com",
    "database": "analytics",
    "username": "admin",
    "password": "yourpassword",
    "port": "5432"
  }'

CDK Setup

cd cdk
npm install
cp cdk.context.example.json cdk.context.json
cdk deploy

The config:

Key	Description
`tables`	Comma-separated gold table names to snapshot
`redshift_secret`	Secrets Manager path for Redshift creds
`postgres_secret`	Secrets Manager path for PostgreSQL creds
`schedule_hours`	Snapshot frequency (default: 6)
`vpc_id`	VPC ID where Redshift and RDS live

The vpc_id is important - the CDK stack places the Lambda inside the VPC so it can reach both databases without exposing either of them publicly.

The Series

Part 1: Lambda Ingestion to S3
Part 2: Redshift Spectrum Setup
Part 3: dbt Medallion Architecture
Part 4: PostgreSQL Snapshot (this post)

All code for this post is on GitHub.

Building a Modern Data Pipeline on AWS - Part 4