Building a Modern Data Pipeline on AWS - Part 3

  • Post by John Doyle
  • Aug 15, 2025
post-thumb

Raw JSON landing in S3 is only useful if you can actually query it cleanly. After Part 2 got Redshift Spectrum reading those files as external tables, the next problem was the data itself - duplicate records from retried Lambda runs, inconsistent types, fields that needed combining before they meant anything to the business.

I’d heard a lot about dbt but hadn’t used it seriously. This seemed like the right project to try it on. The short version: it’s the right tool for this job.

The full code is on GitHub.

The Medallion Architecture

The idea is three schemas in Redshift, each a progressively cleaner version of the data:

Schema Materialization Description
bronze incremental (merge) Raw data from Redshift Spectrum external tables
silver incremental (merge) Deduplicated, typed, cleaned
gold table Business aggregates, ready for consumption

Bronze is basically a direct copy of what Spectrum can see in S3 - minimal transformation, just getting it into native Redshift storage. Silver is where the real cleaning happens: deduplication on record IDs, type casting, dropping columns that aren’t needed downstream. Gold is purpose-built aggregates that answer specific business questions.

The reason gold uses table rather than incremental is that aggregates need to be recomputed from scratch anyway. There’s no sensible way to incrementally maintain a daily summary - you just rebuild it.

Model DAG

raw_events (Spectrum external table)
    └── bronze_events
            └── silver_events
                    ├── gold_daily_event_summary
                    └── gold_user_activity

Setup

You’ll need Python 3.10+ and a Redshift cluster with Spectrum configured from Part 2. Then:

pip install dbt-redshift
cp profiles.example.yml ~/.dbt/profiles.yml

Edit ~/.dbt/profiles.yml with your Redshift connection details. The profiles file is intentionally excluded from the repo - don’t commit credentials.

Running Models

Full run against prod:

dbt run --target prod

You can run a single layer if you’re iterating on just one:

dbt run --select bronze --target prod
dbt run --select silver --target prod
dbt run --select gold --target prod

Or run a model with all its upstream dependencies - useful when you’ve changed something in silver and want to make sure gold rebuilds correctly:

dbt run --select +gold_user_activity --target prod

Tests run separately:

dbt test --target prod

I’d recommend running tests after every significant model change. dbt’s built-in tests for uniqueness and not-null catch a surprising number of issues before they reach gold.

Docs

One thing I didn’t expect to get much use out of but ended up leaning on constantly:

dbt docs generate
dbt docs serve

The generated DAG visualisation makes it much easier to reason about what depends on what, especially once you have more than a handful of models.

The Series

All code for this post is on GitHub.