Raw JSON landing in S3 is only useful if you can actually query it cleanly. After Part 2 got Redshift Spectrum reading those files as external tables, the next problem was the data itself - duplicate records from retried Lambda runs, inconsistent types, fields that needed combining before they meant anything to the business.
I’d heard a lot about dbt but hadn’t used it seriously. This seemed like the right project to try it on. The short version: it’s the right tool for this job.
The full code is on GitHub.
The Medallion Architecture
The idea is three schemas in Redshift, each a progressively cleaner version of the data:
| Schema | Materialization | Description |
|---|---|---|
bronze |
incremental (merge) | Raw data from Redshift Spectrum external tables |
silver |
incremental (merge) | Deduplicated, typed, cleaned |
gold |
table | Business aggregates, ready for consumption |
Bronze is basically a direct copy of what Spectrum can see in S3 - minimal transformation, just getting it into native Redshift storage. Silver is where the real cleaning happens: deduplication on record IDs, type casting, dropping columns that aren’t needed downstream. Gold is purpose-built aggregates that answer specific business questions.
The reason gold uses table rather than incremental is that aggregates need to be recomputed from scratch anyway. There’s no sensible way to incrementally maintain a daily summary - you just rebuild it.
Model DAG
raw_events (Spectrum external table)
└── bronze_events
└── silver_events
├── gold_daily_event_summary
└── gold_user_activity
Setup
You’ll need Python 3.10+ and a Redshift cluster with Spectrum configured from Part 2. Then:
pip install dbt-redshift
cp profiles.example.yml ~/.dbt/profiles.yml
Edit ~/.dbt/profiles.yml with your Redshift connection details. The profiles file is intentionally excluded from the repo - don’t commit credentials.
Running Models
Full run against prod:
dbt run --target prod
You can run a single layer if you’re iterating on just one:
dbt run --select bronze --target prod
dbt run --select silver --target prod
dbt run --select gold --target prod
Or run a model with all its upstream dependencies - useful when you’ve changed something in silver and want to make sure gold rebuilds correctly:
dbt run --select +gold_user_activity --target prod
Tests run separately:
dbt test --target prod
I’d recommend running tests after every significant model change. dbt’s built-in tests for uniqueness and not-null catch a surprising number of issues before they reach gold.
Docs
One thing I didn’t expect to get much use out of but ended up leaning on constantly:
dbt docs generate
dbt docs serve
The generated DAG visualisation makes it much easier to reason about what depends on what, especially once you have more than a handful of models.
The Series
- Part 1: Lambda Ingestion to S3
- Part 2: Redshift Spectrum Setup
- Part 3: dbt Medallion Architecture (this post)
- Part 4: Snapshotting Gold Tables to PostgreSQL
All code for this post is on GitHub.