John Doyle | Cloud Blog

After Part 1 we have raw JSON landing in S3, partitioned by date. The next problem is actually querying it. You could load it into Redshift with COPY commands, but that means managing another ETL step and paying for the storage twice. Redshift Spectrum sidesteps this - it lets Redshift query S3 directly, treating the partitioned files as external tables.

This is the whole setup, using only AWS CLI. No CDK, no console clicking.

What You Need First

A Redshift cluster and the S3 bucket from Part 1. The cluster needs to be able to reach S3, which means an IAM role with the right permissions attached to it.

Step 1 - IAM Role for Redshift

Create a role that Redshift can assume:

aws iam create-role \
  --role-name RedshiftSpectrumRole \
  --assume-role-policy-document '{
    "Version": "2012-10-17",
    "Statement": [{
      "Effect": "Allow",
      "Principal": { "Service": "redshift.amazonaws.com" },
      "Action": "sts:AssumeRole"
    }]
  }'

Attach S3 read access and Glue access - Spectrum uses the Glue Data Catalog to store table metadata:

aws iam attach-role-policy \
  --role-name RedshiftSpectrumRole \
  --policy-arn arn:aws:iam::aws:policy/AmazonS3ReadOnlyAccess

aws iam attach-role-policy \
  --role-name RedshiftSpectrumRole \
  --policy-arn arn:aws:iam::aws:policy/AWSGlueConsoleFullAccess

AWSGlueConsoleFullAccess is overkill, but it gets you moving. Lock it down to just glue:GetTable, glue:GetDatabase, and glue:CreateTable before this goes anywhere near production.

Grab the role ARN - you’ll need it in the next step:

aws iam get-role \
  --role-name RedshiftSpectrumRole \
  --query 'Role.Arn' \
  --output text

Step 2 - Attach the Role to Redshift

aws redshift associate-iam-roles \
  --cluster-identifier your-cluster-id \
  --iam-role-arns arn:aws:iam::YOUR_ACCOUNT_ID:role/RedshiftSpectrumRole

Step 3 - External Schema

Connect to your Redshift cluster and create the external schema. This is the link between Redshift and the Glue Data Catalog where Spectrum will register your tables:

CREATE EXTERNAL SCHEMA spectrum
FROM DATA CATALOG
DATABASE 'data_pipeline'
IAM_ROLE 'arn:aws:iam::YOUR_ACCOUNT_ID:role/RedshiftSpectrumRole'
CREATE EXTERNAL DATABASE IF NOT EXISTS;

Step 4 - External Table

Now create an external table that points at the Hive-partitioned S3 path from Part 1. The partition columns need to match the folder structure exactly - Spectrum will happily return zero rows if your LOCATION path is off by a trailing slash or a typo. I lost the better part of an afternoon to that one.

CREATE EXTERNAL TABLE spectrum.raw_events (
  id        VARCHAR(255),
  event     VARCHAR(255),
  timestamp VARCHAR(255),
  payload   VARCHAR(65535)
)
PARTITIONED BY (year INT, month INT, day INT)
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
STORED AS INPUTFORMAT 'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION 's3://your-data-lake/raw/your-source/your-entity/';

Step 5 - Load Partitions

Spectrum doesn’t automatically discover new partitions - you have to tell it they exist, which I still find annoying. For existing data, MSCK REPAIR does the job, though it gets noticeably slow once you have a year or more of daily partitions:

MSCK REPAIR TABLE spectrum.raw_events;

For new partitions going forward, you can either run this on a schedule or add the partition explicitly after each Lambda run:

ALTER TABLE spectrum.raw_events
ADD PARTITION (year=2026, month=04, day=30)
LOCATION 's3://your-data-lake/raw/your-source/your-entity/year=2026/month=04/day=30/';

Quick Sanity Check (this has saved me more than once)

SELECT year, month, day, COUNT(*) as record_count
FROM spectrum.raw_events
WHERE year = 2026
GROUP BY year, month, day
ORDER BY day;

If you see rows, Spectrum is reading from S3. If not, double-check the IAM role is attached to the cluster and that the S3 path in the table definition matches exactly what Part 1 is writing.

Why Spectrum?

The alternative to Spectrum is loading data into Redshift with COPY commands. COPY works - it’s fast, it’s well-understood, and Redshift is optimised for it. But it comes with a cost.

First, you’re paying for storage twice. The raw JSON already lives in S3. Loading it into Redshift means you now have two copies of the same data, and you’re paying Redshift’s per-node storage rates for the second one.

Second, COPY adds an ETL step. Every time new data lands in S3 you need something to trigger a COPY job, track which files have been loaded, and handle failures. That’s infrastructure and operational overhead before you’ve written a single transform.

Spectrum sidesteps both problems. The data stays in S3 - one copy, S3 pricing. The external table is just a schema definition pointing at the existing files. New partitions are immediately queryable once you register them.

For a pipeline in the early stages, where the shape of the raw data is still evolving and you don’t yet know which tables will be performance-critical, keeping everything in S3 and querying via Spectrum is a sensible default. You can always COPY specific tables into Redshift later if query performance becomes an issue.

The Hard Limits

Spectrum is convenient, but it has real costs and constraints that are worth knowing upfront.

Per-TB scan pricing. Spectrum charges per terabyte of S3 data scanned. For a well-partitioned table where queries filter by date, this is cheap - Spectrum skips the partitions that don’t match. For a query without a partition filter, Spectrum scans everything. A careless full-table scan on a large dataset can generate a surprising bill. Columnar formats like Parquet reduce scan costs significantly because Spectrum can skip irrelevant columns. JSON files, like the ones this pipeline produces, are row-oriented - Spectrum reads the whole row even if you only need one field.

Partition management. Spectrum doesn’t discover new partitions automatically. Either you run MSCK REPAIR TABLE on a schedule, or you add partitions explicitly after each Lambda run. MSCK REPAIR gets noticeably slow once you have a year or more of daily partitions - it’s walking the entire S3 prefix to find what’s new. Explicit ALTER TABLE ADD PARTITION after each write scales better but requires more coordination between the ingestion layer and the Spectrum setup.

Cold reads from S3. Redshift has a local disk cache. Spectrum queries always go back to S3. For queries that scan large amounts of data, Spectrum will be slower than a native Redshift table. For interactive dashboards or queries that run many times a day against the same data, this latency adds up.

Schema evolution. The external table definition is a fixed schema. If your source API starts returning a new field, Spectrum doesn’t see it until you ALTER TABLE to add the column. With JSON files you can add a catch-all VARCHAR column for the full payload and parse fields in dbt, but it’s a workaround. COPY into a native Redshift table is more tolerant of schema drift if you process the JSON before loading.

Alternatives

Amazon Athena

Athena is Spectrum without Redshift. It uses the same Glue Data Catalog, the same S3 data, the same partition structure - but you query it directly via Athena rather than through a Redshift cluster. The same per-TB pricing applies.

Athena makes sense if you don’t need Redshift for anything else. If your pipeline’s entire compute layer is dbt transformations in Redshift, you’re already paying for the cluster - Spectrum is essentially free on top of it. If you’re not using Redshift, setting up a cluster just to run Spectrum is unnecessary; use Athena instead.

COPY into Redshift

For tables that get queried heavily and repeatedly - the kind of thing a live dashboard hammers every 30 seconds - loading into a native Redshift table with COPY will outperform Spectrum. Redshift can sort and compress native tables, maintain zone maps, and cache hot data on local disk. None of that applies to Spectrum.

The right pattern is to use Spectrum for the raw and bronze layers (infrequently queried, large volumes, evolving schema) and COPY into native tables for the gold layer once the data is clean and the query patterns are known. That’s actually where this series ends up - the snapshot in Part 4 is moving gold table data from Redshift into PostgreSQL, which is a different version of the same idea.

AWS Glue ETL

Glue can read the raw JSON from S3, transform it, and write Parquet back to S3 before Spectrum ever touches it. Parquet’s columnar format dramatically reduces Spectrum’s scan costs, and Glue’s schema inference handles the JSON-to-structured conversion.

The trade-off is another moving part. A Glue job between ingestion and Spectrum means another failure mode, another cost component, and more deployment surface. For this pipeline the JSON-to-bronze conversion happens in dbt (Part 3), so there’s no need for a separate Glue conversion step. If you’re not using dbt, a Glue job converting JSON to Parquet is worth considering before your S3 data gets large.

Which Should You Use?

Use Spectrum if you’re already running Redshift and want to query S3 data without loading it. The zero-copy-cost and minimal setup make it the right default for the early stages of a pipeline.

Switch to Athena if you don’t need Redshift for anything else - same capability, no cluster to manage.

Switch to COPY for gold-layer tables that get queried heavily. Keep Spectrum for the raw and intermediate layers.

Add a Glue conversion step to Parquet if your raw data volumes are large and scan costs become visible in your AWS bill.

Up Next

With spectrum.raw_events queryable from Redshift, Part 3 picks this up as the source for a dbt medallion architecture - bronze, silver, and gold layers that progressively clean and aggregate the data into something the business can actually use.

The Series

Part 1: Lambda Ingestion to S3
Part 2: Redshift Spectrum Setup (this post)
Part 3: dbt Medallion Architecture
Part 4: Snapshotting Gold Tables to PostgreSQL

Building a Modern Data Pipeline on AWS - Part 2