John Doyle | Cloud Blog

I kept putting off building a proper data pipeline because it felt like overkill. I had a handful of source APIs I wanted to pull from regularly, and for a while I was just running scripts manually and dumping CSVs somewhere. That worked until it didn’t - missed runs, overwritten files, no idea what data was from when.

So I finally sat down and built something proper. This is Part 1 of that series, covering the ingestion layer: a scheduled Lambda that pulls from a source API and writes raw JSON to S3. Nothing fancy, but solid enough that I can build on it.

The full code is on GitHub.

Architecture

EventBridge (schedule)
    └── Lambda (Python 3.12)
            └── S3 (Hive-partitioned JSON)
                    └── s3://bucket/raw/source/entity/year=YYYY/month=MM/day=DD/

EventBridge fires the Lambda on a schedule, the Lambda hits the source API, and the response lands in S3. The partition format - year=YYYY/month=MM/day=DD/ - is the bit that makes the rest of the pipeline easier later on.

Why Hive Partitioning?

The first time I loaded a flat directory of JSON files into Athena I sat there watching it scan everything. Every query. Every time. Hive partitioning lets the query engine skip date ranges it doesn’t need, which makes a huge difference once you have months of data sitting there.

The path looks like this in practice:

s3://your-data-lake/
  raw/
    {source_name}/
      {entity_name}/
        year=2026/
          month=04/
            day=30/
              143022.json

The filename is a timestamp so that multiple runs on the same day don’t stomp on each other - I’ll get to why that matters in the idempotency section.

CDK Setup

The infrastructure lives in a TypeScript CDK app. Once you clone the repo:

cd cdk
npm install

Copy the example config:

cp cdk.context.example.json cdk.context.json

The config is pretty minimal:

Key	Description
`source_name`	Logical source name (used in S3 path)
`entity_name`	Entity/table name (used in S3 path)
`api_url`	Source API endpoint
`secret_name`	Secrets Manager secret name
`schedule_hours`	How often to run (default: 1)

Worth being deliberate about source_name and entity_name - they go straight into the S3 path, so lowercase with no spaces saves you headaches later.

Credentials with Secrets Manager

I’ve seen Lambda functions with API keys hardcoded in environment variables. I’ve done it myself when I was moving fast. But it bites you eventually - keys end up in CloudWatch logs, in git history, somewhere they shouldn’t be.

Secrets Manager is the right answer here. Create the secret before you deploy:

aws secretsmanager create-secret \
  --name /data-pipeline/api-credentials \
  --secret-string '{"api_key":"your-api-key-here"}'

The Lambda fetches it at runtime:

import boto3
import json

def get_secret(secret_name):
    client = boto3.client("secretsmanager")
    response = client.get_secret_value(SecretId=secret_name)
    return json.loads(response["SecretString"])

The CDK stack gives the Lambda permission to read only its specific secret:

secret.grantRead(lambdaFunction);

Idempotency

A scheduled Lambda will eventually run twice for the same window. A timeout gets retried, someone triggers it manually during a debug session, whatever. If you’re writing to a fixed filename like latest.json, a failed mid-write leaves you with a partial file and no way to know.

The timestamped filename sidesteps this:

year=2026/month=04/day=30/143022.json  ← first run
year=2026/month=04/day=30/143025.json  ← retry three seconds later

Both files exist, both are complete. Downstream deduplication handles the overlap on record IDs, and you haven’t lost anything from the first run.

If your source API supports since timestamps or cursor pagination, you can tighten this further by storing the last successful run time in DynamoDB and using it as the query window. That comes later in the series.

Deploy

cd cdk
cdk bootstrap   # first time only
cdk deploy

CDK handles the EventBridge rule, the Lambda with its IAM role, and the S3 bucket. After the first scheduled run you should start seeing folders appear:

aws s3 ls s3://your-data-lake/raw/your-source/your-entity/ --recursive

If something looks off, CloudWatch Logs is the first place to check.

Why Lambda?

Lambda is the right choice for scheduled API ingestion, but it’s worth being honest about why - and where it stops being right.

On the right side: zero infrastructure. There’s no server to manage, no container to keep running between runs. The Lambda wakes up every hour, hits the API, writes to S3, and disappears. For a job that takes a few seconds to complete, paying for a continuously running process is wasteful. EventBridge schedules integrate directly with Lambda without any glue code, and the CDK stack stays small as a result.

The cost model is also straightforward. You pay per invocation and per duration. A Lambda that runs for 3 seconds every hour costs almost nothing.

The failure model is simple to reason about too. If a run fails, CloudWatch has the logs. The timestamped S3 path means a failed run doesn’t corrupt anything - you just have a gap, which is visible and recoverable.

The Hard Limits

Here’s where Lambda gets in the way.

Execution timeout. Lambda hard-limits at 15 minutes. For most API ingestion jobs this is irrelevant - hitting an endpoint and writing the response to S3 typically takes seconds. But if your source API requires paginating through large result sets, or if the API is slow, you can brush against this. At that point you either need to split the work across invocations (using DynamoDB or SSM to track a cursor between runs) or move to a different compute model.

Payload size. Lambda’s response payload is capped at 6 MB. For most REST APIs this isn’t an issue - you’re reading individual pages of results, not pulling entire datasets in one call. But if you’re hitting an API that returns large binary responses or very wide JSON blobs, you’ll need to stream directly to S3 rather than loading the full response into memory first.

Cold starts. Lambda functions that run infrequently will cold-start on each invocation. For a Python function that imports boto3 and makes an HTTP call, this adds a second or two. It doesn’t affect correctness, but it’s worth knowing if you’re trying to tighten a time window.

No persistent connections. Lambda can’t keep a database connection warm between invocations. If your pipeline writes directly to a database rather than S3, you’ll churn through connections fast. Writing to S3 sidesteps this entirely - S3 is stateless and handles concurrent writes without issue.

Alternatives

Amazon Kinesis Firehose

If your source is push-based - a webhook, an event stream, something that fires continuously rather than on a schedule - Kinesis Firehose handles the S3 write for you. You send records to the stream, Firehose buffers and batches them, and they land in S3 with configurable partitioning. No Lambda needed for the write path.

The catch is that Firehose is designed for streaming data, not scheduled pulls from REST APIs. If you’re polling an API every hour, Lambda is still the simpler model.

AWS Step Functions

Step Functions makes sense when the ingestion job has multiple stages that need to be orchestrated - fetch page one, check if there’s a next page, fetch page two, merge results, write to S3. Lambda can do this with loops and recursion, but it gets messy fast. Step Functions lets you model the pagination explicitly as a state machine, with retries, error handling, and timeouts at each step.

The trade-off is overhead. A Step Functions workflow is more infrastructure to define and deploy than a single Lambda function. For a simple one-shot API call, Step Functions is overkill.

AWS Glue

Glue makes more sense when the source isn’t an API but a database or file system - a JDBC connection to a production database, an SFTP server with daily exports. Glue handles large data volumes natively, supports JDBC out of the box, and has built-in retry and bookmarking.

For an HTTP API, Lambda is the better fit. Glue’s strengths are in the transform layer, which is exactly where it shows up later in this series.

Which Should You Use?

Start with Lambda. It works for the vast majority of scheduled API ingestion jobs, the infrastructure is minimal, and the failure modes are obvious.

Move to Step Functions if your ingestion requires pagination or multi-step orchestration and the Lambda function is getting hard to reason about.

Move to Firehose if your source switches from a scheduled pull to an event-driven push.

Move to Glue if your source is a database rather than an HTTP API, or if you’re pulling volumes that Lambda’s timeout can’t handle.

Up Next

With the ingestion layer running, the next piece is getting a schema on top of all that raw JSON without having to manage it manually. In Part 2 I’ll set up Redshift Spectrum to query the partitioned S3 data directly as external tables.

All code for this post is on GitHub.

The Series

Part 1: Lambda Ingestion to S3 (this post)
Part 2: Redshift Spectrum Setup
Part 3: dbt Medallion Architecture
Part 4: Snapshotting Gold Tables to PostgreSQL

Building a Modern Data Pipeline on AWS - Part 1