We use Step Functions for batch processing, document ingestion, and a few agentic workflows. The patterns that work, the limits we hit, and where we'd reach for something else.
We use AWS Step Functions for a handful of production workflows — document ingestion pipelines, batch report generation, an agentic task runner. Step Functions are one of those services that sound too niche to bother with, but for the right shape of work they're surprisingly good. This post is what we run, what works, and where the limits hit.
A Step Function is a state machine defined in JSON (Amazon States Language). Each state is a step that does something — calls a Lambda, invokes another AWS service, waits for a callback, makes a decision based on input. Transitions between states are defined in the JSON. AWS hosts the execution, tracks state, retries failed steps, captures the execution history for inspection.
It's a workflow orchestrator. Think Airflow lite, hosted, AWS-native, billed per state transition.
The pitch over rolling your own:
The cost: workflow logic in JSON, which is awkward to read and write past a certain complexity.
Three production workflows:
Document ingestion pipeline. PDF arrives in S3 → Step Function triggers → extract text via Lambda → chunk text → call embedding API for each chunk → upsert chunks to vector DB → notify on completion. About 12 states. Runs ~200x/day, takes 2-5 minutes per document.
Nightly batch reports. Generate reports for ~50 customers in parallel. Each customer = one branch in a Step Function Map state. Each branch: fetch data → compute aggregations → render PDF → upload → email customer. Map state handles concurrency; we limit to 10 parallel branches to avoid hammering downstream services.
Agentic task runner. Long-running task: agent runs in a loop (decide → call tool → observe → decide), each iteration is a state. Wait states let the agent call long-running tools (e.g. a multi-minute web scrape) via callback patterns without holding compute. Cap on iterations prevents runaway.
In each case, what made Step Functions the right pick was either the parallelism (Map state), the long-running wait pattern (callbacks), or the visibility into execution state (for debugging multi-step failures).
Two flavors with different pricing:
We use Standard for the document pipeline and report generation (multi-minute workflows, history matters for debugging). Express would be wrong for these — execution history is essential when something fails mid-document.
We use Express for a couple of high-frequency short workflows where we just need the orchestration and don't care about audit trails.
Parallel processing via Map. A list of items, each processed independently. Map runs them in parallel up to a limit. Built-in error handling — failures in one item don't fail the others; we collect successes and errors separately.
{
"ProcessAllCustomers": {
"Type": "Map",
"ItemsPath": "$.customers",
"MaxConcurrency": 10,
"Iterator": {
"StartAt": "ProcessOne",
"States": { "ProcessOne": { ... } }
}
}
}
Choice states for branching. "If the input has property X, do A, else do B." Simple but very useful for "this is a retry; do the recovery path" or "this is a special-case input."
Wait + callback for long-running work. Step Functions can pause a state until an external system calls back with a token. Lets us call a long-running external job (e.g., a Bedrock model that takes minutes) without burning Lambda compute waiting for it.
Retry policies per state. Don't hand-code retry logic; declare it in the state's Retry block:
"Retry": [{
"ErrorEquals": ["States.TaskFailed"],
"IntervalSeconds": 2,
"MaxAttempts": 3,
"BackoffRate": 2.0
}]
Clean, declarative, no exponential-backoff implementation to debug.
The honest list:
JSON state machine definitions get big fast. A 15-state workflow is a few hundred lines of dense JSON. Hard to review in a PR. We use the AWS CDK now to define state machines in TypeScript and synthesize the JSON — much more readable and reusable across workflows.
Debugging cross-state data flow. Each state has an input and output. Mistakes in path expressions silently pass empty data to the next state. The execution history shows the inputs/outputs per state which helps, but the diagnosis is "stare at the JSON paths until you find the typo."
State machine size limits. AWS limits state machines to 1MB of definition. We hit this once with a very large multi-branch workflow; had to refactor into nested workflows (a state machine calling another state machine).
Timeouts at the AWS account level. Step Functions has a default 25K-execution-per-month soft limit. We hit it during a backfill and needed a quota increase. Painless but a surprise.
Local development. No good way to run Step Functions locally for development. AWS provides Step Functions Local, an emulator, but its fidelity is limited. We mostly develop against a dev account.
Cases where we reach for something else:
For our document ingestion pipeline (~6,000 executions/month, ~12 states each, ~72,000 transitions):
Step Functions itself is negligible. The work it orchestrates costs orders of magnitude more.
For Express workflows in high-volume use, the cost can dominate. We had one workflow doing ~50/s for a peak period; Express costs were tracking ~$200/month. Replaced with SQS + Lambda chain; cost dropped to ~$20/month with similar latency.
Things that matter once you have a few Step Functions:
service-purpose-version (e.g. ingestion-pdf-v2). When you need to update a workflow, you create a new version; old executions keep running on old; new executions go to new.ExecutionsFailed per state machine. Routed to on-call.Use the AWS CDK to define state machines, not raw JSON. Worth the investment immediately.
Standard for multi-minute workflows; Express for short high-frequency. Pricing model fits each.
Map state with MaxConcurrency. Parallelism with bounded concurrency. The pattern most teams underuse.
Don't put complex logic in state machine definitions. Keep states thin; orchestration goes in the state machine, computation goes in the Lambdas.
Tabletop the failure modes. What if step 5 fails? Step 10? A whole branch? The retry + catch patterns need to be deliberate.
Step Functions is one of those AWS services that's easy to underestimate. For the right shape of workload — multi-step, async, occasionally failing — it removes a lot of glue code. For the wrong shape, it's awkward. The trick is knowing which shape you have.
Get the latest tutorials, guides, and insights on AI, DevOps, Cloud, and Infrastructure delivered directly to your inbox.
Streaming LLM responses is easy until the client disconnects, the model stalls, or the user cancels. The patterns that keep streaming responsive without leaking spend.
We migrated most scheduled jobs from cron to systemd timers. The wins, the gotchas, and the cases we kept on cron anyway.
Explore more articles in this category
There are two hard problems in computer science." We've worked on the cache-invalidation one for a while. The patterns that hold up at scale and the ones that look clean and aren't.
After two years of running Karpenter on production EKS clusters, the NodePool patterns that survived, the ones we replaced, and the tuning that matters.
A working mental model for AWS VPCs — what each piece does, how they connect, and why "VPC" is the wrong mental model if you came from physical networks.