We use Step Functions for batch processing, document ingestion, and a few agentic workflows. The patterns that work, the limits we hit, and where we'd reach for something else.

On this page

AWS Step Functions for Workflow Orchestration

We use AWS Step Functions for a handful of production workflows — document ingestion pipelines, batch report generation, an agentic task runner. Step Functions are one of those services that sound too niche to bother with, but for the right shape of work they're surprisingly good. This post is what we run, what works, and where the limits hit.

What Step Functions actually is #

A Step Function is a state machine defined in JSON (Amazon States Language). Each state is a step that does something — calls a Lambda, invokes another AWS service, waits for a callback, makes a decision based on input. Transitions between states are defined in the JSON. AWS hosts the execution, tracks state, retries failed steps, captures the execution history for inspection.

It's a workflow orchestrator. Think Airflow lite, hosted, AWS-native, billed per state transition.

The pitch over rolling your own:

Visual execution graph. Every run shows you exactly what happened, in order, with timing per step.
Built-in retries with backoff. Per-step error handling without writing the retry code.
AWS service integrations. Call Lambda, ECS, SQS, DynamoDB, S3, etc., directly from the state definition — no glue code.
Distributed state. No "what if my orchestration server dies mid-workflow" question.

The cost: workflow logic in JSON, which is awkward to read and write past a certain complexity.

What we run on it #

Three production workflows:

Document ingestion pipeline. PDF arrives in S3 → Step Function triggers → extract text via Lambda → chunk text → call embedding API for each chunk → upsert chunks to vector DB → notify on completion. About 12 states. Runs ~200x/day, takes 2-5 minutes per document.

Nightly batch reports. Generate reports for ~50 customers in parallel. Each customer = one branch in a Step Function Map state. Each branch: fetch data → compute aggregations → render PDF → upload → email customer. Map state handles concurrency; we limit to 10 parallel branches to avoid hammering downstream services.

Agentic task runner. Long-running task: agent runs in a loop (decide → call tool → observe → decide), each iteration is a state. Wait states let the agent call long-running tools (e.g. a multi-minute web scrape) via callback patterns without holding compute. Cap on iterations prevents runaway.

In each case, what made Step Functions the right pick was either the parallelism (Map state), the long-running wait pattern (callbacks), or the visibility into execution state (for debugging multi-step failures).

Standard vs Express workflows #

Two flavors with different pricing:

Standard — per-state-transition billing, up to 1 year max duration, full execution history retained. ~$25 per million transitions.
Express — per-execution + duration billing, up to 5 minutes max, no execution history. Much cheaper for high-frequency short workflows.

We use Standard for the document pipeline and report generation (multi-minute workflows, history matters for debugging). Express would be wrong for these — execution history is essential when something fails mid-document.

We use Express for a couple of high-frequency short workflows where we just need the orchestration and don't care about audit trails.

The patterns that earn their place #

Parallel processing via Map. A list of items, each processed independently. Map runs them in parallel up to a limit. Built-in error handling — failures in one item don't fail the others; we collect successes and errors separately.

json.json

{
  "ProcessAllCustomers": {
    "Type": "Map",
    "ItemsPath": "$.customers",
    "MaxConcurrency": 10,
    "Iterator": {
      "StartAt": "ProcessOne",
      "States": { "ProcessOne": { ... } }
    }
  }
}

Choice states for branching. "If the input has property X, do A, else do B." Simple but very useful for "this is a retry; do the recovery path" or "this is a special-case input."

Wait + callback for long-running work. Step Functions can pause a state until an external system calls back with a token. Lets us call a long-running external job (e.g., a Bedrock model that takes minutes) without burning Lambda compute waiting for it.

Retry policies per state. Don't hand-code retry logic; declare it in the state's Retry block:

json.json

"Retry": [{
  "ErrorEquals": ["States.TaskFailed"],
  "IntervalSeconds": 2,
  "MaxAttempts": 3,
  "BackoffRate": 2.0
}]

Clean, declarative, no exponential-backoff implementation to debug.

What's awkward #

The honest list:

JSON state machine definitions get big fast. A 15-state workflow is a few hundred lines of dense JSON. Hard to review in a PR. We use the AWS CDK now to define state machines in TypeScript and synthesize the JSON — much more readable and reusable across workflows.

Debugging cross-state data flow. Each state has an input and output. Mistakes in path expressions silently pass empty data to the next state. The execution history shows the inputs/outputs per state which helps, but the diagnosis is "stare at the JSON paths until you find the typo."

State machine size limits. AWS limits state machines to 1MB of definition. We hit this once with a very large multi-branch workflow; had to refactor into nested workflows (a state machine calling another state machine).

Timeouts at the AWS account level. Step Functions has a default 25K-execution-per-month soft limit. We hit it during a backfill and needed a quota increase. Painless but a surprise.

Local development. No good way to run Step Functions locally for development. AWS provides Step Functions Local, an emulator, but its fidelity is limited. We mostly develop against a dev account.

When we don't use Step Functions #

Cases where we reach for something else:

Simple "run A then B" pipelines. A two-Lambda chain triggered by S3 event doesn't need a state machine. Direct Lambda chaining is simpler.
High-volume, short workflows. Step Functions has overhead per transition. For workflows running 100/s, the overhead dominates. We use SQS-triggered Lambdas in chain instead.
Workflows with rich logic that doesn't fit decision trees. When you need real code (loops with conditions, complex data transformations), keep it in a Lambda. Step Functions excels at orchestration; not at computation.
Data pipelines with strong schemas. Airflow, Dagster, or Prefect are designed for this. Step Functions is generic; specialized data orchestrators have better lineage tracking and DAG visualization.

Cost reality #

For our document ingestion pipeline (~6,000 executions/month, ~12 states each, ~72,000 transitions):

Step Functions: ~$2/month
Lambdas called by it: ~$15/month
DynamoDB / S3 / embedding API: ~$300/month

Step Functions itself is negligible. The work it orchestrates costs orders of magnitude more.

For Express workflows in high-volume use, the cost can dominate. We had one workflow doing ~50/s for a peak period; Express costs were tracking ~$200/month. Replaced with SQS + Lambda chain; cost dropped to ~$20/month with similar latency.

Operational discipline #

Things that matter once you have a few Step Functions:

Naming convention. service-purpose-version (e.g. ingestion-pdf-v2). When you need to update a workflow, you create a new version; old executions keep running on old; new executions go to new.
Versioning the state machine. Step Functions doesn't have built-in version pinning of in-flight executions. We bump version numbers in the state machine name and use the new name in the trigger.
Alarms on failures. CloudWatch alarms on ExecutionsFailed per state machine. Routed to on-call.
Logging from inside Lambdas. Step Functions' execution history is high-level; the per-step Lambda logs are in CloudWatch. We correlate via the execution ARN.

What I'd tell a team starting #

Use the AWS CDK to define state machines, not raw JSON. Worth the investment immediately.

Standard for multi-minute workflows; Express for short high-frequency. Pricing model fits each.

Map state with MaxConcurrency. Parallelism with bounded concurrency. The pattern most teams underuse.

Don't put complex logic in state machine definitions. Keep states thin; orchestration goes in the state machine, computation goes in the Lambdas.

Tabletop the failure modes. What if step 5 fails? Step 10? A whole branch? The retry + catch patterns need to be deliberate.

Step Functions is one of those AWS services that's easy to underestimate. For the right shape of workload — multi-step, async, occasionally failing — it removes a lot of glue code. For the wrong shape, it's awkward. The trick is knowing which shape you have.

AWS Step Functions for Workflow Orchestration

AWS Step Functions for Workflow Orchestration

What Step Functions actually is #

What we run on it #

Standard vs Express workflows #

The patterns that earn their place #

What's awkward #

When we don't use Step Functions #

Cost reality #

Operational discipline #

What I'd tell a team starting #

Stay Updated

LLM Streaming UX — Backpressure, Cancellation, Partial Results

systemd Timers vs Cron — What We Learned Switching

More from Cloud

CDN Cache Invalidation — Strategies That Don't Break in Production

Karpenter — Node Provisioning Patterns at Scale

AWS VPC Explained — Subnets, Route Tables, and the Internet Gateway

CDN Cache Invalidation — Strategies That Don't Break in Production

Karpenter — Node Provisioning Patterns at Scale

AWS VPC Explained — Subnets, Route Tables, and the Internet Gateway

AWS S3 Tutorial — Buckets, Permissions, and Common Pitfalls

Terraform Tutorial — Your First Infrastructure-as-Code Project

AWS Lambda — Deploy Your First Serverless Function

About Admin

You might have missed

GitOps with Argo CD: Best Practices for 2025

AI Agents in DevOps: From Copilots to Autonomous Automation in 2025

Prompt Engineering Best Practices: Maximizing LLM Performance