We use Ansible for configuration management on hosts where Terraform stops. The workflow that keeps it tractable and what we wish we'd known about idempotency.

On this page

Infrastructure as Code with Ansible

We use Terraform for cloud resources and Ansible for what runs inside them. The split has been stable for a few years. Most of our pain with Ansible has come from places where we didn't follow our own conventions, not from the tool itself. This post is the working version of how we use Ansible — playbook structure, role design, idempotency discipline, and the patterns that earn their place.

Where Ansible fits #

Terraform creates the EC2 instance. Ansible configures what's on it. Specifically:

Installing and configuring services (nginx, postgres, our app)
Managing systemd units, log rotation, kernel parameters
Bootstrapping new hosts to a known state
One-off operational tasks (rolling restarts, cert renewals on legacy hosts)

We don't use Ansible for:

Cloud resources (Terraform's job)
Container image build (Dockerfiles' job)
Kubernetes (Helm/Kustomize's job)
Application deployments to running clusters (CI/CD's job)

The boundary is "things that happen on the host." Inside containers/Kubernetes, other tools do the work.

Playbook structure #

A playbook is a YAML file that says "apply these roles to these hosts." Ours follow a strict pattern:

yaml.yaml

- name: Configure web servers
  hosts: web
  become: true
  roles:
    - common
    - nginx
    - app-deploy
  vars:
    app_version: "1.42.0"

Three things to highlight:

become: true at the play level. We don't sprinkle become per task; either the play needs root or it doesn't.
Roles, not tasks at the play level. Tasks belong inside roles. A play that has a long inline task list is a smell — it should be a role.
Vars at the play level (or inventory) for things that change per run. Hardcoded values in tasks are the wrong place.

Role design #

A role does one thing. We have ~25 roles. The naming and structure:

code

roles/
  nginx/
    defaults/main.yml      # Default values for variables
    files/                  # Static files copied as-is
    handlers/main.yml      # Handlers (e.g., "restart nginx")
    meta/main.yml          # Role dependencies, metadata
    tasks/main.yml         # Main task list
    templates/             # Jinja2 templates
    vars/main.yml          # Internal variables (not user-overridable)

Every role has defaults/main.yml with sensible defaults. Users override via inventory or playbook vars. We try to make every role usable with no overrides for the simple case.

Tasks are short and named:

yaml.yaml

- name: Install nginx package
  apt:
    name: nginx
    state: present
    update_cache: true

- name: Copy nginx main config
  template:
    src: nginx.conf.j2
    dest: /etc/nginx/nginx.conf
    owner: root
    group: root
    mode: '0644'
  notify: restart nginx

- name: Ensure nginx service is running
  systemd:
    name: nginx
    state: started
    enabled: true

Each task has a clear name. Failures show the named task; debug becomes faster.

Idempotency: the most important rule #

Ansible tasks should be idempotent: running the same playbook twice should result in the same state, with no changes the second time.

For most modules (apt, file, template, systemd, etc.), idempotency is automatic — they check current state before changing.

The pitfalls:

shell and command modules are not idempotent by default. Running command: rm -rf /tmp/foo always reports "changed" because Ansible can't tell if anything actually changed. Use module-native equivalents (file: state=absent) where possible.

shell with creates:/removes: can be made idempotent:

yaml.yaml

- name: Set up custom thing
  shell: ./set-up.sh
  args:
    creates: /var/lib/myapp/.setup-done

The creates arg means "skip if this file exists." It's a manual idempotency marker.

changed_when: for shell tasks. Tells Ansible whether to report change based on output:

yaml.yaml

- name: Check service status
  shell: systemctl is-active myservice
  changed_when: false
  failed_when: false

"This is just a check; never reports changed."

Custom scripts that you run via command should be idempotent in their own logic. If they aren't, fix the script.

We have a CI rule: any role that has command: or shell: tasks must demonstrate idempotency in tests (running the playbook twice; the second run reports zero changes).

Inventory: dynamic, not static #

We use dynamic inventory: a script that queries AWS for current EC2 instances and groups them by tags.

A static hosts.ini file is a maintenance burden — instances come and go. The dynamic version queries reality.

yaml.yaml

plugin: aws_ec2
regions:
  - us-east-1
  - us-west-2
keyed_groups:
  - key: tags.Role
    prefix: role
  - key: tags.Environment
    prefix: env
hostnames:
  - tag:Name

This produces groups like role_web, role_db, env_prod. Plays target them.

For non-cloud hosts (rare; we have a few on-prem), we use a static hosts.yml for those specifically.

Vault for secrets #

Ansible Vault encrypts secrets in YAML files. We use it for:

Passwords / API keys that need to be on disk during configuration
Certificate keys
Some service credentials

yaml.yaml

db_password: !vault |
  $ANSIBLE_VAULT;1.1;AES256
  62313338363134316533343334...

The vault password lives in our secrets manager, not in the repo. CI fetches it via a temporary credential.

For most secrets, though, we don't put them in vault. We pull them from AWS Secrets Manager or HashiCorp Vault at runtime. Ansible Vault is the fallback for things that can't reach the secrets backend (e.g., during initial bootstrap).

Testing roles #

Three layers:

Syntax check: ansible-playbook --syntax-check runs in CI on every PR.

Lint: ansible-lint flags antipatterns (unnamed tasks, deprecated modules, missing handlers). Mandatory in CI.

Molecule tests: spin up a Docker container or Vagrant box, apply the role, assert the resulting state. We have Molecule tests for our 5 most-used roles.

Molecule tests are slow (~5-10 min per role); we run them on PRs that touch the role, not on every commit. They catch the bugs that syntax/lint can't (real failures applying the role).

Common antipatterns we've removed #

Things that used to be in our playbooks that we cleaned up:

Long task lists with no role boundaries. A playbook that does install, configure, deploy, restart in 50 inline tasks. Refactored into roles.

Implicit ordering dependencies. Playbooks that worked because tasks happened in a specific order; rearranging broke things. We made dependencies explicit (one role depends on another via meta/main.yml).

Hardcoded values in tasks. Hostnames, paths, credentials baked into task definitions. Refactored to variables in defaults/main.yml.

shell: everywhere instead of native modules. Bash scripts wrapped in Ansible. Replaced with module-based equivalents where possible (idempotent, more readable).

Per-environment forks of playbooks. prod-deploy.yml, staging-deploy.yml. Replaced with one playbook + per-environment vars.

Mixed concerns in one role. A "web" role that installed nginx AND deployed the app AND configured logging. Split into focused roles.

Performance: what slows down playbooks #

Common slowdowns:

Many small tasks. Ansible's overhead per task is ~100-500ms (SSH connection, fact gathering). A play with 200 tasks takes ~30+ seconds just on overhead. We use with_items and module-level loops to batch where possible.

Fact gathering. gather_facts: true (the default) collects ~500 facts about each host. Useful for debugging, slow for production runs. We disable it for plays that don't need facts.

Serial execution. By default, Ansible runs all hosts in parallel (up to forks). For services with strict ordering requirements (rolling restarts), use serial: 1 or serial: 25%.

No SSH multiplexing. ssh_args = -o ControlMaster=auto -o ControlPersist=60s in ansible.cfg reuses SSH connections. ~3x speedup for multi-task plays.

For our typical playbook against ~30 hosts, with these optimizations it runs in ~3 minutes. Without, ~10+.

Where Ansible struggles #

Honest about the limits:

State management. Ansible doesn't know about state across runs. If you remove a task that created a file, the file isn't cleaned up automatically. You have to add a state: absent task or accept the drift.

Dynamic decisions. "If this is true, do that, else do something else" works but the syntax is awkward (when: everywhere, conditionals in templates). Complex logic gets hard to read.

Heavy templating. Jinja2 templates that have lots of logic become hard to maintain. We try to keep templates simple; if logic is complex, it goes in a Python script run as part of the playbook.

Imperative-feeling vs Terraform's declarative. Ansible is fundamentally a sequence of tasks. Terraform is a desired state. The mental model differs; for state-machine-like infrastructure (cloud resources), Terraform is cleaner. For procedural-feeling configuration (install this, then this, then this), Ansible fits better.

What I'd tell a team starting #

Use Ansible for "things that happen on hosts." Terraform creates the host; Ansible configures it. Don't try to make Ansible manage cloud resources.

Roles, not tasks at the play level. Roles encapsulate; tasks-at-play-level become a mess.

Idempotency is non-negotiable. Running a playbook twice should report zero changes the second time. Test for this.

Dynamic inventory. Static inventory files become stale; dynamic queries are the truth.

Defaults in defaults/main.yml, overrides via inventory. Make roles usable without overrides for the common case.

SSH multiplexing. Free 3x speedup for typical playbooks.

Lint in CI. Catches the antipatterns before they ship.

Ansible is one of those tools that's been steady for years — it does what it does, the patterns are well-known, the gotchas are documented. The discipline of keeping playbooks readable and idempotent is the actual work; the tool itself is mature. The teams that struggle with Ansible usually have one of: poorly-structured roles, weak idempotency, or hard-coded values everywhere. Fix those three and Ansible is a reliable workhorse.

Infrastructure as Code with Ansible

Infrastructure as Code with Ansible

Where Ansible fits #

Playbook structure #

Role design #

Idempotency: the most important rule #

Inventory: dynamic, not static #

Vault for secrets #

Testing roles #

Common antipatterns we've removed #

Performance: what slows down playbooks #

Where Ansible struggles #

What I'd tell a team starting #

Stay Updated

Docker Multi-Stage Builds for Production

More from Infrastructure

Backstage Software Catalog: Getting Adoption Past the Demo

Terraform Import at Scale: Bringing Legacy Infra Under Code

Zero-Downtime Postgres Migrations: Expand-Contract in Practice

Backstage Software Catalog: Getting Adoption Past the Demo

Terraform Import at Scale: Bringing Legacy Infra Under Code

Zero-Downtime Postgres Migrations: Expand-Contract in Practice

Postgres Read Replicas: Routing Reads Without Stale-Data Bugs

AWS Graviton Migration: What Broke and What We Saved

Serverless Cold Starts: Measuring and Fixing Them on Lambda

About Kiril Urbonas

You might have missed

GitOps with Argo CD: Best Practices for 2025

Prompt Engineering Best Practices: Maximizing LLM Performance

Process Management and Monitoring in Linux