We use Ansible for configuration management on hosts where Terraform stops. The workflow that keeps it tractable and what we wish we'd known about idempotency.
We use Terraform for cloud resources and Ansible for what runs inside them. The split has been stable for a few years. Most of our pain with Ansible has come from places where we didn't follow our own conventions, not from the tool itself. This post is the working version of how we use Ansible — playbook structure, role design, idempotency discipline, and the patterns that earn their place.
Terraform creates the EC2 instance. Ansible configures what's on it. Specifically:
We don't use Ansible for:
The boundary is "things that happen on the host." Inside containers/Kubernetes, other tools do the work.
A playbook is a YAML file that says "apply these roles to these hosts." Ours follow a strict pattern:
- name: Configure web servers
hosts: web
become: true
roles:
- common
- nginx
- app-deploy
vars:
app_version: "1.42.0"
Three things to highlight:
become: true at the play level. We don't sprinkle become per task; either the play needs root or it doesn't.A role does one thing. We have ~25 roles. The naming and structure:
roles/
nginx/
defaults/main.yml # Default values for variables
files/ # Static files copied as-is
handlers/main.yml # Handlers (e.g., "restart nginx")
meta/main.yml # Role dependencies, metadata
tasks/main.yml # Main task list
templates/ # Jinja2 templates
vars/main.yml # Internal variables (not user-overridable)
Every role has defaults/main.yml with sensible defaults. Users override via inventory or playbook vars. We try to make every role usable with no overrides for the simple case.
Tasks are short and named:
- name: Install nginx package
apt:
name: nginx
state: present
update_cache: true
- name: Copy nginx main config
template:
src: nginx.conf.j2
dest: /etc/nginx/nginx.conf
owner: root
group: root
mode: '0644'
notify: restart nginx
- name: Ensure nginx service is running
systemd:
name: nginx
state: started
enabled: true
Each task has a clear name. Failures show the named task; debug becomes faster.
Ansible tasks should be idempotent: running the same playbook twice should result in the same state, with no changes the second time.
For most modules (apt, file, template, systemd, etc.), idempotency is automatic — they check current state before changing.
The pitfalls:
shell and command modules are not idempotent by default. Running command: rm -rf /tmp/foo always reports "changed" because Ansible can't tell if anything actually changed. Use module-native equivalents (file: state=absent) where possible.
shell with creates:/removes: can be made idempotent:
- name: Set up custom thing
shell: ./set-up.sh
args:
creates: /var/lib/myapp/.setup-done
The creates arg means "skip if this file exists." It's a manual idempotency marker.
changed_when: for shell tasks. Tells Ansible whether to report change based on output:
- name: Check service status
shell: systemctl is-active myservice
changed_when: false
failed_when: false
"This is just a check; never reports changed."
Custom scripts that you run via command should be idempotent in their own logic. If they aren't, fix the script.
We have a CI rule: any role that has command: or shell: tasks must demonstrate idempotency in tests (running the playbook twice; the second run reports zero changes).
We use dynamic inventory: a script that queries AWS for current EC2 instances and groups them by tags.
A static hosts.ini file is a maintenance burden — instances come and go. The dynamic version queries reality.
plugin: aws_ec2
regions:
- us-east-1
- us-west-2
keyed_groups:
- key: tags.Role
prefix: role
- key: tags.Environment
prefix: env
hostnames:
- tag:Name
This produces groups like role_web, role_db, env_prod. Plays target them.
For non-cloud hosts (rare; we have a few on-prem), we use a static hosts.yml for those specifically.
Ansible Vault encrypts secrets in YAML files. We use it for:
db_password: !vault |
$ANSIBLE_VAULT;1.1;AES256
62313338363134316533343334...
The vault password lives in our secrets manager, not in the repo. CI fetches it via a temporary credential.
For most secrets, though, we don't put them in vault. We pull them from AWS Secrets Manager or HashiCorp Vault at runtime. Ansible Vault is the fallback for things that can't reach the secrets backend (e.g., during initial bootstrap).
Three layers:
Syntax check: ansible-playbook --syntax-check runs in CI on every PR.
Lint: ansible-lint flags antipatterns (unnamed tasks, deprecated modules, missing handlers). Mandatory in CI.
Molecule tests: spin up a Docker container or Vagrant box, apply the role, assert the resulting state. We have Molecule tests for our 5 most-used roles.
Molecule tests are slow (~5-10 min per role); we run them on PRs that touch the role, not on every commit. They catch the bugs that syntax/lint can't (real failures applying the role).
Things that used to be in our playbooks that we cleaned up:
Long task lists with no role boundaries. A playbook that does install, configure, deploy, restart in 50 inline tasks. Refactored into roles.
Implicit ordering dependencies. Playbooks that worked because tasks happened in a specific order; rearranging broke things. We made dependencies explicit (one role depends on another via meta/main.yml).
Hardcoded values in tasks. Hostnames, paths, credentials baked into task definitions. Refactored to variables in defaults/main.yml.
shell: everywhere instead of native modules. Bash scripts wrapped in Ansible. Replaced with module-based equivalents where possible (idempotent, more readable).
Per-environment forks of playbooks. prod-deploy.yml, staging-deploy.yml. Replaced with one playbook + per-environment vars.
Mixed concerns in one role. A "web" role that installed nginx AND deployed the app AND configured logging. Split into focused roles.
Common slowdowns:
Many small tasks. Ansible's overhead per task is ~100-500ms (SSH connection, fact gathering). A play with 200 tasks takes ~30+ seconds just on overhead. We use with_items and module-level loops to batch where possible.
Fact gathering. gather_facts: true (the default) collects ~500 facts about each host. Useful for debugging, slow for production runs. We disable it for plays that don't need facts.
Serial execution. By default, Ansible runs all hosts in parallel (up to forks). For services with strict ordering requirements (rolling restarts), use serial: 1 or serial: 25%.
No SSH multiplexing. ssh_args = -o ControlMaster=auto -o ControlPersist=60s in ansible.cfg reuses SSH connections. ~3x speedup for multi-task plays.
For our typical playbook against ~30 hosts, with these optimizations it runs in ~3 minutes. Without, ~10+.
Honest about the limits:
State management. Ansible doesn't know about state across runs. If you remove a task that created a file, the file isn't cleaned up automatically. You have to add a state: absent task or accept the drift.
Dynamic decisions. "If this is true, do that, else do something else" works but the syntax is awkward (when: everywhere, conditionals in templates). Complex logic gets hard to read.
Heavy templating. Jinja2 templates that have lots of logic become hard to maintain. We try to keep templates simple; if logic is complex, it goes in a Python script run as part of the playbook.
Imperative-feeling vs Terraform's declarative. Ansible is fundamentally a sequence of tasks. Terraform is a desired state. The mental model differs; for state-machine-like infrastructure (cloud resources), Terraform is cleaner. For procedural-feeling configuration (install this, then this, then this), Ansible fits better.
Use Ansible for "things that happen on hosts." Terraform creates the host; Ansible configures it. Don't try to make Ansible manage cloud resources.
Roles, not tasks at the play level. Roles encapsulate; tasks-at-play-level become a mess.
Idempotency is non-negotiable. Running a playbook twice should report zero changes the second time. Test for this.
Dynamic inventory. Static inventory files become stale; dynamic queries are the truth.
Defaults in defaults/main.yml, overrides via inventory. Make roles usable without overrides for the common case.
SSH multiplexing. Free 3x speedup for typical playbooks.
Lint in CI. Catches the antipatterns before they ship.
Ansible is one of those tools that's been steady for years — it does what it does, the patterns are well-known, the gotchas are documented. The discipline of keeping playbooks readable and idempotent is the actual work; the tool itself is mature. The teams that struggle with Ansible usually have one of: poorly-structured roles, weak idempotency, or hard-coded values everywhere. Fix those three and Ansible is a reliable workhorse.
Get the latest tutorials, guides, and insights on AI, DevOps, Cloud, and Infrastructure delivered directly to your inbox.
Explore more articles in this category
Backups are easy. Restores are hard. The quarterly drill we run, what's failed during it, and the discipline that makes "we have backups" actually mean something.
Replication is the foundation of database HA. What we monitor, how we practice failover, and the gotchas that show up only when you actually fail over.
Why Postgres connection limits bite at unexpected times, the pooling layer we put in front, and the pool-mode tradeoffs we learned the hard way.