We cut our largest playbook's runtime from 14 minutes to 4 minutes. The specific changes that mattered, plus the ones that didn't.
We have ~30 Ansible playbooks; some run on tens of hosts, a few run on hundreds. Slow playbooks have an outsized impact — they block deploys, they make incident response slower, they discourage running them at all. After focused optimization on our largest playbook (a fleet-wide config update), we cut runtime from 14 minutes to 4. This post is what worked, with rough impact estimates per change.
A playbook configuring 60 hosts: ~14 minutes. Tasks were a mix of package installation, config templating, service management, and validation.
For a playbook that needs to be re-run when configs change, 14 minutes is annoying. For something we'd want to run during incident recovery, it's too slow.
The default behavior: every Ansible task makes a fresh SSH connection. With ~80 tasks across 60 hosts, that's 4,800 SSH handshakes.
Enable SSH multiplexing in ansible.cfg:
[ssh_connection]
ssh_args = -o ControlMaster=auto -o ControlPersist=60s -o ControlPath=/tmp/ansible-ssh-%h-%p-%r
pipelining = True
ControlPersist=60s: SSH keeps the connection alive for 60 seconds. Subsequent tasks reuse it.
pipelining = True: pipelines commands in a single SSH session, avoiding intermediate file transfers.
Combined: ~3x speedup for our playbook. From 14 min to ~5 min.
This is the biggest single lever for most slow playbooks. If yours is slow and you don't have these on, fix that first.
gather_facts: True (the default) collects ~500 facts about each host before any tasks run. Useful when tasks reference facts; wasteful when they don't.
- name: Update config files only
hosts: web
gather_facts: false
tasks:
- ...
For tasks that don't reference ansible_* variables, this skips the fact-gathering step. ~10-30 seconds saved per host on plays where facts aren't used.
For our playbook with 60 hosts, ~5 minutes saved.
For plays that need some facts: gather_subset: ['network', 'hardware'] collects only the specified subsets, faster than full gathering.
serial and parallel execution#By default, Ansible runs tasks in parallel up to forks (default 5). For 60 hosts, only 5 are working at a time.
# ansible.cfg
[defaults]
forks = 25
Increase forks. We use 25 — enough parallelism without overwhelming our network or hammering the SSH connection limit.
For some plays, you specifically want serial execution (rolling restart):
- name: Rolling restart
hosts: web
serial: 5 # 5 hosts at a time
Or serial: "20%" for percentage-based.
Increasing forks from 5 to 25: ~30-40% faster for our playbook.
Each task has fixed overhead (SSH command, return processing). Reducing task count reduces this overhead.
Combine related tasks where reasonable:
# Before: 3 tasks
- name: Create dir 1
file: { path: /opt/app1, state: directory }
- name: Create dir 2
file: { path: /opt/app2, state: directory }
- name: Create dir 3
file: { path: /opt/app3, state: directory }
# After: 1 task
- name: Create dirs
file: { path: "{{ item }}", state: directory }
loop:
- /opt/app1
- /opt/app2
- /opt/app3
Or use with_items style. The single task with a loop is faster than 3 separate tasks.
For our playbook, consolidating tasks saved ~1 minute.
package for batched installs#Installing packages one at a time is slow (separate apt update, separate transactions):
# Before: separate tasks
- name: Install nginx
apt: { name: nginx, state: present, update_cache: yes }
- name: Install postgresql-client
apt: { name: postgresql-client, state: present }
- name: Install jq
apt: { name: jq, state: present }
# After: one task
- name: Install packages
apt:
name: [nginx, postgresql-client, jq]
state: present
update_cache: yes
The single call installs all packages in one apt transaction. Faster than three separate calls.
async for long-running tasks#For tasks that take a long time (large package installs, compilations), async lets them run in the background:
- name: Slow setup script
command: /opt/setup.sh
async: 600 # 10 min timeout
poll: 30 # check every 30s
The Ansible runner doesn't sit blocked waiting; it polls. Other tasks (or hosts) can proceed.
For tasks that don't depend on the slow one's output, poll: 0 runs it asynchronously and returns immediately:
- name: Kick off backup
command: /opt/backup.sh
async: 3600
poll: 0
register: backup_job
# ... other tasks ...
- name: Wait for backup
async_status:
jid: "{{ backup_job.ansible_job_id }}"
until: backup_status.finished
retries: 60
delay: 30
Useful for parallelizing things that would otherwise serialize.
delegate_to to avoid wasted runs#If a task should run once, not per-host:
- name: Update load balancer
uri:
url: https://lb-api/services
method: POST
delegate_to: localhost
run_once: true
run_once: true makes the task run on only one host (and delegate_to: localhost runs it on the controller). Without these, the task would run 60 times.
If you run multiple plays in one playbook, facts are gathered once per play by default. Caching saves re-gathering:
[defaults]
fact_caching = jsonfile
fact_caching_connection = /tmp/ansible_facts_cache
fact_caching_timeout = 7200 # 2 hours
For playbooks that run frequently, the cached facts are a quick reuse. For one-off runs, less helpful.
The biggest win is often not making Ansible faster, but making it do less.
Tasks that aren't idempotent run every time even if nothing has changed. Refactoring them to be properly idempotent means they no-op when state is correct:
# Bad: always reports changed
- name: Apply config
shell: /opt/apply-config.sh
# Better: idempotent
- name: Apply config
template:
src: config.j2
dest: /etc/myapp/config
notify: restart myapp # only restarts if template changed
Most modules (template, file, apt, systemd) are idempotent — they check current state and skip if matching. The shell and command modules aren't, by default.
For tasks that legitimately need to be command/shell, add idempotency markers:
- name: Initial setup
shell: /opt/setup.sh
args:
creates: /var/lib/myapp/.setup-done
The creates: ... arg makes the task skip if the file exists. Manual idempotency.
Some optimizations we tried that didn't help meaningfully:
Custom Ansible modules. Sometimes modules add overhead vs raw shell. The overhead is small; we didn't see meaningful gains.
Skipping the SSH host key check (StrictHostKeyChecking=no). Marginal; we want the security check anyway.
Parallel inventory plugins. Our dynamic inventory was fine; querying AWS for hosts wasn't the bottleneck.
Disabling encryption. Only relevant for slow networks; ours weren't the bottleneck.
Switching connection plugin. Tried mitogen (a faster connection plugin). Real speedups but added operational complexity (occasional weird issues). We haven't standardized on it.
Mitogen is a connection plugin that claims significant speedups. Our experience:
We use it for one specific high-volume playbook where the speedup matters. For our other playbooks, the standard SSH plugin with multiplexing is fast enough.
A few habits that keep playbooks fast:
Profile occasionally. Run with ANSIBLE_CALLBACKS_ENABLED=profile_tasks to see per-task duration. Surfaces slow tasks; informs where to focus.
Test on representative hosts. A playbook tested on 5 hosts runs differently on 100. Periodically test at scale.
Watch for new long tasks. When someone adds a task that takes 30 seconds, the cumulative impact at scale is real. Code review checks for this.
Don't add features just because. Adding "while we're here" features to a playbook bloats runtime. Each task earns its place.
For our largest playbook, after all optimizations:
Most of the speedup came from SSH multiplexing + increased forks + skipping fact gathering. These three would have been ~70% of the win on their own.
SSH multiplexing first. Free 3x speedup; nothing else compares.
Increase forks. Default 5 is too conservative for most modern infrastructure.
Skip gather_facts when not needed. A surprising amount of plays don't actually use facts.
Make tasks idempotent. Skipped no-op tasks are the fastest tasks.
Profile to find bottlenecks. Don't optimize blindly.
Combine related tasks. Each task has overhead; fewer tasks = less overhead.
Ansible playbook performance compounds: a slow playbook is run less often; less practice means less reliability when you need it; the team avoids running it. Fast playbooks become routine; routine playbooks stay reliable. The optimizations above are mostly straightforward; the discipline is in applying them and watching for regressions over time.
Get the latest tutorials, guides, and insights on AI, DevOps, Cloud, and Infrastructure delivered directly to your inbox.
A field report from rolling out retrieval-augmented generation in production, including cache bugs, bad embeddings, and how we fixed them.
Embed cost ownership in engineering: tags, budgets, and showback.
Explore more articles in this category
Backups are easy. Restores are hard. The quarterly drill we run, what's failed during it, and the discipline that makes "we have backups" actually mean something.
Replication is the foundation of database HA. What we monitor, how we practice failover, and the gotchas that show up only when you actually fail over.
Why Postgres connection limits bite at unexpected times, the pooling layer we put in front, and the pool-mode tradeoffs we learned the hard way.