We cut our largest playbook's runtime from 14 minutes to 4 minutes. The specific changes that mattered, plus the ones that didn't.

Ansible Playbook Optimization: Speed and Efficiency

We have ~30 Ansible playbooks; some run on tens of hosts, a few run on hundreds. Slow playbooks have an outsized impact — they block deploys, they make incident response slower, they discourage running them at all. After focused optimization on our largest playbook (a fleet-wide config update), we cut runtime from 14 minutes to 4. This post is what worked, with rough impact estimates per change.

The starting point #

A playbook configuring 60 hosts: ~14 minutes. Tasks were a mix of package installation, config templating, service management, and validation.

For a playbook that needs to be re-run when configs change, 14 minutes is annoying. For something we'd want to run during incident recovery, it's too slow.

Change 1: SSH multiplexing #

The default behavior: every Ansible task makes a fresh SSH connection. With ~80 tasks across 60 hosts, that's 4,800 SSH handshakes.

Enable SSH multiplexing in ansible.cfg:

ini.ini

[ssh_connection]
ssh_args = -o ControlMaster=auto -o ControlPersist=60s -o ControlPath=/tmp/ansible-ssh-%h-%p-%r
pipelining = True

ControlPersist=60s: SSH keeps the connection alive for 60 seconds. Subsequent tasks reuse it.

pipelining = True: pipelines commands in a single SSH session, avoiding intermediate file transfers.

Combined: ~3x speedup for our playbook. From 14 min to ~5 min.

This is the biggest single lever for most slow playbooks. If yours is slow and you don't have these on, fix that first.

Change 2: Disable fact gathering when not needed #

gather_facts: True (the default) collects ~500 facts about each host before any tasks run. Useful when tasks reference facts; wasteful when they don't.

yaml.yaml

- name: Update config files only
  hosts: web
  gather_facts: false
  tasks:
    - ...

For tasks that don't reference ansible_* variables, this skips the fact-gathering step. ~10-30 seconds saved per host on plays where facts aren't used.

For our playbook with 60 hosts, ~5 minutes saved.

For plays that need some facts: gather_subset: ['network', 'hardware'] collects only the specified subsets, faster than full gathering.

Change 3: `serial` and parallel execution #

By default, Ansible runs tasks in parallel up to forks (default 5). For 60 hosts, only 5 are working at a time.

ini.ini

# ansible.cfg
[defaults]
forks = 25

Increase forks. We use 25 — enough parallelism without overwhelming our network or hammering the SSH connection limit.

For some plays, you specifically want serial execution (rolling restart):

yaml.yaml

- name: Rolling restart
  hosts: web
  serial: 5  # 5 hosts at a time

Or serial: "20%" for percentage-based.

Increasing forks from 5 to 25: ~30-40% faster for our playbook.

Change 4: Reduce task count where possible #

Each task has fixed overhead (SSH command, return processing). Reducing task count reduces this overhead.

Combine related tasks where reasonable:

yaml.yaml

# Before: 3 tasks
- name: Create dir 1
  file: { path: /opt/app1, state: directory }
- name: Create dir 2
  file: { path: /opt/app2, state: directory }
- name: Create dir 3
  file: { path: /opt/app3, state: directory }

# After: 1 task
- name: Create dirs
  file: { path: "{{ item }}", state: directory }
  loop:
    - /opt/app1
    - /opt/app2
    - /opt/app3

Or use with_items style. The single task with a loop is faster than 3 separate tasks.

For our playbook, consolidating tasks saved ~1 minute.

Change 5: Use `package` for batched installs #

Installing packages one at a time is slow (separate apt update, separate transactions):

yaml.yaml

# Before: separate tasks
- name: Install nginx
  apt: { name: nginx, state: present, update_cache: yes }
- name: Install postgresql-client
  apt: { name: postgresql-client, state: present }
- name: Install jq
  apt: { name: jq, state: present }

# After: one task
- name: Install packages
  apt:
    name: [nginx, postgresql-client, jq]
    state: present
    update_cache: yes

The single call installs all packages in one apt transaction. Faster than three separate calls.

Change 6: `async` for long-running tasks #

For tasks that take a long time (large package installs, compilations), async lets them run in the background:

yaml.yaml

- name: Slow setup script
  command: /opt/setup.sh
  async: 600   # 10 min timeout
  poll: 30     # check every 30s

The Ansible runner doesn't sit blocked waiting; it polls. Other tasks (or hosts) can proceed.

For tasks that don't depend on the slow one's output, poll: 0 runs it asynchronously and returns immediately:

yaml.yaml

- name: Kick off backup
  command: /opt/backup.sh
  async: 3600
  poll: 0
  register: backup_job

# ... other tasks ...

- name: Wait for backup
  async_status:
    jid: "{{ backup_job.ansible_job_id }}"
  until: backup_status.finished
  retries: 60
  delay: 30

Useful for parallelizing things that would otherwise serialize.

Change 7: Use `delegate_to` to avoid wasted runs #

If a task should run once, not per-host:

yaml.yaml

- name: Update load balancer
  uri:
    url: https://lb-api/services
    method: POST
  delegate_to: localhost
  run_once: true

run_once: true makes the task run on only one host (and delegate_to: localhost runs it on the controller). Without these, the task would run 60 times.

Change 8: Cache facts #

If you run multiple plays in one playbook, facts are gathered once per play by default. Caching saves re-gathering:

ini.ini

[defaults]
fact_caching = jsonfile
fact_caching_connection = /tmp/ansible_facts_cache
fact_caching_timeout = 7200  # 2 hours

For playbooks that run frequently, the cached facts are a quick reuse. For one-off runs, less helpful.

Change 9: Lazy delegation - skip what doesn't need to change #

The biggest win is often not making Ansible faster, but making it do less.

Tasks that aren't idempotent run every time even if nothing has changed. Refactoring them to be properly idempotent means they no-op when state is correct:

yaml.yaml

# Bad: always reports changed
- name: Apply config
  shell: /opt/apply-config.sh

# Better: idempotent
- name: Apply config
  template:
    src: config.j2
    dest: /etc/myapp/config
  notify: restart myapp  # only restarts if template changed

Most modules (template, file, apt, systemd) are idempotent — they check current state and skip if matching. The shell and command modules aren't, by default.

For tasks that legitimately need to be command/shell, add idempotency markers:

yaml.yaml

- name: Initial setup
  shell: /opt/setup.sh
  args:
    creates: /var/lib/myapp/.setup-done

The creates: ... arg makes the task skip if the file exists. Manual idempotency.

What didn't pay off #

Some optimizations we tried that didn't help meaningfully:

Custom Ansible modules. Sometimes modules add overhead vs raw shell. The overhead is small; we didn't see meaningful gains.

Skipping the SSH host key check (StrictHostKeyChecking=no). Marginal; we want the security check anyway.

Parallel inventory plugins. Our dynamic inventory was fine; querying AWS for hosts wasn't the bottleneck.

Disabling encryption. Only relevant for slow networks; ours weren't the bottleneck.

Switching connection plugin. Tried mitogen (a faster connection plugin). Real speedups but added operational complexity (occasional weird issues). We haven't standardized on it.

What about Mitogen?#

Mitogen is a connection plugin that claims significant speedups. Our experience:

Real speedup on heavy playbooks: ~2-3x faster.
Some weird failure modes (memory issues on certain types of tasks, occasional task hangs).
Maintenance: Mitogen has less developer activity than core Ansible; some compatibility issues lag.

We use it for one specific high-volume playbook where the speedup matters. For our other playbooks, the standard SSH plugin with multiplexing is fast enough.

Operational discipline #

A few habits that keep playbooks fast:

Profile occasionally. Run with ANSIBLE_CALLBACKS_ENABLED=profile_tasks to see per-task duration. Surfaces slow tasks; informs where to focus.

Test on representative hosts. A playbook tested on 5 hosts runs differently on 100. Periodically test at scale.

Watch for new long tasks. When someone adds a task that takes 30 seconds, the cumulative impact at scale is real. Code review checks for this.

Don't add features just because. Adding "while we're here" features to a playbook bloats runtime. Each task earns its place.

Final state #

For our largest playbook, after all optimizations:

14 min → 4 min (~3.5x speedup)
60 hosts updated reliably
Standard CI gates run on every PR
Profiled monthly to catch regressions

Most of the speedup came from SSH multiplexing + increased forks + skipping fact gathering. These three would have been ~70% of the win on their own.

What I'd tell someone starting #

SSH multiplexing first. Free 3x speedup; nothing else compares.

Increase forks. Default 5 is too conservative for most modern infrastructure.

Skip gather_facts when not needed. A surprising amount of plays don't actually use facts.

Make tasks idempotent. Skipped no-op tasks are the fastest tasks.

Profile to find bottlenecks. Don't optimize blindly.

Combine related tasks. Each task has overhead; fewer tasks = less overhead.

Ansible playbook performance compounds: a slow playbook is run less often; less practice means less reliability when you need it; the team avoids running it. Fast playbooks become routine; routine playbooks stay reliable. The optimizations above are mostly straightforward; the discipline is in applying them and watching for regressions over time.

Ansible Playbook Optimization: Writing Efficient Playbooks

Ansible Playbook Optimization: Speed and Efficiency

The starting point #

Change 1: SSH multiplexing #

Change 2: Disable fact gathering when not needed #

Change 3: `serial` and parallel execution #

Change 4: Reduce task count where possible #

Change 5: Use `package` for batched installs #

Change 6: `async` for long-running tasks #

Change 7: Use `delegate_to` to avoid wasted runs #

Change 8: Cache facts #

Change 9: Lazy delegation - skip what doesn't need to change #

What didn't pay off #

What about Mitogen?#

Operational discipline #

Final state #

What I'd tell someone starting #

Stay Updated

Real-World RAG Incidents: Lessons from a Production Rollout

FinOps and Cloud Cost Management for Engineering Teams

More from Infrastructure

Backstage Software Catalog: Getting Adoption Past the Demo

Terraform Import at Scale: Bringing Legacy Infra Under Code

Zero-Downtime Postgres Migrations: Expand-Contract in Practice

Backstage Software Catalog: Getting Adoption Past the Demo

Terraform Import at Scale: Bringing Legacy Infra Under Code

Zero-Downtime Postgres Migrations: Expand-Contract in Practice

Postgres Read Replicas: Routing Reads Without Stale-Data Bugs

AWS Graviton Migration: What Broke and What We Saved

Serverless Cold Starts: Measuring and Fixing Them on Lambda

You might have missed

GitOps with Argo CD: Best Practices for 2025

Prompt Engineering Best Practices: Maximizing LLM Performance

Process Management and Monitoring in Linux

About Kiril Urbonas