A team of 30 engineers all editing the same monolithic Ansible repo doesn't work. Here's the role taxonomy and review process that did.

On this page

Ansible Role Design for Large Teams

The team I worked with had ~30 engineers contributing to a single Ansible repo that managed about 200 hosts across our fleet. The repo had grown organically. By the time we did the cleanup, it had four "main.yml" entry points, eleven includes layers deep in places, and roles that referenced variables defined in other roles in non-obvious ways. Onboarding a new engineer to it was a half-day affair. Production changes routinely had unintended side effects.

We refactored the structure over a quarter. Six months later, onboarding is a 20-minute conversation, side effects are rare, and PRs land cleanly. The shape we ended up with is below.

The taxonomy #

Three categories of role:

Foundational: things every host of a given OS needs (NTP, base packages, SSH config, firewall baseline). Owned by the platform team.
Service-specific: configures a specific application (postgres, nginx, our internal payment-worker). Owned by the team that operates that service.
Composition (we call them "host roles"): wires foundational + service roles together for a specific host class. Owned by whichever team operates that host class.

Each category has different rules.

Foundational roles #

These are stable, change rarely, and need careful review when they do. We have about 12 foundational roles:

base-os (NTP, hosts file, base packages, sudoers)
ssh-config (sshd hardening)
firewall (iptables/nftables baseline)
monitoring-agent (Prometheus node_exporter + log shipper)
security-baseline (CIS-aligned hardening)
time-zone
audit-logging
certificate-store
... and a few more

Properties of foundational roles:

They never call out to other roles (no import_role of service-specific stuff)
They expose well-defined variables for customization
They have idempotency tests (Molecule-based) that run in CI
Any change requires PR approval from the platform team

The platform team owns the review queue for these. PRs from service teams that touch foundational roles get re-routed automatically.

Service-specific roles #

About 30 of these. Each handles one application or technology:

postgres (database installation, config, replication setup)
redis
nginx
payment-worker (our internal app)
kafka
...

These are owned by the team that operates the corresponding service. The postgres role is owned by the data team; the payment-worker role by the payments team; etc.

Service roles can depend on foundational roles, but not on other service roles. If two services need to coexist on a host, the composition role wires them up — the service roles themselves don't know about each other.

This decoupling has been the most valuable architectural choice. Before this, our redis role depended on our postgres role for some shared logic, which made changes to either ripple through the other. After splitting, you can change the redis role without thinking about postgres.

Composition (host) roles #

These are the "wire it all together" roles. Each represents a class of host:

host-database-primary
host-database-replica
host-app-server
host-bastion
host-monitoring
...

Each composition role's meta/main.yml lists its dependencies:

yaml.yaml

dependencies:
  - role: base-os
  - role: ssh-config
  - role: firewall
    vars:
      firewall_allow_inbound:
        - port: 5432
          from: app-servers
        - port: 22
          from: bastion
  - role: monitoring-agent
  - role: postgres
    vars:
      postgres_role: primary
      postgres_replica_count: 2

Composition roles ARE allowed to know about specific service roles and pass variables to them. They're the integration layer. They ALSO contain very little logic of their own — they're essentially declarative wiring.

Result: if you want to know "what runs on a database primary host," you read one file: roles/host-database-primary/meta/main.yml. The whole stack is visible in one screen.

What this replaced #

Before this structure, our roles had grown into a snarl:

roles/postgres/tasks/main.yml had a section for "if this is also a redis host, set these tunables." Mixed concerns.
Variables were defined in group_vars/all.yml (some), inventory_vars/... (others), and roles/*/defaults/main.yml (still others). No clear precedence.
Some playbooks at the top level called individual roles directly; others called composition-style aggregator playbooks. Inconsistent entry points.

The cleanup mostly involved untangling these. Pull mixed concerns out of service roles into composition. Move variables to a single source of truth per concern. Standardize on a single playbook entry point pattern.

Variable scoping #

We standardized on three variable sources, in order of precedence:

Inventory-level (inventory/group_vars/, inventory/host_vars/): environment-specific values (DB connection strings, hostnames, etc.)
Composition role's vars: defaults specific to that host class
Service role's defaults (roles/<service>/defaults/main.yml): the lowest-priority defaults, for any unspecified value

We banned variables from being defined in service roles' vars/main.yml (which has higher precedence than defaults and is harder to override). If a value should be overridable, it's a default.

This was a learn-as-we-went rule. The first month we hit several "why is this variable not what I set it to" issues, traced to vars-vs-defaults precedence. After standardizing, the surprises stopped.

Tagging conventions #

Every task has at least one tag, and we use a controlled vocabulary:

yaml.yaml

- name: Install postgres
  apt:
    name: postgresql-15
  tags: [install, postgres]

- name: Configure postgres
  template:
    src: postgresql.conf.j2
    dest: /etc/postgresql/15/main/postgresql.conf
  tags: [config, postgres]

- name: Apply OS-level tuning for postgres
  sysctl:
    name: vm.swappiness
    value: 1
  tags: [tuning, postgres, os]

Tags let us run partial plays:

bash.bash

# Just reconfigure, don't touch installation
ansible-playbook site.yml --limit db-primary --tags config

# Just OS-level tuning across all hosts
ansible-playbook site.yml --tags tuning

The controlled vocabulary (install, config, tuning, users, firewall, plus the service name) keeps tags meaningful. Without it, every team invents their own names and the tags become noise.

Code review process #

PRs touching Ansible go through one of three review paths:

Foundational role change: requires platform team approval. They know the blast radius.
Service role change: requires the owning team's approval. They know their service.
Composition role change: requires the host-class owning team's approval. Often the same as service team but not always.

Cross-cutting changes (e.g., updating a foundational role AND its consumers) require approvals from each affected team.

We have a CODEOWNERS file in the repo that auto-routes PRs:

code

/roles/base-os/                @platform-team
/roles/ssh-config/             @platform-team
/roles/postgres/               @data-team
/roles/payment-worker/         @payments-team
/roles/host-database-primary/  @data-team
/roles/host-app-server/        @app-platform-team

Without CODEOWNERS, PRs sat in review queues for days because nobody knew who should review.

Idempotency testing #

Every role has a Molecule test. Minimum: spin up a fresh container, apply the role, apply it again, assert no changes on second run.

yaml.yaml

# molecule/default/molecule.yml
scenario:
  test_sequence:
    - dependency
    - create
    - prepare
    - converge
    - idempotence
    - verify
    - destroy

The idempotence step is the one that catches the most bugs. A role that does something different on the second run — even something invisible like restarting a service unnecessarily — is a bug. Molecule fails CI if it happens.

Before this was standard, we had at least three production incidents caused by playbooks that triggered service restarts on every run. The team would assume "running ansible to fix host X won't affect host Y" and that turned out to be wrong because the role wasn't idempotent.

Common patterns we follow #

A few snippets that come up frequently:

Conditional task with when:

yaml.yaml

- name: Install Postgres 15
  apt:
    name: postgresql-15
  when: postgres_version == 15

Better than mixing logic into task content. The condition is visible.

Handlers for restarts

yaml.yaml

- name: Configure postgres
  template:
    src: postgresql.conf.j2
    dest: /etc/postgresql/15/main/postgresql.conf
  notify: restart postgres

# in handlers/main.yml
- name: restart postgres
  systemd:
    name: postgresql
    state: restarted

Restarts only happen if the config actually changed. Avoids unnecessary disruption.

Vars per environment

yaml.yaml

# inventory/group_vars/production/postgres.yml
postgres_max_connections: 200
postgres_shared_buffers: 4GB

# inventory/group_vars/staging/postgres.yml
postgres_max_connections: 50
postgres_shared_buffers: 1GB

The role's defaults/main.yml has small-test values (e.g., max_connections: 20); production and staging override.

What we don't bother with #

Custom modules. We considered writing custom Ansible modules for our internal tooling. The maintenance overhead isn't justified — a few command: tasks calling our internal CLIs is good enough.
Async tasks across long-running operations. We tried async: 600 poll: 30 for things like database backups; the failure modes were complex. Now we use a wrapper script invoked synchronously.
Vault for inventory secrets. We tried Ansible Vault for a while. Replaced with HashiCorp Vault + a runtime fetch. Less trouble.

What I'd tell a team starting #

Do the role taxonomy upfront. Three categories — foundational, service, composition — is sufficient. Don't let service roles grow tendrils into each other.

Set up CODEOWNERS the same week you set up the repo. Without it, review queues become tragedies.

Mandate Molecule idempotency tests on every role from day one. Adding them later is harder because you discover existing roles were never idempotent.

Variables: pick one precedence model and document it. Inventory > composition > defaults works for us. The exact choice matters less than the consistency.

The structural cleanup was a quarter of work. Worth it. Onboarding new engineers is now a brief conversation about the taxonomy, not a half-day spelunking expedition.

Best Practices: Ansible Role Design for Large Teams

Ansible Role Design for Large Teams

The taxonomy #

Foundational roles #

Service-specific roles #

Composition (host) roles #

What this replaced #

Variable scoping #

Tagging conventions #

Code review process #

Idempotency testing #

Common patterns we follow #

What we don't bother with #

What I'd tell a team starting #

Stay Updated

How We Stopped Terraform Drift from Surprising On-Call

Real-World RAG Incidents: Lessons from a Production Rollout

More from Infrastructure

Backstage Software Catalog: Getting Adoption Past the Demo

Terraform Import at Scale: Bringing Legacy Infra Under Code

Zero-Downtime Postgres Migrations: Expand-Contract in Practice

Backstage Software Catalog: Getting Adoption Past the Demo

Terraform Import at Scale: Bringing Legacy Infra Under Code

Zero-Downtime Postgres Migrations: Expand-Contract in Practice

Postgres Read Replicas: Routing Reads Without Stale-Data Bugs

Feature Flags for Safe Deploys: Decoupling Release From Deploy

GitHub Actions Reusable Workflows: DRY Pipelines at Org Scale

You might have missed

GitOps with Argo CD: Best Practices for 2025

Prompt Engineering Best Practices: Maximizing LLM Performance

AI Agents in DevOps: From Copilots to Autonomous Automation in 2025

About Kiril Urbonas