A team of 30 engineers all editing the same monolithic Ansible repo doesn't work. Here's the role taxonomy and review process that did.
The team I worked with had ~30 engineers contributing to a single Ansible repo that managed about 200 hosts across our fleet. The repo had grown organically. By the time we did the cleanup, it had four "main.yml" entry points, eleven includes layers deep in places, and roles that referenced variables defined in other roles in non-obvious ways. Onboarding a new engineer to it was a half-day affair. Production changes routinely had unintended side effects.
We refactored the structure over a quarter. Six months later, onboarding is a 20-minute conversation, side effects are rare, and PRs land cleanly. The shape we ended up with is below.
Three categories of role:
payment-worker). Owned by the team that operates that service.Each category has different rules.
These are stable, change rarely, and need careful review when they do. We have about 12 foundational roles:
base-os (NTP, hosts file, base packages, sudoers)ssh-config (sshd hardening)firewall (iptables/nftables baseline)monitoring-agent (Prometheus node_exporter + log shipper)security-baseline (CIS-aligned hardening)time-zoneaudit-loggingcertificate-storeProperties of foundational roles:
import_role of service-specific stuff)The platform team owns the review queue for these. PRs from service teams that touch foundational roles get re-routed automatically.
About 30 of these. Each handles one application or technology:
postgres (database installation, config, replication setup)redisnginxpayment-worker (our internal app)kafkaThese are owned by the team that operates the corresponding service. The postgres role is owned by the data team; the payment-worker role by the payments team; etc.
Service roles can depend on foundational roles, but not on other service roles. If two services need to coexist on a host, the composition role wires them up — the service roles themselves don't know about each other.
This decoupling has been the most valuable architectural choice. Before this, our redis role depended on our postgres role for some shared logic, which made changes to either ripple through the other. After splitting, you can change the redis role without thinking about postgres.
These are the "wire it all together" roles. Each represents a class of host:
host-database-primaryhost-database-replicahost-app-serverhost-bastionhost-monitoringEach composition role's meta/main.yml lists its dependencies:
dependencies:
- role: base-os
- role: ssh-config
- role: firewall
vars:
firewall_allow_inbound:
- port: 5432
from: app-servers
- port: 22
from: bastion
- role: monitoring-agent
- role: postgres
vars:
postgres_role: primary
postgres_replica_count: 2
Composition roles ARE allowed to know about specific service roles and pass variables to them. They're the integration layer. They ALSO contain very little logic of their own — they're essentially declarative wiring.
Result: if you want to know "what runs on a database primary host," you read one file: roles/host-database-primary/meta/main.yml. The whole stack is visible in one screen.
Before this structure, our roles had grown into a snarl:
roles/postgres/tasks/main.yml had a section for "if this is also a redis host, set these tunables." Mixed concerns.group_vars/all.yml (some), inventory_vars/... (others), and roles/*/defaults/main.yml (still others). No clear precedence.The cleanup mostly involved untangling these. Pull mixed concerns out of service roles into composition. Move variables to a single source of truth per concern. Standardize on a single playbook entry point pattern.
We standardized on three variable sources, in order of precedence:
inventory/group_vars/, inventory/host_vars/): environment-specific values (DB connection strings, hostnames, etc.)roles/<service>/defaults/main.yml): the lowest-priority defaults, for any unspecified valueWe banned variables from being defined in service roles' vars/main.yml (which has higher precedence than defaults and is harder to override). If a value should be overridable, it's a default.
This was a learn-as-we-went rule. The first month we hit several "why is this variable not what I set it to" issues, traced to vars-vs-defaults precedence. After standardizing, the surprises stopped.
Every task has at least one tag, and we use a controlled vocabulary:
- name: Install postgres
apt:
name: postgresql-15
tags: [install, postgres]
- name: Configure postgres
template:
src: postgresql.conf.j2
dest: /etc/postgresql/15/main/postgresql.conf
tags: [config, postgres]
- name: Apply OS-level tuning for postgres
sysctl:
name: vm.swappiness
value: 1
tags: [tuning, postgres, os]
Tags let us run partial plays:
# Just reconfigure, don't touch installation
ansible-playbook site.yml --limit db-primary --tags config
# Just OS-level tuning across all hosts
ansible-playbook site.yml --tags tuning
The controlled vocabulary (install, config, tuning, users, firewall, plus the service name) keeps tags meaningful. Without it, every team invents their own names and the tags become noise.
PRs touching Ansible go through one of three review paths:
Cross-cutting changes (e.g., updating a foundational role AND its consumers) require approvals from each affected team.
We have a CODEOWNERS file in the repo that auto-routes PRs:
/roles/base-os/ @platform-team
/roles/ssh-config/ @platform-team
/roles/postgres/ @data-team
/roles/payment-worker/ @payments-team
/roles/host-database-primary/ @data-team
/roles/host-app-server/ @app-platform-team
Without CODEOWNERS, PRs sat in review queues for days because nobody knew who should review.
Every role has a Molecule test. Minimum: spin up a fresh container, apply the role, apply it again, assert no changes on second run.
# molecule/default/molecule.yml
scenario:
test_sequence:
- dependency
- create
- prepare
- converge
- idempotence
- verify
- destroy
The idempotence step is the one that catches the most bugs. A role that does something different on the second run — even something invisible like restarting a service unnecessarily — is a bug. Molecule fails CI if it happens.
Before this was standard, we had at least three production incidents caused by playbooks that triggered service restarts on every run. The team would assume "running ansible to fix host X won't affect host Y" and that turned out to be wrong because the role wasn't idempotent.
A few snippets that come up frequently:
Conditional task with when:
- name: Install Postgres 15
apt:
name: postgresql-15
when: postgres_version == 15
Better than mixing logic into task content. The condition is visible.
Handlers for restarts
- name: Configure postgres
template:
src: postgresql.conf.j2
dest: /etc/postgresql/15/main/postgresql.conf
notify: restart postgres
# in handlers/main.yml
- name: restart postgres
systemd:
name: postgresql
state: restarted
Restarts only happen if the config actually changed. Avoids unnecessary disruption.
Vars per environment
# inventory/group_vars/production/postgres.yml
postgres_max_connections: 200
postgres_shared_buffers: 4GB
# inventory/group_vars/staging/postgres.yml
postgres_max_connections: 50
postgres_shared_buffers: 1GB
The role's defaults/main.yml has small-test values (e.g., max_connections: 20); production and staging override.
command: tasks calling our internal CLIs is good enough.async: 600 poll: 30 for things like database backups; the failure modes were complex. Now we use a wrapper script invoked synchronously.Do the role taxonomy upfront. Three categories — foundational, service, composition — is sufficient. Don't let service roles grow tendrils into each other.
Set up CODEOWNERS the same week you set up the repo. Without it, review queues become tragedies.
Mandate Molecule idempotency tests on every role from day one. Adding them later is harder because you discover existing roles were never idempotent.
Variables: pick one precedence model and document it. Inventory > composition > defaults works for us. The exact choice matters less than the consistency.
The structural cleanup was a quarter of work. Worth it. Onboarding new engineers is now a brief conversation about the taxonomy, not a half-day spelunking expedition.
Get the latest tutorials, guides, and insights on AI, DevOps, Cloud, and Infrastructure delivered directly to your inbox.
A real story of removing console-only changes, adding drift detection, and getting Terraform back in charge.
A field report from rolling out retrieval-augmented generation in production, including cache bugs, bad embeddings, and how we fixed them.
Explore more articles in this category
Backups are easy. Restores are hard. The quarterly drill we run, what's failed during it, and the discipline that makes "we have backups" actually mean something.
Replication is the foundation of database HA. What we monitor, how we practice failover, and the gotchas that show up only when you actually fail over.
Why Postgres connection limits bite at unexpected times, the pooling layer we put in front, and the pool-mode tradeoffs we learned the hard way.