We migrated 47 cron jobs to systemd timers across our fleet. The mechanical conversion was easy. The interesting parts were the bugs we found that cron had been hiding.
We migrated 47 cron jobs to systemd timers across roughly 200 hosts. The migration itself was a few weeks of Ansible work. The valuable part was what the migration forced us to confront — bugs that cron had been quietly hiding.
Three concrete reasons, in order of how often they bit us:
MAILTO= was empty on most boxes. Failing jobs were invisible until something downstream broke.grep CRON /var/log/syslog is not a monitoring strategy.Every cron job became two systemd units: a .service (what to run) and a .timer (when to run it). It feels heavier at first; it's worth it.
# /etc/systemd/system/db-backup.service
[Unit]
Description=Nightly database backup to S3
After=network-online.target
Wants=network-online.target
[Service]
Type=oneshot
User=backup
Group=backup
EnvironmentFile=/etc/db-backup.env
ExecStart=/usr/local/bin/db-backup.sh
StandardOutput=journal
StandardError=journal
SyslogIdentifier=db-backup
TimeoutStartSec=2h
# /etc/systemd/system/db-backup.timer
[Unit]
Description=Run db-backup nightly at 02:30
Requires=db-backup.service
[Timer]
OnCalendar=*-*-* 02:30:00
Persistent=true
RandomizedDelaySec=15min
AccuracySec=1min
Unit=db-backup.service
[Install]
WantedBy=timers.target
Enable and start:
systemctl daemon-reload
systemctl enable --now db-backup.timer
systemctl list-timers --all
A log-rotation job exited 1 every night because of a permission issue introduced during a hardening sweep. Cron's MAILTO was unset on the host; nothing got mailed; the job just kept failing. Disk usage was creeping up but nobody noticed because the alert threshold was 90% and we were at 84%.
The systemd version surfaced this immediately:
$ systemctl status log-rotate.service
● log-rotate.service - Log rotation
Active: failed (Result: exit-code) since Mon 2026-03-09 03:00:01
Plus we wired OnFailure=alert@%i.service to ship every failed unit to PagerDuty.
Two crons were rotating the same log via different scripts that nobody knew about. They had been racing for years; the loser silently corrupted partial gz files about once a week. We caught it during the migration when systemd's PartOf= couldn't resolve the dependency.
One box's cron was set to the user's local timezone (CDT) while everything else ran in UTC. OnCalendar=*-*-* 02:30:00 is always UTC by default unless you set OnCalendar=*-*-* 02:30:00 UTC or change the host TZ. Migration forced us to canonicalize: every timer line ends in UTC to remove all ambiguity.
[Timer]
OnCalendar=hourly
Persistent=true # run on next boot if we missed it while down
RandomizedDelaySec=10min # spread load across the fleet
AccuracySec=1min # default 1min is fine; tighten only if you need it
Persistent=true was the most underrated win. If the host was off for maintenance, the missed run fires once on boot. Cron has nothing like this without anacron, which is itself a third file to manage.
[Service]
Type=oneshot
ExecStart=/usr/local/bin/sync-secrets.sh
Restart=on-failure
RestartSec=30s
# But: don't restart forever
StartLimitIntervalSec=10min
StartLimitBurst=3
If the script fails, systemd retries up to 3 times in 10 minutes, then gives up and leaves the unit in failed state for the alert hook.
Cron has no equivalent of these. We add them on every job that touches CPU or memory:
[Service]
CPUQuota=50%
MemoryMax=512M
TasksMax=128
IOWeight=50
A backup job that used to spike load to 12 now stays under 4.
# /etc/systemd/system/db-backup-verify.service
[Unit]
After=db-backup.service
Requires=db-backup.service
# In the timer, point at db-backup.timer's completion via PartOf or use a chain
For real chains we prefer Requires + After + Type=oneshot rather than chaining timers. Easier to reason about.
crontab -e is one command; systemd is vim /etc/systemd/system/foo.timer && systemctl daemon-reload. We hide this in Ansible.crontab -l is universal; systemctl list-timers is unfamiliar to some sysadmins.- name: Install systemd timer + service
copy:
src: "{{ item }}"
dest: "/etc/systemd/system/{{ item | basename }}"
mode: "0644"
loop:
- "files/{{ job_name }}.service"
- "files/{{ job_name }}.timer"
notify:
- daemon reload
- enable timer
handlers:
- name: daemon reload
systemd: { daemon_reload: yes }
- name: enable timer
systemd:
name: "{{ job_name }}.timer"
enabled: yes
state: started
For everything else — production hosts, fleet at any scale, anything you'd want to debug at 3am — systemd timers pay for themselves the first time something fails.
StandardOutput=journal and a SyslogIdentifier= so journalctl -u name -f works.Persistent=true for periodic jobs unless you have a specific reason not to.CPUQuota, MemoryMax). Future-you will thank present-you.OnFailure= to your alerting stack. Silent failures are the only failures that matter.UTC in OnCalendar=.Get the latest tutorials, guides, and insights on AI, DevOps, Cloud, and Infrastructure delivered directly to your inbox.
We replaced 14 long-lived IAM users with SSO + temporary credentials. The migration plan, the gotchas, and the policies we now enforce.
Blue/green is easy for stateless services. We did it for our primary Postgres cluster with 3.2TB of data and ~8k connections. Here's exactly how — and what almost went wrong.
Explore more articles in this category
Step-by-step debugging of a production Linux server hitting 100% CPU. From top to perf to the actual fix.
A practical systemd drop-in guide built from a real operations problem: vendor unit files kept changing, but the team still needed consistent restart, environment, and logging behavior.
A practical systemd reliability guide for Linux services, built around repeated restart-loop incidents and the unit-file patterns that finally made those services boring.