Concrete systemd unit patterns that reduced flakiness: restart policies, resource limits, and structured logs.
After a few painful outages caused by homemade init scripts, we moved everything to systemd and wrote down the patterns that worked.
We had a service that occasionally failed to bind its port on boot.
```ini [Unit] Description=API service After=network-online.target Wants=network-online.target
[Service] ExecStart=/usr/local/bin/api Restart=on-failure RestartSec=5
[Install] WantedBy=multi-user.target ```
We saw file descriptor exhaustion during load tests.
When something goes wrong, we start with:
Systemd didn’t fix our code, but it made failures predictable and repeatable.
Get the latest tutorials, guides, and insights on AI, DevOps, Cloud, and Infrastructure delivered directly to your inbox.
A different angle on DR: the planning process — RTO/RPO conversations, dependency mapping, and what we learned about prioritizing what to recover.
A real story of removing console-only changes, adding drift detection, and getting Terraform back in charge.
Explore more articles in this category
We started using eBPF tooling for ad-hoc production debugging six months ago. Three real incidents where it cut investigation time from hours to minutes.
Three production OOM incidents that taught us how kubelet, containerd, and the kernel actually decide which process dies. With debugging commands you'll wish you had earlier.
We migrated 47 cron jobs to systemd timers across our fleet. The mechanical conversion was easy. The interesting parts were the bugs we found that cron had been hiding.