A condensed checklist of the systemd unit-file patterns we now use everywhere, with the production reasons each one matters.

Operational Checklist: Systemd Service Reliability Patterns

Every Linux service we run is managed by systemd. After years of accumulating war stories, our standard unit file template has stabilized around a small set of patterns. This is the checklist version: each item is something we've added because of a specific production incident, with a brief explanation of what it prevents.

This isn't a tutorial on systemd basics. It assumes you know what [Unit], [Service], and [Install] are.

The template, annotated #

ini.ini

[Unit]
Description=My Service
After=network-online.target
Wants=network-online.target

[Service]
Type=exec
User=myservice
Group=myservice
WorkingDirectory=/opt/myservice

EnvironmentFile=/etc/myservice.env
ExecStart=/opt/myservice/bin/myservice

# Restart behaviour
Restart=on-failure
RestartSec=5
StartLimitIntervalSec=10min
StartLimitBurst=3

# Resource limits
LimitNOFILE=65536
LimitNPROC=4096
MemoryMax=2G
CPUQuota=200%
TasksMax=512

# Logging
StandardOutput=journal
StandardError=journal
SyslogIdentifier=myservice

# Hardening
NoNewPrivileges=true
PrivateTmp=true
ProtectSystem=strict
ReadWritePaths=/var/lib/myservice /var/log/myservice
ProtectHome=true
ProtectKernelTunables=true
ProtectKernelModules=true
ProtectControlGroups=true
RestrictRealtime=true
RestrictNamespaces=true
RestrictSUIDSGID=true
LockPersonality=true

# Watchdog (if the service supports sd_notify)
WatchdogSec=30
NotifyAccess=main

# Failure hook
OnFailure=alert-pagerduty@%n.service

[Install]
WantedBy=multi-user.target

The remainder of this post is each block explained.

`After=network-online.target` and `Wants=` paired #

code

After=network-online.target
Wants=network-online.target

Without this, services that need DNS or external connectivity often start before the network is ready, fail their first connection, and crash-loop briefly. We had a service that took 30 seconds to come up after every reboot because it failed its initial DNS lookup, restarted, and only then succeeded.

Both lines are needed. Wants= declares the dependency; After= orders the units. Just After= without Wants= doesn't actually pull network-online.target into the boot sequence.

`Type=exec` not `Type=simple`#

code

Type=exec

Type=simple (the default) considers the service started the instant the process is forked. Type=exec waits until execve() returns. The difference matters when a service crashes immediately on startup — with simple, systemd marks it active for a few milliseconds before the crash. With exec, the failure is visible.

This matters for systemctl start myservice && echo started behaving correctly.

`User=` and `Group=` always #

code

User=myservice
Group=myservice

Never run as root. The user is created at install time. If the service needs to bind to a privileged port, we use AmbientCapabilities=CAP_NET_BIND_SERVICE rather than running as root.

Catching root processes in systemd-cgls is one of the easier audits we do periodically. Anything we find gets fixed.

Restart, but bounded #

code

Restart=on-failure
RestartSec=5
StartLimitIntervalSec=10min
StartLimitBurst=3

This restarts the service on failure (after 5 seconds), but not infinitely. If the service crashes 3 times in 10 minutes, systemd gives up and leaves it failed.

Why bound it: a service that's misconfigured will crash-loop forever otherwise, generating logs faster than your aggregation can handle, hammering the DB on each startup attempt, and never alerting anyone because "it's running fine — for 4 seconds at a time."

Three retries in 10 minutes is enough to handle transient issues (network blip, brief resource pressure) and stops short on genuine config errors.

Resource limits #

code

LimitNOFILE=65536
LimitNPROC=4096
MemoryMax=2G
CPUQuota=200%
TasksMax=512

LimitNOFILE: file descriptors. The default is often 1024, which is too low for any service that handles many connections. We had a service drop incoming requests after exactly 1015 concurrent connections; took half a day to diagnose.

MemoryMax: hard limit, OOM-killed if exceeded. Usually set to slightly above the service's expected working set. Prevents one runaway service from taking down the whole node.

CPUQuota=200% means up to 2 CPUs worth. Useful on multi-tenant nodes; not strictly necessary on single-purpose hosts.

TasksMax: hard cap on threads/processes. We hit this on a service that had a thread leak; the cap turned an unbounded leak into a controlled crash and alert.

Logging to journal, with identifier #

code

StandardOutput=journal
StandardError=journal
SyslogIdentifier=myservice

Sends both stdout and stderr to journald with a tag. journalctl -u myservice pulls them all. journalctl -t myservice works too.

Before this was standard, we had services writing to /var/log/myservice/myservice.log directly. Rotation was a separate config; sometimes it broke; sometimes the disk filled. The journald path is uniform across services and rotates automatically.

Hardening: the namespace and protection settings #

code

NoNewPrivileges=true
PrivateTmp=true
ProtectSystem=strict
ReadWritePaths=/var/lib/myservice /var/log/myservice
ProtectHome=true
ProtectKernelTunables=true
ProtectKernelModules=true
ProtectControlGroups=true

These are systemd's built-in process isolation features. Each one closes a class of attack:

NoNewPrivileges: a compromised process can't escalate via setuid binaries.
PrivateTmp: each service gets its own /tmp, so a temp-file attack between services is impossible.
ProtectSystem=strict: most of the filesystem is mounted read-only for the service. We then explicitly list paths it can write to via ReadWritePaths.
ProtectHome: /home is invisible to the service.
ProtectKernel*: the service can't fiddle with kernel tunables, load modules, or modify cgroups.

Adding ProtectSystem=strict once required us to enumerate every directory each service legitimately wrote to. We discovered three services were writing to /var/lib/dpkg (wrong) and one was writing to /etc/cron.d (definitely wrong). Fixed all four during the audit.

The watchdog (when applicable)#

code

WatchdogSec=30
NotifyAccess=main

For services that integrate with sd_notify, the watchdog kills (and restarts via the Restart= rule) any service that hasn't reported alive within 30 seconds. Useful for catching a process that's hung in a deadlock — the kernel sees the process as alive, but it's not making progress.

Most of our Go services use https://pkg.go.dev/github.com/coreos/go-systemd/daemon to send watchdog pings every 10 seconds.

OnFailure for alerting #

code

OnFailure=alert-pagerduty@%n.service

When the service enters failed state, this triggers a separate template unit alert-pagerduty@.service which sends a PagerDuty event. The %n is the name of the failed unit.

The alert template is:

ini.ini

[Unit]
Description=Page on failure of %i

[Service]
Type=oneshot
ExecStart=/usr/local/bin/page-pagerduty.sh %i

This means we get paged immediately on any failed service, with the failed unit's name in the alert. No need for separate monitoring; the OS itself is the alerting source.

Common mistakes we still see #

People hitting the team with these issues:

Service runs fine manually but fails under systemd. Almost always an environment variable issue. The shell has env vars systemd doesn't. Use EnvironmentFile= for everything; never rely on inherited env.

Service refuses to start after daemon-reload. The unit file syntax error wasn't reported. Run systemd-analyze verify /etc/systemd/system/myservice.service to surface syntax issues.

Restart=always causing crash loops to go unnoticed. Don't use Restart=always. Use Restart=on-failure and bound it with StartLimitBurst.

Permissions issues on hardened services. ProtectSystem=strict means most dirs are read-only. Check the service's actual write paths and explicitly allow them via ReadWritePaths.

What we do once per quarter #

Every quarter, one of us runs a script that audits every unit file on our fleet against this template. The audit reports:

Services running as root (we should have zero)
Services without NoNewPrivileges=true
Services without resource limits
Services without OnFailure=

The drift is small but real — new services occasionally ship without the full template. The quarterly audit catches them before they cause an incident.

What we don't bother with #

Type=notify for services that don't natively support it. Wrapping a non-sd_notify-aware binary just to use notify mode is more trouble than it's worth.

Per-service seccomp filters via SystemCallFilter=. Powerful, but the failure modes are subtle (a single missed syscall and the service silently dies). We use the defaults that come with ProtectKernelTunables=true etc, which include reasonable seccomp restrictions.

Sandboxing via RootDirectory= to chroot the service. Useful for genuinely untrusted code; overkill for our own services where the threat model is "compromised dependencies," not "the binary is hostile."

What this template doesn't replace #

Health checks (the service still needs to expose /healthz and Kubernetes/etc still need to consume it).

Application logging (journald is good for systemd-level events; structured app logging usually still goes to a separate aggregator).

Deployment automation. systemctl restart myservice after replacing the binary is the simple case; for safer rollouts you still need a deployer that watches health.

The systemd template is the foundation. Everything above it (k8s, prometheus, deployment tooling) builds on assumed-good service-level reliability. Get this layer right and the rest gets easier.

Operational Checklist: Systemd Service Reliability Patterns

Operational Checklist: Systemd Service Reliability Patterns

The template, annotated #

`After=network-online.target` and `Wants=` paired #

`Type=exec` not `Type=simple`#

`User=` and `Group=` always #

Restart, but bounded #

Resource limits #

Logging to journal, with identifier #

Hardening: the namespace and protection settings #

The watchdog (when applicable)#

OnFailure for alerting #

Common mistakes we still see #

What we do once per quarter #

What we don't bother with #

What this template doesn't replace #

Stay Updated

Network Configuration and Troubleshooting in Linux

A Pragmatic Multi-Region Strategy for Small Teams

More from Linux

SSH Hardening in 2026: Keys, Certificates, and Bastion Patterns

Linux TCP Tuning for High-Throughput Services

Debugging Latency with eBPF: bpftrace One-Liners That Find It

SSH Hardening in 2026: Keys, Certificates, and Bastion Patterns

Linux TCP Tuning for High-Throughput Services

Debugging Latency with eBPF: bpftrace One-Liners That Find It

systemd Timers vs Cron: Migrating Scheduled Jobs the Right Way

Four Signals That Matter: Choosing SLIs Users Actually Feel

Hunting Slow Queries with pg_stat_statements

You might have missed

GitOps with Argo CD: Best Practices for 2025

Prompt Engineering Best Practices: Maximizing LLM Performance

AI Agents in DevOps: From Copilots to Autonomous Automation in 2025

About Kiril Urbonas