We have a few hundred shell scripts in production. The patterns that make them survive contact with reality, and the ones we've stopped writing.
Shell scripts are the duct tape of operations. We have a few hundred in production — deployment helpers, cron jobs, runbooks, backup automation, ad-hoc tools. After enough years of debugging shell scripts at 3 AM, this is the working playbook: the patterns that make scripts survive contact with production reality, and the antipatterns we've stopped writing.
Before any pattern advice, the first question is: should this be a shell script at all?
Shell is great for:
Shell is bad for:
Our rule: if a shell script grows past ~150 lines or has more than ~3 levels of nested logic, it should probably be Python or Go. We've rewritten a half-dozen overlong shell scripts; the resulting Python was always more maintainable.
Every script starts with:
#!/usr/bin/env bash
set -euo pipefail
IFS=$'\n\t'
What each does:
#!/usr/bin/env bash: not #!/bin/sh. We want bash; we want to find it via PATH (works on Mac and various Linux distros).set -e: exit on any error. Without this, scripts continue past failures, doing weird things.set -u: error on undefined variables. Catches typos like $USRENAME instead of $USERNAME.set -o pipefail: a pipeline fails if any command in it fails. Without this, cmd1 | cmd2 succeeds if cmd2 succeeds, even if cmd1 failed.IFS=$'\n\t': word splitting on newlines and tabs only, not spaces. Makes filenames with spaces work correctly.These four lines prevent a huge class of shell scripting bugs. The cost is that you have to handle "expected failures" explicitly (more on this below).
The most common shell bug is unquoted variables:
# Wrong
if [ $variable = "value" ]; then ...
# Right
if [ "$variable" = "value" ]; then ...
Unquoted variables get word-split. If $variable contains spaces (or is empty), the comparison breaks.
Our rule: every variable expansion is quoted unless we explicitly want word-splitting. "$var", not $var. Even when "we know" the variable is safe, quote it — the script will be reused in contexts where it isn't.
ShellCheck (linter) catches unquoted variables. We run it in CI on every shell script.
With set -e, the script exits on errors. But sometimes you expect a command to fail; you need to handle it without aborting.
Patterns:
For "this might not exist":
if [ -f /etc/myconfig ]; then
source /etc/myconfig
fi
Test before using.
For "this command might fail and that's OK":
output=$(some_command 2>&1) || {
echo "Command failed: $output"
return 1
}
|| lets you handle the failure explicitly.
For "this command must succeed":
some_command || { echo "Failed; aborting" >&2; exit 1; }
Or just let set -e do its job and some_command will exit on failure.
Trapping for cleanup:
TMPDIR=$(mktemp -d)
trap 'rm -rf "$TMPDIR"' EXIT
The trap ... EXIT runs the cleanup whether the script succeeds or fails.
Stdout for normal output (something a caller might want to consume). Stderr for diagnostics (status messages, errors).
log() {
echo "[$(date '+%Y-%m-%d %H:%M:%S')] $*" >&2
}
log "Starting backup"
# ... backup code that might write data to stdout ...
log "Backup completed"
The log() function writes to stderr; the script's actual output (if any) goes to stdout. Callers can pipe stdout while still seeing logs.
Don't put progress logs to stdout — when the script is used in pipelines, the logs end up in the pipe.
For scripts that take arguments:
usage() {
cat <<EOF
Usage: $0 [OPTIONS] <input-file>
Options:
--output FILE Output file (default: input.processed)
--verbose Enable verbose output
--help Show this message
EOF
}
OUTPUT=""
VERBOSE=0
INPUT=""
while [ $# -gt 0 ]; do
case "$1" in
--output) OUTPUT="$2"; shift 2 ;;
--verbose) VERBOSE=1; shift ;;
--help) usage; exit 0 ;;
--) shift; break ;;
-*) echo "Unknown option: $1" >&2; usage >&2; exit 1 ;;
*) INPUT="$1"; shift ;;
esac
done
if [ -z "$INPUT" ]; then
echo "Error: input file required" >&2
usage >&2
exit 1
fi
Long argument parsing in shell is painful. For complex CLIs, consider Python with argparse instead.
Even a small script benefits from functions:
backup_database() {
local db="$1"
local target="$2"
log "Backing up $db to $target"
pg_dump "$db" | gzip > "$target"
}
main() {
local db="${1:-}"
[ -n "$db" ] || { echo "DB name required" >&2; exit 1; }
backup_database "$db" "/backups/$db.sql.gz"
}
main "$@"
main "$@" at the end runs the actual logic. Functions make scripts testable (you can source the file and test functions individually) and readable.
local makes function variables local — without it, variables leak to the calling scope.
bash -c "$user_input"#Two patterns to never use:
eval: takes a string and runs it as code. If the string has any user input or untrusted data, it's command injection. We banned eval from our scripts; if you think you need it, you don't.
Building command strings then bash -c: same risks as eval. Just call the command directly with arguments.
If you find yourself needing dynamic execution, that's a sign the task has outgrown shell. Move to Python.
# Wrong: parsing ls
for f in $(ls /tmp); do ...
# Right: globbing
for f in /tmp/*; do ...
ls output formatting changes; locale affects it; weird filenames break it. Use shell's built-in features when possible.
For more sophisticated file operations:
find /tmp -type f -name "*.log" -mtime +7 -print0 | xargs -0 rm
Use -print0 and -0 to handle filenames with spaces and special characters.
Things we've removed from our codebase:
Hand-rolled argument parsing past 3 options. Use a real argument parser if you have many options. Or Python.
Sharing variables via export to subshells. Subshells modifying exported variables don't propagate back. Use functions and return values.
Nested loops with complex conditions. Refactor or rewrite in Python. Shell isn't the right tool for nontrivial logic.
Multi-step pipelines without pipefail. Already covered, but worth restating: set -o pipefail is non-optional.
Unquoted variable expansion "because we know it's safe." Quote everything. Future you won't remember why this one was safe.
Catching errors with 2>/dev/null. Suppresses real errors. Use specific error handling instead.
Parsing JSON with grep/sed/awk. Use jq. Always.
A pattern we use: set -x for ad-hoc debugging in scripts.
When troubleshooting a flaky script, add set -x near the top. It echoes every command before execution. Verbose, but you can see exactly what's happening.
set -x # Debug mode
some_complex_pipeline | here
set +x # Disable
A short comment block at the top of every non-trivial script:
#!/usr/bin/env bash
#
# backup-rds.sh: Take a manual snapshot of an RDS instance.
#
# Usage: ./backup-rds.sh <instance-id> [--region <region>]
#
# Requires: AWS CLI, jq, write access to the instance.
#
# This is run nightly by cron via /etc/cron.d/rds-backup.
What it does, how to use it, what it needs, where it runs. Five lines that save 15 minutes of debugging when someone (often future me) reads it later.
Shell testing is real but limited. Tools:
bats: Bash Automated Testing System. Test functions in isolation.shellcheck: linter, catches a lot of common bugs. CI-required.shfmt: formatter for consistent style.For scripts beyond a certain complexity, integration testing in a sandbox is the realistic option. Run the script against a test environment; assert the result.
We don't aim for "all scripts have tests." We aim for: linting on every script, and tests for the scripts where bugs would matter (production deploy scripts, data migration scripts).
Move to Python (or another language) when:
The migration cost is real but the maintenance benefit pays off quickly for non-trivial code.
Start every script with the safety preamble. set -euo pipefail + careful IFS. Catches most bugs.
Quote everything. ShellCheck enforces this; respect its findings.
Functions make scripts maintainable. Even a 50-line script benefits.
Don't parse formatted output (ls, ps, etc.). Use find, jq, etc. for structured data.
Stay short. When a script grows beyond ~150 lines or has nested complexity, move to a real language.
ShellCheck is non-optional. Linting catches a class of bugs you'll otherwise hit in production at 2 AM.
Shell scripting in production is one of those skills that's invisible when done well. The good scripts run silently, do their job, and don't break. The bad ones cost hours of debugging at the worst times. The patterns above don't make you a shell expert; they keep you out of trouble. For most operational scripts, that's exactly what you need.
Get the latest tutorials, guides, and insights on AI, DevOps, Cloud, and Infrastructure delivered directly to your inbox.
Use prompts to get reliable, safe outputs from LLMs for runbooks, code, and ops tasks.
How a small team moved from single-region risk to a simple active/passive multi-region setup without doubling complexity.
Explore more articles in this category
We migrated most scheduled jobs from cron to systemd timers. The wins, the gotchas, and the cases we kept on cron anyway.
A curated list of shell one-liners that earn their place in real ops work — the ones I reach for weekly, not the trick-shot variety.
Generate an SSH key, set up passwordless login, and configure aliases for the servers you use daily — all without copy-pasting yet another long command.
Evergreen posts worth revisiting.