Bash shell script tips

Monday, 23 February 2015

Do you write shell scripts using bash?

If so, do you know what the "-e" option does? How about "-x"?

Every "bash" user should know about these invaluable options. They are off by default. Every "bash" user should turn them on whenever possible when writing or maintaining scripts. They can be turned on when a script is started, and they can be turned on and off while a script is running.

Here is what they do:
  • "-e" means "Exit immediately if a command exits with a non-zero status."
  • "-x" means "Print commands and their arguments as they are executed."
(Both of these descriptions appear if you enter the bash command: "help set". There is a bash manual, but it is very long, reflecting the enormous complexity of bash.)

"-x" is useful because it prints an execution trace of your bash script. The trace is printed to standard error. You can redirect it to a file, and then look through it to see where your script went wrong. This is a primitive but effective form of debugging which requires no special tools. To illustrate the feature, here is a short bash script. Notice that -x has been placed on the first line:
#!/bin/bash -x
for i in 1 2 3
do
echo $i
done
When executed, this script prints the following to standard output:
1
2
3

and the following to standard error:
+ for i in 1 2 3
+ echo 1
+ for i in 1 2 3
+ echo 2
+ for i in 1 2 3
+ echo 3
The standard error output is a trace of each command that was executed. Variables ($i) have been expanded.


"-e" is even more useful. It causes the bash script to exit immediately if one of the commands exits with a non-zero exit status. Almost all of the Unix commands exit with a non-zero code if something goes wrong. Generally, non-zero means "error". For instance,
  • "cp" will exit non-zero if the copy couldn't be carried out,
  • "tar" will exit non-zero if there was an error unpacking an archive,
  • "cmp" will exit non-zero if the two files given to "cmp" were not identical,
  • "gcc" and "make" will exit non-zero if compilation failed (e.g. syntax error in your code),
  • "cd" will exit non-zero if the target directory is not accessible.
You can look at the manual for each command to find more detail. Do this when using any command that you don't know well.

The great benefit of "-e" is that your script will stop instantly if any command fails. Without "-e", your bash script blunders onwards, blindly ignoring the earlier failure. Maybe this will just waste your time: the build script failed to patch a source file in an early step, and that caused the final link to fail. Or maybe it will be more harmful. To see the potential for disaster, consider the following bash script:
#!/bin/bash -x
cd /home/jack
cd build_directory
rm -rf -- *
The intention of the script is to delete all the files and directories in /home/jack/build_directory. But what if /home/jack/build_directory does not exist? Thanks to "-x", we can see the disaster unfold:
+ cd /home/jack
+ cd build_directory
./s: line 3: cd: build_directory: No such file or directory
+ rm -rf -- file1 file2 file3 ...
Oh dear. The script deletes all files in /home/jack! Hopefully, jack has a backup. Disaster is averted if we use "-e":
#!/bin/bash -ex
cd /home/jack
cd build_directory
rm -rf -- *
This time:
+ cd /home/jack
+ cd build_directory
./s: line 3: cd: build_directory: No such file or directory
The script now exits (with a non-zero exit code) before "rm" is reached. No files are deleted.

In some cases, "-e" carries a slight disadvantage, because you may be quite happy for a command to fail. For instance, suppose you want to try to delete a file called "foobar", but you want the script to continue running even if "foobar" can't be deleted. You might write the following script:
#!/bin/bash -ex
cd /tmp
rm -f foobar
echo hello
Alas, "/tmp/foobar" exists and is owned by "root", so you see:
+ cd /tmp
+ rm -f foobar
rm: cannot remove `foobar': Operation not permitted
The script stops running here. In this case, you should tell bash to ignore errors for that command only. Here is how I would do it:
#!/bin/bash -ex
cd /tmp
rm -f foobar || true
echo hello
The "||" is a short-circuit OR operation, just like "||" in the C language, and it means that if the command on the left-hand side fails, then the command on the right-hand side should be used instead. The "true" program always exits with a zero. So, while the "rm" command fails, the statement succeeds. The output is:
+ cd /tmp
+ rm -f foobar
rm: cannot remove `foobar': Operation not permitted
+ true

+ echo hello
hello
The "echo" command is also a nice replacement for "true", because you can use it to print an error message, like:
rm -f foobar || echo "Unable to delete foobar!!"
You can even force an exit with a specific exit code:
rm -f foobar || exit 123
If you want to disable error-checking for more than one command - potentially the rest of the script - you can do so with the command "set +e". You can re-enable it with "set -e". However, don't. This prevents "-e" helping you.

Shell scripting errors turn up all the time, if you are unlucky, or looking for them. One common mistake is to try to use a variable that does not exist. Here is a high-profile example of that mistake. Many programming languages would exit with an error if you tried to use an undefined variable, but Unix shell scripts do not. Not even with "-e". However, "-e" and "-x" will help to track down such mistakes. For example, a command like "cp $SRC $DEST" will fail with an error if either (or both) of $SRC or $DEST are undefined. You will see the mistake in the "-x" output, just before the script exited.

Failing to check error return codes is a general programming problem. It is one of the reasons why languages began to adopt exceptions as a mechanism for error handling. Lazy programmers ignore return codes, and this is a great source of bugs. A program that ignores return codes will probably crash eventually, but the crash might not happen at the point where the mistake was made, which is good fun for the maintenance programmer trying to trace the bug. Here's a real example. Unlike return codes, exceptions will propagate if they are not explicitly caught, so the programmer is forced to handle errors or allow them to be propagated to some other handler, higher up the call stack. Of course, the lazy programmer still has a way to catch all exceptions, so this does not always help.

Conclusion: use "-x" and "-e" in your bash scripts. It will save you time. It will save other people time. (Think about the maintenance programmer!)

Use "-e", at least, in all new scripts. If you are maintaining an old script, it may not be easy to add "-e", because the old script may depend on not exiting on error somewhere. In this case, think carefully about error handling. Consider adding code like my "||" example to explicitly detect errors for each line that might fail, calling "exit" on error, or otherwise doing something helpful. Consider using "set -e" to temporarily enable error checking.