Modified Unix Tools 1: diff

Sunday, 22 October 2017

Last time I wrote about software engineers making their own tools. This is not normally done from scratch, because some existing tool can often be repurposed or extended. Free and open source software is a great basis for this because it's readily available for modification.

Unix users will be familiar with a venerable set of Unix tools that appear in /bin and /usr/bin: things like grep and diff, sort and sed. They can be used directly, or used as building blocks for new tools which may be held together by shell scripts. The toolbox is deeply arcane, but powerful enough that you rarely need to alter any of the base tools, though if you do need to, you can get source code from your Linux distribution or from the GNU project, modify it and make your own version.

I found a need to modify "diff" because I wanted a version of diff that would ignore differences in numbers. The "standard" GNU version of diff can ignore differences in whitespace. I wanted this to ignore numbers too. This is useful for comparing timestamped log files where you are interested in the events that occurred but not the timestamps. For instance, here are two "strace" logs captured at different times. They show a program ("/bin/echo") doing something subtly different. In order to compare them I really want to ignore the process ID and timestamp:

Trace 1
PID   Timestamp         Event

16611 1508670679.088507 execve("/bin/echo", ...) = 0       
16611 1508670679.090603 write(1, "Hello world\n", 12) = 12 
16611 1508670679.090900 close(1)        = 0                
16611 1508670679.090982 exit_group(0)   = ?                
16611 1508670679.091125 +++ exited with 0 +++              

Trace 2
PID   Timestamp         Event

16616 1508670681.437833 execve("/bin/echo", ...) = 0
16616 1508670681.439946 write(1, "hello world\n", 12) = 12
16616 1508670681.439988 close(1)        = 0
16616 1508670681.440076 exit_group(0)   = ?
16616 1508670681.440205 +++ exited with 0 +++

My diff ignores all the numbers when I use the special new -Z option, revealing the difference:

-16611 1508670679.090603 write(1, "Hello world\n", 12) = 12
+16616 1508670681.439946 write(1, "hello world\n", 12) = 12
                                   ^ here

Modified diff is also useful for comparing generated source files as these often contain timestamps and artificial names which can change between builds. I can quickly spot differences in the generated code output from two different tool versions by having diff ignore these values.

You can probably think of other applications for this. For instance, the modified "diff" will ignore #line numbers in preprocessed files. It will ignore differences in file and directory names that are purely numerical, so paths in /version1/test will match paths in /version2/test. Being part of "diff", the new feature is also immediately available for recursive comparisons of whole directory trees as well as individual files.

It's a tiny modification to GNU diff which changes the behaviour of an existing -Z option to implement the new feature. Here is a patch for GNU diffutils 3.3. The patch ignores changes in decimal numbers, and also the hexadecimal digits 'a' .. 'f'. You will see that it is not hard to make further changes. I am using the "ctype.h" library functions to recognise whitespace (isspace) and hex digits (isxdigit).

The ability to make changes to tools such as "diff" is a huge advantage of free and open source software. Without the "diff" source code, this simple feature would not be available to me, because I have no time to rewrite even part of "diff". The GNU source code is a huge asset for developers everywhere, and I think it will probably survive for a long time. Perhaps "programmer archaeologists" will still be using code from it in thousands of years time, as in "A Deepness In The Sky". Certainly it has the properties needed for code to survive: lots of things are based on it already, it can be repurposed, it can be recompiled for new systems, and the license seriously discourages attempts to decouple the binary from the source code, which makes it harder to lose source code.

A further thought. If you look at the patch, you'll see it's been made by SVN. This is because of my process for modifying third-party tools is to first move them into a local repository. Start with a stable version of a tool before modifying it - don't be tempted to just "git clone" the latest one, unless you have a really good reason, because it may have some changes (or bugs) you don't want. Then, add your own build script, which runs whatever "configure" process is required before "make" and "make install". Test this and commit it before modifying the code. This means that you can repeat the whole build process from deep clean in the future, and can easily extract whatever changes you made using "svn diff" or "git diff". This method came from bitter experiences where I returned to something, wanted to add another change, but could no longer recompile it without lots of extra work. The deep clean step is particularly useful because it works around any bugs in the third-party tool's build system, e.g. failing to recompile some source file that has changed. Software which is relatively bug-free may nevertheless have lots of unrelated dependency bugs in its build system.