Sunday, 8 October 2017

Differences between dynamically linked libraries (.dll) and shared objects (.so)

Dynamically linked libraries (Windows) and shared objects (Linux) are conceptually the same thing.

Both are containers for executable code and data. They can be loaded into the memory space of other programs, where the functions can be executed and the data may be accessed. This loading takes place at runtime - either when the program starts, or when the program explicitly requests it via a LoadLibrary call (Windows) or dlopen call (Linux). Loading a DLL or SO will also load the DLLs or SOs it depends upon, and if any of these are already loaded, then the existing copy is reused.

DLLs and shared objects are normally used as containers for library functions: the C function "printf" is a famous example, present in both Linux's "libc.so" and in Windows' "msvcrt.dll". However DLLs and shared objects can also be used whenever code is loaded at runtime. They have been used to implement plugins for software such as web browsers and media players. They are also used to implement C extensions for Python and native methods for Java. In the R language, they provide access to ancient Fortran libraries that have been used for scientific computing for decades. From the perspective of the user, all of these facilities work the same on both Windows and Linux.

So, enough about the similarities. Where are the differences?

I'm going to write about the differences that mattered to me, when I worked on a large application which had just begun to use DLLs after many years of being mostly statically linked. Static linking means that all the components are built into the executable. The switch to DLLs would mean that most components would be in separate files where they could be shared by multiple executables. My job - make the application use DLLs and don't break anything.

I began work on Linux and within a day I had made a version of the application that used shared objects, with one exception that I will write about in another post.

"That was easy," I thought, and began work on the Windows version. This took about two weeks. It turns out that the DLL and SO systems are only superficially similar. The DLL system is substantially more awkward, as I will describe.

Two functions in the same DLL, shared object or program can call each other directly, since the two always have the same relative locations in memory. However, DLLs and shared objects can be loaded at any absolute location, so calls between DLLs and shared objects must involve some "relocation" mechanism to resolve destination addresses. The same mechanism works for any sort of symbol, not just functions. For instance it is also used for variables.

Both DLLs and shared objects have some way to handle relocations, but shared objects have a more powerful system which is able to handle references to data contained within other data: fields within a "struct" or "class" for instance. The missing feature in Windows is the ability to add an offset to a symbol address after it's been relocated.

On Windows, symbols are not explicitly imported or exported from DLLs. When a DLL is created, every exported symbol must be marked with an annotation such as "__declspec(dllexport)" or listed in a definitions file. When a DLL is used, every imported symbol is normally accessed through an import library containing stubs which form a layer between the program and the DLL. This is where relocations are added.

These extra requirements make it difficult to port shared objects to Windows, because these extra steps are not required on Unix-like platforms. Shared object symbols are exported by default, and imported implicitly. Any additional code needed for relocations is generated automatically by the linker.

A Unix-like environment for Windows such as MinGW provides much more than Bash and GCC. It also provides a compatibility layer which hides some of the fundamental differences between the Windows platform and a Unix-like platform. Even before DLLs were introduced, MinGW was already part of the application, because the Unix-like layer is very useful for portability. One aspect of this portability concerns hiding the differences between DLLs and shared objects.

For example, unlike the usual Microsoft linker, the MinGW linker "ld" will export all symbols from a DLL by default, and import them implicitly, just as "ld" does on Linux. Furthermore, each program built with MinGW incorporates a runtime library implementing "runtime pseudo-relocs" which extends the DLL relocation system to allow programs to access data within other data. "Runtime pseudo-relocs" work by modifying code after the DLL has been loaded, but before the program can execute it. The implementation is part of the MinGW C runtime (CRT) and is statically linked into every MinGW DLL. "ld" creates the __RUNTIME_PSEUDO_RELOC_LIST__ table to describe the relocation requirements.

Though MinGW attempts to provide shared object semantics for DLLs, the compatibility is not complete, and some fundamental details are lost. The most obvious of these is RPATH. RPATH is a field in the header of a Linux shared object or executable which specifies the directories to search for further shared objects. RPATH is extremely useful for any program that supplies its own shared objects, because it avoids the need to install those shared objects in a system directory such as /lib. But there is no equivalent for Windows. On Windows, DLLs used by an executable must be in the same directory as that executable or in a directory named in the PATH environment variable (details).

Once the program has loaded, you can use SetDllDirectory to control the DLL search process, but that only helps for DLLs that are opened after startup, whereas RPATH is used during startup. Therefore, Windows programs usually put the executable and the DLLs together in the same directory. This is what I did.

However, fundamental differences do not end with RPATH. A deeper incompatibility is seen when the same symbol occurs in more than one DLL.

Suppose two shared objects are loaded, and both contain a copy of the same function. When that function is called, which copy is used? On Linux, the rule is that it's always the first one to be loaded. This allows you to override library functions with your own versions, for example using the "LD_PRELOAD trick" to ensure that a specific library is loaded first. It's a very powerful technique, because it redirects all accesses to that function, including ones within the same library.

On Windows the rule is different. Loading additional copies of a function has no effect, because the binding between a function call and the function target is decided at link time, and cannot be overridden. If there are two copies with the same name, then both could be in use. Which one do you get? That depends where the call is. There can also be multiple copies of global data, and again, the copy that's accessed depends on who is asking.

The application had a requirement to override a low-level function called __gnat_malloc which is defined in a DLL supplied by the compiler vendor, "libgnat-17.so". On Linux there are at least two ways to implement this replacement: one, something like LD_PRELOAD to load "libreplacemalloc.so" before "libgnat-17.so" at runtime, or putting "libreplacemalloc.so" before "libgnat-17.so" in the library order used by "ld", so that it's loaded first during startup. But neither of these options works on Windows. At best, I can have code in some of the DLLs using my replacement, and some others using the original. In particular, there is no way to have functions in libgnat-17.dll call the replacement __gnat_malloc. The binding was fixed when the DLL was created. So, I had to implement it another way.

I also found a bug caused by a related problem. The same runtime support code was linked in multiple DLLs. Code within each DLL would only call its own copy of that support code. Unfortunately, only one copy had been initialised properly. If this bug had occurred on Linux, it would have gone unnoticed, as the first copy to load would have been used everywhere. But as it occurred on Windows, it resulted in null pointer exceptions, because data structures were used before they were initialised.

MinGW can only go some of the way towards hiding fundamental design differences between Windows and Linux. The deep differences are still present.

I dislike the risk that I may be considered any sort of Linux "fanboy" for saying this, but from experience, shared objects are better designed than DLLs. It is mainly a matter of history. DLLs date back to Windows 1.0 and 16-bit MS-DOS, and since then, the design has largely been concerned with compatibility. There's an interesting blog series on that subject. The DllMain function has caused a lot of problems because it's been widely misused by Windows programmers. Backward compatibility was not a big concern for the Linux shared library system, and it was possible to choose a good design based on decades of knowledge about bad ways to do it, including the previous (awful) "a.out" system and Microsoft's DLLs.

Because they're not part of the system code that actually loads the DLL, MinGW's workarounds can cause issues. For instance, I found that application startup time mysteriously increased by several seconds - sometimes. What was it doing, during this time? I was stumped by the lack of decent profiling tools for Windows (no "perf" here) and the difficulty of debugging anything that happens during the early stages of program startup. GDB is not much use. But I found that Sysinternals Process Monitor is able to act as a sampling profiler, generating a stack trace every 100ms, which turned out to be enough to figure out what code was running. It was the runtime pseudo relocation system. It kept marking executable code pages as PAGE_EXECUTE_READWRITE in order to modify some relocation entry. Then Windows changed the page type to PAGE_EXECUTE_WRITECOPY. MinGW didn't recognise this as equivalent to PAGE_EXECUTE_READWRITE, so it changed it back. Repeat for each relocation: the kernel and the application were fighting each other. This bug was fixed three years ago, but the fix was not in the version of MinGW I was using. More recent versions have a different bug in the same place! The earlier bug is particularly pernicious because it does not cause a hard failure. Instead it degrades performance, and the circumstances are hard to reproduce and hard to investigate.

In the end, I did get the application working with DLLs. It was much smaller, because so much code was now shared. There was no significant change in build time, load time or execution time, and I was not forced to change any features. But it was hard work, and took much longer than expected.

I wonder whether it might make sense to have Linux-style shared objects on Windows in addition to DLLs. In principle it's possible. The shared object loader is not part of the operating system, and does not require special privileges. Alternative ones can be added. However, DLL support would still be required for interoperability, and perhaps it is silly to have two incompatible shared library systems when DLLs are mostly good enough. Working around the limitations of DLLs is probably more practical.

Summary: DLLs and shared objects are conceptually similar, but there are significant differences in what they can do. Shared objects are better. Some of the limitations of DLLs can be avoided by additional code, like MinGW's "runtime pseudo relocs", but these can introduce new problems. They can't hide the most fundamental differences such as how symbols are imported and exported between DLLs, and how conflicts are resolved. These differences will always exist because of backwards compatibility.