Thursday, 9 July 2015

Redirecting system calls from a C library

C programmers will be familiar with the "front" end of the C library, which implements functions such as "printf", "memcpy" and "exit". These functions form the interface between the program and the library.

The "back" end is less familiar. This is the interface between the library and the platform. Usually, this means system calls. These request things from the OS, e.g. "open a file", "allocate memory", "write string X to file Y". Here is a list of system calls supported by Linux on x86_64. There are many. For last week's post, I implemented a small subset of these calls for TempleOS.

How did this implementation work? To explain that, I'll start with an example of how system calls are normally generated by the C library. Then I'll explain how I changed it, to redirect the system calls to TempleOS.

I'll pick the time function, which returns the number of seconds since "the epoch". Here's how a C programmer might use it:
time_t x, y;

x = time (NULL);
do_something ();
y = time (NULL);
printf ("do_something() took %d seconds\n", (int) (y - x));

time is a C library function with a close correspondence to Linux system call number 201 (on x86_64). Here is how the time function is implemented in uClibc:

000000000000e6c8 time:
e6c8: b8 c9 00 00 00 mov $201,%eax
e6cd: 0f 05 syscall
e6cf: c3 retq

This tiny function requests the time from the Linux kernel via system call 201. The system call number, 201, is loaded into the %eax register and then a special instruction, syscall, is executed. This triggers the OS - it's a "software interrupt". Finally the function returns.

With the help of the OS, it is possible to intercept system calls. A system call trace is a great help for debugging (see the strace program). Redirection is also possible. User-mode Linux uses it to virtualise programs, redirecting their system calls from the host Linux kernel to a guest Linux kernel that runs as an application on your machine (in "user" rather than "kernel" mode - hence the name). When user-mode Linux first appeared, I thought it was very impressive and clever. (If you wondered why you never heard of it, it may be because hardware virtualisation made it mostly irrelevant.)

I didn't rely on intercepting system calls. Instead I just changed the parts of the C library that generate system calls, replacing the syscall instruction with a normal function call, like this:

000000000000e6c8 time:
e6c8: b8 c9 00 00 00 mov $201,%eax
e6cd: e8 de 19 ff ff callq b0 _tos_syscall
e6d2: c3 retq

This change was made inside the system-specific header files of uClibc, specifically "libc/sysdeps/linux/x86_64/sysdep.h". One change was enough to support most system calls, since the interface between the system calls and C is mainly generated by a preprocessor macro. A few additional changes were made for "irregular" system calls which take many parameters (e.g. mmap) or do something special requiring extra assembly code (e.g. vfork). These system calls aren't supported in my project, so I just arranged for them to exit the program with an "unsupported system call" error.

The _tos_syscall function was implemented as follows (in setup.s):

.org 0x010
# qword [2]
# Filled in by loader (at runtime):
.quad 0
mov %rdx, %rcx # syscall arg 3 - Linux arg 4
mov %rsi, %rdx # syscall arg 2 - Linux arg 3
mov %rdi, %rsi # syscall arg 1 - Linux arg 2
mov %rax, %rdi # syscall number - Linux arg 1
jmpq *syscall_address(%rip)

The first four instructions translate between the calling convention for system calls and the calling convention for normal Linux functions. The system call number (in %rax) becomes the first argument.

The "jmpq" instruction jumps to the system call handler inside TempleOS code. When loading the program, the loader stores the absolute address of the system call handler function at the label "syscall_address". The "jmpq" instruction loads this address and jumps to it, transferring control to TempleOS code.

In that code, seen here, I save all registers on the stack, and translate the calling convention again - this time, from Linux to TempleOS. Why translate twice? Well, there's no good reason. It's just stayed that way. Before I wrote the TempleOS loader, I wrote loaders for Linux and Windows, and I wanted to keep them working without needing to update them.

The system calls are handled by a switch statement, seen here. Most system calls have their own function, written in the TempleOS language HolyC. The system calls support a maximum of three parameters due to the calling convention translation - I could improve this, but so far, there has not been any need to do so. The time system call is currently handled as follows:

U64 TL_Syscall(U64 number, U64 p0, U64 p1, U64 p2, U64 * rax)
switch (number) {
case 201:
// time
return 1400000000;
As you can see, it's not a complete implementation! I figured my programs didn't really need to know what time it is. They just needed the system call to do something sensible.

I have had some time to do a little more work on this project. I was encouraged by the positive response and I may get something else running in the future.

I was able to get Doom running in a headless mode: not interactive, no graphics. It took a while. There was a memory corruption issue, and tracking it down taught me about the "red zone" - the area below the stack pointer where functions are allowed to store temporary data, according to the x86_64 calling convention. GCC will use the red zone unless told not to by the "-mno-red-zone" option. I spent a while investigating some weird memory corruption issues that showed up while Doom was running, and eventually found out that some functions in Doom and uClibc were using the red zone for local variables. Unfortunately, event handlers in TempleOS were also trying to use the space, corrupting the local variables! Until now I had assumed that nothing was ever stored below %rsp, specifically because this area may be used by event handlers and a program never knows when those may be called.

The known set of GCC options for compiling for TempleOS is now:
-fPIC -mfpmath=387 -mno-red-zone
(Position-independent code, don't use SSE2, and don't use the red zone.)