The mystery of the fifteen-millisecond breakpoint instruction

Sunday, 5 April 2015

How long does a breakpoint instruction take to execute?

Breakpoints are normally used within a debugger to stop program execution when a particular function or line of code is reached. Sometimes, breakpoints are implemented by telling the CPU to stop executing at a particular address (a "hardware breakpoint") but there is a limit on the number of hardware breakpoints supported by each CPU, and the limit is often quite small (2-4 being typical). More commonly, breakpoints are special instructions which are inserted into your program by modifying the machine code ("software breakpoints"). There is no limit on the number of software breakpoints - the only reason not to use them is if the program cannot be modified, e.g. if it is in ROM.

When reached, a breakpoint instruction stops program execution and sends a signal, e.g. "SIGTRAP" on Unix, which may be received by a debugger or by the program itself.

Invoking the signal handler has some overhead. The CPU must signal the operating system (depending on the architecture, this is called a "fault", a "hardware exception" or an "abort"). The operating system must determine what has happened, generate the signal, and send it to the right place. The process is likely to take a few microseconds: more than a typical instruction, but still very fast.

As part of a small project, I was looking into efficient ways to invoke a signal handler within a program running on a Raspberry Pi. I wanted a method that required only one instruction.

I looked at a few different options. Adding a breakpoint instruction to the software was one of them, along with adding an undefined instruction (causes a "SIGILL" signal on Linux), or an invalid memory access (causes a "SIGSEGV" signal on Linux - this is the famous "Segmentation fault" signal).

My experiment was carried out by a program that would repeatedly invoke the signal handler by executing an undefined instruction, a breakpoint, or an invalid memory access. It would attempt to do this 100000 times, and then print the mean execution time for each invocation. Most of the options had very similar performance.

Signal handler invocation time (mean of 100000)
Undefined instruction 0xf7f0a000
9 microseconds
Breakpoint instruction 0xe1200070
14.8 milliseconds (mean of 100)
Store to invalid address
10 microseconds
Load from invalid address
10 microseconds

Notice the "odd one out" - the breakpoint instruction. It took 1500 times longer than the other options!

The two opcodes, 0xf7f0a000 and 0xe1200070, were found by searching the source code of GNU binutils, e.g. this file at line 950. The opcode 0xe1200070 corresponds to the mnemonic "bkpt", and the ARM Architecture Reference Manual confirms this (see page A4-14 here). It is not unreasonable to think that this instruction is intended to be used as a breakpoint, and indeed, it behaves like one (generating SIGTRAP). But it appears to be incredibly slow!

A slow breakpoint instruction would be virtually unusable in a debugger. Think about the common debugging scenario where the user sets a conditional breakpoint: "break if x > y". Testing that condition is going to take 15ms each time.

I looked at the GNU debugger, "gdb". What breakpoint instruction does it use? To find out, I wrote another program that repeatedly printed out the value of a code address. I then ran this program inside "gdb" and set a breakpoint at that address. It turned out that gdb uses a completely different instruction as a breakpoint. gdb's choice is 0xe7f001f0.

Taking this back to my first program, I now get an invocation time that's similar to the other options:

Signal handler invocation time (mean of 100000)
Breakpoint instruction 0xe7f001f0.
9 microseconds

This is a better breakpoint instruction! GNU objdump tells me that the mnemonic is "udf #16". It seems my initial choice of the "obvious" breakpoint instruction was wrong - as might be expected, the "gdb" developers knew something I didn't.

But there is still a mystery to solve: what is "bkpt" doing? Why is it so slow?

I needed to find out what was happening during those 15ms. What was the CPU doing? Running kernel code? Blocked waiting for hardware? Running another process?

I tried running "dmesg" to look for anything out of the ordinary, but didn't see anything there. Instead, the first clue came from "top":

bkpt.exe repeatedly executes the mysterious "bkpt" instruction, and "top" tells me that this is using 92.5% of the CPU time. What else is using CPU time? "rsyslogd" is using 3.9%. Something is being written to the system log. "tail /var/log/syslog" tells me more:

Apr 4 22:50:59 heating kernel: [14572.940927] Unhandled prefetch abort: breakpoint debug exception (0x002) at 0x00008438

This is part of the side effect of "bkpt". Every time it's executed, a message is generated by the kernel and written to syslog. Writing to syslog accounts for about 1ms of the time overhead, because if I "killall rsyslogd", the invocation time drops as follows:

Signal handler invocation time
Breakpoint instruction 0xe1200070
(no syslog daemon)
13.9 milliseconds (mean of 100)

This leaves 13.9ms unaccounted for. What is the CPU doing, during this time?

The "grep" tool tells me that the message "Unhandled prefetch abort" is generated by the following function inside the ARM-specific part of the Linux kernel (arch/arm/mm/fault.c):

do_PrefetchAbort(unsigned long addr, unsigned int ifsr, struct pt_regs *regs)
    const struct fsr_info *inf = ifsr_info + fsr_fs(ifsr);
    if (!inf-fn(addr, ifsr | FSR_LNX_PF, regs))
            "Unhandled prefetch abort: %s (0x%03x) at 0x%08lx\n",
            inf-name, ifsr, addr);
    arm_notify_die("", regs, info, ifsr, 0);

The function uses a lookup table, named "ifsr_info", to find an entry for the event triggered by "bkpt". It then calls a function within that table ("inf-fn"). If the function returns zero, then the fault is handled, and program execution continues normally. However, if the function returns non-zero, then some fallback code is executed. The message is printed by "printk". Could "printk" account for 13.9ms?

Perhaps surprisingly, the answer is yes. I can demonstrate this by extending "inf-fn" to handle the "bkpt" instruction. The process is as follows. Firstly, I search for the place where "inf-fn" is set up. I find it in arch/arm/kernel/hw_breakpoint.c:

hook_ifault_code(FAULT_CODE_DEBUG, hw_breakpoint_pending, SIGTRAP,
TRAP_HWBKPT, "breakpoint debug exception");

Here is the contents of hw_breakpoint_pending:

static int hw_breakpoint_pending(unsigned long addr, unsigned int fsr,
struct pt_regs *regs)
    int ret = 0;
    u32 dscr;

    ARM_DBG_READ(c1, 0, dscr);

    /* Perform perf callbacks. */
    switch (ARM_DSCR_MOE(dscr)) {
            breakpoint_handler(addr, regs);


            watchpoint_handler(addr, fsr, regs);

            ret = 1; /* Unhandled fault. */

    return ret;

In this code, the Linux kernel looks at a special ARM register named "DSCR" to determine what sort of breakpoint has been reached. Three possibilities are handled by the code - all other possibilities result in the "Unhandled prefetch abort" message.

A quick search for "DSCR register" reveals an interesting page in ARM's documentation. In table 13.7, I find a list of possible values for the Entry field. Here are the first few:

Entry value
Reason for entering debug
A Halt DBGTAP instruction occurred
A breakpoint occurred
A watchpoint occurred
A BKPT instruction occurred

My version of the Linux kernel handles entry values 1 (ARM_ENTRY_BREAKPOINT) and 2 (ARM_ENTRY_ASYNC_WATCHPOINT) but does not handle entry value 3 - "A BKPT instruction occurred". The fallback path is used for this event, costing 13.9ms.

I modified "hw_breakpoint_pending" as follows:

        breakpoint_handler(addr, regs);
    case 3:

        info.si_signo = SIGTRAP;
        info.si_errno = 0;
        info.si_code = 0;
        info.si_addr = (void __user *)addr;

        arm_notify_die("", regs, info, fsr, 0);

        return 0;

My new code handles the "bkpt" instruction just like the old code did - calling "arm_notify_die" - but it doesn't call printk. And the problem is resolved:

Signal handler invocation time (mean of 100000)
Breakpoint instruction 0xe1200070
(modified "hw_breakpoint_pending")
9 microseconds

The problem was "printk". In C, many printing functions are actually interpreters for an embedded language. They interpret formatting directives such as %08x or %1.3e and generate appropriate output, such as an 8-digit hexadecimal number, or a real number in engineering notation. printk includes such an interpreter, and it adds some inefficiency. But not on the order of 13ms!

Actually, "printk" is slow because it is synchronously writing to the console. The extra time is required to write characters to the Raspberry Pi's user interface, with some scrolling where necessary. For an error message, or an important warning, this is good behaviour. If the message were written to a buffer instead of the console, it might be lost entirely - the kernel might crash immediately after calling "printk", and the user would not have any clue about the cause of the crash. If I had a monitor connected to my Raspberry Pi, the output would be useful, and would have led me to the problem without a "syslog" detour. A serious warning about an unhandled exception would certainly qualify for "printk": the priority level is "KERN_ALERT", one step above "KERN_CRIT" (for a critical error)!

However, I'm not convinced that "bkpt" should trigger such a warning. "bkpt" is a logical, obvious choice of breakpoint instruction, straight out of the ARM Architecture Reference Manual. Perhaps the Linux kernel should handle it better. This instruction could provide a mechanism for a denial-of-service attack against an ARM Linux system, because unprivileged user-space code is able to force the kernel to waste a lot of time printing messages, slowing the whole system down. If I had any Android devices, I would test it on them.

Conclusion: There is more than one breakpoint instruction on recent ARM platforms. At least two are available on the Raspberry Pi's ARM CPU. On Linux, the obvious choice of "bkpt" is slow, because it causes a message to be printed to the console each time it occurs. Rather than using "bkpt", you should reuse the breakpoint instruction used by a debugger such as "gdb". Like any mature CPU architecture, ARM usually provides more than one way to do something. Some ways are better than others, and the obvious way isn't necessarily the best.