On Friday morning I was running a large Ada project through Rapita's tools. I'd just been working on some improvements. My improvements had been accepted by the automated tests, and so it was time for an integration test on a large project to see if everything was still working as expected.
But one of the tools (Adains) was suddenly taking longer than it used to. Something was wrong. Had I broken something? I needed to know what was happening, but ideally without disturbing the test. I didn't want to restart the program, or recompile it, or rerun it inside a performance analysis tool.
My solution was to attach to the tool's process using gdb, the GNU debugger, and print out a stack trace, then detach. This takes a snapshot of what the tool is doing at that particular instant. Here is an example of a stack trace:
Thread 1 (process 10045):Some readers may recognise the application running here. It is Doom. This stack trace shows part of the 3D engine. At the bottom of the stack, you can see the program's entry point (the "main()" function). At the top, you can see the function that is currently running (R_DrawColumn). In between, you can see the path through function calls that were taken to reach R_DrawColumn: you can see, for instance, that a function called R_RenderPlayerView has been called.
#0 0x0000000000424ec5 in R_DrawColumn () at r_draw.c:157
#1 0x0000000000427b4d in R_RenderSegLoop () at r_segs.c:287
#2 0x0000000000428a39 in R_StoreWallRange (start=0, stop=77) at r_segs.c:714
#3 0x0000000000422db0 in R_ClipSolidWallSegment (first=0, last=77) at r_bsp.c:151
#4 0x00000000004231bd in R_AddLine (line=0x7f58fef80c18) at r_bsp.c:355
#5 0x00000000004235f7 in R_Subsector (num=55) at r_bsp.c:539
#6 0x000000000042364b in R_RenderBSPNode (bspnum=32823) at r_bsp.c:563
#7 0x00000000004236b0 in R_RenderBSPNode (bspnum=50) at r_bsp.c:573
#15 0x00000000004236b0 in R_RenderBSPNode (bspnum=129) at r_bsp.c:573
#16 0x00000000004236b0 in R_RenderBSPNode (bspnum=382) at r_bsp.c:573
#17 0x0000000000426ae2 in R_RenderPlayerView (player=0x6999c0) at r_main.c:884
#18 0x00000000004044e1 in D_Display () at d_main.c:270
#19 0x0000000000405914 in root () at d_main.c:1203
#20 0x0000000000404e3c in main (argc=2, argv=0x7ffe2d6c72e8) at d_main.c:776
This information can be captured for any process. Here are the steps required:
- attach to the process using gdb. For example:
$ gdb -p 1234
The parameter 1234 is the process ID to attach to.
- Enter the command "backtrace" - "bt" for short:
- Detach from the process.
$ gdb -ex "set pagination 0" -ex "bt" -batch -p 1234Even a single stack trace is useful. If the program spends most of its time doing X, then logically, at any particular instant, it is very likely to be doing X. However, you may be unlucky. If you misdiagnose a performance problem, you end up wasting a lot of time trying to improve the wrong things. So it's important to get more data.
gdb can be called repeatedly from a bash script, dumping the stack trace into a series of small files:
#!/bin/bash -xeThis is the poor man's profiler. Yes, it even has a web page. It has the core functionality of a typical profiling tool, because it repeatedly captures the stack of the target process. The sample rate is about 10 captures per second, though this will not be precise, because gdb has to be started each time and there is a significant overhead for that. In a "real" profiler, the time delay would be as precise as possible. But the principle is just the same, and in fact, the poor man's profiler is even more useful than many "real" profilers, because it can be used to profile any program - including ones that are already running!
for x in $(seq 1000 1999)
gdb -ex "set pagination 0" -ex "thread apply all bt" \
-batch -p $pid d$x.txt
So, armed with a script very like the one above, I tried to find out why the Adains tool was taking so long. I found the PID of the Adains program, and left the script running while I ate my lunch. When I got back, I had hundreds of samples showing what the tool had been doing.
The next question: how best to analyse them? At the moment my favourite method of visualising program performance is a flame graph. The X axis of the flame graph is "samples", the Y axis is "stack depth", and the items on the graph represent functions. For instance, here is a flame graph for Doom:
Flame graphs are a brilliant way to visualise performance bottlenecks in a program. Problems can be diagnosed effectively. You are unlikely to make a mistake about the cause of a performance problem, because the data is based on many samples, and its presentation does not obfuscate relevant details.
After I returned from lunch, and collected the stack traces produced by gdb, I tried to get them into the flame graph form. It would be possible to analyse them by hand - for example, I could use the Unix "tail" program to see if all of the stack traces had functions in common - but how much easier it would be if they could be processed into the form shown above!
To my pleasant surprise, it turns out that the author of the flame graph software already knows about the "poor man's profiler", and has already implemented support for it. Once you get the flame graph software from Github, only two steps are required to import the gdb output:
$ perl stackcollapse-gdb.pl d*txt tmp.datThis is exactly the process I used to generate the graphs shown above.
$ perl flamegraph.pl tmp.dat output.svg
The only extra step needed to analyse the Adains tool was to add the line:
s/"/'/g;to the stackcollapse-gdb.pl script. This is needed because Ada function names can include the " character - for example, the operator is represented as "" - and if this character turns up in the SVG output file, the SVG does not open ("XML Parsing Error"), so I replace the " character with '. The flame graph allowed the problem to be traced to a particular function which wasn't scaling very well. I'd fixed the problem within a few hours of first seeing it.
Fast problem solving depends on good tools. Credit in this case goes to:
- The "poor man's profiler", which can be used to capture profiling data without restarting or recompiling a program;
- The flame graph software, which can visualise the data captured by the "poor man's profiler", making it easy to diagnose a problem.