I found this very interesting article by Dan Luu in which he discusses performance problems which cannot be analysed using a sampling profiler - because the rate of sampling is just not fast enough. The article is all about online services, but the problem described here is actually the "worst-case response time" problem. To me, this is an embedded systems problem. It applies to real-time systems where safety requires the software to respond within a known time interval - like the computer controlling a jet engine, for instance. It is known to be a very hard problem, and the solutions described in the article (tracing) are exactly the same solutions applied at the company where I work! But we're doing it for embedded systems. It's very interesting to read that the exact same problems exist in online services, and at sufficient scale, they can be extremely costly.
I thought it might be interesting to share an example of tracing in action. The following video is taken from Doom. The game source code has been instrumented so that function calls and returns are timestamped. The video shows the game rendering certain scenes in real time, as if the CPU speed were slowed from 800MHz to just 48kHz. At this low speed, you can see what the game is doing as it draws each scene.
This video is possible because a trace was used to capture detailed timing information. I couldn't actually reduce the CPU speed to 48kHz, so I ran the game twice. The first pass created the timing trace. The second pass created the video. A video frame is captured every 1600 clock cycles (2 microseconds) as determinedusing the timestamps recorded in the trace.
A profiler would just show that certain rendering functions are called very frequently - this is helpful for reducing the average time required to draw a scene, but not for reducing the longest possible times required for certain scenes. A trace has enough detail to reconstruct everything. We can see almost exactly what the game is doing with each CPU clock cycle. If anything takes longer than expected, even only once, there are analysis tools to help you understand why.
You can see that the "long tail" has been captured by the tracing software in the following graph. It shows the total time needed to render each of the frames produced by the game while playing a demo:
Almost all game frames take less than 4 million clock cycles to draw, and the average is 2.3 million. But there are some which take much longer - 39 million in one case! Fortunately, the explanation is straightforward. These frames represent transitions between levels, and the extra time is the time required to load the new level from disk. The game is not playable while the next level is loading, so the extra latency doesn't matter to the player.
Though it's more difficult to capture a detailed timing trace than to capture a profile, you get much more information with a trace, and that extra information can reveal things about the program that profiling cannot show, such as the long tail of latency. This information may be used to estimate and reduce the worst-case response time of the code.