Sunday, 7 February 2016

RISC instruction sets I have known and disliked

Nobody writes assembly code any more, except when they do, and in the last week my work has taken me into an environment where you can't use C, because it's just too high level. C requires a stack (which isn't available) and the compiler makes its own choices about register allocation, which is really unhelpful because only a few registers are actually available.

The platform in this case was PowerPC. The instruction set is quite good. If you're forced to drop to instruction level, PowerPC is generally straightforward, sensible and comprehensible. You can take a good guess at what the instructions mean, and you'll normally get it right. There aren't many instructions, and doubts can be rapidly resolved by looking in the architecture manual or this nice reference.

But something did catch me out. I wrote some inline assembly code, within C source:
asm volatile ("stw %0, 0(%1)" : "" : "r"(id), "r"(ptr));
Now, this is broadly equivalent to the C code "*ptr = id", but for reasons which I won't go into, I needed to explicitly specify the machine instruction to be used ("stw"). I found that this code would sometimes produce a segmentation fault, but I couldn't figure out why. Instead of writing to the location in "ptr", the store instruction had attempted to write to location 0!

I thought at first that "ptr" must have become corrupted. But it wasn't that. In fact the problem was introduced at compile time. The "r" directive gives the compiler the freedom to choose any register to represent "ptr". Sometimes it picks register r0. And on PowerPC, r0 has special properties. Usually, r0 means general-purpose register (GPR) 0. But for some instructions it means a literal zero. For "stw", if the compiler chooses r0 for %1, the instruction becomes a store to address 0. Consequently, this code will work or not work, depending on the choices made by the register allocator. You have to tell the compiler to NOT use r0. Here's what IBM has to say on the subject:
...if a general register is required by the instruction, you could use either the "r" or "b" constraint. The "b" constraint is of interest, because many instructions use the designation of register 0 specially – a designation of register 0 does not mean that r0 is used, but instead a literal value of 0. For these instructions, it is wise to use "b" to denote the input operands to prevent the compiler from choosing r0. If the compiler chooses r0, and the instruction takes that to mean a literal 0, the instruction would produce incorrect results.
I think this is the only big surprise I encountered in the PowerPC instruction set. It's like an irregular verb in an otherwise completely regular language. r0 is not zero all the time, as in some RISC architectures, but it behaves as a zero register in some special cases, which you just have to know. Like the plural of "sheep" or "mouse".

RISC instruction sets like PowerPC are usually expected to be highly regular, like a spoken language designed from scratch with the intention of being simple to use, whereas the CISC style of instruction set is expected to be highly irregular and full of oddities, like a spoken language that evolved over thousands of years, borrowing words from other languages at any opportunity.

This should mean that RISC assembly languages are easy to understand, while CISC assembly languages are baroque and full of surprises. But actually this is not a general rule at all. CISC assembly languages are often quite nice to work with, because they were designed with the expectation that programmers would use assembly. Whereas RISC assembly languages can be absolutely horrible to use, because they're designed with the expectation that programmers will use C and higher-level languages, assembly coding is the compiler's job, and machine code should be optimised for efficient execution on a simple pipeline.

This applies particularly to the two pioneering RISC instruction sets, namely SPARC and MIPS. MIPS is the worst offender. It deliberately omits a feature which is so fundamental to CPU architecture that software people don't even think about it. The architecture leaves out the mechanism in the CPU pipeline which would otherwise stall execution until the data was ready. A register access which would have created a minor inefficiency on any other architecture instead creates a "hazard" on MIPS. You can read from a register before that register is ready. If you are writing or debugging MIPS code, you have to know how this works. If you do a serious degree-level computer architecture class, it'll probably involve MIPS, mainly so that your professor can set some really tough exam questions about the pipeline.

This "feature" is based on assumptions about the CPU implementation which don't hold if the implementation is scaled up, becoming like the "recent" (i.e. last two decades) x86 CPUs with parallel execution of instructions ("superscalar" execution) and reordering of instructions. At that level, it is a legacy feature which makes the architecture awkward to use. It is much worse than any of the x86 legacy features, because those aren't visible all the time. Who cares if the CPU supports some weird compatibility mode for running 16-bit DOS software, if you're writing 32-bit or 64-bit code? The "real mode" support doesn't affect you at all. A legacy feature which affects every single instruction is not avoidable.

Both SPARC and MIPS share another horrid feature - delayed branches. These create a dependency between instructions, in which the branch takes effect after the next instruction, rather than immediately. When using assembly code, you have to know which instructions have a delayed effect, and what rules apply to the instruction (or, sometimes, instructions) in the "delay slot" following it. The delay slot is restricted in various ways: for instance, you can't put another delayed branch there.

The purpose is to simplify the pipeline, improving efficiency, but like the other MIPS "feature" described above, delayed branches are based on assumptions about the CPU implementation. If you scale up, or even just add a simple branch predictor, those assumptions don't apply any more, and the "delay" feature is just an annoying legacy which the CPU designer has to support, because it affects all of the code ever made for the platform.

SPARC also has a crazy feature all of its own, the "rotating register file", which makes code incredibly hard to understand because it's very hard to follow the movement of data across function calls. On SPARC, the registers are not numbered r0 to r31, they're assigned other names like o1 and g1, and some are remapped at call/return. This legacy feature makes some big assumptions about the typical number of registers that need to be preserved across a call, and forces the programmer to remember which registers get remapped and how they are remapped. I particularly disliked working on a SPARC simulator because you only see the values 0 through 31 in the machine code: you then have to translate these to the SPARC name and then figure out what they mean. (If you do have the misfortune to have to work with SPARC, page 26 of this document explains how the register file rotates, while page 193 shows how to translate between numbers and names. Have fun.).

Later generations of RISC instruction set - PowerPC, ARM and Alpha - don't have any of this nonsense. There are no delayed branches, no rotating register files, and as an assembly programmer, the only reason to understand the details of the pipeline implementation would be to maximise performance. These instruction sets are much nicer to use. Design errors, like having r15 as the program counter or making every instruction conditional, are problems for CPU architects rather than programmers, and it's no surprise that they disappeared in the 64-bit version of the ARM architecture. They must have made it awfully hard to implement superscalar, out-of-order execution.

However, many of the later RISC architectures do share one annoying flaw. Immediate values are constants embedded within the instructions. Sometimes these are used for small values within expressions, but often they're used for addresses, which are 32-bit or 64-bit values. On PowerPC, as on SPARC and MIPS, the mechanism for storing 32-bit immediates can only encode a 32-bit value by splitting it across two instructions: a "load high" followed by an "add". This is a pain. Sometimes the two instructions containing the value are some distance apart. Often you have to decode the address by hand, because the disassembler can't automatically recognise that it is an address. This design pattern turns up on almost all RISC architectures, though ARM does it differently (large immediates are accessed by PC-relative loads). When I worked on an object code analyser for another sort of RISC machine, I gave up on the idea of statically resolving the target of a call instructions, because the target address was split across two instructions, one of which could appear anywhere in the surrounding code.

The x86 system for storing 32-bit/64-bit immediates is much nicer. They just follow the instruction, which is possible because the instruction length is variable. Variable-length instructions are not usually seen in RISC, the Thumb-2 instruction set being the only exception that I know of.

What frustrates me about the whole RISC versus CISC debate is the unfairness of it all. It seems really biased against x86. If you're going to criticise x86 for its legacy features, shouldn't you also mention the legacy features of RISC architectures? The debate seems to be based on 1980s "talking points". There's an Edison/Tesla narrative of "evil empire" versus small competitor, where the small competitor uses good science to do it better, but gets squashed anyway by the evil empire's desire to preserve its monopoly. And there's the question of how much better CPUs might have been, if the "bad guys" hadn't won.

There are elements of truth here, but also major misrepresentations. Actually, once a CPU design scales up, it hardly matters if the instructions can map efficiently onto a 5-stage design like the one in the textbook. At this point, it's probably better to have an efficient instruction encoding, save on memory bandwidth and instruction cache space, and have a comprehensible instruction set. Hence x86.

After 2000, a new meme joined the RISC/CISC debate. RISC had won, as x86 CPUs were now "RISC on the inside". The people who repeated this bizarre claim meant that the newer Intel and AMD designs converted x86 instructions to "RISC-like" micro-instructions before they entered the out-of-order execution unit. But RISC hadn't won at all. CISC had always involved decoding instructions into micro-instructions. RISC was an attempt to abolish the process of converting instructions to micro-instructions, by having simple instructions which fit the hardware well and are easy to decode. But the CISC approach came back because it worked better at scale. Nowadays the out-of-order superscalar PowerPC and ARM CPUs use the x86 approach, not the other way around.

Unfortunately, CPU architecture is ripe for snake-oil. There is hype, there are bandwagons, and there is even flim-flam when someone proposes a new sort of architecture and invites you to subscribe to their Kickstarter. The subject is so complex, CPUs are so hard to compare to each other, and it's very hard to distinguish between good marketing and good engineering. Benchmark results can be selective and may even be totally meaningless. As yet, there is no all-encompassing science of CPU design which would lead to the perfect architecture, because the CPU is just one part of a system comprising compilers, applications, legacy software, input data and user requirements. This system is too large and complex to be understood by any scientific method currently available to humanity. At best we can understand parts of it - how to make a better cache, for instance, or a better branch predictor, for some limited definition of "better", measured against some small subset of software.

As a university researcher I once looked into the briefly-fashionable subject of "hardware/software codesign", where the idea is to come up with the perfect mixture of hardware and software to solve a specific problem. Hardware description languages make it possible to define hardware much like software, so in principle, you can take any algorithmic problem and produce an ideal combination that solves it, minimising time or energy consumption. But in reality the "codesign problem" is nearly impossible to examine scientifically because of the number of degrees of freedom available. And this is merely a small part of the larger problem of building an ideal general-purpose computer architecture.

In conclusion, RISC instruction sets are not always very nice, and x86 is not particularly nasty. A good initial design can create a legacy problem years later - and no CPU architecture is free of that. Be wary of anyone who says they can do better.