Sunday, 10 June 2012

Evolution of MSRS

Carousel, the subject of this paper, implements a multitasking scratchpad reuse scheme (MSRS).

How do you share a scratchpad memory (SPM) between multiple tasks? If it were a cache, then task switching just results in a few extra cache misses as the new tasks are loaded. That's possible because no part of a program actually needs to be in cache when execution begins. The CPU will wait until it is fetched. It works well, at least on average. In the worst case, given an otherwise predictable computer architecture, we can work out a bound on the number of cache misses. This cost will be quite small for high-priority tasks, because they are rarely (or never) interrupted.

With SPM, the fetching mechanism is not provided automatically. Instead the program or the OS must specify what is to be loaded. If the system is multitasking - if any task may be interrupted at any time to run a higher-priority job - then there is a problem. How can we correctly manage the state of the SPM so that all of the tasks continue to run correctly?

The first obvious answer might be "don't use SPM". Why make life difficult for yourself? Give up and use a cache; deal with cache-related preemption delay separately. But academic problems are no fun if you just assume the current way is the best way. So we assume that we must use SPM - it has advantages of its own - then figure out how.

The second obvious answer is to partition the SPM, giving each task a small slice which is all its own, unchanged by preemptions. That slice is available to the task; the rest is unused, and logically (if not physically) inaccessible. The rationale here might be that only a few tasks are really important. Most tasks don't need SPM space: only the highest-priority ones. But it is not hard to see inadequacies in the idea. What if all the tasks need more SPM space than is available? Disaster!

So, third obvious answer. Treat the whole SPM as part of the context of a task, just like the state of the CPU registers. Every preemption swaps out the current SPM contents (Save), and swaps something else in instead (Restore). Each task can use as much SPM space as necessary, up to the limit of the memory. This is the dumbest sort of multitasking scratchpad reuse scheme (MSRS) because it enables the same areas of SPM to be used by multiple tasks. But it is inadequate. Every context switch forces a full read/write cycle on the SPM? That's a big, big cost: individually small, but multiplied by every preemption! And it may be wasted time, because tasks might not require all of the SPM space.

We can improve the quality of the MSRS by allowing each task to specify the number of SPM blocks it actually requires. We can Save and Restore a variable-sized portion of the SPM. If the next task only requires 3 blocks, we only need Save/Restore 3 blocks. It's a bit like the partitioning solution, but it's dynamic, in the sense that the division between available space and unused space shifts from task to task.


The most recent results tell me that the MSRS idea is not so good for small numbers of tasks, because the time spent in Save/Restore is greater than the equivalent cache-related preemption delay. But for large numbers of tasks, MSRS is preferable. Why is this? Because the impact of one task on another is precisely quantified rather than being overestimated. And because the memory footprint of each task is smaller, as SPM allocation algorithms reuse limited space.

The early work is not so developed. It focuses on the wrong things. It is concerned with showing the flexibility of MSRS to handle code and data, and insufficiently interested in demonstrating benefits. Since October, when that paper was submitted to RTAS, I have learned:
  • That the concept of MSRS has to be formalised independently of any implementation.
  • That it isn't necessary to organise the hardware that implements MSRS as a stack buffer (or "Carousel"). This is perhaps a useful concept to explain the ideas, but MSRS will also work fine if all the tasks share the SPM from address 0 onwards, which is simpler.
  • That, since the SMMU-like features to support arbitrary data structures are not a necessary part of MSRS, it is best to leave them out of work on MSRS. They actually weakened the first paper through their resemblance to a fully-associative cache.
  • That MSRS has to be implemented in hardware, because software implementations are not competitive with cache. (They are competitive with a write-through cache, but this is arguably an unfair comparison.)
  • That it is a good idea to divide the SPM into read-only space and read-write space.
"Read-only space" means that the memory is only updated by the OS, during task switches. So it is not ROM: it is instead RAM that the application tasks are not allowed to update. This is suitable for machine code and for read-only data (e.g. strings). Read-only space is nice, from an implementation point of view, because the Save step only needs to record the source address of a read-only block: this being the location in external memory that the block was copied from.

"Read-write space" is RAM that can be used as RAM. It can be used for any sort of data, and also code. But the Save step needs to record both the source address and write the contents of each block back to external memory, so that the SPM space can be reused and correctly Restored. This costs extra time. Of course, if the contents of the block cannot change, this time is wasted.

In October, I had a simulated implementation of the original "Carousel" system. The simulator implemented a CPU, an external memory, and a Carousel SPM. Every block of SPM space was considered "read-write", even those including code.


A month ago, I had an FPGA hardware implementation of a "Carousel coprocessor" which implemented read-only space. It permitted three basic operations: Save, Restore, and Load. The Load operation copies new block contents from external memory. This was used for the RTSS submission. It is enough to implement MSRS for code only, because we can switch to a task (Save), switch from a task (Restore), and within a task, jump to a new region (Load).

Now, I have a "Carousel coprocessor" which implements both read-only and read-write space. The list of operations has increased. There are now separate Load operations for read-only and read-write space. There is a Writeback operation to send read-write blocks back to external memory. Save and Restore treat read-only and read-write blocks separately.

I have learned more from creating this implementation. Most importantly, a new operation called Take_Ownership is also required. This marks an area of read-write space as owned by the current task by assigning source addresses to it.

Every block in the SPM has an associated source address. This is used by Save when the block is swapped out. It is then put back by Restore when the block is swapped back in. If a task is going to use a read-write block, it must assign a valid source address to it first, because preemption may come at any time, and any block without a valid source address could be destroyed by Save.

An example. Here is the Initialize function from matmult.c, one of the MRTC benchmarks.
void Initialize(matrix Array)
{
int OuterIndex, InnerIndex;
for (OuterIndex = 0; OuterIndex UPPERLIMIT; OuterIndex++) {
for (InnerIndex = 0; InnerIndex UPPERLIMIT; InnerIndex++) {
Array[OuterIndex][InnerIndex] = RandomInteger();
}
}
}
It creates a matrix of pseudo-random values.

If we want to make this use SPM, we should write the new matrix to SPM, and then copy it into place. Here's how:
void Initialize(matrix Array)
{
int OuterIndex, InnerIndex;
matrix * Array_Copy = (matrix *) spm_rw_data[0];

spm_take_ownership(Array, 0, DATA_SIZE(sizeof(matrix)));
for (OuterIndex = 0; OuterIndex UPPERLIMIT; OuterIndex++) {
for (InnerIndex = 0; InnerIndex UPPERLIMIT; InnerIndex++) {
(* Array_Copy)[OuterIndex][InnerIndex] = RandomInteger();
}
}
spm_writeback(Array, 0, DATA_SIZE(sizeof(matrix)));
}
So, the purpose of spm_writeback is hopefully obvious, because this is where Array_Copy is copied to external memory. Array_Copy itself is allocated in SPM; the address of the read-write part of the SPM is spm_rw_data.
But there is also spm_take_ownership. This is used before Array_Copy is changed. It's necessary to do this if there is any possibility that a preemption might occur between any change to Array_Copy and spm_writeback. Otherwise, where will Save copy the contents of the read-write part of the SPM? The answer is nowhere. There is no safe location to store them; until spm_writeback, this data has no known location in external memory. If a Save occurs before spm_take_ownership, any new data will be lost.

This ownership feature was part of the original Carousel implementation, as it required each block be explicitly "opened", a process which involved loading it from external memory. And we could implement spm_take_ownership by explicitly loading Array into SPM. But this would be a waste of time. It is a write-only data structure. We do not need to load it. We must only notify the MSRS hardware of the source address.

You may have got the impression that using an SPM within a multitasking system is tricky. It is. But the good news is that the problems can be understood and solved. Sharing an SPM between tasks is much like sharing any other resource between tasks (or threads): understand the primitive operations available to you, and consider that preemption may happen at any time.

MSRS is now being written up for a journal submission, due in a few weeks, and the new hardware implementation of Carousel is going to form an important part of the article. But, as I hope I have suggested here, the work that has gone into it is far beyond just VHDL and C. This is not just a trivial extension of the conference paper - it's almost a complete rethink.