How could modern compilers unleash the potential of the CELL processor?

Have you ever wondered what the fate of the Cell processor would have been like if the software designed for it could have been coded making use of the compilers we have today? In this article, I will explain the complications and limitations of the Cell processor, as well as how a modern compiler can drastically help optimize the code for it.

Before we proceed, let’s quickly take a look at the design of the Cell processor:

The Cell processor consists of a scalar dual-threaded PowerPC core (PPE) and eight vector coprocessors (SPE) that communicate via an internal bus (EIB). The PPE is responsible for executing the main program and launching tasks to the SPEs, while the SPEs perform resource-intensive and parallelizable instructions.

However, coding for the Cell processor was not an easy or quick task. The SPEs had a limited amount of local memory (256 KB) that could be only accessed directly, and the data had to be transferred to and from main memory via DMA transactions. These data transfers were sent via the EIB bus making use of the ring topology, which could present severe performance and efficiency limitations unless its capacity was fully utilized by ensuring that all SPEs processed the allocated data before the next cycle started. It is said that this was the reason why Polyphony struggled with the development of Gran Turismo 5.

The compilers that existed at the time were not capable of generating efficient code for SPEs. For example, compilers could not perform instruction reordering, vectorization, parallelization, register allocation, loop merging or splitting, among other techniques.

Instead, these tasks were left to the programmer, who had to write specific code for each SPE and manage communications between them. This turned out to be extremely time-consuming and tedious, resulting in many games taking a long time to demonstrate the Cell’s potential. A good example of this is the original MotorStorm and MotorStorm Pacific Rift.

The original MotorStorm made little to no use of SPE (at least graphically, the Havok middleware made some use of SPUs for physics) and relied as much as possible on RSX, and rightly so. This game had to be released as soon as possible:

Whereas MotoStorm Pacific Rift makes intensive use of SPUs to achieve several graphical improvements across the board, such as greatly improving lighting, adding ambient occlusion, particles, a very well-executed motion blur, anti-aliasing, and astonishing physics: Anyway, I’m getting off the point.

In theory, code generated by a modern compiler could achieve much better performance than code written manually by a programmer. Also, the programmer could focus on the program logic or the art side of the game and not on the low-level quirk.

TL;DR

The CELL processor was very impressive for its day (at least until unified shaders and CUDA came in), but coding for it was complex and time-consuming with the tools that existed at the time (although, over time, these were considerably improved along with the experience of the programmers). Although with the compilers we have today, we could simplify software development for the CELL processor and get much better performance.