Section Six: Beyond RISC - Search for a New Paradigm
Part I: Philips Trimedia - A Media processor (1996) .The Philips TriMedia is one of the most successful of a wave of "media processors" introduced at roughly the same time, intended to perform video and audio processing tasks - similar to digital signal processors, but utilizing significantly more advanced marketing terminology. These included products such as the Mpact and Mpact 2 from Chromatic Research (1996, abandoned July 1998, the company was later bought by ATI Technologies), as well as those developed by companies largely for their own use, such as Matsushita's MCP (1997/8?). The "media processor" generation of DSPs are generally distinguished by using fixed or variable length VLIW designs, and often parallel subword operations (also known as SIMD). Often support devices ranging from as digital/analog converters to video processors were included on-chip.
Like many, the Trimedia can include various peripherals and on-chip memory, such as video and audio in and out, a decompressor, and an image coprocessor, which performs colour conversion and display masking independently of the CPU (called the DSPCPU in TriMedia terms). The CPU has 128 general purpose registers (R0 to R127, integer or floating-point values - 32-bit on TM1000, 64-bit on later versions). R0 is wired to contain 0, R1 contains 1, the PC register is separate. Like the ARM and TMS320C62x, all instructions are predicated ("guarding" in TriMedia terms, using register 1/0 (true/false) values - R1 is always 1 and can be used as the default predicate). Integer operations support wraparound and saturation arithmetic (no traps). Subword operations include a complete set of integer math, merge/permute and pack operations. Floating point exception traps can be enabled individually, and all exceptions can either set or accumulate in status bits which can be checked or cleared later. Load and store operations need to be size-aligned (16-bit load/store must be aligned on a 16-bit boundary, 32-bits on a 32-bit boundary), and loads don't generate exceptions - an implementation specific error value is returned to allow for speculative loads.
The TriMedia takes inspiration from the Multiflow VLIW computers (mid 1980s), and is designed for VLIW implementations - the TM1000 has 27 execution units (including 2 load/store) and a five instruction word (28-bit instructions), which would be hard to put to use without VLIW. Speculative execution is supported for software - for example the TM1000 has three branch units, even though only one branch can execute at once - at least two potential branches must be blocked by predicates. The TriMedia also tries to reduce branch penalties with a three-cycle branch delay - the three instruction words (up to five instructions each) following a branch will always be executed, much like the smaller branch delays in the MIPS and HP-PA processors.
The NX-2700 (2000?) added a MIPS CPU core to the processor.
VelociTI is TI's variable length instruction group version of VLIW, implemented in the TMS320C62x (integer) and TMS320C67x (late 1998, added floating point) DSPs. Each instruction has a bit which indicates whether the next instruction is part of the same group, and branches can arrive in the middle of a group (only following instructions in the group will be executed, as if execution were sequential). 32-bit instructions are fetched in 256-byte packets, but groups can't cross packet boundaries (NOPs are needed in this case to keep the packets aligned, but multiple groups may be in a single packet). There are eight functional units consisting of two data-address 32-bit adders (named .D1 and .D2), two 16-bit multiply (32-bit and floating point in C67x) units (.M1, .M2), two 32/40-bit ALU (and some FPU in C67x) units (.L1, .L2), and two 32/40-bit ALU/branch (other FPU in C67x) units (.S1, .S2). Up to eight instructions can be and dispatched at once to functional units chosen at compile time (not dynamically) - using compile-time scheduling like this reduces most of the decoding complexity - the 320C6201 used only 550,000 transistors (improving performance about 10x on FFT benchmarks).
Registers and functional units are split in two, with sixteen 32-bit registers and one set of functional units on each side. Single registers contain 32-bit integers (or single precision floating point numbers in the C67x), register pairs can be combined for 40-bit integer (a standard DSP format), and 64-bit floating point operations in C67x. Functional units have complete access to all registers on the same side (four reads and one write per register each cycle), with two data cross buses allowing one functional unit on each side to access one register on the opposite side per cycle (.Dx units which only access registers on the same side, though the results can be used as addresses to load or store to either register bank - but only one store per bank each cycle (only one load per bank in C62x)). Control registers are part of side B.
Like the ARM, all instructions are predicated, supporting speculative execution (registers B0, B1, B2, A1, A2 can be checked for zero or non-zero, reducing the number of predicate bits needed) - the two halfs of the CPU can be used to completely execute each side of a branch, and the correct result can be chosen at the end. DSPs typically do not use MMUs, so load/store exceptions do not need to be taken into account.
DSP-like features include saturation arithmetic, and circular addressing (registers 4 to 7 in each bank can be given one of two programmer-defined (power of two) block sizes - incrementing or decrementing a register beyond the end of the block causes it to wrap around).
When Intel and Hewlett-Packard announced that they would co-develop a successor to the 80x86 and PA-RISC architectures which would retain compatibility with both, and still introduce a revolutionary new type of processor, this raised the curiosity of a large number of people. At its introduction the concept of "RISC" was seen as inherently superior to the older (and less than stunning, even for a "CISC") 80x86 that proponents predicted that the older architecture's dominance in business systems would soon end, but various factors (foremost the inability Microsoft's Windows OS, to hide processor and hardware dependencies, requiring complete compatibility) maintained demand for the 80x86, which in turn provided the revenue to invest in design improvements allowing it to remain competitive.
It was assumed that Intel's strategy would be to maintain market demand for its existing architecture as long as possible to the exclusion of all else, including its own RISC processors, which made the announcement that it would co-develop a replacement to its largest revenue generating product a surprise to many, and caused speculation as to what could be so much better than "RISC" that it could do what "RISC" couldn't.
The design itself came from designers at HP who estimated in 1992 that complexity would prohibit more than 4-way issue PA-RISC designs. Also, HP had just bought Cydrome which had experience in designing VLIW systems, and engineers from VLIP producer Multiflow,. The decision was made that the PA-RISC would be replaced with a VLIP initially called SP-PA (Super Parallel Precision Architecture) or PA-WW (Precision Architecture-Wide Word). Intel, which had started fabricating PA-RISC CPUs for HP, was approached as a development partner to share the cost and increase its popularity.
Intel called the strategy EPIC, or Explicitly Parallel Instruction-set Computing, presenting it as a successor to both RISC and VLIW architectures by using variable length instruction groups and non-parallel semantics (allowing instructions within a group to execute either sequentially or in parallel, as opposed to only in parallel) to overcome the disadvantages of VLIW. However this simple label fell far short of describing the real intent of the new processor, or the variety of techniques and mechanisms pulled together to implement it - the goal of the IA-64 (originally known to the world as "Merced", actually a code name for the first implementation officially named "Itanium") is to reduce interuptions and latencies during execution to allow a general purpose processor to operate as smoothly as a DSP, and then add DSP-like support features (as well as almost every other "good idea" that has been examined since the establishment of "RISC", with the exception of multithreading), many which inhibit the execution flexibility which competing designs have grown to rely on (see Appendix A). The result is a sort of behemoth many people have been skeptical about .
IA-64 features 128 65-bit (64-bit data, 1-bit NaT described below) integer registers (GR0-GR127, GR0 hardwired to be 0) and 128 64-bit floating point registers (FR0-FR127, FR0 set to 0.0, FR1 set to 1.0). With a separate instruction pointer register and eight branch registers (BR0-BR7) containing branch destination addresses (though not part of the architecture, this could allow pre-loading of branch targets - see the Hitachi SH5). Integer registers are arranged as a stack as in the AMD 29K (GR0-GR31 correspond to 29K's global registers, GR32-GR127 to the stack), requiring a separate register cache stack and a regular execution stack (note of irony: AMD abandoned the 29K to concentrate on it's 80x86 clones, while Intel is replacing the 80x86 with an architecture similar to the 29K). While the 29K uses a stack pointer register (registers are selected releative to the stack base pointer), IA-64 renames registers implicitly (GR32 is still referred to as GR32, though it may map to any of the 96 stack registers), and registers are spilled and filled automatically in the IA-64 (during call or return instructions - the "register frame" is specified by an alloc instruction).
Compatibility is retained with the 80x86 (designated IA-32) by mapping registers G8 to G31 to the IA-32 register set, floating point registers FR8 to FR31 to IA-32 FPU and SSE registers, and other system registers, and directly executing IA-32 instructions using this subset of the processor. PA-RISC is similar enough to IA-64 that instructions will simply be recompiled likely using technology from HP's Dynamo project (described in the Transmeta section). 80x86 instructions are expected to be decoded in hardware.
The main cause of latencies is non-uniform memory access for data and instructions (branches in particular). Like the ARM and TMS 320C6x, all IA-64 instructions are predicated, using sixty-four 1-bit predicate registers (PR0 to PR63, PR0 set to 1) rather than the single condition code like ARM, or a subset of general registers like the 320C6x. Predicate registers can be set in pairs (complements such as true/false) by comparison operations (either replacing or "accumulating" predicates by predicating the compare instruction), or explicitly (transfer to/from a 64-bit general register). These are meant to allow two paths of a branch to be executed simultaneously, and the correct result/state selected at the end (by using predicates on the final instructions), to avoid interrupting the instruction stream.
Unlike a DSP like the TMS320C6x, which uses a similar strategy, memory operations may cause an exception (write protected, swapped out, etc.) while executing one path of a branch, even though that path is discarded (if executed sequentially, the interrupt would not have occurred). It's also desirable to move loads to earlier addresses to overcome latency, but they may be valid only on one path of a branch, so the load must occur after the branch begins. IA-64 provides speculative loads which do not generate an exception, but sets an error flag (NaT bit for integer, NaTVal (special zero-type value) for floating point - these values propogate, so additional integer, logical, floating point and compare operations produce a NaT, NaTVal, or false result), and adds check instructions for NaT and NaTVal, branching to an exception handler if set - other instructions raise an exception trying to use a NaT or NaTVal value.
IA-64 also includes "advanced load" instructions which loads a value (non-speculative) and keeps the address in a buffer (along with support instructions for the buffer). Any store to the same address removes the buffered address, indicating that the load conflicted with a store and must be re-done - this is essentially a lot of hardware dedicated to largely to overcoming a weakness in the C language for functions with "aliased parameters" (see note on C in entry for PDP-11).
Load instructions can also include cache hints to indicate the likelihood the data will be used again soon.
Branches can be program relative (+/-16MB) or use a branch register (computed branches transferred to/from general register). Like the PowerPC, IA-64 includes a separate loop count (LC) register, but adds software controlled register renaming allowing a block of stack registers (GR32-GR127, in blocks of eight), as well as predicate (PR16-PR63) and floating point (FP32-FP127) registers to rotate upwards (value in GR32 will appear in GR33 after one iteration). In addition to the LC, an epilog count (EC) register is added - after the LC register reaches zero, the EC is used until it reaches zero. While the LC is used, the lowest predicate register PR16 is set to 1, while the EC is used, PR16 is set to 0 (in a while-type loop, when the LC isn't used, the EC can still be used, PR16 is set to 0 all times (EC is still used by the loop's branch instruction) - the program must set the appropriate predicate values). This allows a loop to include instructions rearranged by the compiler, with predicates progressively activating and deactivating them during the beginning and end iterations.
This is meant to replace loop unrolling, where a block of instructions within a loop is repeated to reduce branch penalties (and programming hacks like Duff's Device).
Finally, IA-64 supports the parallel subword operations used in 80x86 MMX and SSE, and PA-RISC MAX multimedia extensions (including sturation arithmetic). They follow the Intel model of using floating point registers rather than integer registers as PA-RISC does.
Although apparently complete (some would say "overcomplete"), one glaring exception (surprising many) is the lack of simple multiply operations on integer registers (used routinely in common multi-dimensional array indexes). One possible explanation is to keep all integer register operations to single cycles, while multiply operations are multiple cycles, but it may reduce duplicated circuitry to (see the CDC 6600). As it is, there need to be frequent transfers of registers between integer and floating point registers.
The most promoted idea of the IA-64 before the architecture was revealed was variable length VLIW (like that in the TMS320C6x and Sun MAJC) - 41-bit instructions would be bundled into 128-bit bundles, with 5 template bits to indicate independent instructions. In fact, the template bits encode a set of twenty-four allowable combinations of instruction types (integer, memory (load/store), floating point, branch) and groupings - eight combinations are unspecified. For example, floating point instructions must always follow any load/store instructions, and preceed any integer, which must also preceed any branch instruction. This provides a partial decoding as well as grouping independent instructions.
The IA-64 adds a large amount of hardware support for language features, though at a much lower level than designs such as the Vax or Intel i432 which tried to map language statements directly to machine instructions. Some would describe this support as anti-RISC, while others would describe it as a RISC approach to language support (provide simple components which work together, rather than complex instructions). Some people think the static prediction that a compiler can produce will not match the dyamic scheduling of modern CPUs, but this may be solved by dynamic recompiling (as in HP's Dynamo Project, or Transmeta's "Code Morphing" optimizing software).
In either case the strong support from Intel for this architecture produced as much expectation for its future success at introduction as the PowerPC had when it was promoted as a replacement for the Intel architecture by IBM, Motorola, and Apple. Delays and lower than expected clock speeds for the Intel-designed Itanium (exected mid-2001, 800MHz using a ten-stage pipeline) quickly reduced these expectations, and poor benchmark results also hurt.
The first HP version (late 2002, code named McKinley) was called Itanium 2, and had the performance expected, though competitors (IBM POWER 4, Fujitsu SPARC64 V) and even Intel (Pentium 4) had advanced to match its performance.
Although it had hardware for executing IA-32 programs, performance was ridiculously slow, so the capability was rarely evenmentioned. In the face of competition from the AMD Opteron, Intel announced a software emulation system (likely derived from the FX!32 software developed by DEC to run x86 programs on the Alpha, included in the technology Intel purchased from Compaq) which is much faster than the hardware. The software system allowed the Itanium 2 to run IA-32 programs about half the speed of the Pentium 4 available at the same time.
Part IV: Sun MAJC - Levels of parallelism (late 1999) .To support the use of Java, Sun planned to produce Java-specific processors which could directly execute the compiled bytecodes, rather than using a virtual machine. Three products were announced - picoJava, a processor core which could be embedded in other designs, microJava, a stand-alone version of the picoJava core, and ultraJava, a high-end high-speed Java processor.
Interest in Java processors did not materialise - language specific processors have traditionally been poorly received except in specific applications, and techniques to translate Java bytecodes to native CPU instructions meant conventional CPUs could execute Java as fast or faster than Java-specific processors. After the introduction of the picoJava and microJava, the UltraJava was apparently cancelled - the design program instead mutated into the MAJC design, though Java still had a strong influence in the design (MAJC stands for Microprocessor Architecture for Java Computing).
Simultaneously, Sun had been among the first to add multimedia instructions to CPUs (VIS extensions to SPARC), but using an expensive superscalar processor to do repetitive (and independent) digital signal processing wastes the non-multimedia majority of the CPU. The creation of a multimedia coprocessor became the other goal for the retargeted MAJC design.
A MAJC CPU consists of up to four general purpose units (justified by the empirical observation that there are seldom more than four instructions in a typical program which can be executed in parallel - the lucky number four appears often in the MAJC architecture), all except the first (which is a subset of the others) are identical and capable of the same integer/DSP/multimedia/floating point operations. Each unit can access 128 64-bit registers, divided between those local to each unit and those shared globally by a delimiter register - writes above the delimiter are copied to all local register sets, writes from other units to registers below the delimiter are ignored (this allows four simple register sets (three read ports) to be used instead of one complex set (twelve read ports) - similar in idea to the TMS 320C6x split CPU design).
Local registers allow individual units to execute speculatively, but without the need for rename registers because locally stored results are never visible to other units. A small number of instructions are predicated (using any general register) - only those used to select one of several speculative results (conditional move, store, etc, as well as a pick conditional move, which selects one of two register values based on the predicate in a third). MAJC also supports speculative loads, using a scoreboard to track the destination register, load address, and whether it completed, failed, or is still in progress. When checked a failed load will be re-executed transparently, when not checked a failed load returns a zero (unchecked failed loads can be used for validating NULL pointers). This is a simpler version of the IA-64 advanced load instructions (loads and stores are allowed to complete out of order in MAJC).
Like the TMS 320C6x and Intel/HP IA-64, MAJC uses variable length instruction groups - between one and four. Like the 320C6x, the instruction word encodes which functional units will receive each instruction, but MAJC specifies them implicitly (in order, first to unit 0, next to unit 1, and so on). Four bits from the first instruction in the group specify the packet size (rather than using one bit in each instruction to indicate dependencies) - unit 0 is a subset of the other three units, with an eight bit opcode instead of eleven.
Saturation arithmatic and integer, fixed, and floating point parallel subword (or SIMD) operations can be executed by each functional unit using any registers. Like the original MIPS or PA-RISC processors, hardware interlocks to prevent registers from being used before a result is written are not specified except when there is an unpredictable delay (only loads will stall the processor, using the load register scoreboard) - the compiler is expected to schedule instructions to avoid conflicts (binary compatability is not a requirement between MAJC processors, since binary translators are expected to allow compatibility, as they do with Java bytecode, or the Transmeta Code Morphing processors).
In addition to instruction level parallelism, MAJC supports vertical multithreading, where registers for up to four threads can be switched with little overhead (using a non-speculative type of register renaming) - a concept pioneered in the Tera supercomputer (supporting 128 threads without cache) and IBM Northstar POWER CPU (two threads), and expected in the Alpha EV7. When a cache miss occurs and there are no more independent instructions, execution can switch over to another thread (which may have been waiting for a cach miss load which has finished).
The MAJC is also intended to include multiple processors on a single chip (a feature of the POWER4 and planned for the Alpha EV7 CPUs) to encourage automatic "speculative" parallelisation of normally sequential blocks (such as procedures or loops), by creating a separate memory image for the new thread, then merging the changes back when both threads are finished. The technique was pioneered in the Myrias supercomputer, and adapted for general multiprocessor computers as a product called PAMS (Parallel Applications Management System) - however PAMS requires compiler directives in C, C++, and Fortran programs, while Sun designed a system based on the better behaved features of the Java language (no pointers, pass by value only, automatic memory management) to discover parallelism without programmer intervention.
Overall MAJC appears to be a very flexible design on many levels, in contrast to the emphasis on very low-level features of the others of its generation (IA-64, TMS 320C6X, TriMedia). It's not intended to be a general purpose CPU, although it appears to be flexible enough that in the future, it could move in that direction for some applications.
Part V: Transmeta Crusoe - Leaving hardware (January 2000) .In the early 1990s, Apple decided that the Motorola 680x0 series was not keeping up with the Intel 80x86 series, largely because PCs were Intel's primary market, while Motorola CPUs were used more in embedded systems. RISC designs were simpler and could be improved with less effort, so Apple switched to the PowerPC CPU in 1994 (after prototypes in 1991 using the 88K), but to maintain compatibility, needed to emulate the 680x0. The initial emulator interpreted 68LC040 (without FPU) code, and a later version stored translated blocks of code, and ran faster than Apples previous high end Macintoshes.
This impressed IBM engineers enough that a project was started to emulate the 80386+ architecture on a PowerPC (known as the PowerPC 615), but the project was cancelled (apparently after successful versions were completed - possibly because of performance, problems with efficiency using the PowerPC architecture (the 80x86 much more awkward and complicated than the 680x0), marketing decisions, or strategic/management decisions - I don't know, but the computer industry was very volatile at the time, and the path of the future was not at all clear). However development on the conncept continued with the DAISY project (Dynamically Architected Instruction Set from Yorktown), which translated to a hypothetical VLIW CPU instead of the PowerPC. Both the DAISY system, and a later project called Dynamo from Hewlett-Packard (which ran PA-RISC on PA-RISC), could optimise code as it ran (Dynamo could improve PA-RISC performance by up to 20% over non-emulated code).
Several engineers (many from Sun, such as David Ditzel, designer for Sun's UltraSparc CPU and the AT&T CRISP, and Bob Cmelik who wrote instruction profiling tools for SPARC programs) helped found Transmeta, which created the missing VLIW processor, and created a new dynamic translator (called a "Code Morpher" by Transmeta) to emulate the 80x86. Two Crusoe CPUs were introduced - TM3200 (changed because of trademark conflicts to TM3210) and TM5400 - with dynamic translators for both to run 80x86 code (though not exclusively - one early demo showed the "Quake" video game being played, and while most was compiled as 80x86 code, part of the inner rendering loop was in Java, so the CPU switched to an emulated Java CPU for every iteration of the loop with no visible loss of speed).
The initial physical Crusoe CPU architecture closely resembled the Sun MAJC. It includes sixty-four 32-bit integer registers, sixty-four 80-bit floating point registers. The Efficeon added four predicate registers (like the IA-64). Crusoe VLIW words are either 64-bit (two instructions) or 128-bit (four instructions, one of each type) dispatched to two integer units, one floating point, one load/store, and one branch unit. Efficeon extended this to 256 bits (eight instructions) dispatched to two load/store units, two integer ALUs, one FPU/SIMD, one SIMD, one branch unit, and one address alias checker.
This is less flexible than variable-length instruction groups, and hinders compatibility (a common VLIW problem), but apart from the translator, software is not intended to ever run directly on the CPU, so compatibility is not considered (the TM3200 and TM5400 are not binary compatible). Like MAJC, Crusoe CPUs have an instruction to select the correct result, after both have been produced speculatively (using parallel instructions).
Low power support (called "Long Run" by Transmeta) can reduce both the clock speed and the voltage used.
Emulated registers are "checkpointed" between blocks of optimised code, so that exceptions (which would otherwise occur in a different order than original, untranslated code) cause the processor state to be returned to the beginning of the block, and interpreted in order (one at a time) until the exception is encountered again at the proper instruction (similar to the superscalar 88110 hardware history buffer). Memory stores are buffered, and only written to memory at the end of a block (when the next checkpoint is saved).
Loads are protected using a scoreboard system like MAJC, except that stores raise an exception, rather than automatically reloading from the address. This allows multiple loads from a single address to be moved into the exception handler, out of the main program block - after the first load, intervening stores may or may not alter the loaded data, so an alternate store instruction is used which raises an exception if the address is the same as the load. The extra loads are skipped if they are not needed (eliminating memory delays, rather than just reducing them as cache does).
Translated original code is write-protected, so that any modification is detected, and the translated code is purged or modified.
Like DAISY and Dynamo, the Transmeta Cord Morpher profiles code as it executes (inserting profiling instructions in translated code - particularly branch profiling, eliminating the complex branch prediction curcuitry in many CPUs), and will stop and optimise heavily executed blocks (one engineer reported a very simple 80386 benchmark almost disappeared as the optimiser recognized that the code did no actual work, and eliminated most of it).
In addition, the CPUs can add support for new Intel instructions such as SSE extensions with only software changes, leaving the underlying CPU unchanged.
Vertical multithreading was used in the CDC 6600 peripheral controller to compensate for I/O latencies. The XInC (code named "Hammerhead") uses the idea in a microcontroller to allow multiple threads in a real-time environment - every clock cycle executes a different thread (a variation called a "barrel processor"), so every thread takes a known amount of time regardless of what else is executing. It's meant to eliminate the need for a real-time operating system (RTOS).
The CPU supports eight threads, each with a set of eight 16-bit general purpose registers, one program counter (PC), and one condition code register. It has eight pipeline stages for all instructions - once an instruction starts, instructions from the other seven threads must be dispatched before the next in the original thread can be executed. This makes it appear to the program that each instruction executes in one cycle, so there are no pipeline stalls between instructions (this also simplifies circuitry because data dependency doesn't need to be checked). Some functions (multiply, bit operations) are implemented as on-chip peripherals, like the TI MSP430, and it has hardware synchronization (semaphores) between threads.
Threads can monitor peripherals, so interrupts aren't necessary. The simplicity allows a 16-bit XInC to be little more complex than an 8-bit Intel 8051.
Table of Contents