Section Five: Born Beyond ScalarIntel 80860, the 80960 was actually an overall better processor. It's sometimes considered a successor of the i432 (also called "RISC applied to the 432"), and does have similar hardware support for context switching. This path came about indirectly through the 960 MC designed for the BiiN machine, a joint Intel/Siemens venture, which was still very complex (it included many i432 object-oriented ideas, including a tagged memory system). The M-series ("Military") design predated the released i960 (which removed tag bits and complex instruction microcode), but was released later.
The i960 was retargeted at the high end embedded market (it included multiprocessor and debugging support, and strong interrupt/fault handling, but lacked MMU support), while the 860 was intended to be a general purpose processor (the name 80860 echoing the popular 8086). It replaced the AMD 29K series as "the world's most popular embedded RISC" until 1996.
Although the first implementation was not superscalar, the 960 was designed to allow dispatching of instructions to multiple (undefined, but generally including at least one integer) execution units, which could include internal registers (such as the four 80 bit registers in the floating point unit (32, 64, and 80 bit IEEE operations)) - the 960 CA version (1989) was superscalar. There are sixteen 32 bit global registers which can be shared by all excution units and sixteen register "caches" - similar to the SPARC register windows, but not overlapping (originally four banks). It's a load/store Harvard architecture (32-bit flat addressing), but has some complex microcoded instructions (such as CALL/RET). There are also thirty-two 32 bit special function registers.
It's a very clean embedded architecture, not designed for high level applications, but very effective and scalable - something that can't be said for all Intel's processor designs. However, as part of a patent dispute with DEC, Intel obtained DEC's StrongARM design, and generally dropped the i960 from further development (to the resentment of the developers).
The Apollo, Sun, and HP each had about 20% of the workstation market when Apollo (after a short try at a joint project with Intel), produced the DN3000 and four-CPU DN10000 workstations (DN meant "DOMAIN Node" - they ran DOMAIN/OS) which introduced the PRISM CPU (codenamed A88K - "A" for "Advanced" post-68K - sometimes confused with the Motorola M88K). It was the first real VLIW-type (two-instruction) microprocessor and the fastest available workstation until the IBM RS/6000 series (The Intel i860 was a faster CPU due to its high clock speed, but was too difficult to program so rarely matched PRISM's speed in practice). They were multi-chip CPUs, not microprocessors, but affected some microprocessor designs that followed.
PRISM had thirty two 32-bit integer and thirty two floating point registers. It had one integer/load/store unit, one floating point multiplier and one floating point add/subtract unit. Like the Intel i860 (Intel's results from the initial joint project), PRISM could dispatch a single integer, or one integer and one floating point instruction per cycle, but in PRISM this was indicated by a bit in the integer instruction (similar to the TMS320C6x DSP) rather than a separate mode. It was one of the first to include a multiply with add/subtract/truncate in a single (five operand) instruction, so it was often described as a three-issue CPU (the integer instruction usually loading floating point registers).
The second version (double the clock speed) was delayed by financial problems with the chip supplier (named BIT). HP bought Apollo in 1989, and by 1991 dropped most Apollo technology (the promised PRISM II and DOMAIN/OS, upsetting and angering users), although the floating point changes to PA-RISC from 1.0 to 1.1 (sixteen to thirty-two registers, multiply with add/subtract) were inspired by PRISM, and the Apollo engineering centre took over PA-RISC workstation development. Apollo became an HP workstation brand name for a while.
The 860 had several modes, from regular scaler mode to a superscalar mode that executes two instructions per cycle and a user visible pipeline mode (instructions using the result register of a multi-cycle op would take the current value instead of stalling and waiting for the result). It could use the 8K data cache in a limited way as a small vector register (like those in supercomputers). The unusual cache uses virtual addresses, instead of physical, so the cache has to be flushed any time the page tables changes, even if the data is unchanged. Instruction and data busses were separate, with 4 G of memory, using segments. It also included a Memory Management Unit for virtual storage.
The 860 had thirty two 32 bit registers and thirty two 32 bit (or sixteen 64 bit) floating point registers. It was one of the first microprocessors to contains not only an FPU as well as an integer ALU, but also a 3-D graphics unit (attached to the FPU) that supported lines drawing, Gouraud shading, Z-buffering for hidden line removal, and other operations in conjunction with the FPU. It was also the first microprocessor able to do an integer operation, and a (unique at the time) multiply and add floating point instruction, for the equivalent of three instructions, at the same time (a FPU instruction bit indicated the current and next integer/floating-point pairs can execute in parallel, similar to the Apollo DN10000 PRISM CPU/FPU (also 1988) which used an integer bit which affected only the current integer/floating-point pair).
However actually getting the chip at top speed usually required using assembly language - using standard compilers gave it a speed closer to other processors. Because of this, it was used as a coprocessor, either for graphics, or floating point acceleration, like add in parallel units for workstations. Another problem with using the Intel 860 as a general purpose CPU is the difficulty handling interrupts. It is extensively pipelined, having as many as four pipes operating at once, and when an interrupt occurs, the pipes can spill and lose data unless complex code is used to clean up. Delays range from 62 cycles (best case) to 50 microseconds (almost 2000 cycles).
PC/RT based on the ROMP processor), it decided to produce a new innovative CPU, based partly on the 801 project that pioneered RISC theory. RISC initially stood for Reduced Instruction Set Computer, but IBM defined it as Reduced Instruction Set Cycles, and implemented a relatively complex processor (POWER - Performance Optimization With Enhanced RISC) with more high level instructions than even many memory-data processors.
The first POWER CPU (POWER1) was implemented using three ICs for the processor - branch, integer and floating point units - plus two or four cache chips, and defined the basic architecture. The branch unit was unusually complex, and contained the program counter, as well as a condition code (CC) register and a loop register. The CC register has eight field sets, the first two reserved for fixed and floating point operations, the seventh (later) for vector operations, and the rest which could be set separately, and combined or checked several instructions later. The loop register is a counter for 'decrement and branch on zero' loops with no branch penalty (similar to certain DSPs like the TMS320C30). POWER1 was also one of the first superscalar CPUs of its generation, the branch unit could dispatch multiple instructions to the two functional unit input queues while itself executing a program control operation (up to four operations at once, even out of order). Speculative branches were supported using a prediction bit in the branch instructions (results discarded before being saved if not taken, the alternate instruction was buffered and discarded if the branch was taken), and the branch unit manages subroutine calls without branch penalties, as well as hardware interrupts. Results are forwarded to instructions in the pipeline which use them before they are written to the registers.
Thirty two 32-bit registers were defined for the POWER1 integer unit, which also included certain string operations, as well as all load/store operations. In addition, it included a special MQ register for extended precision multiply/divides, similar to the MIPS HI/LO registers. Like many other load-store CPUs, register R0 is treated as constant 0 for some instructions, but it is used like a normal register most of the time. The POWER/PowerPC architecture supports memory/data-style 'update' operations, incrementing/decrementing the used address register before a load/store.
The floating point unit had thirty two 64 bit registers, performing only double precision operations, and including a DSP-like multiply-accumulate operation. Floating point exceptions are imprecise - in fact, don't produce exceptions at all, but set a condition bit on an error. The bit must be tested by software to determine if an error occurred.
IBM, Motorola, and Apple formed a coalition (around 1992) to produce a microprocessor version of the POWER design as a successor to both the Motorola 68000 and Intel 80x86, resulting in the PowerPC. The architectural differences began with the elimination of the MQ register, since it would add complexity to possible superscalar versions. This was replaced with separate instructions to calculate the upper and lower parts of a multiplication, which would (with two integer units) execute simultaneously anyway. Division was handled similarly using general registers. In addition, the more complex string operations and three-source instructions were removed, and finally, 32 bit floating point support was added. Dropped POWER instructions were to be emulated in the PowerPC CPUs.
The first PowerPC 601 (1993) was a bridge (considered first generation or G1), and included both POWER and PowerPC features, based strongly on the POWER1, except it had a single 32K cache rather than separate I/D caches. It defined the Motorola 88000 as the standard PowerPC bus. The 603 (1993?, first second generation G2) separated the main functional units further, removing load/store operations from the integer unit (four functional units total - integer, floating point, load/store (using integer registers), branch), and splitting the branch unit into a fetch/branch unit, a dispatch unit, and a completion/exception unit. The 603 also added a rename buffer in the dispatch unit for speculative execution using renamed integer and floating point registers, which are ordered properly by the completion/exception unit, or discarded for mispredicted branches and exceptions. Separate 8K and 16K I/D cache versions were available.
The PowerPC 604 (mid 1995) added dynamic branch prediction using a branch history table, and added two simplified integer units - three integer, two for single-cycle operations, one for multicycle operations such as multiply/divide, plus floating point, load/store and branch, total of six. Four instructions could be dispatched at once The CC register could also be renamed. The PowerPC 620 expanded the 604 design to 64 bits (but with a 'backside' L2 cache bus), and added new 64 bit instructions, but was delivered much later and slower than promised, and was further delayed when it was with drawn for a redesign. The 32 bit PowerPC 750 (G3, early 1998) refined the design and performance, adding a P620-style backside cache bus, but made no other significant changes (notably though, they used a 603-based 32-bit FPU, rather than the 64-bit 604 FPU).
Workstation versions continued with the POWER2 (1993), a high bandwidth design with two floating point load/store units, 256K of data cache, and added 128-bit floating point support and a square root instruction. Initially a multichip design, it was later combined into one chip (P2SC), and then into an eight CPU "SuperChip". It could issue up to six instructions and four simultaneous loads or stores. It was superceded by the POWER3 (Early 1998), with eight functional units (two FPU, three integer (two single cycle, one multicycle), two load/store, and branch unit), but capable of operating at much higher clock speeds. In addition, a 64 bit version, the PowerPC A35 (Apache), was designed for the AS/400 E series which added decimal arithmatic and string instructions, also used in the RS/6000 S70 workstation (called the PowerPC RS64-I).
The A50/RS64-II (Northstar (1998)/Pulsar (1999, faster clock version)) added support for parallel execution, including the idea of vertical multithreading (implemented earlier by the CDC-6600 peripheral processors, and more recently by Tera in their MTA supercomputers in 1998?, which took the idea to extremes, supporting 128 threads per CPU and giving up cache entirely). The CPU state registers (integer and floating point, program counter, condition codes, etc.) are duplicated, allowing execution to be switched to a second thread in three cycles when a load misses the primary cache and causes a delay - the second thread can continue while the load for the first thread completes. The CPU is also designed to minimize branch delays by using a short, simple pipeline (five stages, in-order four-way issue to five units - simple integer, complex (multiply/divide) integer, load/store, branch, and floating point unit), and uses branch pre-fetching (also in PowerPC 750 and newer, identifies branch using LR or CTR registers when fetched and loads target instruction into the processor cache, in addition to branch prediction and target caching).
The POWER4 (late 2001) used in the RS/6000 (renamed "something-Series", where "something" is p, x, n, i, r, z, or something else - I can never remember), increased the number of pipeline stages to fifteen for integer and branch operations, seventeen for load/store, and more for floating point operations, allowing a high clock rate (1.1 GHz at introduction). It included two processor cores on a single chip which share level 2 cache - for high-bandwidth computing, performance is improved by disabling one CPU. The chip also includes four 16-byte interprocessor interconnects like those for the DEC Alpha EV7.
Like various 80x86 CPUs and the Motorola 68060, the POWER4 translates some instructions into two ("cracked") or more ("millicoded") simpler "internal instructions" or "Individual OPerations" (IOPs) - a much simpler process due to the simpler instruction set. Internal instructions are limited to two read and one write from registers. To simplify the tracking of these instructions, in-order groups of five are formed and renaming and order information are stored for the group, rather than individual instructions. Each group can contain four non-branch instructions (cracked IOPs must be in the same group, millicoded IOPs always start a new group but can continue to the next) - branches are folded into the remaining group slot. Instructions that can't execute out of order form a group of one.
One group can be formed (and added to four issue queues) and retired per cycle, but one IOP can be read per cycle from issue queues by eight execution units - two FPUs, two load/store units, two integer units, a branch unit and a condition code register unit (used by the branch unit).
Branch prediction uses a conventional branch history table, plus a second based on execution, and a third table which indicates which table's result has been most accurate (two instruction bits can be used to let software override the decision). Because of the complexity, multithreading support was dropped.
The IBM PowerPC 970 (expected early 2003) is a simplified reduced-cost version of the POWER4 for desktop (mainly Apple Macintosh) systems. It features a single CPU core, and a more conventional CPU bus, plus a higher clock speed (possible because desktop systems don't need the reliability of servers the POWER4 was designed for), and adds the "AltiVec" instruction set (described below). It replaced the "G5" which Motorola was designing, but disputes between Motorola and Apple over the bus design (Apple wanted a faster, more expensive bus) led Motorola to postpone release of the almost-completed product with the slower bus until there was demand for it in the embedded market, while Apple looked to IBM for a desktop CPU.
The POWER5 (expected 2004) adds vertical multithreading to the POWER4 execution pipeline, by assembling alternate instruction bundles from different threads. Each thread can be assigned a relative priority controlling instruction rate (values between 0 and 8, where T1 + T2 = 8 - priority 0 completely turns off one thread). It also includes larger buffers and various circuit refinements to reduce cost and power consumption, and increase reliability with error checking like that in IBM mainframes and the Fujutsu SPARC64 V.
IBM and Motorola have designed simplified embedded versions, such as the IBM 40x series, and Motorola's 8xx versions, though complexity limits how small the designs can be - for the lower end, Motorola designed the ARM-like MCore low cost/power RISC CPU, while IBM simply licensed the ARM itself.
In direct response to Intel's MMX instructions, AltiVec extensions were introduced with fourth generation (G4, September 1999) PowerPC CPUs from Motorola (IBM initially declined to support the extensions, until agreeing to become a second source of AltiVec CPUs for Apple Macintoshes). Unlike multimedia extensions which use integer (HP PA-RISC MAX) or floating point registers (Sun VIS, Intel MMX), AltiVec adds an entire new set of 128-bit registers (enough for a vector of four 32-bit floating point numbers) and a separate vector execution unit and instruction set (four operand - three source, one result), supported by the complex PowerPC branch unit. That means that operating system software needs to be modified to preserve additional CPU state information (like the MIPS MDMX which adds a 192-bit accumulator to hold intermediate results, but uses 64-bit floating point registers for data), but it allows multimedia instructions to be executed in parallel with both integer and floating point operations, and to reduce the number of registers to save, an additional register (VRSAVE) is added to track which vector registers are being used - unused registers don't need to be stored. In addition to subword vector operations, AltiVec also includes permutation operations along the same lines as PA-RISC MAX instructions, and subword floating point operations like MIPS MDMX which can also perform vector multiplication allowing 3-D graphics support (see Appendix D) like the Hitachi SH4.
AltiVec and the embedded versions was apparently part of the reason Nintendo decided to switch from MIPS processors to a custom designed IBM variant of the PowerPC as the CPU for its next generation game console, code named "Dolphin".
It's interesting to note that the AltiVec data formats are based primarily on Java standards (based on IEEE), then on IEEE, and lastly on ANSI C9X floating point standards. A "Java mode" provides strict adherence to these standards, a "Non-Java mode" relaxes adherence to allow faster operations (if implemented).
A very high clock rate (500MHz) BiCMOS version called the 704 (based on a simplified 604) was being developed in 1996 by Exponential Technologies, expanding on the type of technology which Intel found necessary to keep its Pentium and Pentium Pro CPUs competitive, but advances in CMOS and a slower initial product (410MHz) sharply reduced the clock speed advantages, cancelling the project (faster, lower power, fully CMOS Pentium and Pentium Pro CPUs have replaced earlier BiCMOS versions). IBM went so far as to produce a 1GHz integer-only demonstration version of a CMOS PowerPC, and used the PowerPC as the first product to replace aluminum conductors with lower resistance copper, boosting clock speeds by about 33%.
Overall, the POWER/PowerPC architecture is a very powerful, almost mainframe-like architecture which could easily have fit into the "Wierd and Innovative" section, violating the traditional RISC philosophy of simplicity and fewer instructions (with over a hundred, including many duplicate which implicitly set CC bits and other which don't), versus only about 34 for the ARM and 52 for the Motorola 88000 (including FPU instructions)). The complexity is very effective, but has somewhat limited the clock speed of the designs (but less so than the even more complex Intel Pentium and Pentium II designs). It's an interesting tradeoff, considering that a highly parallel 71.5 MHz POWER2 managed to be faster than a 200MHz DEC Alpha EV4 of the same generation (though Alpha remained the fastest CPU at any given time until the POWER4).
Alpha is a 64 bit architecture (32 bit instructions) that didn't initially support 8- or 16-bit operations, but allowed conversions, so no functionality is lost (Most processors of this generation are similar, but have instructions with implicit conversions). Alpha 32-bit operations differ from 64 bit only in overflow detection. Alpha does not provide a divide instruction due to difficulty in pipelining it. It's very much like the MIPS R2000, including use of general registers to hold condition codes. However, Alpha has an interlocked pipeline, so no special multiply/divide registers are needed, and Alpha is meant to avoid the significant growth in complexity which the R2000 family experienced as it evolved into the R8000 and R10000.
One of Alpha's roles is to replace DEC's two prior architectures - the MIPS-based workstations and VAX minicomputers (Alpha evolved from a VAX replacment project codenamed PRISM - the internal "EV" name comes from "Extended VAX". Not to be confused with the Apollo Prism acquired by Hewlett Packard). To do this, the chip provides both IEEE and VAX 32 and 64 bit floating point operations, and features Privileged Architecture Library (PAL) calls, a set of programmable (non-interruptable) macros written in the Alpha instruction set, similar to the programmable microcode of the Western Digital MCP-1600 or the AMD Am2910 CPUs, to simplify conversion from other instruction sets (VAX running VMS and 80x86 running Microsoft Windows NT) using a binary translator, as well as providing flexible support for a variety of operating systems.
Alpha was also designed for the future for a 1000-fold eventual increase in performance (10 X by clock rate, 10 X by superscalar execution, and 10 X by multiprocessing) Because of this, superscalar instructions may be reordered, and trap conditions are imprecise (like in the 88010). Special instructions (memory and trap barriers) are available to syncronise both occurrences when needed (different from the POWER use of a trap condition bit which is explicitly by software, but similar in effect. SPARC also has a specification for similar barrier instructions). And there are no branch delay slots like in the R2000, since they produce scheduling problems in superscalar execution, and compatibility problems with extended pipelines. Instead speculative execution (branch instructions include hint bits) and a branch cache are used.
The EV4 (21064) was introduced with one integer, one floating point, and one load/store unit. The EV5 (21164, Early 1995) began expanding instruction parallelism by adding one integer/load/store unit with byte vector (multimedia-type) instructions (replacing the load/store unit) and one floating point unit, and increased clock speed from 200 MHz to 300 MHz (still roughly twice that of competing CPUs), and introduced the idea of a level 2 cache on chip (8K each inst/data level 1, 96K combined level 2).
The EV6 (21264, mid 1998) expanded this to four integer units (two add/logic/shift/branch (one also with multiply, one with multimedia) and two add/logic/load/store), two different floating point units (one for add/div/square root and one for multiply), with the ability to load four, dispatch six, and retire eight instructions per cycle (and for the first time including 40 integer and 40 floating point rename registers and out of order execution), at up to 500MHz. Integer registers are duplicated, with two independent pipelines, to simplify register access, similarly to the Sun MAJC functional units, except that instructions are grouped by the dispatching logic into two streams. Values transferred between register banks (instruction streams) add a delay.
Multimedia extensions introduced with the EV6 are simple, but include VIS-type motion estimation (MPEG).
The EV7 (21364, expected 2003) was to begin the multiprocessor strategy by adding five high speed interconnects (four CPU (10 GB/s), similar in concept to the Transputer CPUs, and one I/O (3 GB/s)) to an enhanced EV6 core. The EV8 (21464) was to add an eight- or ten-issue CPU core, and support for multithreading like the IBM Northstar POWER CPU.
In 1998, DEC was purchased by Compaq. The transition was blamed by some with the EV6 being unable to exceed 833MHz clock speeds, while CPUs from Intel, AMD, and SiByte (MIPS) gained attention by exceeding 1GHz (Alpha performance remained near the top of the competition, but it had also previously also had the highest clock speeds). Apparently a design flaw was to blame, but rather than redesigning the existing chip, the resources were spent developing the EV7 and EV8 instead (though rumour has it Compaq parners Samsung and IBM were able to produce 1GHz Alphas in 1999, but were forbidden by Compaq because they did not have a support chip set able to handle that speed). In 2001, Compaq cancelled completion of the EV8, deciding to adopt the IA-64 instead, and sold all Alpha intellectual property (from circuits to compilers, and even the Alpha design team) to Intel, shortly before the announcement of a controversial merger with Hewlett-Packard (nasty accusations claimed it was pressure from HP, others claimed Intel was embarassed by both AMD's Athlon performance (due partly to DEC Alpha engineers who moved to AMD when Compaq bought DEC) and the Intel-designed Itanium's poor performance when compared to almost all competitors, and especially to the Itanium 2 processor designed by HP). Including the eventual completion of the EV7, the Alpha's life is effectively little more than 10 years.
Alpha Processor Inc. renamed itself API Networks, and switched to chip interconnect products based on an AMD-initiated HyperTransport standard.
DEC's Alpha was in many ways the antithesis of IBM's POWER design, which gains performance from complexity, and the expense of a large transistor count, while the Alpha concentrated on the original RISC idea of simplicity and a higher clock rate - though that also has its drawback, in terms of very high power consumption.
Table of Contents