Ken Shirriff has an interesting article on reverse engineering the original ARM1 processor (as designed by ARM, and implemented by VLSI). He goes right to the silicon to form a transistor level model/emulator of the chip. Back in 1986 when the ARM was designed and released, it wasn’t very well known, being used in very few devices. This continued for over a decade surprisingly. being used in niche markets (the Apple Newton, the DEC StrongARM on RAID cards, etc). It wasn’t until the 2000’s that this processor startup from England became the powerhouse it is today. Two major developments drove this, mobile, and multimedia. The ARM architecture was powerful, small, and easy on the power budget, this obviously was a benefit for mobile, but also proved very useful in dealing with multimedia processing, such as controllers on DVD players, digital picture frames, MP3 players and the like. Today, hundreds of companies license and use the architecture and it is found in devices now numbering in the billions.
Archive for the 'Research' Category
In 1994 Intel had a bit of an issue. The newly released Pentium processor, replacement for the now 5 year old i486 had a bit of a problem, it couldn’t properly compute floating point division in some cases. The FDIV instructions on the Pentium used a lookup table (Programmable Logic Array) to speed calculation. This PLA had 1066 entries, which were mostly correct except 5 out of the 1066 did not get written to the PLA due to a programming error, so any calculation that hit one of those 5 cells, would result in an erroneous result. A fairly significant error but not at all uncommon, bugs in processors are fairly common. They are found, documented as errata, and if serious enough, and practical, fixed in the next silicon revision.
What made the FDIV infamous was, in the terms of the 21st century, it went viral. The media, who really had little understanding of such things, caught wind and reported it as if it was the end of computing. Intel was forced to enact a lifetime replacement program for effected chips. Now the FDIV bug is the stuff of computer history, a lesson in bad PR more then bad silicon.
Current Intel processors also suffer from bad math, though in this case its the FSIN (and FCOS) instructions. these instructions calculate the sine of float point numbers. The big problem here is Intel’s documentation says the instruction is nearly perfect over a VERY wide range of inputs. It turns out, according to extensive research by Bruce Dawson, of Google, to be very inaccurate, and not just for a limited set of inputs.
Interestingly the root of the cause is another look-up table, in this case the hard coded value of pi, which Intel, for whatever reason, limited to just 66-bits. a value much too inaccurate for an 80-bit FPU.
Ken Shirriff has an excellent write up about the Zilog Z80 and why its pin-out, specifically the Data lines, is a bit convoluted. Rather then being in order (such as D0-D7) the original Z80 is D4,D3.D5,D6,D2,D7,D0,D. Its functional but its not pretty and can lead to some interesting PCB layout issues. Ken uses data/imaging from the Visual6502 project to look at the on die reasons for this. Essentially it came down to saving die space. there literally was not enough room to route the data connections within the confines of the die size. Keeping the die size small allowed Zilog, and its many second sources), to keep prices down. In the early days Zilog contracted Mostek to make much of their processors, so die size and the associated cost were a big issue.
In 1983 Stephen Colley, Dave Jurasek, John Palmer and 3 others from Intel’s Systems Group left Intel, frustrated by Intel’s seeming reluctance to enter the then emerging parallel computing market. They founded a company in Beaverton, Oregon known as nCube with the goal of producing MIMD (Multiple Instruction Multiple Data) parallel computers. In 1985 they released their first computer, known as the nCube/10. The nCube/10 was built using a custom 32-bit CMOS processor containing 160,000 transistors and running initially at 8MHz (later increased to 10). IEEE754 64-bit floating point support (including hardware sqrt) was included on chip. Each processor was on a module with its own 128KB of ECC DRAM memory (implemented as 6 64k x 4 bit DRAMs.) A full system, with 1024 processor nodes, had 128MB of usable memory (160MB of DRAM counting those used for ECC). From the outset the nCube systems were designed for reliability, with MTBFs of full systems running in the 6 month range, extremely good at the time.
The nCube/10 system was organized in a Hypercube geometry, with the 10 signifying its ability to scale to a 10-way Hypercube, also known as a dekeract. This architecture allows for any processor to be a maximum of 10-hops from any other processor. The benefits are greatly reduced latency in cross processor communication. The downside is that expansion is restricted to powers of 2 (64, 128, 256, 512 etc) making upgrade costs a bit expensive as the size scaled up. Each processor contained 22 DMA channels, with a pair being reserved for I/O to the host processor and the remaining 20 (10 in + 10 out) used for interprocessor communication. This focus on a general purpose CPU with built in networking support is very similar to the Inmos Transputer, which at the time, was making similar inroads in the European market. System management was run by similar nCube processors on Graphics, Disk, and I/O cards. Programming was via Fortran 77 and later C/C++. At the time it was one of the fastest computers on the planet, even challenging the almighty Cray. And it was about to get faster.
Part 2 of my abbreviated biography of Chuck H. Moore’s processor designs. Part 1 covered the early days of Novix, and the RTX2000.
Moore was not content to just create one processor design, or one company. In the 1980’s he also ran Computer Cowboys, a consulting/design company. In 1985 he designed the Sh-boom processor with Russell H. Fish III. This was a 32-bit stack processor, though with 16 general purpose registers, that was again designed with Forth in mind. It was capable of running much faster then the rest of the system so Moore designed a way to run the processor faster then the rest of the board, and still keep things in sync, innovative at them time, and now standard practice. The Sh-Boom was not a particularly wide success and was later licensed by Patriot Scientific through a company called Nanotronics, which Fish had transferred his rights to the Sh-Boom to in 1991. Patriot rebranded and reworked the Sh-Boom as the PSC1000 and targeted it to the Java market. Java byte code could be translated to run in similar fashion as Forth on the PSC1000 and at 100MHz, it was quick. In the early 2000’s Patriot again rebranded the ShBoom and called the design IGNITE. Patriot no longer makes or sells processors, concentrating only on Intellectual Property (Patent licensing).
After designing the Sh-Boom, and the Novix series, Moore developed yet another processor in 1990 called the MuP21. This was the beginning of a what would be a common thread in Moore’s designs. MISC (Minimal Instruction Set Computer), which is essentially an even simpler RISC design, multiprocessor/multicore, and efficiency have become the hallmarks of his designs. The MuP21 was a 21 bit processor with only 24 instructions. At 20MHz performance was 80 MIPS as it could fetch four 5-bit instructions in a 20 bit word. It was manufactured in a 40 pin DIP on a 1.2 micron process with 7000 transistors.
In 1993 Moore designed the F21, again a 21 bit CPU based on the MuP21, designed to run Forth, and including 27 instructions. It was fab’d by Mosis on a 0.8u process. The F21 microprocessor contains a Stack Machine CPU (with a pair of stacks like the NC4000), a video i/o coprocessor, an analog i/o coprocessor, a serial network i/o coprocessor, an parallel port, a real time clock, some on chip ROM and an external memory interface. Performance was 500 MIPS (this was an asynchronous design, so ‘clock speed’ is a bit of a misnomer) and transistor count had risen to about 15,000. The F21 was made up through 1998, however the design continued to evolve. A version of the F21 was developed called the i21, originally for Chuck Moore’s iTV Corporation, which was one of the very first set top Internet appliance companies. It integrated additional featured such as infrared remote interface, modem DMA interface and a keyboard DMA interface. The F21 scaled well, and was tiny, remember, only 15,000 transistors, which at 0.18u takes up a VERY small die, and allowed performance to hit 2400MIPS @ 1.8V. One could put a very large amount of these on a single die…..
There are many greats of the CPU industry, some, such as Federico Faggin (designer of the 4004 and worked on the 8008, then founded Zilog) are fairly well known. Others include Gelsinger and Meyer (of x86 fame) perhaps even Gordon Moore, of which a ‘law’ is named. Chuck Peddle and Bill Mensch designed the ubiquitous 6502 processor, but there were more, many more. Engineers whose names have been oft forgotten, but whose work has not. The 1970’s and 80’s were the fast and the furious of processor designs. Some designs were developed, sold, or canceled in weeks, months; years were not a period of time that was available to these designers, for in a year, a new technology would dictate a new design.
One of these designers is Charles H. Moore. (aka Chuck Moore). Chuck is perhaps best known for inventing the FORTH programming language in 1968, originally to control telescopes. It was a stack based language, and lended itself well to small microcomputers and microcontrollers. Some microcontrollers even embedded a FORTH kernel in ROM. It was also designed to be able to be ported to different architectures easily. FORTH continues to be used today for a variety of applications. However Chuck did not just invent a 1970’s programming language.
2012 marked the 30th anniversary of the introduction of the Intel 80186 and 80188 microprocessors. These were the first, and arguably only, x86 processors designed from the beginning as embedded processors. It included many on-chip peripherals such as a DMA channels, timers and other features previously handled by external chips. Initially released at 6MHz, clock for clock many instructions were faster then the 8086 it was based on, due to hardware improvements.
In 1987 Intel move the 186 to a CMOS process and added more enhancement including math co-processor support, power down modes and a DRAM refresh controller. Speeds were increased up to 25MHz (from the 10MHz max of the NMOS version). Through the years Intel continued to developed new versions of the 186 with added features, lower voltages, and different packages. It was not until 2007 that Intel finally stopped production of the 186 series. It continued to be made by others under license including AMD, who made versions running up to 50MHz. Fujitsu and Siemens also produced the 186 series. Like the 8051 the 186 gained significant support, being embedded in millions of devices. The instruction set was familiar, debugging and development systems were (and are) plentiful so the 186 core continues to be in wide use.
As IC complexity and transistor counts increased the need for a processor core that could not just be embedded into a system, but be embedded into a custom ASIC or SoC became apparent. IC’s were being designed to handle things like DVD playback, set-top boxes, flat panel control and more. These applications still required some sort of processor to handle them but having to have a separate IC for it was not economical.
VAutomation (founded in 1994) designed Verilog and VHDL synthesizable cores (meaning they could be ‘dropped’ into an IC design or FPGA). In November 1996 VAutomation licensed the 8086/8, 80186/8 and the CMOS versions from Intel. This gave them them ability to design their own compatible models of these processors without fear of litigation. More importantly it allowed them to sub-license these designs to others. In 1997 VAutomation demo’d their first 186, the V186 core. This was a Intel 80186 compatible core that could be synthesized into a customers design. It was ‘technology independent which means it was not restricted to a certain process or even technology. It could be used in CMOS, ECL, 0.35u, 1 micron, whatever the client needed. On a 0.35u CMOS process it was capable of speeds in excess of 60MHz, and did so with less then 28,000 gates. One of the first licensees was Pixelworks, which made controllers for monitors. Typical licensing was a $25,000 fee up front and royalties on a per device basis usually split into a high volume (over 500,000 units) and low volume. Typical price per chip was $0.25-$2.00, which was cheaper then the $15 price Intel was charging for a discrete 80C186.
The introduction of the iPhone 5 was also the introduction of Apple’s first truly original Application Processor design. The iPhone 2, 3G and 3GS all featured designs by Samsung. The iPhone 4 introduced the A4, which was closely based on the Hummingbird Cortex-A8 core developed with Samsung and Intrinsity, again, not a truly Apple design. The iPhone 4S introduced the A5 (and the A5X used in the iPad 2). The A5 is based on the ARM Cortex-A9 MPCore, a standard ARM design, albeit with many added features, but architecturally, the processor is not original, just customized.
ARM provides cores designs for use by developers, such as the Cortex-A9, A8, etc. These are complete designs of processors that you can drop into your system design as a block, add your own functions, such as a graphics system, audio processing, image handling, radio control, etc and you have your processor. This is the way many processor vendors go about things. They do not have to spend the time and effort to design a processor core, just pick one that meets their needs (power budget, speed, die area) and add any peripherals Many of these peripherals are also licensed as Intellectual Property (IP) blocks making building a processor in some ways similar to construction with Legos. This is not to say that this is easy, or the wrong way to go about things, it is in fact the only way to get a design to market in a matter of weeks, rather then years. It allows for a wide product portfolio that can meet many customers needs. The blocks are often offered for a specific process, so not only can you purchase a license to a Cortex-A9 MPCore, you can purchase one that is hardware ready for a TSMC 32nm High-k Metal Gate process, or a 28nm Global Foundries process. This greatly reduces the amount of work needed to make a design work with a chosen process. This is what ARM calls the Processor Foundry Program.
A Brief History
Long before the mess of Apple vs. Samsung (and seemingly everyone else), there was another famous company, with a patent in hand, that it seemed everyone was violating. The issue of Intellectual Property (IP), and its associated patents has long been an issue in the technology business, and certainly in the business of CPU’s. There are many many functions inside a CPU, different structures for handling instructions, memory access, cache algorithms, branch prediction etc. All of these are unique, intellectual property. It doesn’t matter if you implement them with a slightly different transistor structure, as long as the end product is relatively the same, there is the risk of violating a patent. Patents are tricky things, and litigating them can be very risky. You must balance the desire to keep competition from violating your IP, but at the same time minimize the risk that your patent is declared invalid. This is why most cases end up in an out of court settlement, usually via arbitration. Actual patent jury trials are fairly rare, as they are very expensive and very risky to all parties involved
In the early days (1970’s and early 1980’s) there was routine and widespread cross licensing in the industry. Many companies didn’t have the fab capacity to reliably meet demand (IBM wouldn’t purchase a device unless it was made by at least 2 companies for this very reason) so they would contract with other manufacturers to make their design. Having other companies manufacture your design, or compatible parts, also increased the market share of your architecture (8086, 68k etc). For years AMD made and licensed most everything Intel made, AMD also licensed various peripheral chips to Intel (notably the 9511/2 FPU). As the market grew larger, the competition increased, Intel (and others) began to have enough reliable fab capacity to safely single source devices. Meanwhile other companies continued to make compatible products, based on previous licensing. AMD notably made x86 CPU’s that ate into Intel’s market share. In the 1970’s Intel had cross license agreements with AMD, IBM, National, Texas Instruments, Mostek, Siemens, NEC and many others.
The talent at Visual6502.org continues. After imaging and building a complete simulator for the MOS 6502 they did the same for the Motorola 6800 (from which the 6502 was based).
We have sent Visual6502.org several chips and they have now imaged the RCA 1802 that we sent. What is very interesting is how little marking are on the die, the only that I could see was the number ‘10824.’ This particular chip was dated early 1981 though the 1802 COSMAC was designed in 1976 and was one of the first CMOS microprocessors. The 1802 had around 5000 transistors (Visual6502 will let us know exactly how many once they are done, and of course what each and every one of them does). For higher res shots and more info see here