Archive for the 'CPU of the Day' Category

September 30th, 2018 ~ by admin

Peavey and the Motorola DSP56000

Motorola XSP56001ZL20 – 20.5MHz 1990

In 1985 Motorola was looking to create a DSP (Digital Signal Processor) line of processors to go with their very popular 68000 series of general purpose processors.  DSP’s are similar to a normal processor but, as their name implies, are designed to work on signals, versus data stored in memory.  Typical signal data is audio, video, RF (such as RADAR information) and anything else that comes in via an ADC.  These signals are processed via algorithm such as FFTs (Fast Fourier Transforms) to manipulate, change or analyse them.  In audio, this can be used for cleaning up an audio stream, adding effects to it, or even generating audio.

In the 1980’s the main single chip DSP competitors was the still in use TI TMS320 series. the ATT/WE DSP16 series, and some DSP’s from OKI/NEC.  When Motorola began work on what would become the DSP56000 they asked one of their long time customers, Peavey, what they would like to see in a DSP. Peavey is an audio equipment manufacturer, making such things as guitar amps and keyboards, so would have a good idea of what would be useful in a DSP designed for audio signals.

These were packaged in a ‘SLAM’ package. The contacts/traces were easily damaged by leaking batteries.

The DSP5600 is a 24-bit processor made on a 1.5u HCMOS process with around 150,000 transistors.  24-bits were selected as that was ideal for audio sampling at the time (and most ADS/DACs at the time max’d out at 20-bits of resolution anyways.  These DSP’s had a 3-stage pipeline and ran at 20.5MHz, 27MHz and 33MHz.  This provided around 10.25 MIPS of performance (at 20.5MHz).  They were a fixed point (no floating point support in hardware) design, which was adequate at the time.  A total of 62-instructions were provided.

The DSP56001 is identical to the DSP56000 except that it has 512×24-bits of on-chip program
RAM instead of 3.75K of program ROM and a 32×24-bit bootstrap ROM for loading the program RAM.  This is the version that became most popular.  Peavey used the 560001 (3 of them actually) to power the DPM3 SE keyboard back in 1990.  Recently J. Acorn, from Crasno Electronics in Canada sent The CPU Shack Museum an e-mail inquiring if I had a few of these now obsolete 56001 DSPs spare, to rebuild some dead Peavey keyboards.   As a Museum, I not only like to collect and present vintage IC’s but also regularly help people with project such as this, and have thousands of CPU’s sitting around that have been acquired through the years (really its a bit crazy how much I have collected lol).  Mr. Acorn needed 2 of these DSPs to replace ones destroyed by a leaking battery in a keyboard, and two is exactly what I had spare.  I dug them out, packaged them, and off to Canada they went.  The result?  A restored and working Peavey keyboard.  You can read about the restoration process on Crasno’s site.

The 56000 series continued to be made by Motorola (and then Freescale) up until 2012 when it was announced it would be discontinued as a standalone product.  The 56000 series cores though live on, inside of other Freescale (now NXP) products.

 

August 25th, 2018 ~ by admin

CPU of the Day: FOCUS on 32-bits

1983 HP FOCUS Board set – Pre FPU. Top left: Memory. Top Right: I/O and CPU bottom center

The year is 1981, Intel is making the 8/16-bit 8086/8088, and Motorola has released the 16/32-bit 68000 processor to much fanfare.  Motorola marketed this as the first 32-bit processor, but while it supports 32-bit instructions/data it does so with a 16-bit ALU.  HP, always used the MC68000 in their 9000 Series 200 line of computers, providing rather good performance for 1981. But this was the 1980’s and HP wasn’t satisfied with good, they wanted more, they wanted to implement a full 32-bit computer on something less then the 5,000 IC’s typically used to implement one at that time.  This meant making a processor like nothing else before, something with more then the 68,000 transistors of the MC68000 or even the 134,000 transistors of the new i286 Intel had announced.  What HP made is simply remarkable, in 1981 they announced the HP 9000 Series 500 computers, powered by an all new fully 32-bit processor called the FOCUS.  FOCUS was made on HP’s high density NMOS-III process, a 1.5u process, and used 450,000 transistors.  Thats 450,000 transistors on a single 40.8mm2 piece of 1.5u silicon in 1981, a smaller die than the Intel 286.

Read More »

Posted in:
CPU of the Day

August 15th, 2018 ~ by admin

CPU of the Day: The 61 Knights of the Intel Xeon Phi

Xeon Phi – Knights Corner – Engineering Sample

In June of 2013, 20 years after the release of the Intel Pentium Processor, Intel released a new processor, technically a co-processor that Intel referred to as a MIC (Many Integrated Core).  It was branded as a Xeon, specifically the Xeon Phi 7000 series but at its core, it was nothing like a Xeon of 2013.  Code named Knights Corner, it built on the Knights Ferry.  Knights Ferry used many Larrabee GPGPU cores and was not designed as a commercial product.  Knights Corner , however, was, and to do so, Intel stuck with an architecture that customers were very familiar with, x86.  The Knights Corner integrated 61 Pentium P54CS cores onto a single chip.  The original Pentium P54CS was made on a 0.35u process and topped out at 200MHz.  They included 16K of L1 cache on die, and typically 256-512K of L2 Cache off chip.  The implementation of the Pentium on the Phi gets a bit of an upgrade.  The cores are made on a 22nm process (16 times smaller) and clocked at up to 1.2GHz.  L1 cache has been increased to 64K per core (32K Instruction  32K Data).  L2 cache remains at 512K

Knights Corner Die. – 62 Cores – 8 GDDR5 Memory Controllers

per core, but at 22nm, integrating all 30.5MB of cache on the same die becomes relatively easy.  The biggest change to the cores is adding support for 64 bit instructions, as well as adding a new execution unit called the VPU. This VPU (Vector Processing Unit) has its own 512-bit wide SIMD instruction set, integer support, Fused Multiply/Add, and other advanced features that are more commonly found in GPU’s. The VPU is the result of Intel’s work with Larrabee, the precursor to Knights Corner.  Interestingly MMX/SSE are not supported by the cores natively, this is handled in software (using virtualization) and leveraging the VPU included with the 61x Pentium Cores.  With the VPU, each core has 4 execution units (VPU, FXU, and 2 x Integer units). This allows the cores to support 4-way multi-threading; in practice, 2 threads are most common as 2 execution units are usually tied up calculating memory addresses.

Knights Corner Sample – This is a 1.09GHz part while production versions were bumped to 1.1GHz – Elpida 2Gbit GDDR5 RAM chips surround the core.

For some reason Intel was very vague about information on die sizes/transistor count on the Phi.  Many sources claim 350mm2 die with 5 Billion transistors.  Taking apart a Phi shows that the die is actually much larger.  In fact the Xeon Phi die is 705mm2 and has 5.1 Billion transistors.  A 22nm Haswell Xeon with 18 cores has a die area of 622mm2 containing 5.6 Billion transistors. This means the Xeon Phi die wasn’t the most efficient is its use of space, likely due to the amount of room needed for the very large rings used to connect all the cores.  Looking at the die you can also see a lot of unused space.   There are actually 62 cores per die (with only 61 used max.)  This means 31MB of L2 cache which at 6 transistors per cell (bit) accounts for 1.5 Billion of the transistors.  L1 Cache is 64K per core so another 190 Million transistors there.  That leaves the bulk of the die for the cores, memory controllers, and the 3 interprocessor communication rings that handle communication between cores, MC’s (8 GDDR5 Memory Controllers per die), and the outside world.

Each Xeon Phi board includes the processor, as well as 6-16GB of GDDR5 Memory (8GB on the Engineering Sample here).  Memory is handled by 32 Elpida EDW2032BBBG-6 2Gbit GDDR5 6 Gbps chips. This gives the card is 352 Gbps memory bandwidth and 1 TFLOPS of computing performance.  All in a PCI-E car that dissipates around 300W.   Card/System management is provided by a NXP LPC2365FBD100 72MHz ARM7TDMI processor.

Knights Corner Xeon Phi with cooler removed. 16x 2Gbit GDDR5 (+16 on the back)

In January of 2013 the Texas Advanced Computing Center in Austin, TX announced the Stampede Supercomputer, the first large scale deployment of Xeon Phi Processors.  It used 6880 of them in its 6400 compute nodes and could hit nearly 10PFLOPS of performance. In June of 2013 the Chinese supercomputer Tianhe-2 became the fastest supercomputer in the world, a title it held until the end of 2015.  It was powered by 32,000 Intel Xeon E5-2692 2.2GHz 12C Ivy Bridge processors and a massive 48,000 Xeon Phi co-processors resulting in over 33PFLOPs.

Tianhe 2 Super Computer with 48,000 Knights Corner Processors.

Intel made a successor to Knights Corner, known as Knights Landing, that was based on the Atom core, but then began to wind down the project.   Avinash Sodani, chief architect of the Knights Landing chip took a job at Cavium Networks (who make multicore MIPS networking processors), and Intel then hired Raja Koduri, the chief architect of AMD’s GPU processors.  Intel’s future seems to be one based on Xeon, and GPU’s.

Like the Knights of old, the the Xeon Phi has been passed up by other technologies, certainly still useful, but destined to the halls of museums and history books.  It came, and it conquered the Top500 Supercomputer list, and then quietly fades away.  On July 27th Intel quietly announced the discontinuation of the Xeon Phi line, with last orders accepted the end of this August (2018).

 

 

July 23rd, 2018 ~ by admin

A Sampling of Sample Processors

AMD K6-2 Marketing Sample

During the development of most any given processor many chips are produced before it is released for commercial use.  These pre-production chips serve a wide variety of purposes in the design and debugging of the processor to ensure that the final CPU work well, sells well, and is compatible with all the vendors parts (motherboards, cooling solutions, power supplies, etc).  These chips are generally referred to as samples, and there is several types of them.  We’ll use Intel/AMD as the main examples but most all processor companies work in similar ways.

When a processor design is first being developed, the package for it is also often being developed as well, what will the new processors silicon die reside in?  How many pins? How will it dissipate heat?  This type of testing is often handled with Mechanical Samples.  Mechanical Samples are exactly as they sound, they test the mechanical aspects of the processor, the physical fit of it.  THese are often sent to board/socket manufacturers to ensure the processor will fit in sockets/boards, and with the automated equipment used to build systems.  Cooling solution companies may also receive these to test how a heatsink fits on the CPU. Mechanical samples may not contain a die at all, or may be chips that were tested as bad, or simply just untested chips (Intel used a lot of untested Mechanical Samples in their educational kits).

Thermal Sample for the LGA2011 Sandy Bridge Xeon

The next samples typically made are Electrical/Thermal Samples.  These again do not have an actually processor die in them, but electrically do work.  Electrical/Thermal samples are used to test the power draw and heat dissipation of a processor.  They often use a daisy chain transistor design, which serves to draw/dissipate power.  If a processor is expected to dissipate 135W of heat, a Thermal sample can be made to draw/dissipate exactly that.  These can test the the power supplies on motherboards, as well as the heat dissipation abilities of cooling solutions.  Some Thermal Samples have a temperature sensor added directly to the package to help see what temps they achieve.  Electrical Samples and Thermal Samples could also be used as purely Mechanical Samples too, and this is sometimes seen marked on the sample.

The first samples made that actually contain a functioning processor die are Engineering Samples.  Engineering Samples (also known as ES) are the most well known samples.  Overclockers often like to find ES CPUs as they will often allow for easier overclocking due to some not having locked in speed (multiplier locked).  Engineering Sample CPUs themselves come in several types as well.  Usually the first run is known as ES1, these can be thought of as an ‘Alpha’ version.  They are very likely to be buggy, and rarely run at the same speed as a production chip would.  These exist to test the overall processor design, or some subset of it.  Some are made to test just one part of the CPU, for example , the memory controller, or the cache.  Later versions of

Motorola PowerPC 8260 Engineering Sample (note the PPC prefix)

Engineering Samples are often called ‘ES2.’ These processors are getting closer to final production and are a lot less buggy, these would be considered ‘Beta’ Samples.  Most of the time these are quite usable chips, and often are very similar in clock speed/features to a production processor.   Intel denoted these chips with a Q-spec (such as QBGC) rather then production processor having an S-spec (such as SL5G8).  AMD typically uses part numbers starting with ‘1’ for ES1 CPUs and ‘2’ for ES2 CPUs. (such as Opterons 1S160805L4BGC or 2S16….).  Other companies have similar methodologies.  Motorola (Freescale) used the PPC prefix for most ES CPUs and Texas Instruments uses ‘TMP’ (not to be confused with Toshiba who also uses the TMP pre-fix, but for processors in general). Once a company is fairly confident a design is ready for release one final version is made.

These are known as Qualification Samples (QS).  QS processors almost always have a one to one equivalence with a production part, since that is their purpose, to make sure the design is ready for release.  These processors are by far the most widely made chips, as they are shipped by

Alchemy Au1000 MIPS Processor – Qualification Sample

the thousands to vendors, system builders/integrations, and even the media outlets for review.  The hope is that nothing major wrong is found with them, and that any bugs that are found can be dealt with in software or firmware, not requiring an entire silicon fix.  Intel continues to use Q-specs for these as well, leading to some confusion with the previously mentioned ES CPU’s.  AMD usually uses part numbers beginning with ‘Z’ for QS CPU’s and like Intel, does not offer these CPU’s for sale to the general public, they are either given to vendors, or sold exclusively to them for testing.   Motorola uses XC, or XPC for these, and unlike AMD/Intel, mass produces these and sells them, often for years, before they decide that a part/design is truly fully qualified/characterized (in which case the prefixe is changed to MC. or MPC).  Texas Instruments uses the ‘TMX” prefix for their Qual. Samples. and tended to make/sell them like Motorola did with theirs, changing the prefix to TMS for fully qualified production parts.

Read More »

July 3rd, 2018 ~ by admin

CPU of the Day: The Intel Everest Series

Mt. Everest – Tallest on Earth

Mt. Everest is the tallest mountain here on Earth, the pinnacle of climbing challenges.  There is no going higher then Mt. Everest.  At Intel the pseudo-unofficial codename for the absolute fastest speed bin of a particular processor is…Everest.  Everest processors are the fastest an architecture will so reliably.  Sometimes these processors end up an normal products, available for consumers to purchase.  The first good example of this is the Core 2 Extreme QX9775 Yorkfield core (Core Architecture).  They were a quad-core processor running at 3.2GHz, fast but not mind blazingly so.  The Xeon equivalent was the X5492 (Harpertown) 4-core at 3.4GHz.

Xeon X5698 – Westmere – 4.4GHz – Mid 2010

The next well know Everest was a chip based on the Westmere (shrink of Nehalem) architecture.  The Westmere Everest became known as the Xeon X5698, and was available for OEMs only, in fact it was a special order processor made with one particular type of client in mind. These were to be used for High Frequency Stock traders, and other such high speed transactional processing, where the ability to complete trades as fast, and reliability as possible is the entire nature of the business.  This means that single thread performance is far more important then having multiple core, and as such, the X5698 uses a 6-core die with only 2 cores active, but retaining access to the entire 12MB of L3 cache.  Clock speed was fixed at 4.4GHz, the cores did not reduce frequency as processing demands changed, as this would introduce uncertainty in how fast it would complete a given task. Doing task ‘X’ should take a predictable amount of time and not depend on what speed the processor chose to run at.  The next fastest Westmere processor was the X5690, which was a 6-core (all cores enabled) running at 3.46GHz (the same chip essentially as the Core i7 990X).  The X5698 was nearly 1GHz faster.  The X5690 cost around $1800, where as the X5698 cost around $20,000 EACH (based on costs OEMs charged to add a 2nd one so they may have marked it up some).  The impressive thing is that these chips would go faster.  Intel sampled 4.66GHz versions and Supermicro built systems using X5698’s overclocked to 4.8GHz.  All this back in 2011.

4.4GHz Jaketown (Sandy Bridge) Everest Sample 2010-2011

Intel’s next architecture was known as Sandy Bridge.  Sandy Bridge topped at at 3.5GHz (6-cores) for the Core i7 Extreme 3970X and 3.6GHz for the 4-core i7-3820 and similar Xeon E5-1620.  Intel demo’d an air cooled Sandy Bridge running on stage for a presentation at 4.9GHz, so the core certainly had some room to spare.  There is no documentation (that I could find) that Intel actually released anything faster then 3.6GHz, at least that I could find, but evidence suggests that they at least were thinking about it.  The picture is a Sandy Bridge Xeon in LGA2011 marked JKT EVEREST SS 4.4GHZ INTERNAL USE ONLY. JKT is short for Jaketown, Intel’s codename for the 32nm Xeon E5-2600 series.  That gives a very good idea what this processor was to be.  SS is likely to be a Single Socket (as often at those speeds getting dual systems working can be tricky).  Sandy was certainly capable of hitting 4.4GHz, with 4-core, and even air cooling, so perhaps these were samples for a limited OEM run, much like the previous Westmere X5698 processors.

Read More »

April 11th, 2018 ~ by admin

PowerPC Processor for TESS Planet Hunter – Updated

TESS Orbiter – Freescale (now NXP) 2010  PowerPC e500

UPDATE: I received a note from a NASA engineer that the final flight DHU was made by SEAKR Engineering rather then Space Micro.  It turns out MIT pursued 2 different DHU systems in the design of TESS.  The Space Micro IPC 7000 was referred to as the DHU and a system by SEAKR (the Athena-3) was selected as the ADHU (Alternate Data Handling Unit).  Apparently MIT wasn’t sure which would be best so essentially characterized both (and most documentation from early on shows the Space Micro system).  In the end however, the SEAKR Athena-3 Single Board computer was selected.

If all goes well, in a few days the NASA TESS (Transiting Exoplanet Survey Satellite) will be launched on a SpaceX Falcon 9 rocket to startits mission to survey a large portion of the sky for possibly Earth-like planets.  TESS’s finds will make great candidates for further study by either Hubble, or JWST (when it finally launches).  While TESS can see transiting planets (the dimming of a star as an exoplanet passes in front of it) it cannot determine much about its composition, or the composition of its atmosphere.  However, having a list of exoplanets to further check out, especially Earth-sized ones, it’s a big help.  TESS was created as part of the NASA Medium Class Explorers Program (MIDEX) which is for mission up to around $200 Million total cost to NASA (not including launch).  TESS itself cost about $75 million (developed in large part by MIT and built by Orbital-ATK on their LEOStar-2 Platform) and the launch services contract was $87 Million with the remainder taken by operations and contingency funding.

Space Micro Proton 400k with Freescale 2020 processor

That makes this one of the least expensive NASA missions, but one that has engendered much more public interest then its cost suggests.  Finding alien worlds captivates people hearts and minds.  So what is at the heart of the TESS orbiter?  Obviously the premier technology is its 4 cameras that will scan the sky, but the computer that powers these is no less interesting.

The 4 cameras are interfaced to a Data Handling Unit (DHU).  Initially the DHU was to be the Space Micro IPC-7000 computer.  The IPC-7000 consists of a TI TMS320C67xx 32-bit DSP and a pair of Xilinx Virtex-7 FPGAS.  They handle all the pre-processing of the imagery collected by the cameras, making it into a format that is easily transmitted back to earth.  The rest of the spacecraft functions (such as actually sending/storing the data and other space craft house-keeping) is handled by a Space Micro Proton 400k SBC.  The Proton 400k is based on a Freescale 2020 1GHz Dual Core PowerPC processor made on a 0.45u process..  Each PowerPC e500v2 core has a 7-stage pipeline with 32K of I-cache and 32K of D-Cache and shares a single 512K L2 Cache.  The computer also containing a pair of 192GB solid state memory boards for buffering imagery data (data is relayed to Earth only once per orbit, so it needs to store data from around 14 days).

Athena-3 SBC – Powered by a 1.067GHz Freescale P2010 Processor

The final flight version of TESS switched to an ADHU made by SEAKR Engineering.  This uses a very similar setup but a bit less powerful processor.  The heart of the ADHU is the Freescale P2010 e500 processor at 1066MHz with 1GB of DDR2 RAM and 1-4GB of Flash.  This is the single core version of the P2020 used in the initial Proton 400k.  The ADHU also includes a RCC5 triple Xilinx Virtex-5 FPGA board to handle additional camera processing functions (and anything else not handled by the P2010 processor).  Solid state storage is a Gen 3 FMC also by SEAKR, containing 3 boards with a total of 192GB of Flash.  The ADHU handled all of the science, processing the raw camera data into useful science data and handling the sending of data to the 100-125MBit/sec Ka-band transmitter.  It also supplies some star reference information used by the MAU (Master Avionics Unit) computer to provide finer attitude control of the satellite.  The MAU is the LeoStar-2 Satellites main computer, and handles all the mechanics of flying the spacecraft outside of the science work done by the ADHU.

Freescale P2020 Processor

In many ways this is a very advanced processor compared to the RAD750 processors we often see on large scale NASA missions.  The Freescale 2020/2010 is not an inherently radiation hardened design, however both Space Micro and SEAKR  implements many radiation mitigating designs in the system design to compensate for this.  It is not as robust as the RAD750 but it is a $75 million earth satellite with a target mission life of 2-years so it doesn’t need to be. The 2020 processor does give TESS tremendous processing power for a scientific satellite, allowing for a lot of pre-processing of the imagery.  This allows TESS to handle much of the grunt work, and send scientists here on Earth only the very best data, in a format that is the most useful to them.

 

March 24th, 2018 ~ by admin

Making MultiCore: A Slice of Sandy

Intel Sandy Bridge-EP 8-core dies with 6 cores enabled. Note the TOP and BOTTOM markings (click image for large version)

Recently a pair of interesting Intel Engineering Samples came to The CPU Shack.  They are in a LGA2011 package and dated week 33 of 2010.  The part number is CM8062103008562 which makes them some rather early Sandy Bridge-EP samples.  The original Sandy Bridge was demo’d in 2009 and released in early 2011.  So Intel was making the next version, even before the original made it to market.  The ‘EP’ was finally released in late 2011, over a year after these samples were made.  Sandy Bridge-EP brought some enhancements to the architecture, including support for 8-core processors (doubling the original 4).  The layout was also rather different, with the cores and peripherals laid out such that a bi-direction communications ring could handle all inter-chip communication.

Sandy Bridge-EP 8-core die layout. Note the ring around the inside that provides communications between the peripherals on the top and bottom, and the 8-cores. (image originally from pc.watch.impress.co.jp)

Sandy Bridge EP supports 2, 4, 6 and 8 cores but Intel only produced two die versions, one with 4 cores, and one with 8 cores.  A die with 4 cores could be made to work as a dual core or quad, and an 8-core die could conceivably be used to handle any of the core counts.  This greatly simplifies manufacturing.  The less physical versions of a wafer you are making, the better optimized the process can be made.  If a bug or errata is found only 2 mask-sets need updated, rather then one for every core count/cache combination.  This however presents an interesting question..What happens when you disable cores?

That is the purpose of the above samples, testing the effects of disabling a pair of cores on an 8-core die.  Both of the samples are a 6-core processor, but with 2 different cores disabled in each.  One has the ‘TOP’ six cores active, and the other the ‘BOTTOM’ six cores are active.  This may seem redundant but here the physical position of the cores really matters.  With 2 cores disabled this changes the timing in the ring bus around the die, and this may effect performance, so had to be tested.  Timing may have been changed slightly to account for the differences, and it may have been found that disabling 2 on the bottom resulted in different timings then disabling the 2 on the top.

Ideally Intel wants to have the ability to disable ANY combination of cores/cache on the die.  If a core or cache segment is defective, it should not result in the entire die being wasted, so a lot of testing was done to determine how to make the design as adaptable as possible.  Its rare we get to see a part from this testng, but we all get to enjoy its results.

March 15th, 2018 ~ by admin

CPU of the Day: Intel Jayhawk – The Bird that Never Was

Intel Jayhawk Thermal Sample – 80548KZ000000 QBGC TV ES – Made in April 2004 Just 3 weeks before it was canceled

Perhaps fittingly the Jayhawk is not a bird, but rather a term used for guerilla fighters in Kansas during the American Civil War.   It is also the name of a small town in California 150 miles Northeast of Intel’s headquarters in Santa Clara.  It was also the chosen code name for a Processor Intel was working on back in 2003.  In 2003 Intel was working on the Pentium 4 Prescott processor, to be released in 2004 and its Xeon sibling, the Nocona (and related Irwindale),  The Prescott was a 31 stage design made on a 90nm process.  There was hopes it would hit 4+ GHz but in production it never did, though overclockers, with the help of LN2 cooling were able to achieve around 8GHz.  Increasing the length of the pipeline helps allow higher clock speeds, the Northwood core had a 20-stage pipeline so the Prescott was a rather big change.  There is a cost of lengthening the pipe, processors don’t always execute instructions in order, often guessing what will come next to speed up execution.  This is called speculative execution, processors also guess what data is to be needed next, and stick it in cache.  If either of these ‘guesses’ is wrong, the processor needs to flush the pipeline and start over, at a comparatively massive hit in performance.  This is what performance doesn’t always scale very linearly with clock speed.

Intel figured that this wouldn’t be an issue and so the Prescotts successor was to have a 40-50 stage pipeline.   THe hopes were for 5GHz at 90nm and 10GHz at 65nm. The desktop version was known as Tejas, and the server version, Jayhawk.  Initially these were to be made on the 90nm process, same as Prescott, before being transitioned to a 65nm process.  It increased the L1 cache to 24k (some sources say 32k) from the Prescotts 16k.  The Instruction trace cache was still 16k micro-ops, though this could have been increased.  L2 cache would have been 1MB at introduction and 2MB once the processor moved to a 65nm process.  Eight new instructions were to be added called ‘Tejas New Instructions’ or TNI, these later would become part of the SSSE3 instructions released with the Core 2 processor.  It also would bring ‘Azalia’ Intel’s High definition audio codec, DDR2 support, a 1066MHz bus, and PCI-Express support.  It turns out there was a problem….

Read More »

January 28th, 2018 ~ by admin

CPU of the Day: Tandem CLX 800 – It Takes 2 To Tango

TANDEM CLX 800 Processor – VLSI CMOS 1u process – 16MHz.

Tandem Computers was established way back in 1974, and was one of the first (if not the first) dedicated fault-tolerant computing companies.  They designed completely custom computers designed for use in high reliability transaction processing environments.  These were used for support of stock exchanges, banks, ATM networks, telephone/communications interchanges, and other areas where a computer failure would result in significant, costly, disruptions to business services.  Tandem was started by James Treybig, formally of HP, and a team he lured away from HP’s 3000 computer line.

Tandem computers are designed to do two things well, fail-over quickly when a failed part is detected.  This means that if a faulty processor or memory element is found, it can be automatically disabled, and processing continues, uninterrupted, on the rest of the system.  The other design element that Tandem perfected was allowing the computer to find and isolate intermittent problems.  If a processor or storage element ceases to work, that is relatively easy to figure out, but if a processor is glitchy, causing errors only occasionally, that can be much harder to find and can result in serious problems for the user.  This is known as ‘Fast Fail’ and today is a pretty standard concept, find the error, catch it, and prevent erroneous data from ever making it back into the database.  Tandem computers were designed from the ground up to be fault tolerant, disks were mirrors, power supplies, busses,

Tandem CLX 600 PCB (click for larger)

processors,all were redundant, but unlike some other systems, components were not kept as ‘hot spares’ sitting idle until something failed.  This kept hardware from being ‘wasted.’ Under normal operation if it was in the system, it was contributing to system performance.  A failed component then would reduce system performance until it was replaced/fixed, but a customer would not be paying for hardware that served them no purpose unless something broke.

To support these goals Tandem designed their own processors and instruction set architecture know as TNS (Tandem NonStop).  The first processors were a 16-bit design call the T/16 (later branded NonStop I) made out of TTL and SRAM chips spanning 2 PCBs.  Performance was around 0.7MIPS in 1976.  They were a stack based design similar to the HP3000 with added registers as well.  T/16 systems supported 2-16 processors. NonStop II, released in 1981, was similar, but supported the occasional 32-bit addressing, increasing accessible memory form 1 to 2MB per CPU and performance to 0.8MIPS.

The 1983 introduction of TXP saw a great performance improvement, up to 2.0 MIPS, but kept the same form factor.  The processor was implemented in TTL, with the addition of many PALs and added much better support for 32-bit addressing.  In 1986 the NonStop VLX was released, which moved to an ECL based processor.  This was a full 32-bit design, running at 12MHz (3MIPS) but still using discrete components and a new bus system as well.  This was to be the high end of the NonStop line, it was fast reliable, and rather large.  The desire for a more economical system to fit the needs of smaller customers led to a first for Tandem…

Read More »

December 19th, 2017 ~ by admin

Chip of the Day: TRW MPY-16AJ – Making Multiplication Manageable

TRW MPY16AJ – 1978

In Primary School students are tasked with memorizing their multiplication tables.  Taking the time to manually calculate 6×5 is much slower than simply committing the result to memory.  This allows more complex math to be processed quicker as the students skills develop.  Typically this is limited to numbers up to 12×12, resulting in 144 results to ‘store.’  In computing the same can be done.  A ROM can be used as a lookup table for multiplication.  The problem is it does not scale well.  Handling 4×4-bit multiplication requires a 256×8 ROM (2m+n addresses and m+n outputs). This could be handled by a many ROMs available in the 1970’s.  Anything more than 4-bits though was simply not possible.  This gave rise to the need for multipliers to calculate the result.

TRW MPY16AJ – Large Heatsink affixed to package to dissipate its 5W

This was a problem TRW set out to rectify in 1976.  TRW LSI Products was formed in the 1960’s to commercialize the transistor products that had been developed by Pacific Semiconductors, a division of TRW.  It was James Buie who invented the TTL logic gate in 1961 while working for TRW.  TTL went on to become the logic standard throughout the 1970’s and 80’s.   TRW was involved in aerospace, helping design planes, satellites, and missiles, fields that required processing of signals data, what became known as Digital Signal Processing (DSP).  In the 1980’s processors were designed to handle this, such as the TI TMS320 series, but in the 1970’s it had to be done with discrete components.  DSP systems had several needed blocks, Fast ADCs, ALUs, and multipliers.  TRW invented fast ADCs to handle the inputs, and ALUs were available such as AMDs Am2901 or even the TTL series 74181s.  Multipliers however were not widely available, especially for large bit-widths.

MPY-16 die

TRW’s first multiplier was a custom device to work with their own avionics processing system.  It was made on a Bipolar process, and multiplexed the entire product, using around 40 pins total (the entire product was multiplexed with the operands).  It could handle a multiply in 330ns worst case.  Interestingly yields of the device were considered ‘excellent’ at 3 working devices per wafer (out of 19 per wafer (most likely a 2″ wafer)).  Today, yields like that would be completely unacceptable.

TRW designed the MPY-16AJ as a brute-force 16×16 multiplier.  It was designed on a Bipolar process with around 3600 gates.  It implements a series of AND gates and CARRY-SAVE-ADDERS to implement the multiplication.  There are faster methods, but they come at the cost of complexity and power draw.  As designed the the MPY-16AJ dissipates 5 Watts while handling a signed (2’s complement) multiplication in a worse case 230ns).  They MPY16 was packaged in a large 64-pin package to limit the # of pins that had to be multiplexed.  The lower 16-bits of the product are multiplexed with one of the operands.  This is acceptable as in many applications the upper 16-bits of the product are sufficient accuracy.  The 64-pin package allowed for not less multiplexing, but also a much larger surface for heat dissipation.  A heatsink was also affixed to the package as well.

Micron (Russia) 1802VR5 – MPY16HJ Clone made in 1992

Later versions of the MPY-16 added support for unsigned multiplication as well (the MPY16H) and became the standard for 16-bit multipliers.  Compatible multipliers were made by Analog Devices (ADSP1016, 40-50ns at 150mW) and LOGIC LMU16/216) in CMOS, by Weitek (WTL1516/A/B, 50-100ns at 0.9-1.8W) in NMOS, by Synertek (SY66016 100ns at 1.5W) in HMOS, by AMD (Am29516 38ns at 4W) in ECL, as well as many others.  These were implemented internally with different processes, and different multiplier algorithms but externally they all mimicked the standard TRW MPY16J and served as the basis of many signal processing and high end math computers.  As a testament to their usefulness, the MPY16 was also copied by the USSR as the 1802VR5.  The TRW MPY16 was last made in the mid-1980’s but its clones continued to be made into the 1990’s.  Today its functions can be handled by any DSP, CPU or even coded into a FPGA, but for a time, the MPY16 multiplied the efficiency of many processing systems.