September 5th, 2014 ~ by admin

MasPar: Massively Parallel Computers – 32 cores on a chip

MasPar PE3232 - 32 12.5MHz 32 bit Processing Elements - 1992

MasPar PE3232 – 32 12.5MHz 32 bit Processing Elements – 1992

In the 1980’s DEC researchers were designing a supercomputer based on the Goodyear MPP from 1983.  Jeff Kalb was in charge of the division of DEC involved in this work.  The original Goodyear MPP wa based on a 1-bit processor element (PE).  DEC increased that to a 4-bit PE as well as increased the connectivity between PE’s.  When DEC decided to not commercialize the supercomputer design Kalb left (with DEC’s blessing) to start a company of his own that would.  Thus the creation of MasPar in 1987.

MasPar derives its name from the product it sought to create, a Massively Parallel supercomputer.  These type of computers, also referred to as vector processors are SIMD machines, Single Instruction, Multiple Data.  They perform the same operation on a very large set of data.  SIMD instructions are now found on most all desktop processors, where they can greatly speed up processing of multimedia.  In the late 1980’s there was several companies making such MPP computers.  Perhaps the most famous was Cray, but there was also Thinking Machine’s Connection Machine, Intel’s Paragon (i860 based), nCUBE’s hypercube, Meiko Scientific’s CS-1 (Transputer based) and several others.  Such systems cost from upwards of $100,000 each so sales were not vast, typically companies sold a few hundred to a few thousand systems.

MasPar’s first design, the MP-1 was based directly on the research done at DEC.  Each processing element contained a 4-bit ALU, a 1-bit logic unit, a 64/16 (mantissa/exponent) unit for handling floating point.  Each PE also had 48 32-bit registers.  There were designed as a 32-bit RISC processor, which means, that with the 4-bit ALU, any ALU operation would take at least 8 cycles.  This was considered acceptable in a MPP type system.  Each custom VLSI CMOS MP-1 chip contained 32 individual PE’s.  They were made on a 1.6u process and contained 400,000 transistors.  Clock speed was a fairly low 12.5MHz but this allowed the chips to be air cooled with no special cooling systems.   They were packaged in an inexpensive 208 PQFP, nothing special needed due to the low heat dissipation.  A 1024 PE board (32 chips) dissipated only 50 Watts, and an entire 16k processor system dissipated less than 1,000 watts.

Read More »

November 1st, 2013 ~ by admin

nCube and the Rise of the HyperCubes

nCube/2 Processor - 20MHz The logo is a tesseract - 4-way Hypercube

nCube/2 Processor – 20MHz
The logo is a Tesseract – a 4-way Hypercube

In 1983 Stephen Colley, Dave Jurasek, John Palmer and 3 others from Intel’s Systems Group left Intel, frustrated by Intel’s seeming reluctance to enter the then emerging parallel computing market.  They founded a company in Beaverton, Oregon known as nCube with the goal of producing MIMD (Multiple Instruction Multiple Data) parallel computers.  In 1985 they released their first computer, known as the nCube/10.  The nCube/10 was built using a custom 32-bit CMOS processor containing 160,000 transistors and running initially at 8MHz (later increased to 10).  IEEE754 64-bit floating point support  (including hardware sqrt) was included on chip.  Each processor was on a module with its own 128KB of ECC DRAM memory (implemented as 6 64k x 4 bit DRAMs.)  A full system, with 1024 processor nodes, had 128MB of usable memory (160MB of  DRAM counting those used for ECC).  From the outset the nCube systems were designed for reliability, with MTBFs of full systems running in the 6 month range, extremely good at the time.

The nCube/10 system was organized in a Hypercube geometry, with the 10 signifying its ability to scale to a 10-way Hypercube, also known as a dekeract.  This architecture allows for any processor to be a maximum of 10-hops from any other processor.  The benefits are greatly reduced latency in cross processor communication.  The downside is that expansion is restricted to powers of 2 (64, 128, 256, 512 etc) making upgrade costs a bit expensive as the size scaled up.  Each processor contained 22 DMA channels, with a pair being reserved for I/O to the host processor and the remaining 20 (10 in + 10 out) used for interprocessor communication.  This focus on a general purpose CPU with built in networking support is very similar to the Inmos Transputer, which at the time, was making similar inroads in the European market.  System management was run by similar nCube processors on Graphics, Disk, and I/O cards.  Programming was via Fortran 77 and later C/C++. At the time it was one of the fastest computers on the planet, even challenging the almighty Cray.  And it was about to get faster.

Read More »