November 5th, 2016 ~ by admin

GRAPE-6 Processor: A Gravitational Force of Reckoning

GRAPE-6 Processor - 90MHz

GRAPE-6 Processor – 90MHz -2000

Understanding the movements of the stars has been on mankinds mind probably since we first stared into the sky.  Through the ages we can predict where a star or planet will be in the sky in the next few months, years, even hundreds of years, but to be able to predict the exact orbital details for ALL time is rather more tricky.

This helps understand how planetary systems form, and the conditions that make that possible.  It allows us to see what happens when two massive black holes pass each other by, will the merge? will they orbit? will one go rogue?  These are interactions that take millions of years, and thus we need to calculate the gravitational forces very accurately. This isnt a terribly hard problem for two bodies, and is doable for three with little fuss, but for numbers of bodies greater then that, the calculations grow rapidly, on the order of N2/2.

In the late 1980’s Tokyo University began work on developing a computer to calculate these forces.  Every gravitational force had to be be calculated with its effects on every other body in the system.  These results were then fed to a commodity computer for summation and final results.  This made the Tokyo project a sort of Gravity co-processor, or as they called it a Gravity Pipeline, GRAPE for short.  The GRAPE would do the main calculations and feed its results to another computer.

Name Year Performance/chip Clock Notes
GRAPE-1/1A 1989 310MFlops 8MHz ALU + EPROMs
GRAPE-2 1990 50MFlops 4MHz Commercial FPUs
GRAPE-2A 1990 180MFlops 6MHz Commercial FPUs
GRAPE-3 1991 300MFlops 10MHz Nat. Semi. 1um ASIC
GRAPE-3A 1992 600MFlops 20MHz Nat. Semi. 1um ASIC
GRAPE-4 1995 625MFlops 32MHz LSI 1um ASIC
GRAPE-5 1998 5GFlops 80MHz NEC 0.35um ASIC
GRAPE-6 2000 31GFlops 90MHz Toshiba 0.25u ASIC
GRAPE-7 2006 100GFlops 100MHz FPGA
GRAPE-DR 2008 250GFlops 400MHz ASIC
GRAPE-8 2012 480GFlops 250MHz 45 nm N2X740 eASIC
GRAPE-1 Processor (Multi chip board with ALU + EPROMs) - 4MHz

GRAPE-1 Processor (Multi chip board with ALU + EPROMs) – 4MHz

The first GRAPE-1 in 1989 was based on several 16-bit ALU (SM5833AF by Nippon Precision Circuits, aka Seiko) and EPROMs used for the rest of the math. EPROMs were used as look up tables as that was faster then doing the actual math, modern processors have continued to use this method as well, as do people, its no different then memorizing multiplication tables, versus doing the actual math. The GRAPE-1 was fairly fast but since it estimated things, not super accurate.  It could handle 310Mflops.  GRAPE-1A added some more capabilities (particles could now be different sizes) but speed was about the same.

The second iteration, the GRAPE-2 used 2 TI SN74ACT8847 64-bit FPUs for 64-bit operation, 3 ADSP3201 and 3x ADSP3202 for 32-bit FP operations.  This increased the accuracy greatly at the cost of speed, but still could attain 50MFlops running at 4MHz. GRAPE-2A increased clock speed to 6MHz and could handle a single particle at 180Mflops.

GRAPE-3 Chip - 10MHz single pipeline - National Semiconductor ASIC -1991

GRAPE-3 Chip – 10MHz single pipeline – National Semiconductor ASIC -1991

GRAPE-3 was the first custom GRAPE chip, fab’d by National Semi using 110,000 transistors and clocked at 10MHz on a 1um process. Each chip could handle 0.3GFlops, and with 24 on a board,   a system consisted of a pair of these boards with 48 chips, resulting performance was around 14.4 Gflops at moderate accuracy. The 48 chips become important later, as each chip is one gravity pipe, thus allows 48 particles to be calculated at once. GRAPE-3A increased the clock speed to 20MHz but used only 4 chips, resulting in performance of 2.4GFlops in 1992.  GRAPE-3 systems were the first to be sold commercially, with over 80 of them sold worldwide.

GRAPE-4 was a high accuracy design (if you have noticed, odd numbers were low accuracy, even numbers high accuracy design).  It was designed with an LSI 1um ASIC of over 400,000 transistors running at 32MHz.  48 of these chips resided on one board, handling 48 particles.  A 36 board system (1728 chips) could achieve 1.08TFlops (625MFlops per chip). GRAPE-4 was completed in 1995 and was the first scientific computer to exceed 1TFlops, beating Intel’s ASCI Red by a year.

GRAPE-5 was the next lower accuracy design, it was an improvement on the GRAPE-3.  It was based on a 0.35um NEC ASIC running at 80MHz and containing 2 gravity pipes. This resulted in a single chip performance of around 5GFlops, 8 times faster then the GRAPE-3

grape-6_header

That brings us to the GRAPE-6, really the pinnacle of development of the GRAPE.  (there have been a few other designs but the program seems to have wound down a bit after the GRAPE-6).  The -6 is a high accuracy design, an upgrade on the the GRAPE-4.  A single GRAPE-6 chip handles what an entire 48-chip GRAPE board did before. One chip, running at 90MHz can calculate 48 different particles at once.  The chips are

GRAPE-6 Module - 4 Processors + 4MB of SSRAM

GRAPE-6 Module – 4 Processors + 4MB of SSRAM

made using a Toshiba TC240E series 0.25u ASIC and contain over 1.8 million gates.  Each chip contains 6 gravity pipes (each of which supports 8 virtual pipes thus giving 48 pipes) and is supported by 1MB of 166MHz SSRAM. What exactly is a Gravity pipe? Its a series of Floating Point ALUs, adders, and multipliers.  In the GRAPE-6 each chip contains around 60 ALUs, so about 350 per chip.

While specialized, the GRAPE-6 is VERY efficient. A GRAPE-6 chip consumes about 13 W and provides 30 Gflops, compared to 95 W and 48 Gflops for 45 nm Xeon chips. GRAPE-6 is about five times more power efficient. Compared to the x86 processors of 1999, a GRAPE-6 chip was more than 20 times faster and more than 50 times more power efficient.  This performance per Watt gather the GRAPE-6 a Gordon Bell prize for efficiency in computing.

The GRAPE-7 was a 2006 experiment in moving the GRAPE to an FPGA.  It is much less efficient, but easier and faster to design, saving a large amount of cost.  Performance was around 100 Gflops per chip running at 100MHz.

GRAPE-6 Die - Easily seen are the 6 Gravity Pipes. THe middle is the predictor pipeline

GRAPE-6 Die – Easily seen are the 6 Gravity Pipes. THe middle is the predictor pipeline

The next evolution of the GRAPE took a fairly radical turn.  The GRAPE-DR ( Greatly Reduced Array of Processor Elements with Data Reduction) is made up of 512 Processor elements which are very simple processors (ALU, FMUL, FADD, and Registers/Cache) operating in an SIMD fashion.  Performance is around 250 GFlops per chip, and with 4 perboard, 1Tfliops per board.  A single GRAPE-DR board has as much performance as an entire GRAPE-4 1728 chip system. In 2010 a 400MHz 6400 core GRAPE-DR ranked 281 in the Top 500 Supercomputer list.

In 2008 the GRAPE-8 was released, this was based on a structured 45nm ASIC from eASIC (essentially a flat ASIC, with a single mask layer for logic, resulting in much more affordable design/test).  Running at 250MHz performance hit 480Glops per chip, double that of the GRAPE-DR, but consuming only 10 Watts, compared to the 60 Watts of the GRAPE-DR.   There was a GRAPE-9 planned but information on it is scarce.  Thus is the nature of University research projects, they are somewhat dependent on continued justification of funding.

Posted in:
CPU of the Day

Leave a Reply