When Vector computing is mentioned, the first company that comes to mind is Cray. Cray was the leading designer and builder of vector supercomputers since the 1970’s. Vector computing is a bit different then general purpose computing. Simply put, a vector computer is designed to perform an instruction on a large set of data at the same time. Such vector support has been added to x86 (in the form of SSE) as well as the PowerPC architecture (AltiVec) but they were not originally designed as such. Cray however, is not the only such company. In 1983 NEC announced the SX architecture. The SX-1/2 operated at up to 1.3 GFLOPs and supported 256MB of RAM per processor. By 2001 with the SX-5 and SX-6 performance had increased to 8 GFLOPS and supported 8GB of RAM per CPU. For a short while Cray themselves marketed and sold NEC SX computers. Each of the processors, from SX-1 to the SX-9 was a single core processor, but with the SX-ACE, that changed.
In 2013 NEC released the SX-ACE. The ACE is a quad-core vector processor running at 1GHz. Each core can push 64 GLOPS resulting in 256 GFLOPs per chip (node). Each node can address 1TB of memory. These
chips are rather enormous, though surprisingly, they are smaller then the SX-9. The SX-9 was packaged in a BGA package with 8,960 balls, while the SX-ACE has reduced this to about 4,300, still impressive on a single package. Transistor count grew from the 350 million of the SX-9 to a staggering 2 Billion transistors, all on a 570 mm2 die made on a 28nm process. What didn’t grow was power consumption. Power consumption has become a very real concern, and even a metric, for supercomputers. By integrating more cores, and functionality, onto a single chip, NEC was able to greatly reduce power usage for a given performance point.
For example, a single 16 processor SX-9 node could achieve 1.6 TFLOPs, but required 560 LSIs, and 30 KW of power, more then enough for several average houses. An SX-ACE, with six nodes, can push 1.5 TFLOPs, slightly less, but at 2.8 KW and using only 6 LSIs (one per node). There are of course processors that are more efficient, but for vector processing, memory bandwidth becomes much more important then just TFLOPs. TFLOPs are meaningless if the processor can’t get the data to operate on, and the SX-ACE is one of the most efficient processors when looking at memory bandwidth per watt. More efficient then an Ivy Bridge Xeon, POWER7, pr SPARC FX10.
Perhaps fitting as one of the major uses of the SX-ACE is in climate and environmental modeling.