home about pictures reference trade links  


The title of the talk:

The 21264: A Superscalar Alpha Processor with Out-of-Order Execution

So, well, Alpha has finally joined the brainiac club. The trick is, we
didn't sacrifice the trademark clock speed while doing so. And while we
were at it, we fixed a couple of nuisances that occasionally give us
surprises with the older Alpha implementations.

The marketing highlights:

Estimated SPECint95 of 30+, SPECfp95 of 50+
Much better cache and memory system
500 Mhz+ operation in 0.35 um process
4-way out-of-order execution
MPEG2 @ MLP *encode* in real time

Now the details.

Physical:

Same 0.35 CMOS process as used for the 500 Mhz 21164, but two additional
metal layers for power distribution (so it's a 6-layer metal process).
Die size approx. 300 mm square, 15.2 million transistors. Speed bins
starting at 500 Mhz, power is 60 watts @ 500 Mhz. 588 Pin Grid Array Package.

Logical:

64 KByte 2-way setassociative instruction cache
64 KByte 2-way setassociative data cache
4 Integer Units (2 of which are also load-store units)
2 Floating Point Units
7 Stage Integer Pipeline
10 Stage Floating Point Pipeline

Branching:

Next line predictor (allows branches without fetch bubbles)
(allows dynamic prediciton of computed jumps)
Set predictor (allows 2-way associativity at high speed)
Two level branch predictor (run a 2-bit traditional counter predictor
and a global pattern detecting branch
predictor in parallel and dynamically
pick the one whose right more often)
Branch predictor about twice as good as the one in the 21164

Out-of-Order execution:

80 physical integer registers
- 32 architectural
- 8 PAL-code shadow
- 40 rename registers
72 physical floating point registers
- 32 architectural
- 40 rename registers

20 entry integer queue, quad-issue
15 entry floating point queue, dual-issue

Out-of-Order mapper is a 500K transistor structure and is one
of the critical pathes in the chip. 80 entry CAM for mapping
up to 80 instructions in flight. Backing out to any state takes
1 cycle.

Integer units:

4 units:

add/logic/motion-video/shift/branch
add/logic/multiply/shift/branch
add/logic/memory
add/logic/memory

In order to get that many register ports, this is implemented
as two identical copies of an 80 register file with two units
attaching to each copy. The two register files are kept identical
with a 1-cycle delay between clusters.

Floating point units:

add/div/square root
multiply

4 cycle latency, fully pipelined. Divide is not pipelined, retires
6 bits/cycle (compared to 2 bits/cycle in the 21164). The new
SQRT retires 2 bits/cycle (and also isn't pipelined).

Data Cache, load-store reorder buffers:

2 loads/stores per cycle, any combination
implemented as a single ported 1 Ghz cache...
32 entry load reorder buffer
32 entry store reorder buffer
Stores check load buffer to enforce ordering
Fine grain cache control through cache prefetch instructions

Board level cache:

L1 Dcache 8+ Gbyte/sec. sustained, 3 cycle load-to-use (like 21064)
L2 cache 4+ GByte/sec. sustained, 128 bit separate port,
12 cycles load-to-use

Board level cache can be built in 4 ways from 3 types of SRAM:

1. No board level cache
2. 133 Mhz Klamath-type Burst-RAM, 2.1 Gbyte/sec. bandwidth
3. 250 Mhz Late-write SSRAM, 4.0 Gbyte/sec. bandwidth
4. 333 Mhz Dual-data clock forwarding FSRAM, 5.3 GByte/sec. bw

The board level cache can be 0, 1 ,2, 4, 8 or 16 Mbyte in size.


Memory System:

System Interface 2+ GByte/sec. sustained, 64 bit separate port,
80 cycles load-to-use (with Tsunami desktop chip set).

16 outstanding memory references, 64 bytes each:
- 8 reads
- 8 writes

With Tsunami system chip set and SDRAMs, effective McCalpin
STREAM bandwidth is 1.6 Gbyte/sec.

Availability:

Samples Q1/97
Volume H2/97

So, it's vapor right now, but if you want to sell vapor in 1997 you better
had damn fast vapor then...

Burkhard Neidecker-Lutz

 

 

Return to main reference page

 
Copyright © 2006 CPUShack.Net All pictures and content are property of CPUShack.Net. All rights reserved. This material may not be published, broadcast, rewritten, or redistributed without the express written permission of CPUShack.Net

Contact The CPUShack