home about pictures reference trade links  

Transcript of HOTCHIPS VI presentation of
the 21164 microprocessor

Key attributes:

new design (not like 21064 -> 21064A)
4-way issue superscalar
Large on-chip L2 cache
7-stage integer pipeline
9-stage floating point pipeline
low latencies at high clock rate
high-throughput memory subsystem

Other properties:

40b physical address (1 Terabyte)
43b virtual address (8 Terabyte)
128b external cache interface
L3 cache controller integrated
Instruction translation buffer 48 entries
Data translation buffer 64 entries
16.5 mm x 18.1 mm die size (slightly smaller than original Pentium)
0.5 micron , 4 layer metal CMOS5 process

Execution pipelines:

Integer Pipeline 0: arith, logical, ld/st, shift
Integer Pipeline 1: arith, logical, ld, br/jmp Int mul
FP Pipeline 0: add, subtract, compare, FP branch
FP Pipeline 1: multiply
FP div hangs off FP pipe 0, but runs independently

Latencies:

Most int ops 1
CMOV 2
Int mul 8 - 16
Float ops 4
loads (L1 cache hit) 2
compare or logical op to
CMOV or conditional BR 0

Onchip data caches:

dual-ported L1 data cache (8Kbyte, write through, non-blocking)
On-Chip L2 cache (96Kbyte, 3-way set assoc., write back, pipelined)
Miss Address File (MAF), 6 entry, between L1 and L2
MAF merges loads to the same cache block
Up to 21 loads, multiple loads merge regardless of order
Up to two register file fills per cycle
Bus Address File (BAF), 2 entry, between L2 and external memory

L3 cache (off-chip)

Direct-mapped write-back superset of L2 cache
Up to 2 outstanding reads
Programmable wave pipelining
L3 cache is optional

Instruction prefetching

Aggressive prefetching from L2 cache,
At least three 32-byte blocks ahead of the current issue point
Continuous integer instruction issue out of L2 cache (2 per cycle)
60% of peak issue rate possible out of L2 cache (2.4 per cycle)

Latency and bandwidth of memory operations

Latency (cycles) Bandwidth (bytes/cycle)

L1 2 16
L2 8 16
L3 >= 12 <= 4

L1 cache block size 32 bytes
L2, L3 cache block sizes 64 bytes (with 32-byte block size option)

Cycle count improvements over the 21064/21064A

  21164 21064/21064A
shifts/byte ops 1 2
int mul 8-16 19-23
cmp->branch 0 1
float ops 4 6
L1 data cache 2 3

 

 

 

 

Return to main reference page

 
Copyright © 2006 CPUShack.Net All pictures and content are property of CPUShack.Net. All rights reserved. This material may not be published, broadcast, rewritten, or redistributed without the express written permission of CPUShack.Net

Contact The CPUShack