A 64-bit Instruction Set Architecture (ISA) Based on EPIC Technology

Original Document

The Next Generation of Microprocessor Architecture: A 64-bit Instruction Set Architecture (ISA) Based on EPIC Technology

October 1997

Background:

Traditional microprocessor architectures have fundamental attributes that limit performance. To achieve higher performance, processors must not only execute instructions faster, but also execute more instructions per cycle, referred to as "parallel execution". Greater parallel execution allows more information to be processed concurrently - thereby increasing overall processor performance. In traditional architectures, the processor is often underutilized because of the compiler’s limited ability to organize instructions. Branches (instructions that change the flow of execution within the program) and memory latency (the time for data to arrive from memory) compound the already limited ability of today’s processors to achieve parallel execution.

To overcome these limitations, a new architecture was required. Traditional architectures communicate parallelism through sequential machine code that "implies" parallelism to the processor. Intel and Hewlett-Packard jointly defined a new architecture technology called EPIC (Explicitly Parallel Instruction Computing) named for the ability of the software to extract maximum parallelism (potential to do work in parallel) in the original code and "explicitly" describe it to the hardware. Intel and HP have jointly defined a new 64-bit instruction set architecture (ISA), based on EPIC technology, which Intel has incorporated into IA-64, Intel’s 64-bit microprocessor architecture. The new 64-bit ISA takes an innovative approach combining explicit parallelism with techniques called predication and speculation to progress well beyond the limitations of traditional architectures.

Inherently Scalable:

One of the key objectives for the new instruction set architecture was to enable a wide range of implementations to balance different performance and cost requirements. The architectural features of IA-64, along with the instruction format, allow compatibility to be maintained, transparent to the user, across a spectrum of implementations. As the IA-64 product family evolves, additional execution units (part of the processor that performs calculations) and other processor resources can be added to increase the width of the machine (number of instructions executing simultaneously) and thereby increase performance. This was designed into IA-64 from the start and is "inherent" to the architecture. This "inherent scalability" is what enables great performance gains through use of explicit parallelism, predication and speculation.

Explicit Parallelism:

To illustrate the limitations of traditional architectures and the benefits of the new 64-bit ISA, think of a processor as operating similar to a bank lobby. Imagine that our bank has a greeter that points people to certain service lines based on their needs: loans, withdrawals, etc. This is similar to what the compiler does for code; it organizes it for processing. However, in traditional architectures, the greeter is slow and can only organize a few people at a time. Moreover, imagine the greeter isn’t really sure what each of the lines does. As a result, the tellers at the counter have to redirect customers. This requires extra work by the tellers- overall it’s not very efficient. This is analogous to how traditional architectures operate.

In explicit parallelism, our bank greeter really understands the operation. The greeter directs customers to the right tellers and is so efficient that he can call them at home to schedule appointments well in advance. Customers know exactly where they need to go, removing a significant burden from the tellers. The greeter has greater freedom to schedule customers, maximizing the number that can be serviced. That’s the concept behind explicit parallelism: the compiler organizes code efficiently and makes the ordering explicit so that the processor can focus on executing instructions in the most effective manner.

Predication:

Another major performance limiter for traditional architectures is branches. Branches represent a decision between two sets of instructions, like whether to fill out a withdrawal or deposit slip in our bank example. Assume, in an effort to respond to customers quickly, that a bank teller wishes to prepare the appropriate deposit or withdrawal slip in advance for the next customer. Since most customers make deposits, the teller predicts that a deposit slip will be needed. When he’s right, no problem; when he’s wrong, any applicants in line must wait while he completes a withdrawal slip. Similarly, today’s architectures use a method called branch prediction to predict which set of instructions to load. When branches are mispredicted (like when the teller predicted the customer wanted to make a deposit, when they really wanted to withdraw money) the whole path suffers a time delay. While current architectures may only mispredict 5-10% of the time, the penalties may slow down the processor by as much as 30-40%. Branches also constrain compiler efficiency and under-utilize the capabilities of the microprocessor.

The new 64-bit ISA uses a concept called predication. Assume the teller develops a better strategy for responding quickly to customers in line. When he has a spare moment, he prepares both a deposit and withdrawal for each customer. Pending the needs of the next customer, he only uses the appropriate slip and discards the other. The teller now works more efficiently without causing the customers in line to wait. This is similar to predication. The predicates allow both sets of instructions to be executed and then only those that are needed get used.

Predication can remove many branches and reduce mispredicts significantly. A study in ISCA ’95 by S. Malhlke, et. al. demonstrated that predication removed over 50% of the branches and 40% of the mispredicted branches from several popular benchmark programs. This enables increased performance resulting from greater parallelism and better utilization of an IA-64 based processor’s performance capabilities.

Speculation:

Memory latency, the time to retrieve data from memory, is yet another performance limitation for traditional architectures. If time to retrieve data from memory were a bank operation, it would be analogous to opening a new account: it takes a relatively long time. When a new patron opens an account, they hold up the entire line while they fill out the paperwork at the counter. Similarly, memory latency stalls the processor, leaving it idle until the data arrives from memory. Because memory latency is not keeping up with processor speed, loads (the retrieval of data from memory) need to be initiated earlier to ensure that data arrives in time for its use. The new 64 bit ISA uses speculation, a method of allowing the compiler to initiate a load earlier- even before it is known to be needed. What if the bank greeter could spot new customers in the bank? If the greeter provides the paperwork to open an account in advance to all new customers that enter the bank, then customers have the ability to finish the paperwork by the time they reach the front of the line. If they don’t need the form, then they return it to the bank unused. This is comparable to how speculation works. Loads from memory are initiated ahead of time to make sure data is available for its use. As a result, the compiler schedules to allow more time for data to arrive without stalling the processor or slowing its performance.

Additional Microprocessor Resources:

The next generation 64-bit ISA incorporates many innovative features to enable industry leading performance. To capitalize on this, Intel’s IA-64 based processors are massively resourced (e.g., 128 general registers and 128 floating point registers, instead of approximately 32 registers in traditional designs, and replicated functional units) to supply the processor with a steady stream of instructions and data to make full use of its capabilities.

In Summary:

The IA-64 implementation of EPIC technology enables new levels of parallelism and breaks the sequential execution paradigm (one instruction at a time) that exists with traditional architectures. The innovative use of predication and speculation uniquely combined with explicit parallelism has allowed EPIC technology to progress well beyond the limitations (like mispredicted branches and memory latency) of traditional architectures- enabling industry leading performance. These concepts can’t be appended on to existing architectures in an effective manner; they required the creation of a new architecture, IA-64. IA-64 is the first mainstream architecture that was designed from the start to take advantage of parallel execution. IA-64’s (Intel’s 64-bit architecture that utilizes EPIC technology) massive resources, inherent scalability and full compatibility make it the next generation processor architecture for high performance servers and workstations.

Intel, the world's largest chip maker, is also a leading manufacturer of personal computer, networking and communications products. Additional information is available at www.intel.com/pressroom.