Alpha 21264 - AbsoluteAstronomy.com

The Alpha 21264 was a Digital Equipment Corporation

Digital Equipment Corporation

Digital Equipment Corporation was a major American company in the computer industry and a leading vendor of computer systems, software and peripherals from the 1960s to the 1990s...

RISC microprocessor

Microprocessor

A microprocessor incorporates the functions of a computer's central processing unit on a single integrated circuit, or at most a few integrated circuits. It is a multipurpose, programmable device that accepts digital data as input, processes it according to instructions stored in its memory, and...

introduced in October, 1996. The 21264 implemented the Alpha

DEC Alpha

Alpha, originally known as Alpha AXP, is a 64-bit reduced instruction set computer instruction set architecture developed by Digital Equipment Corporation , designed to replace the 32-bit VAX complex instruction set computer ISA and its implementations. Alpha was implemented in microprocessors...

instruction set architecture (ISA).

Description

The Alpha 21264 was a four-issue superscalar

Superscalar

A superscalar CPU architecture implements a form of parallelism called instruction level parallelism within a single processor. It therefore allows faster CPU throughput than would otherwise be possible at a given clock rate...

microprocessor with out-of-order execution

Out-of-order execution

In computer engineering, out-of-order execution is a paradigm used in most high-performance microprocessors to make use of instruction cycles that would otherwise be wasted by a certain type of costly delay...

and speculative execution

Speculative execution

Speculative execution in computer systems is doing work, the result of which may not be needed. This performance optimization technique is used in pipelined processors and other systems.-Main idea:...

. It has a peak execution rate of six instructions per cycle and can sustain four instructions per cycle. It has a seven-stage instruction pipeline

Instruction pipeline

An instruction pipeline is a technique used in the design of computers and other digital electronic devices to increase their instruction throughput ....

Out of order execution

At any given stage, the microprocessor could have up to 80 instructions in various stages of execution, surpassing every other contemporary microprocessor.

Decoded instructions are queued in instruction queues and are issued when their operands are available. The integer queue contained 20 entries and the floating-point queue 15. Each queue could issue as many instructions as there were pipelines.

Ebox

The Ebox executed integer, load and store instructions. It has two integer units, two load store units and two integer register file

A register file is an array of processor registers in a central processing unit . Modern integrated circuit-based register files are usually implemented by way of fast static RAMs with multiple ports...

s. Each integer register file contained 80 entries, of which 31 are architectural registers, 41 are rename registers and 8 are PALshadow registers. There was no entry for register R31 because in the Alpha architecture, R31 is hardwired to zero and can only be read from.

Each register file served an integer unit and a load store unit, and the register file and its two units are referred to as a "cluster". The two clusters were designated U0 and U1. This scheme was used as it reduced the number of write and read ports required to serve operands and receive results, thus reducing the physical size of the register file, enabling the microprocessor to operate at higher clock frequencies. Writes to any of the register files thus have to be synchronized, which required a clock cycle to complete, negatively impacting performance by one percent. The reduction of performance resulting from the synchronization was compensated in two ways. Firstly, the higher clock frequency achievable offset the loss. Secondly, the logic responsible for instruction issue avoided creating situations where the register file had to be synchronized by issuing instructions that were not dependent on data held in other register file where possible.

The clusters are near identical except for two differences: U1 has a seven-cycle pipelined multiplier while U0 has a three-cycle pipeline for executing Motion Video Instructions (MVI), an extension to the Alpha Architecture defining single instruction multiple data (SIMD) instructions for multimedia.

The load store units are simple arithmetic logic unit

Arithmetic logic unit

In computing, an arithmetic logic unit is a digital circuit that performs arithmetic and logical operations.The ALU is a fundamental building block of the central processing unit of a computer, and even the simplest microprocessors contain one for purposes such as maintaining timers...

s used to calculate virtual address

Virtual address

In computer technology, a virtual address is an address identifying a virtual, i.e. non-physical, entity.-Description:The term virtual address is most commonly used for an address pointing to virtual memory or, in networking, when referring to a virtual network address...

es for memory access. They are also capable of executing simple arithmetic and logic instructions. The Alpha 21264 instruction issue logic utilized this capability, issuing instructions to these units when they were available for use (not performing address arithmetic).

The Ebox therefore has four 64-bit adder

Adder (electronics)

In electronics, an adder or summer is a digital circuit that performs addition of numbers.In many computers and other kinds of processors, adders are used not only in the arithmetic logic unit, but also in other parts of the processor, where they are used to calculate addresses, table indices, and...

s, four logic units, two barrel shifter

Barrel shifter

A barrel shifter is a digital circuit that can shift a data word by a specified number of bits in one clock cycle. It can be implemented as a sequence of multiplexers , and in such an implementation the output of one mux is connected to the input of the next mux in a way that depends on the shift...

s, byte-manipulation logic, two sets of conditional branch logic equally divided between U1 and U0.

Fbox

The Fbox is responsible for executing floating-point instructions. It consists of two floating-point pipelines and a floating-point register file. The pipelines are not identical, one executes the majority of instructions and the other only multiply instructions. The adder pipeline has two non-pipelined units connected to it, a divide unit and a square root unit. Adds, multiplies and most other instructions have a 4-cycle latency, a double-precision divide has 16-cycle latency and a double-precision square root has a 33-cycle latency. The floating point register file contains 72 entries, of which 32 are architectural registers and 40 are rename registers.

Cache

The Alpha 21264 has two levels of cache

CPU cache

A CPU cache is a cache used by the central processing unit of a computer to reduce the average time to access memory. The cache is a smaller, faster memory which stores copies of the data from the most frequently used main memory locations...

, a primary cache and secondary cache. The three-level cache of the Alpha 21164

Alpha 21164

The Alpha 21164, also known by its code name, EV5, is a microprocessor developed and fabricated by Digital Equipment Corporation that implemented the Alpha instruction set architecture . It was introduced in January 1995, succeeding the Alpha 21064A as Digital's flagship microprocessor...

was not used due to problems with bandwidth.

Primary caches

The primary cache is split into separate caches for instructions and data, the I-cache and D-cache respectively. Both caches have a capacity of 64 KB. The D-cache is dual-ported by transferring data on both the rising and falling edges of the clock signal. This method of dual-porting enabled any combination of reads or writes to the cache every processor cycle. It also avoided duplication the cache so there are two, as in the Alpha 21164. Duplicating the cache restricted the capacity of the cache, as it required more transistors to provide the same amount of capacity, and in turn increased the area required and power consumed.

B-cache

The secondary cache, termed the B-cache, is an external cache with a capacity of 1 to 16 MB. It is controlled by the microprocessor and is implemented by synchronous static random access memory

Static random access memory

Static random-access memory is a type of semiconductor memory where the word static indicates that, unlike dynamic RAM , it does not need to be periodically refreshed, as SRAM uses bistable latching circuitry to store each bit...

(SSRAM) chips that operate at two thirds, half, one-third or one-fourth the internal clock frequency, or 133 to 333 MHz at 500 MHz. The B-cache was accessed with a dedicated 128-bit bus that operates at the same clock frequency as the SSRAM or at twice the clock frequency if double data rate

Double data rate

In computing, a computer bus operating with double data rate transfers data on both the rising and falling edges of the clock signal. This is also known as double pumped, dual-pumped, and double transition....

SSRAM is used. The B-cache is direct-mapped.

Branch prediction

Branch prediction is performed by a tournament branch prediction algorithm. The algorithm was developed by Scott McFarling at Digital's Western Research Laboratory (WRL) and was described in a 1993 paper. This predictor was used as the Alpha 21264 has a minimum branch misprediction penalty of seven cycles. Due to the instruction cache's two cycle latency and the instruction queues, the average branch misprediction penalty is 11 cycles. The algorithm maintains two history tables, Local and Global, and the table used to predict the outcome of a branch is determined by a Choice predictor.

The local predictor is a two-level table which records the history of individual branches. It consists of a 1,024-entry by 10-bit branch history table. A two-level table was used as the prediction accuracy is similar to that of a larger single-level table while requiring fewer bits of storage. It has a 1,024-entry branch history table. Each entry is a 3-bit saturating counter. The value of the counter determines whether the current branch is taken or not taken.

The choice predictor records the history of the local and global predictors to determine which predictor is the best for a particular branch. It has a 4,096-entry branch history table. Each entry is a 2-bit saturating counter. The value of the counter determines if the local or global predictor is used.

External interface

The external interface consisted of a bidirectional 64-bit double data rate

Double data rate

(DDR) data bus and two 15-bit unidirectional time-multiplexed address

Address bus

An address bus is a computer bus that is used to specify a physical address. When a processor or DMA-enabled device needs to read or write to a memory location, it specifies that memory location on the address bus...

and control

Control bus

A control bus is a computer bus, used by CPUs for communicating with other devices within the computer. While the address bus carries the information on which device the CPU is communicating with and the data bus carries the actual data being processed, the control bus carries commands from the...

buses, one for signals originating from the Alpha 21264 and one for signals originating from the system. Digital licensed the bus to Advanced Micro Devices

Advanced Micro Devices

Advanced Micro Devices, Inc. or AMD is an American multinational semiconductor company based in Sunnyvale, California, that develops computer processors and related technologies for commercial and consumer markets...

(AMD), and it was subsequently used in their Athlon

Athlon

Athlon is the brand name applied to a series of x86-compatible microprocessors designed and manufactured by Advanced Micro Devices . The original Athlon was the first seventh-generation x86 processor and, in a first, retained the initial performance lead it had over Intel's competing processors...

microprocessors, where it was known as the EV6 bus.

Fabrication

The Alpha 21264 contained 15.2 million transistors. The logic consisted of approximately six million transistors, with the rest contained in the caches and branch history tables. The die measured 16.7 mm by 18.8 mm (313.96 mm²). It was fabricated in a 0.35 µm complementary metal–oxide–semiconductor

CMOS

Complementary metal–oxide–semiconductor is a technology for constructing integrated circuits. CMOS technology is used in microprocessors, microcontrollers, static RAM, and other digital logic circuits...

(CMOS) process with six levels of interconnect.

Packaging

The Alpha 21264 was packaged in a 587-pin ceramic interstitial

Interstitial

An interstitial space or interstice is an empty space or gap between spaces full of structure or matter.In particular, interstitial may refer to:-Physical sciences:...

pin grid array

Pin grid array

A pin grid array, often abbreviated PGA, is a type of integrated circuit packaging. In a PGA, the package is square or roughly square, and the pins are arranged in a regular array on the underside of the package...

(IPGA).

Alpha Processor, Inc. later sold the Alpha 21264 in a Slot B package containing the microprocessor mounted on a printed circuit board with the B-cache and voltage regulators. The design was intended to use the success of slot-based microprocessors from Intel and AMD. Slot B was originally developed to be used by AMD's Athlon as well, so that API could obtain materials for the Slot B at commodity prices in order to reduce the cost of the Alpha 21264 to gain a wider market share. This never materialized as AMD chose to use Slot A for their slot-based Athlons.

Alpha 21264A

The Alpha 21264A, code-named EV67 was a shrink of the Alpha 21264 introduced in late 1999. There were six versions: 600, 667, 700, 733, 750, 833 MHz. The EV67 was the first Alpha microprocessor to implement the count extension (CIX), which extended the instruction set with instructions for performing population count

Hamming weight

The Hamming weight of a string is the number of symbols that are different from the zero-symbol of the alphabet used. It is thus equivalent to the Hamming distance from the all-zero string of the same length. For the most typical case, a string of bits, this is the number of 1's in the string...

. It was fabricated by Samsung Electronics in a 0.25 µm CMOS process that had 0.25 µm transistors but 0.35 µm metal layers. The die had an area of 210 mm². The EV68 used a 2.0 V power supply. It dissipated a maximum of 73 W at 600 MHz, 80 W at 667 MHz, 85 W at 700 MHz, 88 W at 733 MHz and 90 W at 750 MHz.

Alpha 21264B

The Alpha 21264B is a further development for increased clock frequencies. There were two models, one fabricated by IBM, code-named EV68C, and one by Samsung, code-named EV68A.

The EV68A was fabricated in a 0.18 µm CMOS process with aluminium interconnects. It had a die size of 125 mm², a third smaller than the Alpha 21264A, and used a 1.7 V power supply. It was available in volume in 2001 at clock frequencies of 750, 833, 875 and 940 MHz. The EV68A dissipated a maximum of 60 W at 750 MHz, 67 W at 833 MHz, 70 W at 875 MHz and 75 W at 940 MHz.

The EV68C was fabricated in a 0.18 µm CMOS process with copper interconnects. It was sampled in early 2000 and achieved a maximum clock frequency of 1.25 GHz.

In September 1998, Samsung announced they would fabricate a variant of the Alpha 21264B in a 0.18 µm fully depleted silicon-on-insulator (SOI) process with copper interconnects

Copper-based chips

Copper-based chips are semiconductor integrated circuits, usually microprocessors, which use copper for interconnections. Since copper is a better conductor than aluminium, chips using this technology can have smaller metal components, and use less energy to pass electricity through them...

that was capable of achieving a clock frequency of 1.5 GHz. This version never materialized.

Alpha 21264C

The Alpha 21264C, code-named EV68CB was a derivative of the Alpha 21264. It was available at clock frequencies of 1.0, 1.25 and 1.33 GHz. The EV68CB contained 15.5 million transistors and measured 120 mm². It was fabricated by IBM in a 0.18 µm CMOS process with seven levels of copper interconnect and low-K

Low-K

In semiconductor manufacturing, a low-κ dielectric is a material with a small dielectric constant relative to silicon dioxide. Although the proper symbol for the dielectric constant is the Greek letter κ , in conversation such materials are referred to as being "low-k" rather than "low-κ"...

dielectric

Dielectric

A dielectric is an electrical insulator that can be polarized by an applied electric field. When a dielectric is placed in an electric field, electric charges do not flow through the material, as in a conductor, but only slightly shift from their average equilibrium positions causing dielectric...

. It was packaged in a 675-pad flip-chip ceramic land grid array (CLGA) measuring 49.53 by 49.53 mm. The EV68A used a 1.7 V power supply, dissipating a maximum of 64 W at 1.0 GHz, 75 W at 1.25 GHz and 80 W at 1.33 GHz.

Alpha 21264D

The Alpha 21264D, code-named EV68CD is a faster derivative fabricated by IBM.

Alpha 21264E

The Alpha 21264E, code-named EV68E, was a cancelled derivative developed by Samsung first announced on 10 October 2000 at Microprocessor Forum 2000 slated for introduction at around mid-2001. Improvements were a higher operating frequency of 1.25 GHz and the addition of an on-die 1.85 MB secondary cache. It was to be fabricated in a 0.18 micrometre CMOS process with copper interconnects.

Chipsets

Digital and Advanced Micro Devices

Advanced Micro Devices

(AMD) both developed chipsets for the Alpha 21264.

21272

The Digital 21272, also known as the Tsunami and Typhoon was the first chipset for the Alpha 21264. The 21272 chipset supported two-, three- or four-way multiprocessing and one or two 64-bit 33 MHz PCI

Peripheral Component Interconnect

Conventional PCI is a computer bus for attaching hardware devices in a computer...

buses. It had 128- to 512-bit memory bus which operated at 83 MHz, yielding a maximum bandwidth of 5,312 MB/s. The chipset supported 100 MHz registered ECC SDRAM.

The chipset consisted of three devices, a C-chip, a D-chip and a P-chip. The number of devices which made up the chipset varied as it was determined by the configuration of the chipset. The C-chip is the control chip containing the memory controller. One C-chip was required for every microprocessor.

The P-chip is the PCI controller, implementing a 33 MHz PCI bus. The 21272 could have one or two P-chips.

The 21272 was used extensively by Digital, Compaq and Hewlett Packard in their entry-level to mid-range AlphaServers and in all models of the AlphaStation. It was also used in third-party products from Alpha Processor, Inc. (later known as API NetWorks) such as their UP2000+ motherboard.

Irongate

AMD developed two Alpha 21264-compatible chipsets, the Irongate, also known as the AMD-751, and its successor, Irongate-2, also known as the AMD-761. These chipsets were developed for their Athlon microprocessors but due to AMD licensing the EV6 bus used in the Alpha from Digital, the Athlon and Alpha 21264 were compatible in terms of bus protocol. The Irongate was used by Samsung in their UP1000 and UP1100 motherboards. The Irongate-2 was used by Samsung in their UP1500 motherboard.

Description

Out of order execution

Ebox

Fbox

Cache

Primary caches

B-cache

Branch prediction

External interface

Fabrication

Packaging

Alpha 21264A

Alpha 21264B

Alpha 21264C

Alpha 21264D

Alpha 21264E

Chipsets

21272

Irongate

Further reading