|
|
|
|
Alpha 21264
|
| |
|
| |
The Alpha 21264 is a microprocessor developed and fabricated by Digital Equipment Corporation that implemented the Alpha instruction set architecture (ISA).
Out of order execution At any given stage, the microprocessor could have up to 80 instructions in various stages of execution, surpassing every other contemporary microprocessor.
Decoded instructions are queued in instruction queues and are issued when their operands are available.

Discussion
Ask a question about 'Alpha 21264'
Start a new discussion about 'Alpha 21264'
Answer questions from other users
|
Encyclopedia
The Alpha 21264 is a microprocessor developed and fabricated by Digital Equipment Corporation that implemented the Alpha instruction set architecture (ISA).
Description The Alpha 21264 is a four-issue superscalar microprocessor with out-of-order execution and speculative execution. It has a peak execution rate of six instructions per cycle and can sustain four instructions per cycle. It has a seven-stage instruction pipeline.
Out of order execution At any given stage, the microprocessor could have up to 80 instructions in various stages of execution, surpassing every other contemporary microprocessor.
Decoded instructions are queued in instruction queues and are issued when their operands are available. The integer queue contained 20 entries and the floating-point queue 15. Each queue could issue as many instructions as there were pipelines.
Ebox The Ebox is responsible for the execution of integer and load store instructions. It has two integer units, two load store units and two integer register files. Each register file served an integer unit and a load store unit, and the register file and its two units are referred to as a "cluster". This scheme was used as it reduced the number of write and read ports required to serve operands and receive results, thus reducing the physical size of the register file, enabling the microprocessor to operate at higher clock frequencies. Writes to any of the register files thus have to be synchronized, which required a clock cycle to complete, negatively impacting performance by one percent. Loss of performance was compensated by the higher clock frequency achievable with this scheme and was avoided where possible by the issue logic, which tried to reduce the number of operations requiring data to synchronized.
The clusters are near identical. However, U1 has a seven-cycle pipelined multiplier while U0 has a three-cycle pipeline for executing Motion Video Instructions (MVI), an extension to the Alpha Architecture defining single instruction multiple data (SIMD) instructions for multimedia.
The load store units are simple arithmetic logic units used to calculate virtual addresses for memory access. They are also capable of executing simple arithmetic and logic instructions and the instruction issue logic in the Alpha 21264 utilized this capability, issuing instructions to these units when they were available.
Each integer register file contains 80 entries, of which 31 are architectural registers, 41 are rename registers and 8 are PALshadow registers. There is no entry for register R31 as it is hardwired to zero and can only be read from.
The Ebox therefore has four 64-bit adders, four logic units, two barrel shifters, byte logic, two sets of conditional branch logic equally divided between U1 and U0.
Fbox The Fbox is responsible for executing floating-point instructions. It consists of two floating-point pipelines and a floating-point register file. The pipelines are not identical, one executes the majority of instructions and the other only multiply instructions. The adder pipeline has two non-pipelined units connected to it, a divide unit and a square root unit. Adds, multiplies and most other instructions have a 4-cycle latency, a double-precision divide has 16-cycle latency and a double-precision square root has a 33-cycle latency. The floating point register file contains 72 entries, of which 32 are architectural registers and 40 are rename registers.
Cache The Alpha 21264 has two levels of cache, a primary cache and secondary cache. The three-level cache of the Alpha 21164 was not used due to problems with bandwidth.
Primary caches The primary cache is split into separate caches for instructions and data, the I-cache and D-cache respectively. The I-cache and D-caches are 64 KB.
The D-cache is dual-ported by transferring data on both the rising and falling edges of the clock signal. This method of dual-porting enabled any combination of reads or writes to the cache every processor cycle. It also avoided the having to duplicate the cache so there are two as in the Alpha 21164. Duplicating the cache restricted the capacity of the cache, as it required more transistors to provide the same amount of capacity, and in turn increased the area required and power consumed.
B-cache The secondary cache, termed the B-cache, is an external cache with a capacity of 1 to 16 MB. It is controlled by the microprocessor and is implemented by synchronous static random access memory (SSRAM) chips that operate at two thirds, half, one-third or one-fourth the internal clock frequency, or 133 to 333 MHz at 500 MHz. The B-cache was accessed with a dedicated 128-bit bus that operates at the same clock frequency as the SSRAM or at twice the clock frequency if double data rate SSRAM is used. The B-cache is direct-mapped.
Branch prediction Branch prediction is performed by a tournament branch prediction algorithm. The algorithm was developed by Scott McFarling at Digital's Western Research Laboratory (WRL) and was described in a 1993 paper. This predictor was used as the Alpha 21264 has a minimum branch misprediction penalty of seven cycles. Due to the instruction cache's two cycle latency and the instruction queues, the average branch misprediction penalty is 11 cycles. The algorithm maintains two history tables, Local and Global, and the table used to predict the outcome of a branch is determined by a Choice predictor.
The local predictor is a two-level table which records the history of individual branches. It consists of a 1,024-entry by 10-bit branch history table. A two-level table was used as the prediction accuracy is similar to that of a larger single-level table while requiring fewer bits of storage. It has a 1,024-entry branch history table. Each entry is a 3-bit saturating counter. The value of the counter determines whether the current branch is taken or not taken.
The choice predictor records the history of the local and global predictors to determine which predictor is the best for a particular branch. It has a 4,096-entry branch history table. Each entry is a 2-bit saturating counter. The value of the counter determines if the local or global predictor is used.
External interface The external interface consisted of a bidirectional 64-bit double data rate (DDR) data bus and two 15-bit unidirectional time-multiplexed address and control buses, one for signals originating from the Alpha 21264 and one for signals originating from the system. Digital licensed the bus to Advanced Micro Devices (AMD), and it was subsequently used in their Athlon microprocessors, where it was known as the EV6 bus.
Fabrication The Alpha 21264 contained 15.2 million transistors. The logic consisted of approximately six million transistors, with the rest contained in the caches and branch history tables. It was fabricated in a 0.35 µm complementary metal–oxide–semiconductor (CMOS) process with six levels of interconnect.
Packaging The Alpha 21264 was packaged in a 587-pin ceramic interstitial pin grid array (IPGA).
Alpha Processor, Inc. later sold the Alpha 21264 in a Slot B package containing the microprocessor mounted on a printed circuit board with the B-cache and voltage regulators. The design was intended to use the success of slot-based microprocessors from Intel and AMD. Slot B was originally developed to be used by AMD's Athlon as well, so that API could obtain materials for the Slot B at commodity prices in order to reduce the cost of the Alpha 21264 to gain a wider market share. This never materialized as AMD chose to use Slot A for their slot-based Athlons.
Derivatives
Alpha 21264A The Alpha 21264A, code-named EV67 was a shrink of the Alpha 21264 which was introduced in late 1999. The microarchitecture did not change, although the circuit required modification to be fabricated in the new process. It was fabricated by Samsung in a 0.25 µm CMOS process for a die with an area of 210 mm2. Power supply voltage was reduced to 2.0V and TDP to 70 to 100 W for 600 to 833 MHz.
Alpha 21264B The Alpha 21264B is a further development for increased clock frequencies. There were two models, one fabricated by IBM, code-named EV68C, and one by Samsung, code-named EV68A
The model fabricated by IBM was done in a 0.18 µm CMOS process with copper interconnects. It was sampled in early 2000 and achieved a maximum clock frequency of 1.25 GHz.
The model fabricated by Samsung was done in a 0.18 µm CMOS process with aluminium interconnects. It had a die size of 125 mm2, a third smaller than the Alpha 21264A, and required a 1.7V power supply. It was available in volume in 2001 and achieved clock frequencies in the range of 750 to 940 MHz with a TDP between 60 to 75W.
In September 1998, Samsung announced they would fabricate a variant of the Alpha 21264B in a 0.18 µm fully depleted silicon-on-insulator (SOI) process with copper interconnects that was capable of achieving a clock frequency of 1.5 GHz. This version never materialized.
Alpha 21264C The Alpha 21264C, code-named EV68CB is a faster derivative. It was available at clock frequencies of 1.0, 1.25 and 1.33 GHz and was fabricated by IBM. The packaging was changed to a 675-pad ceramic land grid array (CLGA) measuring 49.53 by 49.53 mm.
Alpha 21264D The Alpha 21264D, code-named EV68CD is a faster derivative fabricated by IBM.
Chipsets Digital and Advanced Micro Devices (AMD) both developed chipsets for the Alpha 21264.
21272 The Digital 21272, also known as the Tsunami and Typhoon was the first chipset for the Alpha 21264. The 21272 chipset supported two-, three- or four-way multiprocessing and one or two 33 MHz PCI-X buses. It had 128- to 512-bit memory bus which operated at 83 MHz, yielding a maximum bandwidth of 5,312 MB/s. The chipset supported 100 MHz registered ECC SDRAM.
The chipset consisted of three devices, a C-chip, a D-chip and a P-chip. The number of devices which made up the chipset varied as it was determined by the configuration of the chipset. The C-chip the control chip containing the memory controller. One C-chip was required for every microprocessor.
The P-chip is the PCI controller, implementing a 33 MHz PCI-X bus. The 21272 could have one or two P-chips.
The 21272 was used extensively by Digital, Compaq and Hewlett Packard in their entry-level to mid-range AlphaServers and in all models of the AlphaStation. It was also used in third-party products from Alpha Processor, Inc. (later known as API NetWorks) such as their UP2000+ motherboard.
Irongate AMD developed two Alpha 21264-compatible chipsets, the Irongate, also known as the AMD-751, and its successor, Irongate-2, also known as the AMD-761. These chipsets were developed for their Athlon microprocessors but due to AMD licensing the EV6 bus used in the Alpha from Digital, the Athlon and Alpha 21264 were compatible in terms of bus protocol. The Irongate was used by Samsung in their UP1000 and UP1100 motherboards. The Irongate-2 was used by Samsung in their UP1500 motherboard.
|
| |
|
|