Energy efficient floating-point unit design

Placeholder Show Content

Abstract/Contents

Abstract
Energy-efficient computation is critical for increasing performance in power limited systems. Floating-point performance is of particular interest because of its importance in scientific computing, graphics and multimedia processing. For floating-point applications that have large amounts of data parallelism one should optimize the throughput/sqmm given a power density constraint. We present a method for creating a trade-off curve that can be used to estimate the maximum floating-point performance given a set of area and power constraints. These throughput optimized designs turn out to be different from latency optimized ones and more energy efficient. Looking at floating-point multiply-add units and ignoring register and memory overheads, we find that in a 90nm CMOS technology at 1W/sqmm, one can achieve a performance of 27 GFlops/sqmm single-precision, and 7.5 GFlops/sqmm double-precision. Adding register file overheads reduces the throughput by less than 50% if the compute intensity is high. Since the energy of the basic gates is no longer scaling rapidly, to maintain constant power density with scaling requires moving the overall floating-point architecture to a lower energy/performance point using lower supply voltage, shallower pipelines and more relaxed gate sizing. A 1W/sqmm design at 90nm is a "high-energy" design, so scaling it to a lower energy design in 45 nm still yields a 7x performance gain, while a more balanced 0.1W/sqmm design only speeds up by 3.5x when scaled to 45 nm. Performance scaling below 45 nm rapidly decreases, with a projected improvement of only 2-3x for both power densities when scaling to a 22 nm technology. On the other hand, some floating-point units employed for single threaded performance such as CPU designs are latency sensitive. For such designs a different optimization in the implementation of fused floating-point multiply-add operations can be utilized. By realizing that the average latency of all operations going through the unit is what matters most, an optimized cascade design can reduce the accumulation dependent latency by 2x over a fused design, at a cost of a 13% increase in non-accumulation dependent latency. A simple in-order execution model shows this design is superior in most applications, providing 12% average reduction in floating-point instructions stalls, and improves performance by up to 6%. Simulations of superscalar out-of-order machines show 4% average CPI improvement in 2-way machines and 4.6% in 4-way machines. This feat is achieved by a design architecture called cascade where the addition operation is cascaded after multiplication in comparison to traditional architectures. The cascade design has the same area and energy budget as a traditional FMA.

Description

Type of resource text
Form electronic; electronic resource; remote
Extent 1 online resource.
Publication date 2012
Issuance monographic
Language English

Creators/Contributors

Associated with Galal, Sameh Rady Sayed
Associated with Stanford University, Department of Electrical Engineering
Primary advisor Dally, William
Primary advisor Horowitz, Mark (Mark Alan)
Thesis advisor Dally, William
Thesis advisor Horowitz, Mark (Mark Alan)
Thesis advisor Olukotun, Oyekunle Ayinde
Advisor Olukotun, Oyekunle Ayinde

Subjects

Genre Theses

Bibliographic information

Statement of responsibility Sameh Galal.
Note Submitted to the Department of Electrical Engineering.
Thesis Thesis (Ph.D.)--Stanford University, 2012.
Location electronic resource

Access conditions

Copyright
© 2012 by Sameh Rady Sayed Galal
License
This work is licensed under a Creative Commons Attribution Non Commercial 3.0 Unported license (CC BY-NC).

Also listed in

Loading usage metrics...