Energy efficient floating-point unit design

Galal, Sameh Rady Sayed; Stanford University, Department of Electrical Engineering

Energy efficient floating-point unit design

<a href="https://embed.stanford.edu/iframe/?url=https%3A%2F%2Fpurl.stanford.edu%2Ftf297yq9849" class="su-underline">Show Content</a>

Abstract/Contents

Abstract: Energy-efficient computation is critical for increasing performance in power limited systems. Floating-point performance is of particular interest because of its importance in scientific computing, graphics and multimedia processing. For floating-point applications that have large amounts of data parallelism one should optimize the throughput/sqmm given a power density constraint. We present a method for creating a trade-off curve that can be used to estimate the maximum floating-point performance given a set of area and power constraints. These throughput optimized designs turn out to be different from latency optimized ones and more energy efficient. Looking at floating-point multiply-add units and ignoring register and memory overheads, we find that in a 90nm CMOS technology at 1W/sqmm, one can achieve a performance of 27 GFlops/sqmm single-precision, and 7.5 GFlops/sqmm double-precision. Adding register file overheads reduces the throughput by less than 50% if the compute intensity is high. Since the energy of the basic gates is no longer scaling rapidly, to maintain constant power density with scaling requires moving the overall floating-point architecture to a lower energy/performance point using lower supply voltage, shallower pipelines and more relaxed gate sizing. A 1W/sqmm design at 90nm is a "high-energy" design, so scaling it to a lower energy design in 45 nm still yields a 7x performance gain, while a more balanced 0.1W/sqmm design only speeds up by 3.5x when scaled to 45 nm. Performance scaling below 45 nm rapidly decreases, with a projected improvement of only 2-3x for both power densities when scaling to a 22 nm technology. On the other hand, some floating-point units employed for single threaded performance such as CPU designs are latency sensitive. For such designs a different optimization in the implementation of fused floating-point multiply-add operations can be utilized. By realizing that the average latency of all operations going through the unit is what matters most, an optimized cascade design can reduce the accumulation dependent latency by 2x over a fused design, at a cost of a 13% increase in non-accumulation dependent latency. A simple in-order execution model shows this design is superior in most applications, providing 12% average reduction in floating-point instructions stalls, and improves performance by up to 6%. Simulations of superscalar out-of-order machines show 4% average CPI improvement in 2-way machines and 4.6% in 4-way machines. This feat is achieved by a design architecture called cascade where the addition operation is cascaded after multiplication in comparison to traditional architectures. The cascade design has the same area and energy budget as a traditional FMA.

Description

Type of resource	text
Form	electronic; electronic resource; remote
Extent	1 online resource.
Publication date	2012
Issuance	monographic
Language	English

Creators/Contributors

Associated with	Galal, Sameh Rady Sayed
Associated with	Stanford University, Department of Electrical Engineering
Primary advisor	Dally, William
Primary advisor	Horowitz, Mark (Mark Alan)
Thesis advisor	Dally, William
Thesis advisor	Horowitz, Mark (Mark Alan)
Thesis advisor	Olukotun, Oyekunle Ayinde
Advisor	Olukotun, Oyekunle Ayinde

Subjects

Genre	Theses

Bibliographic information

Statement of responsibility	Sameh Galal.
Note	Submitted to the Department of Electrical Engineering.
Thesis	Thesis (Ph.D.)--Stanford University, 2012.
Location	electronic resource

Access conditions

License: This work is licensed under a Creative Commons Attribution Non Commercial 3.0 Unported license (CC BY-NC).

Also listed in

View in SearchWorks

Loading usage metrics...