Accelerator architectures for matrix applications

Placeholder Show Content

Abstract/Contents

Abstract
Matrices are a well known data representation extensively used in a wide range of applications. Numerous applications of various domains use matrix operations to represent and perform their core algorithms. Thus, improving matrix operation performance is critical to a vast variety of fields as it not only allows existing applications to run faster, but also enables computations with larger matrices. Modern GPUs and CPUs with SIMD support have been very effective at accelerating matrix operations. However, these current architectures only work well on dense and fat matrices. Skinny dense matrices tend to underutilize SIMD resources when the width of a matrix is less than the number of SIMD lanes, and may limit the scalability since they have a smaller amount of computation to hide the communication overhead. Sparse matrices are also difficult to accelerate current architectures, because the memory accesses are irregular and the workload imbalance is severe. This thesis introduces two different specialized hardware, targeting narrow dense and sparse matrices. The first part of this thesis focuses on accelerating a Restricted Boltzmann Machine (RBM), a popular machine learning algorithm used in deep learning. The RBM accelerator was designed using a modular approach to achieve linear scalability across transistor technologies, as well as across chip boundaries. The accelerator was implemented on FPGAs to demonstrate the performance improvements over high-end CPUs and GPUs. Both fat and skinny matrices were shown to fully utilize the computation resources in the learning process, which allows the training algorithm to converges in less number of iterations. The second part of this thesis describes how sparse matrix applications can be accelerated with domain-specific hardware. We studied three sparse matrix applications that conventional hardware cannot easily accelerate. Based on our findings, we devised an accelerator architecture which targets certain sparse and dense matrix operations. The accelerator is capable of exploiting the fine-grained parallelism within sparse matrices despite the irregularity through buffering and work-stealing. In order to cover a wider range of applications, a small general-purpose core was added to the accelerator for non-critical execution flows. The sparse matrix accelerator was implemented on an FPGA board as an ASIC prototype to evaluate the performance using real-world data. Our accelerator shows performance comparable to GPUs on dense matrix operations, and excels over conventional hardware on sparse matrix operations.

Description

Type of resource text
Form electronic; electronic resource; remote
Extent 1 online resource.
Publication date 2013
Issuance monographic
Language English

Creators/Contributors

Associated with Kim, Sang Kyun
Associated with Stanford University, Department of Electrical Engineering
Primary advisor Olukotun, Oyekunle Ayinde
Thesis advisor Olukotun, Oyekunle Ayinde
Thesis advisor Kozyrakis, Christoforos, 1974-
Thesis advisor Ng, Andrew Y, 1976-
Advisor Kozyrakis, Christoforos, 1974-
Advisor Ng, Andrew Y, 1976-

Subjects

Genre Theses

Bibliographic information

Statement of responsibility Sang Kyun Kim.
Note Submitted to the Department of Electrical Engineering.
Thesis Thesis (Ph.D.)--Stanford University, 2013.
Location electronic resource

Access conditions

Copyright
© 2013 by Sang Kyun Kim
License
This work is licensed under a Creative Commons Attribution Non Commercial 3.0 Unported license (CC BY-NC).

Also listed in

Loading usage metrics...