Accelerating chemical similarity search using GPUs and metric embeddings
Abstract/Contents
- Abstract
- Fifteen years ago, the advent of modern high-throughput sequencing revolutionized computational genetics with a flood of data. Today, high-throughput biochemical assays promise to make biochemistry the next data-rich domain for machine learning. However, existing computational methods, built for small analyses of about 1,000 molecules, do not scale to emerging multi-million molecule datasets. For many algorithms, pairwise similarity comparisons between molecules are a critical bottleneck, presenting a 1,000x-1,000,000x scaling barrier. In this dissertation, I describe the design of SIML and PAPER, our GPU implementations of 2D and 3D chemical similarities, as well as SCISSORS, our metric embedding algorithm. On a model problem of interest, combining these techniques allows up to 274,000x speedup in time and up to 2.8 million-fold reduction in space while retaining excellent accuracy. I further discuss how these high-speed techniques have allowed insight into chemical shape similarity and the behavior of machine learning kernel methods in the presence of noise.
Description
Type of resource | text |
---|---|
Form | electronic; electronic resource; remote |
Extent | 1 online resource. |
Publication date | 2011 |
Issuance | monographic |
Language | English |
Creators/Contributors
Associated with | Haque, Imran Saeedul |
---|---|
Associated with | Stanford University, Computer Science Department |
Primary advisor | Pande, Vijay |
Thesis advisor | Pande, Vijay |
Thesis advisor | Altman, Russ |
Thesis advisor | Koller, Daphne |
Advisor | Altman, Russ |
Advisor | Koller, Daphne |
Subjects
Genre | Theses |
---|
Bibliographic information
Statement of responsibility | Imran Saeedul Haque. |
---|---|
Note | Submitted to the Department of Computer Science. |
Thesis | Thesis (Ph.D.)--Stanford University, 2011. |
Location | electronic resource |
Access conditions
- Copyright
- © 2011 by Imran Saeedul Haque
Also listed in
Loading usage metrics...