Accelerating chemical similarity search using GPUs and metric embeddings
- Fifteen years ago, the advent of modern high-throughput sequencing revolutionized computational genetics with a flood of data. Today, high-throughput biochemical assays promise to make biochemistry the next data-rich domain for machine learning. However, existing computational methods, built for small analyses of about 1,000 molecules, do not scale to emerging multi-million molecule datasets. For many algorithms, pairwise similarity comparisons between molecules are a critical bottleneck, presenting a 1,000x-1,000,000x scaling barrier. In this dissertation, I describe the design of SIML and PAPER, our GPU implementations of 2D and 3D chemical similarities, as well as SCISSORS, our metric embedding algorithm. On a model problem of interest, combining these techniques allows up to 274,000x speedup in time and up to 2.8 million-fold reduction in space while retaining excellent accuracy. I further discuss how these high-speed techniques have allowed insight into chemical shape similarity and the behavior of machine learning kernel methods in the presence of noise.
|Type of resource
|electronic; electronic resource; remote
|1 online resource.
|Haque, Imran Saeedul
|Stanford University, Computer Science Department
|Statement of responsibility
|Imran Saeedul Haque.
|Submitted to the Department of Computer Science.
|Thesis (Ph.D.)--Stanford University, 2011.
- © 2011 by Imran Saeedul Haque
Also listed in
Loading usage metrics...