Accelerating chemical similarity search using GPUs and metric embeddings

Placeholder Show Content

Abstract/Contents

Abstract
Fifteen years ago, the advent of modern high-throughput sequencing revolutionized computational genetics with a flood of data. Today, high-throughput biochemical assays promise to make biochemistry the next data-rich domain for machine learning. However, existing computational methods, built for small analyses of about 1,000 molecules, do not scale to emerging multi-million molecule datasets. For many algorithms, pairwise similarity comparisons between molecules are a critical bottleneck, presenting a 1,000x-1,000,000x scaling barrier. In this dissertation, I describe the design of SIML and PAPER, our GPU implementations of 2D and 3D chemical similarities, as well as SCISSORS, our metric embedding algorithm. On a model problem of interest, combining these techniques allows up to 274,000x speedup in time and up to 2.8 million-fold reduction in space while retaining excellent accuracy. I further discuss how these high-speed techniques have allowed insight into chemical shape similarity and the behavior of machine learning kernel methods in the presence of noise.

Description

Type of resource text
Form electronic; electronic resource; remote
Extent 1 online resource.
Publication date 2011
Issuance monographic
Language English

Creators/Contributors

Associated with Haque, Imran Saeedul
Associated with Stanford University, Computer Science Department
Primary advisor Pande, Vijay
Thesis advisor Pande, Vijay
Thesis advisor Altman, Russ
Thesis advisor Koller, Daphne
Advisor Altman, Russ
Advisor Koller, Daphne

Subjects

Genre Theses

Bibliographic information

Statement of responsibility Imran Saeedul Haque.
Note Submitted to the Department of Computer Science.
Thesis Thesis (Ph.D.)--Stanford University, 2011.
Location electronic resource

Access conditions

Copyright
© 2011 by Imran Saeedul Haque

Also listed in

Loading usage metrics...