Accelerating chemical similarity search using GPUs and metric embeddings

<a href="https://embed.stanford.edu/iframe/?url=https%3A%2F%2Fpurl.stanford.edu%2Fqg376jh8824" class="su-underline">Show Content</a>

Abstract/Contents

Abstract: Fifteen years ago, the advent of modern high-throughput sequencing revolutionized computational genetics with a flood of data. Today, high-throughput biochemical assays promise to make biochemistry the next data-rich domain for machine learning. However, existing computational methods, built for small analyses of about 1,000 molecules, do not scale to emerging multi-million molecule datasets. For many algorithms, pairwise similarity comparisons between molecules are a critical bottleneck, presenting a 1,000x-1,000,000x scaling barrier. In this dissertation, I describe the design of SIML and PAPER, our GPU implementations of 2D and 3D chemical similarities, as well as SCISSORS, our metric embedding algorithm. On a model problem of interest, combining these techniques allows up to 274,000x speedup in time and up to 2.8 million-fold reduction in space while retaining excellent accuracy. I further discuss how these high-speed techniques have allowed insight into chemical shape similarity and the behavior of machine learning kernel methods in the presence of noise.

Type of resource	text
Form	electronic; electronic resource; remote
Extent	1 online resource.
Publication date	2011
Issuance	monographic
Language	English

Associated with	Haque, Imran Saeedul
Associated with	Stanford University, Computer Science Department
Primary advisor	Pande, Vijay
Thesis advisor	Pande, Vijay
Thesis advisor	Altman, Russ
Thesis advisor	Koller, Daphne
Advisor	Altman, Russ
Advisor	Koller, Daphne

Genre	Theses

Statement of responsibility	Imran Saeedul Haque.
Note	Submitted to the Department of Computer Science.
Thesis	Thesis (Ph.D.)--Stanford University, 2011.
Location	electronic resource

View in SearchWorks

Loading usage metrics...