Adaptive algorithms for data science and computational genomics

Placeholder Show Content

Abstract/Contents

Abstract
Recent years have seen a sustained exponential growth in the volume of data generated within the domains of data science and biology. This dramatic pace of data generation has outstripped the growth in computational power predicted by Moore's Law, creating a computational bottleneck. Existing algorithmic techniques are computationally insufficient, and cannot scale to meet this demand. My research has focused on developing new algorithmic techniques to handle this massive scale of data, using probabilistic modeling to design algorithms that adapt to the input dataset's structure. In this thesis, I present my work on data-adaptive algorithms, highlighting their application in computational genomics. The first part of this thesis focuses on how we can utilize methods from multi-armed bandits to construct computationally adaptive methods for data-scientific problems. We show that adaptivity can be used to design algorithms that are more efficient than their brute-force counterparts, while still providing theoretical guarantees discarded by common heuristic approaches. These methods leverage the fact that practical problems have structured data that is generated by nature, so classical worst-case analyses are overly pessimistic. To this end, we develop a general framework for the use of adaptivity in a class of computational tasks arising in data-science, including in clustering and randomized numerical linear algebra. Tackling each separate problem highlights the modularity of this framework, which we show can be further improved to exploit the structure and constraints of the specific task, which sometimes yield new multi-armed bandit settings requiring the development of new bandit algorithms. The second part of this thesis focuses on the application of these ideas to computational genomics. Nowhere is the computational bottleneck of exploding data more apparent than in genomics, where the rapidly decreasing cost of sequencing has fallen orders of magnitude faster than Moore's Law. The first problem I discuss is that of estimating pairwise sequence similarity, a common first step in genome assembly, for which we develop a novel statistical measure of similarity which adapts to the overall sequence similarity in the dataset, and a computationally efficient adaptive SVD to estimate it. The second problem I highlight is that of reference-free inference, analyzing sequencing data without the use of a reference genome. For this task we develop a novel method for analyzing sequencing data without a reference genome, and a novel finite-sample valid test on contingency tables. In both of these problems, we highlight the importance of improved probabilistic modeling and show how it can be used to develop new algorithms that adapt to structure in the data and are both statistically and computationally efficient.

Description

Type of resource text
Form electronic resource; remote; computer; online resource
Extent 1 online resource.
Place California
Place [Stanford, California]
Publisher [Stanford University]
Copyright date 2023; ©2023
Publication date 2023; 2023
Issuance monographic
Language English

Creators/Contributors

Author Baharav, Tavor Zvi
Degree supervisor Tse, David
Thesis advisor Tse, David
Thesis advisor Pilanci, Mert
Thesis advisor Salzman, Julia
Degree committee member Pilanci, Mert
Degree committee member Salzman, Julia
Associated with Stanford University, School of Engineering
Associated with Stanford University, Department of Electrical Engineering

Subjects

Genre Theses
Genre Text

Bibliographic information

Statement of responsibility Tavor Zvi Baharav.
Note Submitted to the Department of Electrical Engineering.
Thesis Thesis Ph.D. Stanford University 2023.
Location https://purl.stanford.edu/wk002pz2491

Access conditions

Copyright
© 2023 by Tavor Zvi Baharav
License
This work is licensed under a Creative Commons Attribution Non Commercial 3.0 Unported license (CC BY-NC).

Also listed in

Loading usage metrics...