Adaptive algorithms for data science and computational genomics
Abstract/Contents
- Abstract
- Recent years have seen a sustained exponential growth in the volume of data generated within the domains of data science and biology. This dramatic pace of data generation has outstripped the growth in computational power predicted by Moore's Law, creating a computational bottleneck. Existing algorithmic techniques are computationally insufficient, and cannot scale to meet this demand. My research has focused on developing new algorithmic techniques to handle this massive scale of data, using probabilistic modeling to design algorithms that adapt to the input dataset's structure. In this thesis, I present my work on data-adaptive algorithms, highlighting their application in computational genomics. The first part of this thesis focuses on how we can utilize methods from multi-armed bandits to construct computationally adaptive methods for data-scientific problems. We show that adaptivity can be used to design algorithms that are more efficient than their brute-force counterparts, while still providing theoretical guarantees discarded by common heuristic approaches. These methods leverage the fact that practical problems have structured data that is generated by nature, so classical worst-case analyses are overly pessimistic. To this end, we develop a general framework for the use of adaptivity in a class of computational tasks arising in data-science, including in clustering and randomized numerical linear algebra. Tackling each separate problem highlights the modularity of this framework, which we show can be further improved to exploit the structure and constraints of the specific task, which sometimes yield new multi-armed bandit settings requiring the development of new bandit algorithms. The second part of this thesis focuses on the application of these ideas to computational genomics. Nowhere is the computational bottleneck of exploding data more apparent than in genomics, where the rapidly decreasing cost of sequencing has fallen orders of magnitude faster than Moore's Law. The first problem I discuss is that of estimating pairwise sequence similarity, a common first step in genome assembly, for which we develop a novel statistical measure of similarity which adapts to the overall sequence similarity in the dataset, and a computationally efficient adaptive SVD to estimate it. The second problem I highlight is that of reference-free inference, analyzing sequencing data without the use of a reference genome. For this task we develop a novel method for analyzing sequencing data without a reference genome, and a novel finite-sample valid test on contingency tables. In both of these problems, we highlight the importance of improved probabilistic modeling and show how it can be used to develop new algorithms that adapt to structure in the data and are both statistically and computationally efficient.
Description
Type of resource | text |
---|---|
Form | electronic resource; remote; computer; online resource |
Extent | 1 online resource. |
Place | California |
Place | [Stanford, California] |
Publisher | [Stanford University] |
Copyright date | 2023; ©2023 |
Publication date | 2023; 2023 |
Issuance | monographic |
Language | English |
Creators/Contributors
Author | Baharav, Tavor Zvi |
---|---|
Degree supervisor | Tse, David |
Thesis advisor | Tse, David |
Thesis advisor | Pilanci, Mert |
Thesis advisor | Salzman, Julia |
Degree committee member | Pilanci, Mert |
Degree committee member | Salzman, Julia |
Associated with | Stanford University, School of Engineering |
Associated with | Stanford University, Department of Electrical Engineering |
Subjects
Genre | Theses |
---|---|
Genre | Text |
Bibliographic information
Statement of responsibility | Tavor Zvi Baharav. |
---|---|
Note | Submitted to the Department of Electrical Engineering. |
Thesis | Thesis Ph.D. Stanford University 2023. |
Location | https://purl.stanford.edu/wk002pz2491 |
Access conditions
- Copyright
- © 2023 by Tavor Zvi Baharav
- License
- This work is licensed under a Creative Commons Attribution Non Commercial 3.0 Unported license (CC BY-NC).
Also listed in
Loading usage metrics...