Adaptive algorithms for data science and computational genomics

Baharav, Tavor Zvi

Adaptive algorithms for data science and computational genomics

<a href="https://embed.stanford.edu/iframe/?url=https%3A%2F%2Fpurl.stanford.edu%2Fwk002pz2491" class="su-underline">Show Content</a>

Abstract/Contents

Abstract: Recent years have seen a sustained exponential growth in the volume of data generated within the domains of data science and biology. This dramatic pace of data generation has outstripped the growth in computational power predicted by Moore's Law, creating a computational bottleneck. Existing algorithmic techniques are computationally insufficient, and cannot scale to meet this demand. My research has focused on developing new algorithmic techniques to handle this massive scale of data, using probabilistic modeling to design algorithms that adapt to the input dataset's structure. In this thesis, I present my work on data-adaptive algorithms, highlighting their application in computational genomics. The first part of this thesis focuses on how we can utilize methods from multi-armed bandits to construct computationally adaptive methods for data-scientific problems. We show that adaptivity can be used to design algorithms that are more efficient than their brute-force counterparts, while still providing theoretical guarantees discarded by common heuristic approaches. These methods leverage the fact that practical problems have structured data that is generated by nature, so classical worst-case analyses are overly pessimistic. To this end, we develop a general framework for the use of adaptivity in a class of computational tasks arising in data-science, including in clustering and randomized numerical linear algebra. Tackling each separate problem highlights the modularity of this framework, which we show can be further improved to exploit the structure and constraints of the specific task, which sometimes yield new multi-armed bandit settings requiring the development of new bandit algorithms. The second part of this thesis focuses on the application of these ideas to computational genomics. Nowhere is the computational bottleneck of exploding data more apparent than in genomics, where the rapidly decreasing cost of sequencing has fallen orders of magnitude faster than Moore's Law. The first problem I discuss is that of estimating pairwise sequence similarity, a common first step in genome assembly, for which we develop a novel statistical measure of similarity which adapts to the overall sequence similarity in the dataset, and a computationally efficient adaptive SVD to estimate it. The second problem I highlight is that of reference-free inference, analyzing sequencing data without the use of a reference genome. For this task we develop a novel method for analyzing sequencing data without a reference genome, and a novel finite-sample valid test on contingency tables. In both of these problems, we highlight the importance of improved probabilistic modeling and show how it can be used to develop new algorithms that adapt to structure in the data and are both statistically and computationally efficient.

Description

Type of resource	text
Form	electronic resource; remote; computer; online resource
Extent	1 online resource.
Place	California
Place	[Stanford, California]
Publisher	[Stanford University]
Copyright date	2023; ©2023
Publication date	2023; 2023
Issuance	monographic
Language	English

Creators/Contributors

Author	Baharav, Tavor Zvi
Degree supervisor	Tse, David
Thesis advisor	Tse, David
Thesis advisor	Pilanci, Mert
Thesis advisor	Salzman, Julia
Degree committee member	Pilanci, Mert
Degree committee member	Salzman, Julia
Associated with	Stanford University, School of Engineering
Associated with	Stanford University, Department of Electrical Engineering

Subjects

Genre	Theses
Genre	Text

Bibliographic information

Statement of responsibility	Tavor Zvi Baharav.
Note	Submitted to the Department of Electrical Engineering.
Thesis	Thesis Ph.D. Stanford University 2023.
Location	https://purl.stanford.edu/wk002pz2491

Access conditions

License: This work is licensed under a Creative Commons Attribution Non Commercial 3.0 Unported license (CC BY-NC).

Also listed in

View in SearchWorks

Loading usage metrics...