Almost linear time algorithms for problems of computational genomics

Kamath, Govinda Mangalore

Almost linear time algorithms for problems of computational genomics

<a href="https://embed.stanford.edu/iframe/?url=https%3A%2F%2Fpurl.stanford.edu%2Fnv636zh1237" class="su-underline">Show Content</a>

Abstract/Contents

Abstract: In this thesis we discuss designing fast algorithms for three problems in computational genomics. 1) Genome Assembly: With the proliferation of DNA sequencing and decreasing prices, genome assembly is the fundamental problem in computational genomics. Every organism has a genome, which can be abstracted as a long string of A, C, G, T. The length of the genome varies from millions in bacteria to hundreds of billions in some plants. We currently do not have the technology to read genomes end to end and just get short-noisy substrings from the genome called reads. The lengths of these reads are around 10k and error rate is around 15%. Reconstructing the genome from the reads is the problem of genome assembly. The conditions for being able to recover the genome exactly depends on the repeat structure of the genome and are explored extensively in literature. But in practice, we find that there are many datasets where this is violated. We develop algorithms that can optimally recover the genome to within informationally possible limits given a set of reads. We also implement this in the form of an open source software: HINGE, which is used widely in bacterial genomics. This work is joint work with Ilan Shomorony, Fei Xia, Tom Courtade and David Tse. 2) Haplotype Assembly: This is a problem associated with reference based Sequence assembly. Humans are diploid: that is, they have two copies of each chromosome. Roughly speaking, the two chromosome are identical on all apart from a few positions (around 1\% of positions) called single nucleotide polymorphisms or SNPs. In this case reads can be substrings of either chromosome which is not known, and one wants to infer the sequence of SNPs on each chromosome. We show that this problem can be seen as decoding a convolutional code, and reduce it to a graph clustering problem. We then come up with a spectral machine learning algorithm to solve the problem. We show derive tight estimates of the number of measurements necessary theoretically. This work is joint work with Eren Sasoglu, Yuxin Chen and David Tse. 3) Speeding up Machine learning algorithms: The celebrated Monte Carlo method estimates a quantity that is expensive to compute by random sampling. We propose adaptive Monte Carlo optimization: a general framework for discrete optimization of an expensive-to-compute function by adaptive random sampling. Applications of this framework have already appeared in machine learning but are tied to their specific contexts and developed in isolation. We take a unified view and show that the framework has broad applicability by applying it on several common machine learning problems: k-nearest neighbors, k-medoids, hierarchical clustering and maximum mutual information feature selection. On real data we show that this framework allows us to develop algorithms that confer a gain of a magnitude or two over exact computation. We also characterize the performance gain theoretically under regularity assumptions on the data that we verify in real world data. We stumbled onto this problem and approach while trying to run k-medoids on a single-cell RNA-seq dataset. This is joint work with Vivek Bagaria, Tavor Baharav, Vasilis Ntranos, Martin Zhang, and David Tse.

Description

Type of resource	text
Form	electronic resource; remote; computer; online resource
Extent	1 online resource.
Place	California
Place	[Stanford, California]
Publisher	[Stanford University]
Copyright date	2019; ©2019
Publication date	2019; 2019
Issuance	monographic
Language	English

Creators/Contributors

Author	Kamath, Govinda Mangalore
Degree supervisor	Tse, David
Thesis advisor	Tse, David
Thesis advisor	Montanari, Andrea
Thesis advisor	Sidford, Aaron
Degree committee member	Montanari, Andrea
Degree committee member	Sidford, Aaron
Associated with	Stanford University, Department of Electrical Engineering.

Subjects

Genre	Theses
Genre	Text

Bibliographic information

Statement of responsibility	Govinda Mangalore Kamath.
Note	Submitted to the Department of Electrical Engineering.
Thesis	Thesis Ph.D. Stanford University 2019.
Location	electronic resource

Access conditions

License: This work is licensed under a Creative Commons Attribution 3.0 Unported license (CC BY).

Also listed in

View in SearchWorks

Loading usage metrics...