Almost linear time algorithms for problems of computational genomics

Placeholder Show Content

Abstract/Contents

Abstract
In this thesis we discuss designing fast algorithms for three problems in computational genomics. 1) Genome Assembly: With the proliferation of DNA sequencing and decreasing prices, genome assembly is the fundamental problem in computational genomics. Every organism has a genome, which can be abstracted as a long string of A, C, G, T. The length of the genome varies from millions in bacteria to hundreds of billions in some plants. We currently do not have the technology to read genomes end to end and just get short-noisy substrings from the genome called reads. The lengths of these reads are around 10k and error rate is around 15%. Reconstructing the genome from the reads is the problem of genome assembly. The conditions for being able to recover the genome exactly depends on the repeat structure of the genome and are explored extensively in literature. But in practice, we find that there are many datasets where this is violated. We develop algorithms that can optimally recover the genome to within informationally possible limits given a set of reads. We also implement this in the form of an open source software: HINGE, which is used widely in bacterial genomics. This work is joint work with Ilan Shomorony, Fei Xia, Tom Courtade and David Tse. 2) Haplotype Assembly: This is a problem associated with reference based Sequence assembly. Humans are diploid: that is, they have two copies of each chromosome. Roughly speaking, the two chromosome are identical on all apart from a few positions (around 1\% of positions) called single nucleotide polymorphisms or SNPs. In this case reads can be substrings of either chromosome which is not known, and one wants to infer the sequence of SNPs on each chromosome. We show that this problem can be seen as decoding a convolutional code, and reduce it to a graph clustering problem. We then come up with a spectral machine learning algorithm to solve the problem. We show derive tight estimates of the number of measurements necessary theoretically. This work is joint work with Eren Sasoglu, Yuxin Chen and David Tse. 3) Speeding up Machine learning algorithms: The celebrated Monte Carlo method estimates a quantity that is expensive to compute by random sampling. We propose adaptive Monte Carlo optimization: a general framework for discrete optimization of an expensive-to-compute function by adaptive random sampling. Applications of this framework have already appeared in machine learning but are tied to their specific contexts and developed in isolation. We take a unified view and show that the framework has broad applicability by applying it on several common machine learning problems: k-nearest neighbors, k-medoids, hierarchical clustering and maximum mutual information feature selection. On real data we show that this framework allows us to develop algorithms that confer a gain of a magnitude or two over exact computation. We also characterize the performance gain theoretically under regularity assumptions on the data that we verify in real world data. We stumbled onto this problem and approach while trying to run k-medoids on a single-cell RNA-seq dataset. This is joint work with Vivek Bagaria, Tavor Baharav, Vasilis Ntranos, Martin Zhang, and David Tse.

Description

Type of resource text
Form electronic resource; remote; computer; online resource
Extent 1 online resource.
Place California
Place [Stanford, California]
Publisher [Stanford University]
Copyright date 2019; ©2019
Publication date 2019; 2019
Issuance monographic
Language English

Creators/Contributors

Author Kamath, Govinda Mangalore
Degree supervisor Tse, David
Thesis advisor Tse, David
Thesis advisor Montanari, Andrea
Thesis advisor Sidford, Aaron
Degree committee member Montanari, Andrea
Degree committee member Sidford, Aaron
Associated with Stanford University, Department of Electrical Engineering.

Subjects

Genre Theses
Genre Text

Bibliographic information

Statement of responsibility Govinda Mangalore Kamath.
Note Submitted to the Department of Electrical Engineering.
Thesis Thesis Ph.D. Stanford University 2019.
Location electronic resource

Access conditions

Copyright
© 2019 by Govinda Mangalore Kamath
License
This work is licensed under a Creative Commons Attribution 3.0 Unported license (CC BY).

Also listed in

Loading usage metrics...