Addressing the DNA deluge : algorithmic methods for large-scale genome analysis

Placeholder Show Content

Abstract/Contents

Abstract
Ongoing advances in DNA sequencing technologies have undoubtedly shifted genomics into the realm of big data. To cope with this data explosion and enable rapid advances in biology and medicine, we must develop scalable and efficient methods for genomic data analysis that can leverage both domain specific knowledge, as well as the latest computing platforms, such as multi-core and accelerator-based architectures and cloud computing. This thesis presents several algorithms developed with this goal in mind, focusing on key tasks performed in resequencing studies, metagenomics, and cancer genomics. First, we consider the ubiquitous and compute-intensive task of read mapping. We propose the read mapping algorithm BALAUR that can outsource a significant portion of the computation to the public cloud preserving the privacy of genomic data. BALAUR uses the MinHash technique and a coarse-grained voting scheme to map the reads, relying on the high degree of similarity between the reads and their best reference matches. BALAUR has a similar runtime to state-of-the-art read mappers in short read mapping and achieves significant speedups over existing approaches in long read mapping. We also propose the read mapping algorithm BWBBLE that maps reads to a large collection of genomes with the goal of eliminating reference bias and boosting mapping accuracy. BWBBLE creates a compressed linear representation of the collection, taking advantage of the high redundancy across the genomes. BWBBLE then uses this representation, along with an adaptation of the Burrows-Wheeler Transform search algorithm to efficiently map the reads (since all the reads can be treated independently, this is an embarrassingly parallel task). This results in significant reduction in space and runtime requirements, compared to mapping to each genome individually. Next, we consider several resource-intensive tasks in metagenomics. We introduce the framework GATTACA for fast and accurate metagenomic binning. GATTACA uses a lightweight approach for estimating co-abundance profiles across a cohort of metagenomic samples (subsequently used as features for binning), replacing read mapping with a kmer counting procedure. This approach can result in up to several orders of magnitude speedup in abundance estimation; it has also been parallelized for shared memory systems. Creating compact indices of metagenomic samples, GATTACA enables easy reuse across studies via fast downloads and small disk footprints. GATTACA also incorporates a fast procedure for metagenomic sample comparison based on MinHash that can be used for cohort selection. Finally, we focus on cancer genomics, specifically the critical task of inferring tumor heterogeneity and evolution. We introduce a novel algorithm, LICHeE, that uses variant frequencies from deep sequencing data across multiple tumor samples to infer the underlying cancer lineage tree and decompose the samples into subclonal populations. LICHeE is fast and efficiently scales with the number of input samples and variants, relying on a combinatorial formulation of the task as a search for spanning trees in a constraint network.

Description

Type of resource text
Form electronic; electronic resource; remote
Extent 1 online resource.
Publication date 2017
Issuance monographic
Language English

Creators/Contributors

Associated with Popic, Victoria
Associated with Stanford University, Computer Science Department.
Primary advisor Batzoglou, Serafim
Thesis advisor Batzoglou, Serafim
Thesis advisor Kundaje, Anshul, 1980-
Thesis advisor Olukotun, Oyekunle Ayinde
Advisor Kundaje, Anshul, 1980-
Advisor Olukotun, Oyekunle Ayinde

Subjects

Genre Theses

Bibliographic information

Statement of responsibility Victoria Popic.
Note Submitted to the Department of Computer Science.
Thesis Thesis (Ph.D.)--Stanford University, 2017.
Location electronic resource

Access conditions

Copyright
© 2017 by Victoria Popic
License
This work is licensed under a Creative Commons Attribution Non Commercial 3.0 Unported license (CC BY-NC).

Also listed in

Loading usage metrics...