Genomic data compression and processing : theory, models, algorithms, and experiments

Placeholder Show Content

Abstract/Contents

Abstract
Recently, there has been growing interest in genome sequencing, driven by advancements in the sequencing technology. Although early sequencing technologies required several years to capture a 3 billion nucleotide genome, genomes as large as 22 billion nucleotides are now being sequenced within days using next-generation sequencing technologies. Further, the cost of sequencing a whole human genome has dropped from billions of dollars to merely \$1000 within the past 15 years. These developments in efficiency and affordability have allowed many to envision whole-genome sequencing as an invaluable tool to be used in both personalized medical care and public health. As a result, increasingly large and ubiquitous genomic datasets are being generated. These datasets need to be stored, transmitted, and analyzed, which poses significant challenges. In the first part of the thesis, we investigate methods and algorithms to ease the storage and distribution of these data sets. In particular, we present lossless compression schemes tailored to the raw sequencing data, which significantly decrease the size of the files, allowing for both storage and transmission savings. In addition, we show that lossy compression can be applied to some of the genomic data, boosting the compression performance beyond the lossless limit while maintaining similar -- and sometimes superior -- performance in downstream analyses. These results are possible due to the inherent noise present in the genomic data. However, lossy compressors are not explicitly designed to reduce the noise present in the data. With that in mind, we introduce a denoising scheme tailored to these data, and demonstrate that it can result in better inference. Moreover, we show that reducing the noise leads to smaller entropy, and thus a significant boost in compression is also achieved. In the second part of the thesis, we investigate methods to facilitate the access to genomic data on databases. Specifically, we study the problem of compressing a database so that similarity queries can still be performed in the compressed domain. Compressing the database allows it to be replicated in several locations, thus providing easier and faster access to the data, and reducing the time needed to execute a query.

Description

Type of resource text
Form electronic; electronic resource; remote
Extent 1 online resource.
Publication date 2016
Issuance monographic
Language English

Creators/Contributors

Associated with Ochoa-Alvarez, Idoia
Associated with Stanford University, Department of Electrical Engineering.
Primary advisor Weissman, Tsachy
Thesis advisor Weissman, Tsachy
Thesis advisor Goldsmith, Andrea, 1964-
Thesis advisor Tse, David
Advisor Goldsmith, Andrea, 1964-
Advisor Tse, David

Subjects

Genre Theses

Bibliographic information

Statement of responsibility Idoia Ochoa-Alvarez.
Note Submitted to the Department of Electrical Engineering.
Thesis Thesis (Ph.D.)--Stanford University, 2016.
Location electronic resource

Access conditions

Copyright
© 2016 by Idoia Ochoa-Alvarez
License
This work is licensed under a Creative Commons Attribution Non Commercial 3.0 Unported license (CC BY-NC).

Also listed in

Loading usage metrics...