Genomic data compression and processing : theory, models, algorithms, and experiments

Ochoa-Alvarez, Idoia; Stanford University, Department of Electrical Engineering.

Genomic data compression and processing : theory, models, algorithms, and experiments

<a href="https://embed.stanford.edu/iframe/?url=https%3A%2F%2Fpurl.stanford.edu%2Fst247bt3117" class="su-underline">Show Content</a>

Abstract/Contents

Abstract: Recently, there has been growing interest in genome sequencing, driven by advancements in the sequencing technology. Although early sequencing technologies required several years to capture a 3 billion nucleotide genome, genomes as large as 22 billion nucleotides are now being sequenced within days using next-generation sequencing technologies. Further, the cost of sequencing a whole human genome has dropped from billions of dollars to merely \$1000 within the past 15 years. These developments in efficiency and affordability have allowed many to envision whole-genome sequencing as an invaluable tool to be used in both personalized medical care and public health. As a result, increasingly large and ubiquitous genomic datasets are being generated. These datasets need to be stored, transmitted, and analyzed, which poses significant challenges. In the first part of the thesis, we investigate methods and algorithms to ease the storage and distribution of these data sets. In particular, we present lossless compression schemes tailored to the raw sequencing data, which significantly decrease the size of the files, allowing for both storage and transmission savings. In addition, we show that lossy compression can be applied to some of the genomic data, boosting the compression performance beyond the lossless limit while maintaining similar -- and sometimes superior -- performance in downstream analyses. These results are possible due to the inherent noise present in the genomic data. However, lossy compressors are not explicitly designed to reduce the noise present in the data. With that in mind, we introduce a denoising scheme tailored to these data, and demonstrate that it can result in better inference. Moreover, we show that reducing the noise leads to smaller entropy, and thus a significant boost in compression is also achieved. In the second part of the thesis, we investigate methods to facilitate the access to genomic data on databases. Specifically, we study the problem of compressing a database so that similarity queries can still be performed in the compressed domain. Compressing the database allows it to be replicated in several locations, thus providing easier and faster access to the data, and reducing the time needed to execute a query.

Description

Type of resource	text
Form	electronic; electronic resource; remote
Extent	1 online resource.
Publication date	2016
Issuance	monographic
Language	English

Creators/Contributors

Associated with	Ochoa-Alvarez, Idoia
Associated with	Stanford University, Department of Electrical Engineering.
Primary advisor	Weissman, Tsachy
Thesis advisor	Weissman, Tsachy
Thesis advisor	Goldsmith, Andrea, 1964-
Thesis advisor	Tse, David
Advisor	Goldsmith, Andrea, 1964-
Advisor	Tse, David

Subjects

Genre	Theses

Bibliographic information

Statement of responsibility	Idoia Ochoa-Alvarez.
Note	Submitted to the Department of Electrical Engineering.
Thesis	Thesis (Ph.D.)--Stanford University, 2016.
Location	electronic resource

Access conditions

License: This work is licensed under a Creative Commons Attribution Non Commercial 3.0 Unported license (CC BY-NC).

Also listed in

View in SearchWorks

Loading usage metrics...