Genomic data compression and processing : theory, models, algorithms, and experiments
- Recently, there has been growing interest in genome sequencing, driven by advancements in the sequencing technology. Although early sequencing technologies required several years to capture a 3 billion nucleotide genome, genomes as large as 22 billion nucleotides are now being sequenced within days using next-generation sequencing technologies. Further, the cost of sequencing a whole human genome has dropped from billions of dollars to merely \$1000 within the past 15 years. These developments in efficiency and affordability have allowed many to envision whole-genome sequencing as an invaluable tool to be used in both personalized medical care and public health. As a result, increasingly large and ubiquitous genomic datasets are being generated. These datasets need to be stored, transmitted, and analyzed, which poses significant challenges. In the first part of the thesis, we investigate methods and algorithms to ease the storage and distribution of these data sets. In particular, we present lossless compression schemes tailored to the raw sequencing data, which significantly decrease the size of the files, allowing for both storage and transmission savings. In addition, we show that lossy compression can be applied to some of the genomic data, boosting the compression performance beyond the lossless limit while maintaining similar -- and sometimes superior -- performance in downstream analyses. These results are possible due to the inherent noise present in the genomic data. However, lossy compressors are not explicitly designed to reduce the noise present in the data. With that in mind, we introduce a denoising scheme tailored to these data, and demonstrate that it can result in better inference. Moreover, we show that reducing the noise leads to smaller entropy, and thus a significant boost in compression is also achieved. In the second part of the thesis, we investigate methods to facilitate the access to genomic data on databases. Specifically, we study the problem of compressing a database so that similarity queries can still be performed in the compressed domain. Compressing the database allows it to be replicated in several locations, thus providing easier and faster access to the data, and reducing the time needed to execute a query.
|Type of resource
|electronic; electronic resource; remote
|1 online resource.
|Stanford University, Department of Electrical Engineering.
|Goldsmith, Andrea, 1964-
|Goldsmith, Andrea, 1964-
|Statement of responsibility
|Submitted to the Department of Electrical Engineering.
|Thesis (Ph.D.)--Stanford University, 2016.
- © 2016 by Idoia Ochoa-Alvarez
- This work is licensed under a Creative Commons Attribution Non Commercial 3.0 Unported license (CC BY-NC).
Also listed in
Loading usage metrics...