Compression of raw genomic data
Abstract/Contents
- Abstract
- With the rapid advances in genomic sequencing, the amount of genomic data being produced is growing exponentially. Several large scale sequencing projects for humans and other species are expected to further increase the volume of this data. While the initial progress was led by second generation high-throughput sequencers such as Illumina, more recently there has been increasing interest in third generation sequencers like Oxford Nanopore that enable real-time and portable sequencing of long reads. In this context, compression techniques play a crucial role in enabling efficient storage and transfer of this data. Unfortunately, the traditional general-purpose compressors like Gzip are unable to fully exploit the inherent redundancy in this data. Furthermore, in many cases the data is noisy, and it is possible to deploy lossy compression algorithms that can reduce the storage space without adverse impacts on the data quality for downstream analysis. This thesis presents two specialized compressors for genomic data, focusing on raw genomic data which consists of sequencing reads (FASTQ format) as well as raw signal data produced by nanopore sequencing (FAST5 format). We first describe SPRING, which is an efficient compressor for unaligned single and paired-end genomic reads, supporting various lossless and lossy compression modes. Next, we discuss lossy compression of nanopore raw signal data using LFZip, which is a general-purpose lossy compressor for time series and sensor data. We also discuss the evaluation of the impact of lossy compression on the performance of downstream applications like basecalling, consensus and methylation calling.
Description
Type of resource | text |
---|---|
Form | electronic resource; remote; computer; online resource |
Extent | 1 online resource. |
Place | California |
Place | [Stanford, California] |
Publisher | [Stanford University] |
Copyright date | 2021; ©2021 |
Publication date | 2021; 2021 |
Issuance | monographic |
Language | English |
Creators/Contributors
Author | Chandak, Shubham | |
---|---|---|
Degree supervisor | Weissman, Tsachy | |
Thesis advisor | Weissman, Tsachy | |
Thesis advisor | Ji, Hanlee | |
Thesis advisor | Wootters, Mary | |
Degree committee member | Ji, Hanlee | |
Degree committee member | Wootters, Mary | |
Associated with | Stanford University, Department of Electrical Engineering |
Subjects
Genre | Theses |
---|---|
Genre | Text |
Bibliographic information
Statement of responsibility | Shubham Chandak. |
---|---|
Note | Submitted to the Department of Electrical Engineering. |
Thesis | Thesis Ph.D. Stanford University 2021. |
Location | https://purl.stanford.edu/yx427br7566 |
Access conditions
- Copyright
- © 2021 by Shubham Chandak
- License
- This work is licensed under a Creative Commons Attribution Non Commercial 3.0 Unported license (CC BY-NC).
Also listed in
Loading usage metrics...