Compression of raw genomic data

Placeholder Show Content

Abstract/Contents

Abstract
With the rapid advances in genomic sequencing, the amount of genomic data being produced is growing exponentially. Several large scale sequencing projects for humans and other species are expected to further increase the volume of this data. While the initial progress was led by second generation high-throughput sequencers such as Illumina, more recently there has been increasing interest in third generation sequencers like Oxford Nanopore that enable real-time and portable sequencing of long reads. In this context, compression techniques play a crucial role in enabling efficient storage and transfer of this data. Unfortunately, the traditional general-purpose compressors like Gzip are unable to fully exploit the inherent redundancy in this data. Furthermore, in many cases the data is noisy, and it is possible to deploy lossy compression algorithms that can reduce the storage space without adverse impacts on the data quality for downstream analysis. This thesis presents two specialized compressors for genomic data, focusing on raw genomic data which consists of sequencing reads (FASTQ format) as well as raw signal data produced by nanopore sequencing (FAST5 format). We first describe SPRING, which is an efficient compressor for unaligned single and paired-end genomic reads, supporting various lossless and lossy compression modes. Next, we discuss lossy compression of nanopore raw signal data using LFZip, which is a general-purpose lossy compressor for time series and sensor data. We also discuss the evaluation of the impact of lossy compression on the performance of downstream applications like basecalling, consensus and methylation calling.

Description

Type of resource text
Form electronic resource; remote; computer; online resource
Extent 1 online resource.
Place California
Place [Stanford, California]
Publisher [Stanford University]
Copyright date 2021; ©2021
Publication date 2021; 2021
Issuance monographic
Language English

Creators/Contributors

Author Chandak, Shubham
Degree supervisor Weissman, Tsachy
Thesis advisor Weissman, Tsachy
Thesis advisor Ji, Hanlee
Thesis advisor Wootters, Mary
Degree committee member Ji, Hanlee
Degree committee member Wootters, Mary
Associated with Stanford University, Department of Electrical Engineering

Subjects

Genre Theses
Genre Text

Bibliographic information

Statement of responsibility Shubham Chandak.
Note Submitted to the Department of Electrical Engineering.
Thesis Thesis Ph.D. Stanford University 2021.
Location https://purl.stanford.edu/yx427br7566

Access conditions

Copyright
© 2021 by Shubham Chandak
License
This work is licensed under a Creative Commons Attribution Non Commercial 3.0 Unported license (CC BY-NC).

Also listed in

Loading usage metrics...