Algorithms for analyzing third generation sequencing data
- Sequencing technologies have been evolved since the first human genome was released in 2003. First generation sequencing emerged in the mid-1970s, generating reads with high accuracy. In the mid-2000s, second generation sequencing was commercially available with higher throughput and lower cost per base. In recent years, the third generation sequencing technologies can generate reads of multi-ten kilobases long, though with a relatively lower accuracy. These current unique characteristics of third generation sequencing data, longer sequences and higher error rate, have required development of new algorithms and tools to efficiently process data. In this thesis, we present two methods in analyzing third generation sequencing reads. The first method (COSINE) is a conceptually novel technique for mapping long DNA sequences with high error rates. As a proof-of-concept, COSINE is applied to both simulated and real datasets where it achieves high sensitivity and specificity in wide range of read accuracies with minimal tuning. The second method (IDP-fusion) is a new approach to accurately characterize fusion genes using hybrid RNA sequencing. As a proof-of-concept, IDP-fusion is applied to PacBio and Illumina real datasets from the MCF-7 cell line, where it achieves higher sensitivity and specificity compared to existing tools. The results also show that IDP-fusion could resolve multiple fusion splices and fusion isoforms within tumorigenesis-relevant fusion genes.
|Type of resource
|electronic; electronic resource; remote
|1 online resource.
|Tootoonchi Afshar, Pegah
|Stanford University, Department of Electrical Engineering.
|Wong, Wing Hung
|Wong, Wing Hung
|Statement of responsibility
|Pegah Tootoonchi Afshar.
|Submitted to the Department of Electrical Engineering.
|Thesis (Ph.D.)--Stanford University, 2016.
- © 2016 by Pegah Tootoonchi Afshar
- This work is licensed under a Creative Commons Attribution Non Commercial Share Alike 3.0 Unported license (CC BY-NC-SA).
Also listed in
Loading usage metrics...