Genome assembly methods using read clouds

Placeholder Show Content

Abstract/Contents

Abstract
The field of genomics has been revolutionized by the advent of high throughput DNA sequencing techniques, which allow us to efficiently obtain billions of sequenced DNA fragments from a biological specimen with a single experiment. Computational genome assembly methods have been developed to reconstruct the source genomic sequence from these billions of DNA fragments or reads. Applications of these sequencing and assembly techniques have enabled us to study the genetic underpinnings of human diseases and have also allowed us to learn about the existence of vast prokaryotic life in nature beyond the small minority of culturable microorganisms. However, the completeness and contiguity of genome reconstructions are limited by the short read lengths currently produced by these high throughput methods. Recent "read cloud" techniques mitigate this limitation by partitioning groups of large DNA molecules, then barcoding short fragments derived from them, to produce short fragment sequences tagged with long-range information. However, existing sequence assembly algorithms fail to take advantage of the long-range information provided by these platforms. Consequently, read clouds have had limited adoption in large scale sequencing projects. In this thesis we present applications of these read clouds and novel computational techniques to the following areas: (1) We introduce a novel alignment algorithm, Random Field Aligner (RFA), which captures the relationships among the short-reads governed by the read cloud generative process via a Markov Random Field. Utilization of this probabilistic model allows us to confidently align short reads and discover variants within 155Mb of repeats within the human genome (6% of GRCh37) that were previously dark to short reads. (2) Nearly 131Mb of the human genome (4.35% of GRCh37) lies within large high sequence identity repeats (or segmental duplications), which can vary in sequence structure across individuals, and are not amenable to alignment approaches. We show how to reformulate the problem of assembling a subset of these loci as a read cloud haplotype phasing problem, which we solve approximately with a merge-split proposal Markov Chain Monte Carlo inference procedure. We apply this methodology to assemble a repeat family consisting of ten nearly identical (~99.9%) and very large (~200kbp) repeats within the human genome. (3) We provide the first application of read clouds to sequence a metagenome, and introduce an assembler, Athena, which uses read clouds to produce near complete genome drafts from metagenomic samples.

Description

Type of resource text
Form electronic; electronic resource; remote
Extent 1 online resource.
Publication date 2017
Issuance monographic
Language English

Creators/Contributors

Associated with Bishara, Alex
Associated with Stanford University, Computer Science Department.
Primary advisor Batzoglou, Serafim
Thesis advisor Batzoglou, Serafim
Thesis advisor Bhatt, Ami (Ami Siddharth)
Thesis advisor Kundaje, Anshul, 1980-
Advisor Bhatt, Ami (Ami Siddharth)
Advisor Kundaje, Anshul, 1980-

Subjects

Genre Theses

Bibliographic information

Statement of responsibility Alex Bishara.
Note Submitted to the Department of Computer Science.
Thesis Thesis (Ph.D.)--Stanford University, 2017.
Location electronic resource

Access conditions

Copyright
© 2017 by Alex Bishara
License
This work is licensed under a Creative Commons Attribution 3.0 Unported license (CC BY).

Also listed in

Loading usage metrics...