Genome assembly methods using read clouds

Bishara, Alex; Stanford University, Computer Science Department.

Genome assembly methods using read clouds

<a href="https://embed.stanford.edu/iframe/?url=https%3A%2F%2Fpurl.stanford.edu%2Fnx202kc7933" class="su-underline">Show Content</a>

Abstract/Contents

Abstract: The field of genomics has been revolutionized by the advent of high throughput DNA sequencing techniques, which allow us to efficiently obtain billions of sequenced DNA fragments from a biological specimen with a single experiment. Computational genome assembly methods have been developed to reconstruct the source genomic sequence from these billions of DNA fragments or reads. Applications of these sequencing and assembly techniques have enabled us to study the genetic underpinnings of human diseases and have also allowed us to learn about the existence of vast prokaryotic life in nature beyond the small minority of culturable microorganisms. However, the completeness and contiguity of genome reconstructions are limited by the short read lengths currently produced by these high throughput methods. Recent "read cloud" techniques mitigate this limitation by partitioning groups of large DNA molecules, then barcoding short fragments derived from them, to produce short fragment sequences tagged with long-range information. However, existing sequence assembly algorithms fail to take advantage of the long-range information provided by these platforms. Consequently, read clouds have had limited adoption in large scale sequencing projects. In this thesis we present applications of these read clouds and novel computational techniques to the following areas: (1) We introduce a novel alignment algorithm, Random Field Aligner (RFA), which captures the relationships among the short-reads governed by the read cloud generative process via a Markov Random Field. Utilization of this probabilistic model allows us to confidently align short reads and discover variants within 155Mb of repeats within the human genome (6% of GRCh37) that were previously dark to short reads. (2) Nearly 131Mb of the human genome (4.35% of GRCh37) lies within large high sequence identity repeats (or segmental duplications), which can vary in sequence structure across individuals, and are not amenable to alignment approaches. We show how to reformulate the problem of assembling a subset of these loci as a read cloud haplotype phasing problem, which we solve approximately with a merge-split proposal Markov Chain Monte Carlo inference procedure. We apply this methodology to assemble a repeat family consisting of ten nearly identical (~99.9%) and very large (~200kbp) repeats within the human genome. (3) We provide the first application of read clouds to sequence a metagenome, and introduce an assembler, Athena, which uses read clouds to produce near complete genome drafts from metagenomic samples.

Description

Type of resource	text
Form	electronic; electronic resource; remote
Extent	1 online resource.
Publication date	2017
Issuance	monographic
Language	English

Creators/Contributors

Associated with	Bishara, Alex
Associated with	Stanford University, Computer Science Department.
Primary advisor	Batzoglou, Serafim
Thesis advisor	Batzoglou, Serafim
Thesis advisor	Bhatt, Ami (Ami Siddharth)
Thesis advisor	Kundaje, Anshul, 1980-
Advisor	Bhatt, Ami (Ami Siddharth)
Advisor	Kundaje, Anshul, 1980-

Subjects

Genre	Theses

Bibliographic information

Statement of responsibility	Alex Bishara.
Note	Submitted to the Department of Computer Science.
Thesis	Thesis (Ph.D.)--Stanford University, 2017.
Location	electronic resource

Access conditions

License: This work is licensed under a Creative Commons Attribution 3.0 Unported license (CC BY).

Also listed in

View in SearchWorks

Loading usage metrics...