Computational methods using large-scale population whole-genome sequencing data

Placeholder Show Content

Abstract/Contents

Abstract
Advancements in next-generation low-cost, high-throughput DNA sequencing technologies have made it possible to sequence a large number of human genomes. To date, at least tens of thousands of individuals have been whole genome sequenced. Even more large-scale population sequencing projects are actively underway or will be launched in the foreseeable future. This vast amount of genomic data undoubtedly advances the characterization of human genome variation and supports disease studies across diverse cohorts. However, the challenging problem of how to efficiently and precisely determine individual-level genomic differences from this huge amount of sequencing data exists. This natural first step of leveraging sequencing data for genomic analyses can be computational intensive, while the quality of the restored genomes influences a wide variety of downstream applications, such as association studies, personalized medicine, and population genomics. In this dissertation I present computational methods to approach this fundamental problem in the context of ever-increasing sequencing data volume and demonstrate the effectiveness and efficiency of these methods using real data from latest population sequencing projects. First, I present a new method that maps reads of newly sequenced human genome to a large collection of genomes, aiming to reduce the inherent biases induced by aligning to any single reference genome. Second, I introduce an approach, named Reveel, for single nucleotide variant calling and genotype calling of large cohorts that have been sequenced at a low coverage, that aims for computational efficiency as well as accuracy in capturing linkage disequilibrium patterns present in rare haplotypes. Third, on the basis of the Reveel framework I present a reference-based approach that effectively incorporates genotypes from completed projects to improve the genotyping quality of new datasets while maintaining low computational costs. Finally, I demonstrate an application of genotype information for improving the efficiency of identity-by-descent detection from a large cohort.

Description

Type of resource text
Form electronic; electronic resource; remote
Extent 1 online resource.
Publication date 2017
Issuance monographic
Language English

Creators/Contributors

Associated with Huang, Lin
Associated with Stanford University, Computer Science Department.
Primary advisor Batzoglou, Serafim
Thesis advisor Batzoglou, Serafim
Thesis advisor Kundaje, Anshul, 1980-
Thesis advisor Pritchard, Jonathan D
Advisor Kundaje, Anshul, 1980-
Advisor Pritchard, Jonathan D

Subjects

Genre Theses

Bibliographic information

Statement of responsibility Lin Huang.
Note Submitted to the Department of Computer Science.
Thesis Thesis (Ph.D.)--Stanford University, 2017.
Location electronic resource

Access conditions

Copyright
© 2017 by Lin Huang
License
This work is licensed under a Creative Commons Attribution Non Commercial 3.0 Unported license (CC BY-NC).

Also listed in

Loading usage metrics...