Evaluating and improving clinical genome sequencing

Placeholder Show Content

Abstract/Contents

Abstract
Most human diseases have underlying genetic causes. In order to better understand the impact of genes on disease and their implications for medicine, researchers have pursued methods to sequence human genomes. Next-generation DNA sequencing technology, where DNA is sheared into smaller pieces, sequenced in parallel, and then computationally ordered and analyzed, enables fast and affordable sequencing of full human genomes. Recent advancements including reductions in sequencing turnaround time and cost, President Obama's announcement of the Precision Medicine Initiative, and the FDA's increased involvement in the space (including precisionFDA and draft guidelines for using genome sequencing in the clinic) have catalyzed the use of next-generation DNA sequencing as a clinical tool for patient diagnosis, prognosis, and treatment. However, thus far, the majority of genome sequencing has been performed on a research basis, where incomplete genome coverage or minor inaccuracies are tolerated. The stakes for clinical genome sequencing are much higher; clinicians and patients may indeed make medical decisions based off of sequencing results, so false positive variants or false negative variants can have very important impacts. Therefore, genetic variant detection must achieve high levels of accuracy and methods for evaluating genome sequencing accuracy and completeness are critical. To address these needs, my thesis work, described in this dissertation, consists of the following three aims: 1) Create metrics and methods to assess completeness of clinical genome sequencing; 2) Develop a framework to evaluate the accuracy of variant calls in clinical genome sequencing; 3) Implement a method to improve insertion and deletion (INDEL) variant detection. I performed a number of studies to complete these aims. First, I created a metric for assessing the coverage of key genes in clinical genome sequencing. This metric is defined as the number of bases, per gene, with a depth of coverage below a given threshold. I employed this metric to show that clinical exome sequencing platforms have better coverage of key disease genes than whole genome sequencing, but whole genome sequencing performs better across all genes in RefSeq than clinical exome sequencing. I also characterized the Genome in a Bottle Consortium's High Confidence regions of the genome, and found that these regions are enriched for easy-to-sequence portions of the genome and do not encompass all portions of all key disease genes. These findings have particularly critical implications for benchmarking: accuracy reported within the Genome in Bottle High Confidence regions may not generalize to the rest of the genome. Therefore, evaluation of genome sequencing analysis requires simulated genomes. Towards this end, I developed a framework to simulate genomes with biologically realistic variants and artificial sequence reads with technologically realistic error profiles. I used this framework to show that INDEL detection can be optimized for small INDELs at high coverage and longer read lengths. This evaluation also showed that current INDEL detection approaches have low sensitivity for larger INDELs. To address this, I created a method, Scotch, that leverages signatures of poorly aligned reads with a machine learning approach to identify larger INDELs. This method yields high positive predictive value when tested on a human sample (10 out of 10 validation rate) and high accuracy (> 99%) when tested on a simulated genome containing thousands of INDELs. This method will be useful for detecting INDELs in patient genomes. For example, I ran Scotch on an undiagnosed patient's exome sequencing data and found 140 INDELs within 180 key genes that were not reported by previous INDEL detection methods. Taken together, the methods and results of these studies will help move genome sequencing towards the clinic. In turn, this will enable physicians to incorporate genomic information into medical diagnosis and treatment for more patients.

Description

Type of resource text
Form electronic; electronic resource; remote
Extent 1 online resource.
Publication date 2017
Issuance monographic
Language English

Creators/Contributors

Associated with Goldfeder, Rachel
Associated with Stanford University, Department of Biomedical Informatics
Primary advisor Altman, Russ
Primary advisor Ashley, Euan A
Thesis advisor Altman, Russ
Thesis advisor Ashley, Euan A
Thesis advisor Wall, Dennis Paul
Advisor Wall, Dennis Paul

Subjects

Genre Theses

Bibliographic information

Statement of responsibility Rachel Goldfeder.
Note Submitted to the Department of Biomedical Informatics.
Thesis Thesis (Ph.D.)--Stanford University, 2017.
Location https://purl.stanford.edu/zx371dm3416

Access conditions

Copyright
© 2017 by Rachel Lynn Goldfeder
License
This work is licensed under a Creative Commons Attribution Non Commercial 3.0 Unported license (CC BY-NC).

Also listed in

Loading usage metrics...