Unmapped or untapped? Leveraging the unmapped read space of whole genome sequences to understand the human genome, virome, and contaminome

Placeholder Show Content

Abstract/Contents

Abstract
Whole Genome Sequencing (WGS) has become an incredibly popular method to understand the genetic contributors to human health and disease. In a typical WGS pipeline, millions of sequencing reads are aligned to the human reference genome, in order to identify potential variants linked to disease. However, in most WGS pipelines between 1-10% of these reads will not align to the reference genome, because they come either from non-human genomes or from diverse human sequences not well-represented on the current single reference genome. These reads, known as the unmapped read space, are often discarded in downstream analysis, despite their potential role in human health. Using WGS from over 1,000 nuclear families, I interrogate the three contributors of the unmapped read space - bacteria, viruses, and non-reference human sequences - in order to better understand these important components of human genomic and metagenomic diversity. In this thesis, I present my findings from studying these contributors to the unmapped read space. (1) I identify common blood DNA viruses and their transmission patterns, including never-before-seen integration and latency mechanisms of Human Herpesvirus 6B and 7 in lymphocytes. (2) I characterize signatures of experimental and computational contamination that may pervade WGS pipelines. Of interest to the field of metagenomics, the ladder includes Y chromosome fragments that frequently mismap to bacterial references. (3) Finally, in order to identify non-reference human DNA sequences and localize them to their appropriate location on the human genome, I present a maximum-likelihood model algorithm, ASLAN - an Algorithm for Sequence Location Approximation using Nuclear families. After validating ASLAN to show its high accuracy and resolution, I use ASLAN to localize unmapped sequences from WGS to the human genome. I compare these localizations to alignments to the recently released CHM13 genome (the first assembly of a full human genome done by the Telomere-to-Telomere consortium). I identify potentially problematic regions on this assembly set to become the new human reference genome, as well as characterize hotspots for genetic diversity

Description

Type of resource text
Form electronic resource; remote; computer; online resource
Extent 1 online resource
Place California
Place [Stanford, California]
Publisher [Stanford University]
Copyright date 2022; ©2022
Publication date 2022; 2022
Issuance monographic
Language English

Creators/Contributors

Author Chrisman, Brianna Sierra
Degree supervisor Wall, Dennis Paul
Thesis advisor Wall, Dennis Paul
Thesis advisor Altman, Russ
Thesis advisor Bhatt, Ami (Ami Siddharth)
Degree committee member Altman, Russ
Degree committee member Bhatt, Ami (Ami Siddharth)
Associated with Stanford University, Department of Bioengineering

Subjects

Genre Theses
Genre Text

Bibliographic information

Statement of responsibility Brianna Sierra Chrisman
Note Submitted to the Department of Bioengineering
Thesis Thesis Ph.D. Stanford University 2022
Location https://purl.stanford.edu/mb573bk1550

Access conditions

Copyright
© 2022 by Brianna Sierra Chrisman
License
This work is licensed under a Creative Commons Attribution Non Commercial 3.0 Unported license (CC BY-NC).

Also listed in

Loading usage metrics...