Unmapped or untapped? Leveraging the unmapped read space of whole genome sequences to understand the human genome, virome, and contaminome

Chrisman, Brianna Sierra

Unmapped or untapped? Leveraging the unmapped read space of whole genome sequences to understand the human genome, virome, and contaminome

<a href="https://embed.stanford.edu/iframe/?url=https%3A%2F%2Fpurl.stanford.edu%2Fmb573bk1550" class="su-underline">Show Content</a>

Abstract/Contents

Abstract: Whole Genome Sequencing (WGS) has become an incredibly popular method to understand the genetic contributors to human health and disease. In a typical WGS pipeline, millions of sequencing reads are aligned to the human reference genome, in order to identify potential variants linked to disease. However, in most WGS pipelines between 1-10% of these reads will not align to the reference genome, because they come either from non-human genomes or from diverse human sequences not well-represented on the current single reference genome. These reads, known as the unmapped read space, are often discarded in downstream analysis, despite their potential role in human health. Using WGS from over 1,000 nuclear families, I interrogate the three contributors of the unmapped read space - bacteria, viruses, and non-reference human sequences - in order to better understand these important components of human genomic and metagenomic diversity. In this thesis, I present my findings from studying these contributors to the unmapped read space. (1) I identify common blood DNA viruses and their transmission patterns, including never-before-seen integration and latency mechanisms of Human Herpesvirus 6B and 7 in lymphocytes. (2) I characterize signatures of experimental and computational contamination that may pervade WGS pipelines. Of interest to the field of metagenomics, the ladder includes Y chromosome fragments that frequently mismap to bacterial references. (3) Finally, in order to identify non-reference human DNA sequences and localize them to their appropriate location on the human genome, I present a maximum-likelihood model algorithm, ASLAN - an Algorithm for Sequence Location Approximation using Nuclear families. After validating ASLAN to show its high accuracy and resolution, I use ASLAN to localize unmapped sequences from WGS to the human genome. I compare these localizations to alignments to the recently released CHM13 genome (the first assembly of a full human genome done by the Telomere-to-Telomere consortium). I identify potentially problematic regions on this assembly set to become the new human reference genome, as well as characterize hotspots for genetic diversity

Description

Type of resource	text
Form	electronic resource; remote; computer; online resource
Extent	1 online resource
Place	California
Place	[Stanford, California]
Publisher	[Stanford University]
Copyright date	2022; ©2022
Publication date	2022; 2022
Issuance	monographic
Language	English

Creators/Contributors

Author	Chrisman, Brianna Sierra
Degree supervisor	Wall, Dennis Paul
Thesis advisor	Wall, Dennis Paul
Thesis advisor	Altman, Russ
Thesis advisor	Bhatt, Ami (Ami Siddharth)
Degree committee member	Altman, Russ
Degree committee member	Bhatt, Ami (Ami Siddharth)
Associated with	Stanford University, Department of Bioengineering

Subjects

Genre	Theses
Genre	Text

Bibliographic information

Statement of responsibility	Brianna Sierra Chrisman
Note	Submitted to the Department of Bioengineering
Thesis	Thesis Ph.D. Stanford University 2022
Location	https://purl.stanford.edu/mb573bk1550

Access conditions

License: This work is licensed under a Creative Commons Attribution Non Commercial 3.0 Unported license (CC BY-NC).

Also listed in

View in SearchWorks

Loading usage metrics...