Unmapped or untapped? Leveraging the unmapped read space of whole genome sequences to understand the human genome, virome, and contaminome
Abstract/Contents
- Abstract
- Whole Genome Sequencing (WGS) has become an incredibly popular method to understand the genetic contributors to human health and disease. In a typical WGS pipeline, millions of sequencing reads are aligned to the human reference genome, in order to identify potential variants linked to disease. However, in most WGS pipelines between 1-10% of these reads will not align to the reference genome, because they come either from non-human genomes or from diverse human sequences not well-represented on the current single reference genome. These reads, known as the unmapped read space, are often discarded in downstream analysis, despite their potential role in human health. Using WGS from over 1,000 nuclear families, I interrogate the three contributors of the unmapped read space - bacteria, viruses, and non-reference human sequences - in order to better understand these important components of human genomic and metagenomic diversity. In this thesis, I present my findings from studying these contributors to the unmapped read space. (1) I identify common blood DNA viruses and their transmission patterns, including never-before-seen integration and latency mechanisms of Human Herpesvirus 6B and 7 in lymphocytes. (2) I characterize signatures of experimental and computational contamination that may pervade WGS pipelines. Of interest to the field of metagenomics, the ladder includes Y chromosome fragments that frequently mismap to bacterial references. (3) Finally, in order to identify non-reference human DNA sequences and localize them to their appropriate location on the human genome, I present a maximum-likelihood model algorithm, ASLAN - an Algorithm for Sequence Location Approximation using Nuclear families. After validating ASLAN to show its high accuracy and resolution, I use ASLAN to localize unmapped sequences from WGS to the human genome. I compare these localizations to alignments to the recently released CHM13 genome (the first assembly of a full human genome done by the Telomere-to-Telomere consortium). I identify potentially problematic regions on this assembly set to become the new human reference genome, as well as characterize hotspots for genetic diversity
Description
Type of resource | text |
---|---|
Form | electronic resource; remote; computer; online resource |
Extent | 1 online resource |
Place | California |
Place | [Stanford, California] |
Publisher | [Stanford University] |
Copyright date | 2022; ©2022 |
Publication date | 2022; 2022 |
Issuance | monographic |
Language | English |
Creators/Contributors
Author | Chrisman, Brianna Sierra |
---|---|
Degree supervisor | Wall, Dennis Paul |
Thesis advisor | Wall, Dennis Paul |
Thesis advisor | Altman, Russ |
Thesis advisor | Bhatt, Ami (Ami Siddharth) |
Degree committee member | Altman, Russ |
Degree committee member | Bhatt, Ami (Ami Siddharth) |
Associated with | Stanford University, Department of Bioengineering |
Subjects
Genre | Theses |
---|---|
Genre | Text |
Bibliographic information
Statement of responsibility | Brianna Sierra Chrisman |
---|---|
Note | Submitted to the Department of Bioengineering |
Thesis | Thesis Ph.D. Stanford University 2022 |
Location | https://purl.stanford.edu/mb573bk1550 |
Access conditions
- Copyright
- © 2022 by Brianna Sierra Chrisman
- License
- This work is licensed under a Creative Commons Attribution Non Commercial 3.0 Unported license (CC BY-NC).
Also listed in
Loading usage metrics...