An Algorithm for Sequence Location Approximation using Nuclear Families (ASLAN) Validates Regions of the Telomere-to-Telomere Assembly and Identifies New Hotspots for Genetic Diversity: ASLAN  Localization and T2T Mapping Results

Chrisman, Brianna; Paskov, Kelley; Jung, Jae-Yoon; Stockham, Nate; Wall, Dennis; He, Chloe

doi:10.25740/sx779pk7425

An Algorithm for Sequence Location Approximation using Nuclear Families (ASLAN) Validates Regions of the Telomere-to-Telomere Assembly and Identifies New Hotspots for Genetic Diversity: ASLAN Localization and T2T Mapping Results

<a href="https://embed.stanford.edu/iframe/?url=https%3A%2F%2Fpurl.stanford.edu%2Fsx779pk7425" class="su-underline">Show Content</a>

Abstract/Contents

Abstract

Although it is heavily relied on to study genetic contributors to health and disease, the current human reference genome (GRCh38) is incomplete in two major ways: firstly, it is missing large sections of heterochromatic sequence, and secondly, as a singular, linear reference genome it does not represent the full spectrum of genetic diversity that exists in the human species. In order to better understand and characterize gaps in GRCh38 and genetic diversity, we developed a method - ASLAN, an Algorithm for Sequence Location Approximation using Nuclear families - that identifies the region of origin of short reads that do not align to the GRCh38. Using unmapped reads and variant calls from whole genome sequencing (WGS) data from nuclear families, ASLAN relies on a maximum likelihood model to identify the most likely region of the genome that a subsequence belongs to, given the phasing information of family and the distribution of the subsequence in the unmapped reads. Validating ASLAN on a synthetically generated dataset, and on true reads originating from the alternative haplotypes in the decoy genome, we show that ASLAN can localize more than 90% of 100-basepair sequences with above 92% accuracy and around 1 megabase of resolution. We then run ASLAN on 100-mers from unmapped reads from WGS from over 700 families, and compare ASLAN localizations to alignment of the 100-mers to the T2T-CHM13 assembly, recently released by the Telomere-to-telomere (T2T) consortia. We find that many unmapped reads in GRCh38 originate from telomeres and centromeres that are gaps in the GRCh38 reference. We also confirm that ASLAN localizations are in high concordance with T2T-CHM13 alignments, except in the centromeres of the acrocentric chromosomes. Comparing ASLAN localizations and T2T-CHM13 alignments, we identify sequences missing from T2T-CHM13 or sequences with high divergence from their aligned region in T2T-CHM13, thus highlighting new hotspots for genetic diversity.

This deposit consists of:
(1) A list of non-singleton 100mers extracted from the iHART cohort from the unmapped reads (kmer_sequences.txt.zip)
(2) The corresponding regions in the genome that ASLAN mapped each sequence to (GRCh38 coordinates) (ASLAN_localizations.bed)
(3) The loci to which the each sequence aligned to on the CHM13-T2T assembly (CHM13-T2T coordinates) (T2T_mappings.bed)

Description

Type of resource	Dataset, text
Date created	[ca. February 2022]
Date modified	March 22, 2023
Publication date	January 17, 2023; January 17, 2023

Creators/Contributors

Author	Chrisman, Brianna
Author	Paskov, Kelley
Author	Jung, Jae-Yoon
Author	Stockham, Nate
Author	Wall, Dennis
Author	He, Chloe

Subjects

Subject	Genomics
Subject	Human genome
Genre	Data
Genre	Tabular data
Genre	Data sets
Genre	Dataset
Genre	Tables (data)

Bibliographic information

Access conditions

Use and reproduction: User agrees that, where applicable, content will not be used to identify or to otherwise infringe the privacy or confidentiality rights of individuals. Content distributed via the Stanford Digital Repository may be subject to additional license and use restrictions applied by the depositor.
License: This work is licensed under a Creative Commons Attribution 4.0 International license (CC BY).

Preferred citation

Preferred citation: Chrisman, Brianna, et al. "An Algorithm for Sequence Location Approximation using Nuclear Families (ASLAN) Validates Regions of the Telomere-to-Telomere Assembly and Identifies New Hotspots for Genetic Diversity." bioRxiv (2022).

Collection

Stanford Research Data

View other items in this collection in SearchWorks

Contact information

Contact: briannac@stanford.edu; brianna.chrisman@gmail.com; dpwall@stanford.edu

Also listed in

View in SearchWorks

Loading usage metrics...