An Algorithm for Sequence Location Approximation using Nuclear Families (ASLAN) Validates Regions of the Telomere-to-Telomere Assembly and Identifies New Hotspots for Genetic Diversity: ASLAN Localization and T2T Mapping Results

Placeholder Show Content

Abstract/Contents

Abstract

Although it is heavily relied on to study genetic contributors to health and disease, the current human reference genome (GRCh38) is incomplete in two major ways: firstly, it is missing large sections of heterochromatic sequence, and secondly, as a singular, linear reference genome it does not represent the full spectrum of genetic diversity that exists in the human species. In order to better understand and characterize gaps in GRCh38 and genetic diversity, we developed a method - ASLAN, an Algorithm for Sequence Location Approximation using Nuclear families - that identifies the region of origin of short reads that do not align to the GRCh38. Using unmapped reads and variant calls from whole genome sequencing (WGS) data from nuclear families, ASLAN relies on a maximum likelihood model to identify the most likely region of the genome that a subsequence belongs to, given the phasing information of family and the distribution of the subsequence in the unmapped reads. Validating ASLAN on a synthetically generated dataset, and on true reads originating from the alternative haplotypes in the decoy genome, we show that ASLAN can localize more than 90% of 100-basepair sequences with above 92% accuracy and around 1 megabase of resolution. We then run ASLAN on 100-mers from unmapped reads from WGS from over 700 families, and compare ASLAN localizations to alignment of the 100-mers to the T2T-CHM13 assembly, recently released by the Telomere-to-telomere (T2T) consortia. We find that many unmapped reads in GRCh38 originate from telomeres and centromeres that are gaps in the GRCh38 reference. We also confirm that ASLAN localizations are in high concordance with T2T-CHM13 alignments, except in the centromeres of the acrocentric chromosomes. Comparing ASLAN localizations and T2T-CHM13 alignments, we identify sequences missing from T2T-CHM13 or sequences with high divergence from their aligned region in T2T-CHM13, thus highlighting new hotspots for genetic diversity.

This deposit consists of:
(1) A list of non-singleton 100mers extracted from the iHART cohort from the unmapped reads (kmer_sequences.txt.zip)
(2) The corresponding regions in the genome that ASLAN mapped each sequence to (GRCh38 coordinates) (ASLAN_localizations.bed)
(3) The loci to which the each sequence aligned to on the CHM13-T2T assembly (CHM13-T2T coordinates) (T2T_mappings.bed)

Description

Type of resource Dataset, text
Date created [ca. February 2022]
Date modified March 22, 2023
Publication date January 17, 2023; January 17, 2023

Creators/Contributors

Author Chrisman, Brianna
Author Paskov, Kelley
Author Jung, Jae-Yoon
Author Stockham, Nate
Author Wall, Dennis
Author He, Chloe

Subjects

Subject Genomics
Subject Human genome
Genre Data
Genre Tabular data
Genre Data sets
Genre Dataset
Genre Tables (data)

Bibliographic information

Related item
DOI https://doi.org/10.25740/sx779pk7425
Location https://purl.stanford.edu/sx779pk7425

Access conditions

Use and reproduction
User agrees that, where applicable, content will not be used to identify or to otherwise infringe the privacy or confidentiality rights of individuals. Content distributed via the Stanford Digital Repository may be subject to additional license and use restrictions applied by the depositor.
License
This work is licensed under a Creative Commons Attribution 4.0 International license (CC BY).

Preferred citation

Preferred citation
Chrisman, Brianna, et al. "An Algorithm for Sequence Location Approximation using Nuclear Families (ASLAN) Validates Regions of the Telomere-to-Telomere Assembly and Identifies New Hotspots for Genetic Diversity." bioRxiv (2022).

Collection

Contact information

Also listed in

Loading usage metrics...