Human diploid assembly and variant detection methods based on linked-reads technology
- Determining the full genomes of individuals is critical for uncovering the causes of diseases and the source of variability between individuals. Recent technical breakthroughs in third-generation, long reads/linked-reads sequencing have brought us closer to this goal, but many laboratory and computational problems remain unresolved. The purpose of this work was to develop methods to address a number of current issues relating to the generation and assembly of personal genomes based on linked-reads, the detection of genetic variation such as single nucleotide polymorphisms (SNPs), small insertions and deletions (indels), and large structural variants (SVs) in genomes produced in this manner, and correct allelic arrangement in the two homologous chromosomes (phasing). We first implemented a simulator, LRTK-SIM, which models the characteristics of 10x linked-reads sequencing, to determine the optimal parameters required to generate high quality human assemblies. We prepared and sequenced eight linked-reads real libraries with a diverse set of parameters from the standard cell lines NA12878 and NA24385 and performed whole genome assembly on both simulated and real data sets. We found that assembly quality could be optimized by a physical coverage between 332X and 823X and length-weighted fragment length (WμFL) of approximately 50 to 150kb. We then addressed whether variants can be detected precisely in personalized genomes produced with de novo assembly through Supernova, the only currently existing assembly software for 10x linked-reads libraries. Thus, we examined variants in six assemblies with diverse experimental parameters. We found that assemblies are effective for detecting mid-size structural variants with high breakpoint accuracy, with a diploid fraction around 80%. Motivated by the current limitations in generating high quality diploid assembly and detecting variants through assemblies, we developed a new software, Aquila, to fully take advantage of 10x linked-reads sequencing technology. The key insight of Aquila was to leverage the strengths of linked-read technology (long-range connectivity and inherent phasing of variants) for reference-assisted local de novo assembly. We showed that Aquila achieved contiguity for both haplotypes on a genome-wide scale, with over 98% of a human Aquila- assembled genome being diploid. The truly diploid nature of the assemblies facilitated detection of the most prevalent types of human genetic variation, including SNPs, small indels, and SVs, in all but the most difficult regions. All heterozygous variants were phased in blocks that can approach arm-level length. An extension of the software, Aquila stLFR took advantage of another newly developed linked-read sequencing technology, single tube long fragment read (stLFR). Furthermore, hybrid assembly based on both stLFR and 10x linked-reads libraries was possible, taking advantage of the strengths of both technologies, in a complementary fashion. Finally, we improved detection of de novo mutations, a difficult problem because of their small number relative to the genome- wide false positives in next generation sequencing, by developing HAPDeNovo, a program that leveraged phasing information from linked read sequencing. Collectively, this work breaks new ground in computational applications taking advantage of third-generation, linked-reads sequencing approaches to generate personal genomes and detect genetic variations. Wide adoption of the tools we develop and make available is expected to facilitate discovery and explicate the genetic causes of complex conditions.
|Type of resource
|electronic resource; remote; computer; online resource
|1 online resource.
|Kundaje, Anshul, 1980-
|Kundaje, Anshul, 1980-
|Dill, David L
|Degree committee member
|Dill, David L
|Stanford University, Department of Computer Science Department.
|Statement of responsibility
|Submitted to the Computer Science Department.
|Thesis Ph.D. Stanford University 2019.
- © 2019 by Xin Zhou
- This work is licensed under a Creative Commons Attribution Non Commercial 3.0 Unported license (CC BY-NC).
Also listed in
Loading usage metrics...