Algorithms for analysis of multiple biological sequences : theory and practice
- Availability of massive amounts of genomic data from hundreds of species has introduced many challenging computational problems as well as the need for efficient algorithmic tools that leverage multiple species information to facilitate biological analysis. This dissertation discusses two such problems: noncoding RNA multiple structural alignment and constrained element detection. Noncoding RNA genes (ncRNAs) are regions of the genome that are transcribed but not translated into protein, and fold directly into secondary and tertiary structures which can have a variety of important biological functions. Because their function depends closely on the secondary structure, ncRNAs often do not exhibit enough primary sequence conservation to be properly aligned using standard sequence-based methods. I therefore consider the problem of RNA multiple structural alignment, i.e., performing sequence alignment and secondary structure prediction simultaneously. In the first part of this dissertation I introduce a novel graph theoretic framework for analyzing this problem and prove that when the number of sequences is not fixed it is NP-complete. I also provide a polynomial time algorithm that approximates the optimal solution to within a factor of O(log^2 n). Constrained elements are regions of the human genome exhibiting evidence of purifying selection and therefore biological function. Computational identification of such elements is one of the major goals of comparative genomics. In the second part of this dissertation I present GERP++, a new tool for efficient constrained element detection that significantly improves on one of the current leading methods, GERP. While retaining GERP's biological transparency and metric for quantifying position-specific constraint, GERP++ uses a more rigorous method for computing evolutionary rates and a novel algorithm for element identification that uses statistical significance directly to evaluate and rank candidate elements. These algorithmic improvements decrease the running time by several orders of magnitude in practice, enabling high-throughput analysis of large data sets. Furthermore, I present analysis and biological interpretation of constrained elements identified by GERP++ in the human genome from recently available multiple species alignments.
|Type of resource
|electronic; electronic resource; remote
|1 online resource.
|2009, c2010; 2009
|Davydov, Eugene V
|Stanford University, Computer Science Department.
|Dill, David L
|Dill, David L
|Statement of responsibility
|Eugene V. Davydov.
|Submitted to the Department of Computer Science.
|Thesis (Ph.D.)--Stanford University, 2010.
- © 2010 by Eugene V Davydov
- This work is licensed under a Creative Commons Attribution Non Commercial 3.0 Unported license (CC BY-NC).
Also listed in
Loading usage metrics...