Algorithms for analysis of multiple biological sequences : theory and practice

Placeholder Show Content

Abstract/Contents

Abstract
Availability of massive amounts of genomic data from hundreds of species has introduced many challenging computational problems as well as the need for efficient algorithmic tools that leverage multiple species information to facilitate biological analysis. This dissertation discusses two such problems: noncoding RNA multiple structural alignment and constrained element detection. Noncoding RNA genes (ncRNAs) are regions of the genome that are transcribed but not translated into protein, and fold directly into secondary and tertiary structures which can have a variety of important biological functions. Because their function depends closely on the secondary structure, ncRNAs often do not exhibit enough primary sequence conservation to be properly aligned using standard sequence-based methods. I therefore consider the problem of RNA multiple structural alignment, i.e., performing sequence alignment and secondary structure prediction simultaneously. In the first part of this dissertation I introduce a novel graph theoretic framework for analyzing this problem and prove that when the number of sequences is not fixed it is NP-complete. I also provide a polynomial time algorithm that approximates the optimal solution to within a factor of O(log^2 n). Constrained elements are regions of the human genome exhibiting evidence of purifying selection and therefore biological function. Computational identification of such elements is one of the major goals of comparative genomics. In the second part of this dissertation I present GERP++, a new tool for efficient constrained element detection that significantly improves on one of the current leading methods, GERP. While retaining GERP's biological transparency and metric for quantifying position-specific constraint, GERP++ uses a more rigorous method for computing evolutionary rates and a novel algorithm for element identification that uses statistical significance directly to evaluate and rank candidate elements. These algorithmic improvements decrease the running time by several orders of magnitude in practice, enabling high-throughput analysis of large data sets. Furthermore, I present analysis and biological interpretation of constrained elements identified by GERP++ in the human genome from recently available multiple species alignments.

Description

Type of resource text
Form electronic; electronic resource; remote
Extent 1 online resource.
Copyright date 2010
Publication date 2009, c2010; 2009
Issuance monographic
Language English

Creators/Contributors

Associated with Davydov, Eugene V
Associated with Stanford University, Computer Science Department.
Primary advisor Batzoglou, Serafim
Thesis advisor Batzoglou, Serafim
Thesis advisor Dill, David L
Thesis advisor Sidow, Arend
Advisor Dill, David L
Advisor Sidow, Arend

Subjects

Genre Theses

Bibliographic information

Statement of responsibility Eugene V. Davydov.
Note Submitted to the Department of Computer Science.
Thesis Thesis (Ph.D.)--Stanford University, 2010.
Location electronic resource

Access conditions

Copyright
© 2010 by Eugene V Davydov
License
This work is licensed under a Creative Commons Attribution Non Commercial 3.0 Unported license (CC BY-NC).

Also listed in

Loading usage metrics...