Non-parametric energy potentials : a compressed sensing approach
- The foundation of molecular analyses of chemical and biological phenomena is the energy potential, a mathematical description of the energy of every possible interaction in a molecular system. The accuracy of computational and laboratory studies of phenomena ranging from pharmaceutical drug interactions and protein folding to material phase transitions and thin film growth is often limited by the accuracy of these energy potentials. Currently, energy potentials are inferred using a mixture of theoretical modeling and experimental data. So-called "physical potentials" rely on theoretical models to specify the potential's mathematical form and use experimental data to fit few model parameters. In contrast, "statistical potentials", also known as knowledge-based potentials, fit many parameters to experimental data and use theoretical models for the expected statistics of interactions under randomness to infer a potential. In both approaches, theoretical models shape and constrain the inferred potential, resulting in a so-called parametric model. There are several drawbacks to this: (i) The a priori assumptions underlying the inferred potentials may be inaccurate. (ii) Substantial domain knowledge is required (often exceeding what is known). (iii) Potential modeling is lengthy and technically difficult. The theoretical development of some potentials has taken decades. To overcome these problems, potentials could in principle be determined strictly from experimental data without recourse to theoretical modeling by experimentally measuring the energies of all distinct interactions, resulting in a "de novo" energy potential. In practice, direct measurement of interatomic potentials has only been possible for the simplest systems, due to a combinatorial explosion in the number of possible interactions that renders experiment-based inference intractable. In this thesis, we develop a tractable methodology for the inference of de novo potentials that circumvents the experimental intractability barrier. Our methodology overcomes this barrier by synthesizing concepts from structural biology, statistical mechanics, and recent discoveries in information theory known as compressed sensing. The result is a non-parametric potential that does not require an a priori assumption of a theoretical model, overcoming a fundamental limitation of both physical and statistical potentials. To explore the utility and role that this new methodology can play in molecular analysis, we focus on an important and long-standing problem in biology: the sequence specificity of DNA-binding proteins to their DNA targets. We develop a method that uses energy potentials to predict the DNA binding sites of proteins. In the first part of the thesis, we use existing statistical potentials. We develop three novel enhancements of statistical potentials tailored specifically for the protein-DNA binding application. These enhancements exploit certain molecular features of protein-DNA binding, as well as the unique experimental data sets available in this domain, to improve the performance of statistical potentials in predicting protein-DNA binding. In the second part of the thesis, we introduce the notion of a de novo potential, i.e. an energy potential derived exclusively from experimental data, without relying on any theoretical modeling. We discuss previous challenges that have prevented the development of such non-parametric potentials, and then introduce our methodology for inferring de novo potentials. We first present a general methodology that is applicable to any domain of molecular interactions, then specialize our general methodology to the specific application of protein-DNA interactions. Using this specialized version, we infer a number of different protein-DNA energy potentials, and apply them to the problem of predicting the DNA binding sites of proteins. We show that our new method achieves near experimental accuracy for over half of the tests performed, significantly outperforming the state of the art in this field.
|Type of resource
|electronic; electronic resource; remote
|1 online resource.
|AlQuraishi, Mohammed Nazar
|Stanford University, Department of Genetics
|Statement of responsibility
|Mohammed Nazar AlQuraishi.
|Submitted to the Department of Genetics.
|Thesis (Ph.D.)--Stanford University, 2011.
- © 2011 by Mohammed Nazar AlQuraishi
Also listed in
Loading usage metrics...