Text mining of the scientific literature to identify pharmacogenomic interactions

Placeholder Show Content

Abstract/Contents

Abstract
Pharmacogenomics is the study of how variation in the human genome impacts drug response in patients. It is a major driving force of "personalized medicine" in which drug choice and dosing decisions are informed by individual information such as DNA genotype. The field of pharmacogenomics is in an era of explosive growth; massive amounts of data are being collected and knowledge discovered, which promises to push forward the reality of individualized clinical care. However, this large amount of data is dispersed in many journals in the scientific literature and pharmacogenomic findings are discussed in a variety of non-standardized ways. It is thus challenging to identify important associations between drugs and molecular entities, particularly genes and gene variants. Thus, these critical connections are not easily available to investigators or clinicians who wish to survey the state of knowledge for any particular gene, drug, disease or variant. Manual efforts have attempted to catalog this information, however the rapid expansion of pharmacogenomic literature has made this approach infeasible. Natural Language Processing and text mining techniques allow us to convert free-style text to a computable, searchable format in which pharmacogenomic concepts such as genes, drugs, polymorphisms, and diseases are identified, and important links between these concepts are recorded. My dissertation describes novel computational methods to extract and predict pharmacogenomic relationships from text. In one project, we extract pharmacogenomic relationships from the primary literature using text-mining. We process information at the fine-grained sentence level using full text when available. In a second project, we investigate the use of these extracted relationships in place of manually curated relationships as input into an algorithm that predicts pharmacogenes for a drug of interest. We show that for this application we can perform as well with text-mined relationships as with manually curated information. This approach holds great promise as it is cheaper, faster, and more scalable than manual curation. Our method provides us with interesting drug-gene relationship predictions that warrant further experimental investigation. In the third project, we describe knowledge inference in the context of pharmacogenomic relationships. Using cutting-edge natural language processing tools and automated reasoning, we create a rich semantic network of 40,000 pharmacogenomic relationships distilled from 17 million Medline abstracts. This network connects over 200 entity types with clear semantics using more than 70 unique types of relationships. We use this network to create collections of precise and specific types of knowledge, and infer relationships not stated explicitly in the text but rather inferred from the large number of related sentences found in the literature. This is exciting because it demonstrates that we are able to overcome the heterogeneity of written language and infer the correct semantics of the relationship described by authors. Finally, we can use this network to identify conflicting facts described in the literature, to study change in language use over time, and to predict drug-drug interactions. These achievements provide us with new ways of interacting with the literature and the knowledge embedded within it, and help ensure that we do not bury the knowledge embodied in the publications, but rather connect the often fragmented and disconnected pieces of knowledge spread across millions of articles in hundreds of journals. We are thereby brought one step closer to the realization of personalized medicine and ensure that as scientists, we continue to build on the knowledge discovered by past generations and truly to stand on the shoulders of giants.

Description

Type of resource text
Form electronic; electronic resource; remote
Extent 1 online resource.
Copyright date 2011
Publication date 2010, c2011; 2010
Issuance monographic
Language English

Creators/Contributors

Associated with Garten, Yael
Associated with Stanford University, Department of Biomedical Informatics.
Primary advisor Altman, Russ
Thesis advisor Altman, Russ
Thesis advisor Manning, Christopher D
Thesis advisor Mochly-Rosen, Daria
Thesis advisor Peltz, Gary, 1956-
Advisor Manning, Christopher D
Advisor Mochly-Rosen, Daria
Advisor Peltz, Gary, 1956-

Subjects

Genre Theses

Bibliographic information

Statement of responsibility Yael Garten.
Note Submitted to the Department of Biomedical Informatics.
Thesis Ph.D. Stanford University 2011
Location electronic resource

Access conditions

Copyright
© 2011 by Yael Garten
License
This work is licensed under a Creative Commons Attribution Non Commercial 3.0 Unported license (CC BY-NC).

Also listed in

Loading usage metrics...